Skip to content

Managing content in Confluence: Archiving and retention policies

This article is part of a series that focuses on the difficulty of managing content in Confluence and offers improvement suggestions through constructive criticism. This time I will talk about keeping content up to date, i.e. archiving, retention policies, automatic notifications etc.

Update 7th August 2012: This article is now partly deprecated as Midori has published a new version of the Archiving Plugin that addresses nearly every issue outlined in this post. As I am traveling and not working at the moment, I cannot write a follow-up article or update this one comprehensively. But suffice to say that having worked on the spec with Midori and having seen their documentation on the new version, the Archiving Plugin 2.0 should cover nearly all your archiving needs in Confluence (the only shortcomings being attachment trashing/archiving and an auto-purge trash bin feature). It is now a truly kick-ass plugin. Find the list of improvements at Atlassian Marketplace and the full documentation at

As I stated in the first article of the series, creating content in Confluence is fun and easy. This leads to a lot of content being created. So much, in fact, that soon you will have a lot of content that is out of date, irrelevant, or both. So how do you discover potentially dated content? What do you do about pages that are obsolete? Pages that need updating? There are many questions to answer and probably more ways than one to implement a workable archiving strategy in Confluence.

I’m glad to be able to say that this article is much more encouraging than the others in the series, since we actually have a solution that works. :) I will outline our archiving strategy, point out the tools we have used and describe the workflows we have established around archiving. There is still room for improvement, as is always the case with technology, so I will point out weaknesses and suggest improvements to our strategy in the last section of the article.

Lots of content, what to do about it?

About a year ago we started to realize we were in a bind. We already had 70 or so global spaces, thousands of pages of content, and no way to know which content was current and relevant and which was out of date, needing either updating (if the page is still relevant) or scrapping (in case it’s obsolete).

Midori Archiving Plugin to the rescue! I looked around and found the Archiving Plugin (AP), which ended up second best in the 2009 Codegeist competition. An awesome plugin all in all, especially considering it’s completely free.

Archiving Plugin configuration pane

Space-specific settings for the AP

The AP took care of many things beautifully without any configuration. Simply install and activate for the spaces you want to keep current. There are only two values you can set per space:

  • How long does content have to be untouched before it expires? (x days, in this example I will use 180 days – roughly six months)
  • How long does content have to be untouched before it is archived? (y days, in this example I will use 365 days – one year)

In addition to these values you can can choose to:

  • Schedule the archiving to take place weekly every monday (schedule not configurable, merely on/off)
  • Send warnings about expired pages to page authors, contributors and space administrators (also not configurable, just on/off)
  • Automatically archive pages that are obsolete (on/off)

Let me explain what the different terms here mean and how the plugin generally works. (If you already know how the plugin works and are interested in my improvement suggestions or our workflows around the plugin, feel free to skip ahead to the criticism or the workflow solution.)

The Archiving Plugin gutted and analyzed

”Untouched” here means that the page content has not been updated for 180 days. The plugin ignores commenting and labeling, but takes into account attachments and descendant pages. If you don’t want the page to be archived, simply edit the contents or attachments of the page. It is now ”valid” again until it remains untouched for 180 days straight.

”Expiring” means that warning emails are sent next Monday to the page creator, most recent editor, and the space administrators, informing them that the page needs to be dealt with within the next 185 days (365-180=185) or it will be automatically archived. ”Obsolete” means that the page is either past its deadline or has been labeled for archiving manually.

The plugin uses four labels to mark pages as included or excluded from the archiving process.

  • archive-single means ”upon the next archiving run, archive this page but not its descendants”. This is kind of odd, since the descendants will be orphaned afterwards. Can anyone provide a use case for this? I sure can’t think of one.
  • archive means ”upon the next archiving run, archive this page along with all descendant pages”.
  • noarchive-single means ”never archive this specific page”.
  • noarchive means ”never archive this page or any of its descendant pages”. So essentially the entire pagetree below this page is ignored by the archiving run.

”Archiving” means several things here – there’s actually a lot to explain. Many things take place; many do not. When a page is archived:

  1. First, the plugin checks for the existence of an archive space. If one does not exist, the plugin creates it. The archive space is a minimal copy of the original space, but the word ”Archive” is appended to the spacekey. So if we are archiving a page from the space Development, spacekey DEV, the archive space will be called ”Development (Archive)”, spacekey DEVArchive. So every space for which archiving is enabled will have a clone space with the archived pages.
  2. Next, the plugin checks whether all required parent pages exist in the archive space. If they do not, the plugin creates every parent page required to maintain the page’s location in the page tree, identically to the original space.
  3. Third, the plugin copies the page, complete with content, attachments, labels and comments, into the archive space. It is worth noting that page history is lost upon archiving. It is not documented why the page is not simply moved, but I suspect it has to do with the timestamp of the archived page. No trace is left upon moving a page, but a copy operation at least creates a new page with a distinct timestamp. Another good reason is that when people use short links or page ID’s to navigate to a page, they might not notice that the page has been archived and will edit an obsolete page. Copying ensures that users clicking on old links will land on a ”Page not found” page, informing them that the page has indeed been archived and is no longer available. On the whole, losing the page history may be preferable to confusing users about whether a page is current or not. (It does save a bit of space, too, though that should be no concern today with plenty of cheap storage capacity.)
  4. Finally, after copying the page, the plugin trashes the original page, removing it from the active space.

The mail notification feature is probably the smartest, most thought-out feature of the entire plugin. It is shockingly simple and efficient. Here’s how it works:

  • Every monday, a scheduled operation runs. For every space where scheduled archiving is activated, the plugin runs through all pages, looking for expired and obsolete content.
  • For the spaces where notifications are activated, warning emails are sent to the creator and last modifier of each expired page, as well as the administrator of the enclosing space. Space admins and page creators are additionally informed when pages are archived.

The above feature could generate a storm of emails, but this is the best part: Midori has managed to create a succinct report that looks something like this (sensitive data censored):

Email report of expired pages for contributors

Email report of expired pages for contributors (sensitive data censored)

If you are a space administrator, you will also get a report like this (sensitive data censored):

Email report of expired pages for Space Administrators

Email report of expired pages for Space Administrators (sensitive data censored)

So at most, two emails are sent to each person that has interacted with any of the expired or obsolete pages that week. The first email contains all the expired pages that week, organized by space, showing all relevant information:

  • When the oldest page in the space expired
  • When the oldest page in each space will be automatically archived
  • What the name and location of each page is (title, space, location in space hierarchy)
    • Each space and page is additionally linked to
    • A direct edit link is also provided for each page
  • Who last modified each page and when

When pages are archived, administrators get another email that contains all the pages archived that week, in the same format as above.

Email report of archived pages for Space Administrators

Email report of archived pages for Space Administrators

In addition to the scheduled event, the archiving process can be manually activated at any time for each individual space (although not globally), sending out notifications or archiving obsolete pages.

Room for improvement

By now you probably understand why the plugin did well in the Codegeist competition. :) But enough ranting about the excellence of the plugin. What about the caveats? Every plugin has weaknesses… So on with the critique.

  • The AP cannot be enabled or run globally; only on a per-space basis. It would be nice if it could be on by default for all new spaces and if it could be run once globally.
  • Attachments cannot be archived separately from pages. Incidentally, blog posts cannot be archived either, but those are usually temporary content entities to begin with. Still, maybe old news should disappear from production to reduce clutter and keep search results relevant.
  • The archiving schedule cannot be customized. A slight annoyance if you would like it to run daily or monthly instead of weekly. Or on another weekday. Or at another hour.
  • The email notifications cannot be customized. It would be nice if some statistics could be included, such as the percentage of expired pages per space and globally. Perhaps a pie chart showing the breakdown of expired pages by space… Okay, maybe I am asking too much, but a guy can dream, right? Oh, that reminds me…
  • No reporting tools come bundled with the plugin. It would be nice to be able to view the above statistics in real time, as well as the number of archived pages per space and globally. Maybe even the average age of pages per space and globally (of course separating production spaces from archive spaces)!
  • No tracking for ”processed” expired pages. How many expired pages have been updated? How many have been archived? Who was it that processed these pages? These statistics would be an excellent tool for encouraging participation in the reviewing and archiving process. The most active archivers could even be rewarded. :)
  • No trace of who archived a page or why it has been archived. This is obviously because labels are used to mark a page for archiving, and labeling is not tracked. But the unfortunate consequence is that when viewing an archived page, it is impossible to know who archived the page and why. This should be remedied by at least recording who marked the page for archiving. But an even better solution would ask the archiver to provide a reason for archiving the page. This would minimize the amount of pages archived by mistake or for invalid reasons.
  • The Confluence search function is not augmented with the option to include or exclude archive spaces. The only way to ensure relevant search results is to hide the archive spaces using space permissions. In this case, a separate account needs to be created which does have access to the archive spaces and can perform searches specifically from the archive spaces. This is a pretty high-maintenance solution though, and personally a pain in my behind. It would great if this were a built-in feature in Confluence, simply an extra field in the SPACES table denoting ”production” or ”archive”. Include or exclude from search results.

How we solved the workflow problem

Now that you know the ins and outs of the Archiving Plugin, let me describe our solution to the problem of processing out-of-date content. How do you know what the overall state of your content is? How do you keep your users informed of that state, so they will have an incentive to review expired pages? How do you lower the threshold for processing those pages? Building tools and a workflow that enabled all this was no easy task, but in the end we came up with a workable solution. Actually, getting everyone in our department to understand the workflow and use the tools was the hardest part. But I digress…

In order to build reasonable review and archiving workflows, page states like ”obsolete”, ”needs updating”, ”update pending” and so on would be required. There is a separate plugin that handles such workflows in a wonderful fashion, but at the time we had to figure all this out, we didn’t really know anything about Ad Hoc Workflows (AHW). I now know that it is probably the best tool to use for workflow building, so we might switch to that eventually. You should know that it does come with a price, but most great plugins do. We have not yet configured AHW for use with the AP, but I’ll be sure to post an update to this series if and when we do. :)

For now I’ll only outline the archiving strategy that I would recommend, based on my experience with our process (about 9 months in production now).

  1. Define suitable expiration periods. How often do pages in your wiki and your individual spaces need to be revisited? Once a year? Twice a year? Every three months? Should different expiration periods be defined for different types of content?
    Our solution is to have 1 year as the expiration period and 2 years as the automatic archival period. We intend to expand the use of the plugin into other spaces where shorter intervals can be used. We have excluded content whose validity is determined by the clients (process descriptions agreed upon with the clients). These are revisited once a year by policy anyway.
  2. Define a workflow for expired pages. How are pages processed when they are reviewed? How many states should pages have? How do they transition from one state to another? How do you task the updating of a page to another person? Who does the actual reviewing, is it the same person that updates pages?
    This is obviously the most difficult part of defining an archiving strategy. We ended up labeling pages ”obsolete” or ”in need of updating” respectively for situations where a page is deemed probably irrelevant but not surely enough to directly archive it; or a page is probably still relevant but needs to be brought up-to-date. These labeled pages pop up on reporting pages where another support person, an architect, or a manager can review the page and confirm or revoke the assessment of the original reviewer. So it’s either a one- or two-step process, depending on the situation.
  3. Build and exploit statistics about the state of your content. Use the numbers both to show management you are making progress getting out-of-date content out of the way and to motivate your users to process those expired pages. Users are much more inclined to contribute when they see they have an impact on the big picture!
    What I did was (1) have one page that lists all expired pages (using the Reporting Plugin) (so they can be sorted by expiration date); (2) another page that lists expired pages by space and space group; (3) a third page that lists pages marked for updating or archival review; and finally (4) a weekly email report that shows the overall state of content up-to-dateness (silly word isn’t it?). It’s been quite a bit of work building the report, what with hand-built SQL queries and Excel spreadsheets, but I believe the reports also strongly encourage participation. It would be wonderful to have a built-in tool for all this.

    A crude manually crafted Excel table that lists expired content per space and space group

    A crude manually crafted Excel table that lists expired content per space and space group. This is sent to all production teams weekly in an email report (4) along with the expired pages list (1).

  4. Organize archiving workshops. We had to do this a few weeks ago when the number of expired pages surpassed 10% of the total. About 15 people got together and went through more than 200 pages in one evening (4 hours). We were able to lower the expired ratio from 10% to 2.6%! All it took was some pizza, some beer, and a decent work plan. Lessons-learned: Make task lists of the expired pages so they can be checked off as they are processed. Assign a team to each space (or space group) so they can work together. This approach has several advantages compared to people working alone:
    1. With plenty of time dedicated to processing pages, users get better acquainted with the content in the wiki
    2. With guidance, users grow more comfortable with the wiki UI
    3. People discuss the content as they review it, increasing their confidence in content management
    4. All of the above combined contribute to increased participation overall, which is every wiki administrator’s dream. :)

Closing thoughts

So what’s my gripe with Atlassian if the AP does such a great job and we have a well-working archiving workflow? It’s that Atlassian could well adopt the AP as their own, bundle it with Confluence, and expand on it a little with retention policies and integrated search. I’ve explained the search issue already, but the retention policies might a bit of elaboration.

Retention policies could encompass the features of the AP, address the drawbacks I mentioned, and add a little twist: let the administrator choose between trashing or archiving old content automatically, and enable limits on the maximum number or age of items in the trash (oldest items are emptied as new ones are trashed). Also, moving pages to the trash should mark them as ”removed” in statistics. Currently statistics only count pages that are emptied from trash, not pages that are trashed. It makes statistics look odd when no pages are removed for months at a time, and suddenly 250 pages are removed at once. Whoa, wait, what happened here? Oh right, I emptied the trash bins. It would be useful to know when and how users trash content, but that information is not available as it currently stands.

That sums it up for archiving and retention policies. On the whole it’s a pretty good affair compared to the other issues I have outlined in this article series. The most serious complaint right now is the lack of tracking exactly who archived a page. Finding out why some pages have mysteriously been archived is time-consuming and frustrating.