Skip to content

Managing content in Confluence: Statistics and reporting

The fourth installment of my Confluence article series has been a long time coming. In the fall I was busy building the spec for Midori’s Archiving Plugin 2.0 (Midori started a discussion with me based on my previous article), and soon after I started to train my successor at Appelsiini, as I would leave my post in January for a round-the-world trip. I wanted to add some images and/or illustrations to make the article easier on the eyes, but I’m finally at a point where I just want to get it published.

So here is my next (and perhaps last) pet peeve with Confluence, namely statistics and reporting. The built-in tools don’t offer a lot of useful statistics and reporting on daily activity could use a lot of improvement.

Why are statistics and reporting needed?

This may seem like a silly question with an obvious answer. But I’ll present our business case with some concrete examples for reference so I can point out flaws and suggest improvements.

As Confluence content grows beyond a few hundred pages and/or activity surpasses a hundred events per day, you start losing track of the big picture. You can look at the Dashboard any time of day and see recent events, sure. You can see the total amount of content from the System Information, sure. There is even global and space-specific tracking for view and edit events, plus lists of most popular content and most active users. But these don’t quite cut it when you try to analyze trends in content and behavior.

The specific questions we need answers to are:

  • What’s in the wiki? How many pages, blogposts, attachments, comments? Both in absolute numbers and in percentages. What’s the breakdown by space and space category? Where is the most content located?
  • Is the content current? How much of the content is up-to-date? Where is the most out-of-date content?
  • What happens in the wiki? How many views and edits are there globally, by space, by space category? Where is the action? How does space activity compare to other spaces? How has activity changed over time? How much has content grown or diminished?
  • Who uses the wiki and how? How many people visited the wiki? How many have visited daily, weekly? Which OS/browser did they use? How long did they stay? How much did they contribute? Who contributed the most? How large a percentage was that of the total contributions?
  • Which content is useful? Which is useless? What’s the most and least viewed content? Which spaces are most / least active? Which content is most commented on (hot topic)? Which content is updated the most (useful)? Which content has been touched by the most contributors (popular)?
  • What happened recently? Yesterday? Last week? Last month?

What Confluence has

  • Usage tracking plugin
    • Space Activity
    • Global Activity
  • Macros: usage, popular, topusers
  • Daily email report
  • Real-time notifications for watched pages and news
  • RSS feeds for favorite content
  • Coming up: Weekly e-mail report with popular content

What Confluence needs: Numbers, percentages, lists, charts, timelines… and control

The data from content and activity can be shown in many ways. I’ll describe the ideas I have for showing the answers to the questions outlined above.

What’s in the wiki?

I’d like a content breakdown by space, by space category, by content type, by contributor, by label. Show me the current situation (pie chart) and a timeline (lines or bars) so I can see trends over the past quarter or year. Better yet, let me pick a point in time or a period of time and show me the statistics thereof (average over a certain period).

It would also be useful to for instance separate non-image attachments from image attachments. After all, non-image attachments are often used differently than images. They are usually office documents, text files, batch files, zip files etc. They might be used in isolation from wiki pages whereas images are usually part of the page content. For content type distribution purposes, I would say images should be tracked as part of page content and thus ignored when calculating attachment numbers.

Another note related to attachments is that reports on attachment sizes would be useful. How many attachments are there that are over 1mb? 5mb? 10mb? 50mb? How much space do attachments currently need in total? In which spaces are the largest disk space hogs situated?

Is the content current?

This need will be mostly addressed by Midori Archiving Plugin 2.0, due out sometime in the spring of 2012. I discussed the weaknesses in Confluence and the current version of the Archiving Plugin in my previous post: Archiving and retention policies.

What happens in the wiki?

I’d like a list of all activity. How many views, edits, create events, remove events? What’s the difference between added and removed content? These should also be viewable by space, space category or globally and the timeframe should be easily adjustable.

A note on administrative actions: sometimes batch operations are performed by script. These operations easily skew statistics if they can’t be filtered out from ”real” human activity. There should be a built-in (customizable) ”automation” account through which scripted mass operations can be performed without being included in statistics that measure human activity and are used to analyze user behavior. Certain spaces or space categories should also be possible to exclude from statistics. There are always test or clone spaces in which non-production behavior takes place.

More than once I’ve wondered why there was an activity spike for a certain month… Until I realized I caused it myself by copying a space with all its 500 pages and 300 comments. Suddenly the activity for that month more than doubled, oops! It’s difficult, if not impossible, to correct for these mistakes after they are done. It is then awkward when you want to show management how activity has been (or should have been) impacted by certain events. ”Yes well here the numbers aren’t quite reliable… I don’t really know what the actual activity level was for that month.”

To summarize, Confluence should by default include methods that filter out scripted and testing activity so admins can produce reliable data about user behavior in the wiki. Currently data has to be manipulated (sometimes heavily) to reflect real user activity and thereby become a useful tool for analyzing behavior and designing the wiki. The process is time-consuming, manual and unreliable. I doubt that most admins currently even bother to build meaningful statistics since maintaining them is so much work.

Who uses the wiki and how?

Who are the most active users? Relative statistics in addition to absolute ones: was the top contributor responsible 2% or 20% of all contributions that month? How many contributions were there in total? Who made the broadest contributions (many different pages, many different spaces)? All statistics should be possible to filter by space and space category, in order to see group-specific behavior.

What about profiles? It would be useful to be able to define ”contributor” and ”active user” profiles so that participation levels could be tracked. Let’s say that a ”contributor” is a person who updates or comments on at least 10 different pages per month. How many users are then ”contributors”? 20% 80%? What is the participation level in the organization?

The other profile, ”active user” could be a person that visits every day and views at least 30 different pieces of content (pages, attachments etc) per week. How many users would then be ”active”? 50% 90%? For how many users is the wiki a daily tool? What is the wiki penetration in the company/department/team?

Which content is useful? Which is useless?

The above statistics are user-oriented, but we also need better content-oriented statistics. Which content is viewed most? Which is viewed the least or not at all? Which content is updated the most? Which content is used by the largest number of users? These should also be available globally and by space or space category.

These numbers are important in that we might find useful content and learn something from it. Why is this content used a lot? Why is that content not? We can probably archive or delete pages that haven’t been used in a long time. There are pages that receive a lot of views, but only by a small group of people. Which pages are truly popular, as in almost everyone uses them? Why are they popular? If those pages have poor usability (language, layout or structure), it’s worthwhile to put some time into improving these pages to further improve their usefulness.

Tip: If you want to give users a voice as well as track activity, you can take a look at the Content Survey and Reporting plugin and the Adaptavist Content Voting plugin. However, with our meager userbase of 120 or so, we don’t believe users would spend a lot of time liking and rating content. We haven’t deemed it necessary nor useful so we haven’t implemented any voting or rating controls yet.

What happened recently?

The options for staying on top of Confluence events are quite disappointing right now. The built-in functions work well for small instances with a few dozen users and a few hundred events per month. But scale that up to 120 spaces, 8000 pages and 3500 events a month and suddenly you need a bit more fine-grained options to not be overwhelmed by wiki activity. It’s important that the e-mails you receive from Confluence contain relevant information in a concise manner: not too much, and not too frequently.

So let me decide exactly which content I receive reports on and how often. Let me create categories with different priorities so I can place the most important content first. It shouldn’t be simply on/off.

For instance, I’d prefer three categories:

  1. Important pages that I’m working on right now and pages whose development and feedback I wish to follow closely. (Watched pages)
    Schedule: Send me real-time reports while pages are being edited (at most one [summarized] e-mail every 30 minutes).
  2. Semi-important pages whose development and feedback I wish to follow on a daily basis. (Favorited pages)
    Schedule: Send me a summarized e-mail report once a day on events for these pages.
  3. New content that might interest me because it’s in my favorite spaces. (Favorited spaces)
    Schedule: Send me a summarized e-mail report once a day on new pages, comments and attachments in my favorite spaces.
Note: If the schedule is identical for one or more categories, combine the reports into one e-mail (separate the categories with headings). The fewer messages sent, the better!

Of course news/blogposts are used differently from pages. They should have their own rules:

  1. Important spaces whose news I want to follow closely. (Watched / News-watched spaces)
    Schedule: Send me an e-mail for every created blogpost right after it has been created. Send me e-mail reports at most every 15 minutes on updates and comments to those items.
  2. Spaces whose news I want to follow daily. (Favorited spaces)
    Schedule: Send me a daily e-mail report every morning with the previous day’s blogposts from these spaces.

One thing that really should be changed is that e-mails should not be sent out 1 for 1 (one e-mail per page save). Many users have suggested a method where an e-mail report is not sent out until 5 minutes after the most recent edit. That way, even if a user makes 10 consecutive edits: if they are all made within 5 minutes of each other, the watchers only receive an e-mail after the last edit has been made. The page versioning could follow the same logic: amend the changes of each consecutive edit to one page version and only increment the page version if no one has been editing the page for 5 minutes (or another user has edited the page in between your edits).

Concerning attachments, too: If a user uploads 150 attachments at once, I don’t want a separate e-mail for each one! As described above, only after 5 minutes have passed since the last upload, should an e-mail be sent with the list of uploaded attachments. If multiple pages were updated with attachments, a separate e-mail could be generated for each page. But not one per attachment! Recently-updated lists could be summarized as well: a single line could be displayed that says ”John Doe uploaded 20 attachments to [page name] in [space name]”. The line could be expandable, sure, but the default of displaying each attachment upload as a separate event is a tad verbose.

These sorts of rules would help users stay in the loop with relevant reports that don’t inundate them with e-mails. As a system administrator that sees all content, my daily email report contains 100 to 200 items. Every day. That’s impossible to process on a daily basis. Instead I watch certain spaces and pages. But even so I may receive 100 distinct e-mail reports a day because I get one e-mail for every little edit and every uploaded attachment. The reports seriously need grouping or else all hell breaks loose whenever a user goes wild with the content. Which I of course want to encourage! Just not the e-mail storms. ;)

Tip: If you haven’t stumbled on it before, the Descendant Notification plugin is worth a shot (as soon as they fix the double notification bug). You can watch entire page families using the plugin and it even watches new child pages! This is often a better solution than watching an entire space.

Closing thoughts

Using Confluence is great fun. It’s even quite fun to administrate, despite the many drawbacks I have described in this article series. But developing the wiki is difficult without good metrics on content and behavior. Much needs to be done before an admin can simply install Confluence and start watching the numbers take visual form.

You can certainly build your own statistics using a combination of the built-in macros, the SQL plugin, Reporting Plugin, Apache Webalizer, Google Analytics and so on… But it quickly becomes a mess of tools that is difficult and time-consuming to maintain. Just take a look at this page to see why. Atlassian should work on getting a basic package working to make the job easier for administrators of medium-to-large size wikis.

The other, more user-related issue, was staying on top of current events in Confluence. This has been discussed elsewhere but I wanted to add my two cents to the discussion. Confluence simply sends out too much e-mail right now and some notification consolidation is in order.

A final note: All the statistics listed in this article should be reportable by e-mail just like the ”recent events” personalized emails. It would be the perfect monday morning report for me (the administrator) to read and analyze while sipping my morning coffee. ;)