How to explain a dataset
Have you ever got hold of a random spreadsheet and thought "but what does this all mean?" Or have you ever found a dataset online that meets your needs, but later found out it’s outdated or wrong? If yes, you've probably been let down by the information available about a dataset – your information needs as a data user haven't been met!
If you're a custodian of a service, product or app that produces data (that's most of them, by the way), the chances are that you're not telling people enough about your data. Even if you've got a documentation site or manual, few people are aware of the diversity of information that users of a dataset need explained.
Explaining a dataset is as much about internal users as it is about external ones. If your team doesn't understand how to work with the dataset that powers your service, their ability to make quick and correct decisions is hampered. Having good quality documentation and content starts with meeting your own needs, and expands to consider the needs of people around you.
We at Register Dynamics are implementers of dataset search engines and data catalogues, including the flagship data.gov.uk, so we've done a lot of research and testing into how to get users to work successfully with data. We've developed a working model over a number of years of projects with governments and startups that captures the information needs of dataset users. In particular, we successfully put this model into use when writing the UK Tariff Data Standard which contains all the different types of content that we talk about here.
We use this model regularly to reason about and explain what data users need to know and how to answer their questions effectively. In this post, we’ll talk you through the model and give you an understanding of what dataset users are crying out for information about!
Maturity of users a.k.a. the data usage lifecycle
Understanding the different stages that users go through in using a dataset is a useful mental model that crops up in many different situations. Our favourite model splits users into four stages of maturity, from least to most mature understanding of the data:
Discover the data that is relevant. The user is looking for a dataset that can solve their need. This user is considering a large number of other datasets, and so is trying to access a small amount of basic information that helps them prioritise further work.
Evaluate if it solves my problem. The user has found a promising looking dataset and now is trying to decide whether it meets their needs. The user may put a considerable amount of time into evaluating the dataset, definitely accessing it if they can and examining properties such as its quality and licensing restrictions.
Integrate with my own process. The user has decided to use a dataset and is now building it into their product, service or process. They now need to know a high level of detail about how the data works and what they can do with it.
Maintain as the data changes. The user has previously integrated a dataset and is now using it to run some live processes. They care a lot about how their use of the data might need to change and are likely to want to be informed about or involved with decisions that affect their usage.
Users at these different stages require very different types of information. Typically, many data catalogues or documentation sites are very good at providing information at the "discover" level but very few manage to communicate the sort of detail that is required at the "integrate" or "maintain" level. It's important to make sure that information needs are met at all levels – because unmet information needs turn into comparatively expensive support requests or (worse) disengaged users.
The users are not just technical
You might assume that the users of data documentation are mainly people like data scientists and developers. Whilst they are important users, we reason that there is actually a much broader set of people that care about a dataset.
Practitioners
We refer to any user who is going to work with the data directly as a "practitioner". These users of course account for a big proportion of the overall user base but cover a wide range of different disciplines.
It's important to remember that practitioners might vary considerably in their skills. We talk about practitioners as having skills on three independent axes:
Numeracy: ability to understand and analyse the statistical quality of the data and reason about statistical concepts it encodes. A statistician might be a user who is highly numerate but not necessarily with the technical skills to transform or build.
Data literacy: ability to transform, visualise and otherwise wield the data. A data analyst or data scientist might be a user who is expert at using packages like R or Python to work with the data, but may not have advanced statistics knowledge or programming know-how.
Technical literacy: ability to build infrastructure, databases and apps that can collect and use the data. A developer might be a user who knows a lot about getting data into and out of databases, but does not necessarily have much knowledge about how to analyse or interpret it.
An expert practitioner might have comparable knowledge in two of these disciplines, but it's rare to encounter users who are expert at all three!
Designers
People who are responsible for the user experience of a product or service, both at the micro and macro level, also need to know plenty about the data that service will use. User-centred designers and business analysts need to understand what the data means and how it changes so that they can integrate the data into their services and processes.
Delivery leaders
People who are responsible for delivering something on time and to budget need to understand how the data is going to change over time. They are looking for potential risks to their delivery from integrating a dataset that may suddenly change or stop receiving updates.
Policy thinkers
People who are responsible for setting new policies need to have at least a basic understanding of what the data stored about or collected by their policy allows them to express. Experience tells us that the less dissonance there is between the policy and the data, the more easily it can be explained and implemented.
For example, if a policy thinker wants to introduce a new kind of behaviour for a certain segment of users, they need to understand if that segment is easily identifiable in the data. If they are not, the operationalisation of the policy will take longer and be more complex.
Users need different types of information
Now that we understand the users, we can turn our attention to the sort of information that they will require.
We talk about information needs as being broken into five broad categories:
Form
The "form" of the data is everything about what it contains. Form includes semantic content of the data like what each record and field actually mean, any business rules or invariants that apply, and any reference data and its source. It also includes technical content about how the data is physically structured, like machine-readable schemas, any APIs for codelists, examples, and compatible tools.
As an example, the technical content will tell you that a data record contains a date field and how the date is formatted, and the semantic content will tell you what the date field means and how it should be used.
Management
"Management" of the data is information about how the data itself changes, like when and how records are created or removed (but not changes to the structure or semantics). For some datasets this might be a very simple answer about an automatic process (e.g. "a user visits a page on our website") but other datasets the answer might be very nuanced and sophisticated (e.g. "an expert group decides that a new disease is a high risk…").
It also needs to talk about how any upstream data sources are integrated – if the dataset contains reference data from some other source, the management content needs to describe when that reference data can change and how quickly those changes are integrated.
For some changes it might be possible to communicate large or important changes ahead of time. The management content might talk about where and how those upcoming changes might be published.
Quality
Information in the "quality" category is everything the user needs to know about the ways in which the data might be wrong or incomplete and what they can do about it. This covers basics like where the data comes from and what guarantees there are about what it will contain, meaningful measures of quality for the specific dataset and historic values for them, observational content like known issues and any remediations, and also information on feedback loops that users can use to report quality problems.
For example, a dataset containing driving licenses may need to explain how the license records are created and what guarantees that business process can make, possible quality issues where license details aren't checked to the same high standard (like when a license has originally come from overseas) and how these can be detected and counted, and what email address to contact if another process can show that information on the license is out of date.
Access
Often some of the trickiest issues with data are about "access". At the basic level, this may be about where to get records from the dataset, but information needs in this category quickly evolve into the "how" of data access like whether an NDA needs to be signed or a DPIA written, and what any agreement actually allows the user to do.
"Access" also needs to cover long-term access to data that changes often. Most users at the beginning of their journey just want to see any data they can, but more mature users want to build automatic systems that can process changes or consume new versions. For fast-changing datasets where the information needs to be up-to-date constantly, users will also have questions about how to meet their own service levels.
For example, if a user wants to provide a service that is available 99.9% of the time but relies on an external dataset via an API, they will want to know what measures they will need to take to achieve their own reliability needs. It may be enough if the API is also guaranteed to be available 99.9% of the time, but if not the user might need to explore options where they can keep local caches of results or exclude downtime because of external data sources from their uptime agreement.
Lifecycle
Information about the "lifecycle" of the data talks about the control, ownership and decision-making processes that apply to it and is crucial for describing how changes to the structure and semantics of the data will be made. This might include a changelog that describes both past changes and upcoming future ones, and any discussion groups, chat channels or mailing lists where proposals can be discussed before they are agreed.
The ability to know and communicate what changes are going to happen ahead of time (instead of just making changes that might break data users) is normally a sign of a mature, important and well-used dataset!
Most content about data tends to focus on the "form" with some scant details on the "access" – typically because data documentation is written by practitioners and this is where their interests lie. But users putting data into live use have information needs in all of these categories.
Putting it all together
By considering the overlap of these categorisations, we can start to drill into the detailed questions that each user at each level of maturity might have in each category. For example, a "practitioner" user at the "integrate" stage of their data journey might need "form" information that answers the question "how do I extract the meaning that I need?"
A useful summary of the model is shown in this table:
Whilst this summary doesn't show every possible user, it's useful as a prompt to explore what each type of user will need to know. For example, each type of user might need to answer "how I will understand quality issues" for different reasons:
Practitioners may need to use or build different technology that can detect possible quality problems and prevent erroneous data from poisoning other parts of the system
Designers may need to think about how data with quality issues is explained to the user or what business processes might be required to handle it
Delivery leaders may need to work out how to test for quality issues before a delivery goes live, and what possible mitigations they will need to deploy
Policy thinkers might need to consider whether or not likely quality issues will stop their policy from meeting its aims or will prevent the impact of their policy from being effectively measured
Sources
This model is not just our own work and isn't just anecdotal. There is serious research and development going into understanding how users understand data. One of our favourite research studies on this topic is "Everything you always wanted to know about a dataset: studies in data summarisation" – the entire paper is worth reading, but Table 8 and the dataset summary template in section 6.4 are a good summary on what users need to know, particularly at the "discover" and "evaluate" journey stages.
A great example of the theory put into practice is the International Aid Transparency Initiative (IATI) data standard. Their material covers all of the aspects mentioned here and more, and is an excellent example of what good looks like when it comes to documenting datasets and data standards.
Finally, our model has also been influenced by the wider data community. A session at OpenDataCamp 7 resulted in a wiki-format document called "Getting Started (Pragmatically) with Metadata" that includes a number of crowdsourced insights about data user information needs.
We'd love feedback on our model
And that's it! We've developed our model over a number of years of delivering data solutions to governments and startups, so we hope you find it informative and useful. If you put it into practice (or need help to do just that) we'd love to hear from you! Get in touch with us at hello@register-dynamics.co.uk.
Author
Tags: