Examining reference data in the StatsWales publishing service
Recently we have undertaken an examination of the potential for reference data within the StatsWales publishing service. This is a service provided by the Welsh Government where data from public sector organisations can be published for use by interested parties. It comprises over a thousand datasets covering various aspects of public services such as health, housing, education and agriculture, to name just a few.
We have been able to download and examine the majority of this data through the platform’s OData functionality, and here we would like to discuss the state of reference data within the current service, and how we could set up a database of reference data that could be used by this or similar services in the future.
StatsWales data structure
In StatsWales datasets are structured as an OLAP data cube, consisting of a fact table with associated lookup tables for each dimension. For reference data we are particularly interested in lookup tables and how they compare across datasets.
We see that there are a number of common dimensions, such as local authorities and various other geographic groupings, along with things like age, ethnicity, etc. However, each dataset has its own unique set of lookup tables, there is no cross-referencing or shared lookup table between datasets with the same dimensions.
There is some indicator of similarity through the use of semantic keys, which are functionally ways of labelling similar data types, however there is no standard for a semantic key and no enforcement of commonality across semantic keys.
Examining StatsWales 2 data we find that dimensions with the same semantic key can often have different ids in them and even have completely different categories. For example, there are multiple different codes used for local authorities, and no consistency in labelling dimensions (many geographic dimensions are simply called ‘Area’). There are even cases where these semantic keys are used completely incorrectly, such as a dimension of years being labelled with the key for local authorities, or a dimension of genders being labelled using the key for age.
Creating good reference data and encouraging its use would reduce such inconsistencies, and allow for more comparison between datasets.
Reference Data Types
Through examining the existing semantic keys and data within StatsWales, we can identify several categories that could form the basis of a set of reference data for the service. Along with the items within them
Geographies - Any kind of geographic areas (e.g. regions), as well as a few specific locations (such as train stations). Sub-categories could include Local Authorities, Economic Areas, Constituencies, Electoral Wards, LSOAs, National Parks, Local Health Boards, etc.
Age - A variety of ways of categorising age groups, ranging from single years to broad categories. Includes single year bands, five year bands, ten year bands, children/adults/pensioners etc.
Ethnicity - Lists of recognised ethnicities, both broad (e.g. White) and detailed (e.g. Welsh).
Service Provider - Healthcare providers: Local Health Boards, Hospitals, GP practices.
Gender - A list of potential gender classifications. (single sub-category).
International Geography - A list of recognised countries. (single sub-category).
Religion - A list of recognised religions. (single sub-category).
Source and maintenance of reference data
Reference data should be maintained centrally by approved staff, there could be a reference data forum where relevant stakeholders can discuss reference data and suggest changes. Staff responsible may have different roles with different responsibilities, one role may be responsible for determining what counts as reference data whereas other roles may be responsible for populating those categories.
Where possible reference data should be sourced from official sources. Or based on commonly used categories within the existing service.
The reference data we have put together comes from a variety of sources. The list of geography categories was populated with data from sources such as DataMapWales, ONS, Open Geographies Portal and UK or Welsh government departments, with a similar situation for Service Providers. International geography comes from the list of countries recognised by the UK Foreign Office. Religion and Ethnicity lists are taken from the UK Census, as are some of the Age categories with others being based on what we see in StatsWales. Gender is based on classifications currently used in StatsWales.
Reference Data structure
Reference data should be structured in such a way that it is simple to maintain and update and easy to find and extract data from.
For structuring the reference data, we have split it into three primary tables.
A list of categories, the high level divisions (e.g. Geographies, Age, etc)
A list of category_keys, the subdivisions (e.g. Local Authority, Region, etc for Geographies)
A list of reference data, the actual data items within these categories (e.g. Carmarthen, Swansea, etc in the case of LAs)
All these tables contain just the ids for their respective data, the actual names/descriptions are separated out into info tables that contain the English and Welsh language names/descriptions for each id.
There is also a hierarchies table, detailing all the potential hierarchies for the reference data.
Structuring the reference data in this normalised way helps improve data governance and ensure data accuracy, whilst making it easy to query and update.
Version numbers help to keep track of changes to the data. Validity periods indicate time periods that the reference data should be used for (though the user would still be able to use these reference data with facts outside that period if they need to).
Separating the data out like this allows for users with different privileges to access the relevant parts of the data, for example some users may be responsible for determining the approved categories for reference data, whist other users may be tasked with populating the reference_data table with the relevant items, which they will only be able to add if their category_keys exist in the category_key table.
Separating out the language data allows for this to be updated independently of the items themselves (i.e. name changes or corrections do not necessitate the creation of an entirely new data item), and allows for easy adding of other languages should the need ever arise.
When reference data needs to be updated, it would simply be a case of adding additional rows to the relevant tables. For example, if a new sub-category is created (e.g. Geog/Towns), we simply add a row to the category_key table (and the relevant descriptions to the category_key_info table).
Actual reference data itself can be created as a table, matching the structure of the existing reference_data table that can just be appended to the existing one (likewise for the _info table).
Updating existing reference data is simply a case of adding updated rows with new version_nos.
Reference Data Extraction and Matching
When using reference data in publishing a dataset, the user will be invited to select a category and sub-category (or sub-categories). They would then be presented with a preview of this reference data to ensure it is what they want. They would also be invited to select a hierarchy.
The system will then do a test match to ensure that all the items in the relevant column of their fact table can be matched against something in the selected reference data. If there are items that don’t match the user will be invited to review and correct them before moving forward.
Once this is done the system will extract the relevant reference data into a lookup table that can go into the dataset’s data cube.
Wider applicability
Whilst we have focused here on how reference data can be applied in the context of the StatsWales service. Many of the principles discussed can be applied to developing reference data for applications more generally. Sourcing and maintaining reference data should be similar in most cases. The data structure can also be pretty universally applied, with maybe some minor tweaks. For example, you may not need categories and subcategories, but rather just a single category table if the reference data needed is not as nuanced. Applying the designs developed here could be beneficial in creating a reliable, authoritative and easy to use source of reference data for many different services.
Author
Tags: