The challenges of disconnected classifications

Organisations that collect data tend to have a lot of things to collect data about, and they tend to want to try to make some overall sense of the data about all those things. A valuable tool to help with that is categorisation. Just look at any nontrivial online shop, and you’ll find products organised into categories, so that entire groups of products can be considered together. It’s nice for users to be able to browse to a category to find things, and valuable for you to be able to say “How are our sales of garden furniture correlated with the weather?”.

The lifecycle of a categorisation system

It all seems so easy at the beginning.

Intuitively, people build categorisations, because our brains are wired to find patterns. As your hypothetical example e-commerce business grows, you’ll intuitively group your products together.

But while doing so, you’ll find a few outliers that don’t fit neatly into the categories. Is a paddling pool garden furniture, or a kid’s toy? After a moment’s reflection about customer buying patterns, you decide to put them under garden furniture; people are more likely to decide to buy them while looking for things for their garden than when looking for a present for a grandchild. Perhaps your software lets you put the same product in more than one category, which can help avoid conflicts by letting you choose both; and it’s very convenient if you have special categories like “Special Offers”, so you can put products in there without having to take them out of the category they’re naturally in.

Complexity rises

Your category system will make distinctions that matter to you and your users, while ignoring distinctions that don’t. Perhaps you have lots of different kinds of garden furniture so will decide to subdivide the Garden Furniture category into patio furniture, barbeques, childrens’ garden toys, and so on; while other categories you don’t have huge ranges in aren’t worth splitting into subcategories that might only have one thing in each.

Your sales analytics will probably make extensive use of product categories to summarise sales, but this is complicated if you’ve got products in multiple categories - which category should a sale of that product contribute towards in your reports? And your reports probably want a somewhat coarser categorisation into a handful of top-level lines, rather than the potentially hundreds of fine-grained subcategories you want to help users find things in your massive product catalogue. And sales reports might care about distinctions such as “Delivery-only products” versus “In-store products” that your customers won’t be so interested in when searching. Do you need to assign every product two categories in two different systems - one for user searching and one for sales reporting - despite them closely overlapping in most cases?

If you ship things abroad wholesale, there’s another categorisation that may be forced upon you - tariff categories for export duties. Garden furniture might attract different rates based on whether it’s mainly made of metal or wood, and where those materials came from, and where final assembly was performed. This is certainly not the kind of distinction desired in a customer-facing search categorisation, nor for your sales analytics. So you need to place every product in yet another category system, which again often overlaps with the others, but also often doesn’t…

The underlying problem

Really, there are three different demands on your product categorisation:

Users want to use it to find the things they want.
You want a categorisation for your sales analytics.
Your export department needs to use the categorisations imposed on them by export legislation and tariffs.

Although these demands lead to broadly similar categories, they differ inexorably in certain details. A single classification system attempting to meet all those demands will have a lot of fine-grained distinctions required by one user, but are meaningless to others. Any items that sit on a boundary between categories, so need to be considered in different categories in different contexts, will have to be put in a special category of their own. Any one categorisation that tries to meet all these needs will be a mess.

Some Solutions

But this doesn’t mean you need three different category systems, largely overlapping, differing only around edge cases.

If your software permits it, you could give each product a classification in a single super-categorisation system that captures every distinction that matters - which then has a mapping to each of the other categorisations.

Or give everything a classification in the user-search category as it’s the easiest one to think about, and have an approximate mapping from that to the other category systems, with optional overrides on a per-product basis for the edge cases when that approximate mapping gets it wrong.

Conclusion

Databases will often have to file things under multiple similar yet competing classification systems; it’s an unavoidable fact of life in many problem domains. But with careful thought, a way can usually be found to minimise the headache of representing and managing these categorisations. We have extensive experience in finding the best way to represent complex data, so get in touch and let us help!

Author

Featured

Alaric Snell-Pym

Alaric is an engineer specialising in understanding complex problems and producing simple solutions. They have a wide range of experience implementing everything from line of business systems to distributed databases comprising thousands of nodes.