Having been asked to speak on a panel at the Data Management Summit in London last week, something said by one of my fellow panellists has stuck in my head. When asked to define ‘data quality’ he turned the phrase around and said that our focus should really be on ‘quality data’—achieving a level of data quality that is ‘fit for purpose’.
Stay current on your favourite topics
That still leaves a question begging: ‘how do you define what’s fit-for-purpose?’ And therein lies the challenge. What is ‘fit-for-purpose’ can be subjective. One man’s ‘quality data’ may not be another’s. For example, there may be subtle nuances in the way different vendors calculate yield curves, so instead of there being an objective single version of the truth, the argument over which source to trust may boil down to subjective preference or opinion. Who is the best data provider for swaps pricing, bond yields, who offers the most accurate corporate actions data? In the absence of a definitive answer to those questions, perhaps the best we can hope for is consensus.
But reaching consensus on anything requires good governance – which segues nicely onto another question that we discussed on the panel. Do most institutions have the right level of data governance in place? Is ownership clearly defined? Is there board-level sponsorship for data quality initiatives? And when there are compromises to be made in defining what a single source should look like, has anyone got big enough boots to make that happen?
While those questions are all fairly open-ended, there is still a lot we can do in the meantime to objectively improve levels of data quality. Cleaning up clearly erroneous data, filling in fields that are incomplete, standardising taxonomies to make sure everything is described consistently – all of these exercises are bread and butter to data managers.
Thankfully, there are now a growing collection of tools and techniques to help us to do those things. And those tools are constantly evolving. From big data storage and retrieval and natural language processing through to artificial intelligence, perhaps the day will come when many data quality processes can be automated.
For now, resolving most issues relating to data quality remains manually intensive. Being able to rely on a centralised enterprise data management team, or even an external utility, might help reap economies of scale and ensure data only has to be cleansed, verified and enriched once. Even so, ensuring that a single source gets adopted across the enterprise is something that many firms struggle with, as my colleague Ilya Finkelshteyn pointed out in his recent blog Fixing Reference Data Distribution.
Looking forward, one trend that offers a potential solution to our woes lies in the use of metadata to provide increasingly rich descriptions of data items. Ultimately, the goal is to have datasets that are self-describing. That means being able to ascertain provenance and lineage (where data has come from), how it has been transformed or enriched and by whom, and maybe even information relating to licensing terms and conditions to ensure there are no misunderstandings regarding usage costs.
Maintaining rich and accurate metadata can drive more informed debate around what constitutes ‘fit for purpose’, potentially helping to drive consensus. Perhaps it could even negate the need for a single ‘golden source,’ allowing an enterprise to agree on definitions of ‘quality data’ that are fit for multiple purposes. The front-office and back-office could agree to disagree, and use different sources or methodologies to calculate a particular data item, but be braced to easily reconcile discrepancies using the metadata that describes each source. That tension could ultimately create its own system of checks and balances. After all, if everyone uses the same ‘golden source’ and that source ends up being wrong, the repercussions are much more significant.
So whether it’s data quality or quality data that you’re striving for, metadata may well hold the key to your objectives.