Regulatory demands for data storage, processing and retrieval have fuelled the demand for data lake solutions. Increasingly businesses are seeing these as a means to drive new business ideas. Citihub Consulting speaks to Associate Partner, Paul Jones on data lake challenges and opportunities. Paul is one of the firm’s most senior software development consultants. He has previously delivered highly successful client engagements on market data inventories, exchange connectivity mapping databases and most recently, a data lake solution for transactional data for MiFID II including Best Execution reporting.
Stay current on your favourite topics
What does a data lake mean to you?
“Data Lake” is a term we associate with Big Data – a set of data that is too big to be processed on a single machine. To process Big Data, we typically combine the storage, memory and compute resources of potentially thousands of machines that can work in tandem. This is just the infrastructure of a data lake, though: there is more to it.
A data lake is a paradigm where any data, particularly data which is unstructured, can be stored relatively cheaply.
In a traditional database, a schema is designed in advance. To ingest data, we must map to the structure we design, which comes at a cost. A common data lake approach is to store data in its raw form and apply what’s called “schema on read” – i.e. only impose a structure on the data when we decide to use it for something. This makes ingestion simpler whilst offering flexibility in selecting the specific data and structure we need for a given use case.
This doesn’t imply that a data lake is unstructured or ungoverned – significant effort should go into designing the processes by which data is moved into and through a lake, into ensuring data quality and good data lineage, into tracking where data is stored, etc.
A data lake is not just about data storage; storage is simply at the bottom of the data lake technology stack. A data lake should also provide the ability to query, analyse and process data.
A common data lake principle is not to delete raw data – you never know when you might need to reprocess what you’ve got, or process for a different use case. It’s an important principle, but it often presents design challenges that are more difficult to overcome than they are with a traditional database.
What do you think a data lake means to other people?
Not so long ago, it seemed that data warehouse was the hot topic amongst firms. A data lake is not a data warehouse although both are sometimes considered as a place to dump data with little concern for further processing. For example, if I were to need FIX messages or a log of the actions that a user made on my website, I should be able to get them from a data lake.
But a data lake should be much more than a fire-and-forget store – raw data should be available, but possibly not to every user. A data lake should make available data that is structured, cleaned, enriched and easy to query.
A data lake should also work with other components in an overall reporting architecture. For example, it may not be the most suitable choice for real-time reporting.
Describe the foundational components of a data lake.
A data lake is not synonymous with Hadoop, but it’s difficult to discuss a data lake without mentioning it. There are alternatives, but the Hadoop ecosystem is rich, widely used and offered by multiple enterprise software vendors and provides a good illustration of the stack.
Hadoop starts with its distributed, resilient filesystem (HDFS). In a distributed filesystem, a dataset can be seamlessly stored across multiple machines. With resiliency baked in, it becomes more acceptable to use commodity hardware, accepting that hardware failures will occur. High volumes of data can be stored reliably and cheaply.
When we have large volumes of distributed data, we need a means of processing it efficiently. This is where MapReduce (an algorithm proposed by Google) originally came in. If we have a computation that needs to run across a distributed dataset, we figure out a way of splitting the task into chunks. Each chunk can then be processed independently, with results collected at the end.
HDFS & MapReduce were the foundation, but ‘Hadoop’ has since grown to mean any of the number of projects and subprojects that now make up the ‘Hadoop ecosystem’.
A number these projects arose because of some of the disadvantages of MapReduce & HDFS – MapReduce can be a relatively slow, unfamiliar model to those without experience. Later projects sought to abstract away from MapReduce, trying to bring the experience closer to that of a traditional database or programming model. Now, there is less of a need to write ‘pure’ MapReduce – projects like Hive, Pig, Spark and Impala allow ‘higher level’ programs to be run on top of HDFS, often using normal SQL, sometimes with significant performance improvements.
What things do you know now that you wish you had known when building out the lake?
With any technology implementation, it can be the edge cases that end up being costly – edge cases and technical limitations that are hidden in the small print of the manual. A data lake build is no different. However, in my experience, there are some data lake considerations that should be given specific attention when you start out.
There are some difficulties that are unique to Big Data. For example, in a traditional database, we can make liberal use of indexes to ensure that reads and writes are fast – indexes are simple, well understood and well supported in all RDBMS platforms. At scale, things are not quite so easy. Indexing may not be supported or may only be possible on a single key. Queries tend to need to be planned in advance, with the data structure designed according to the queries that it needs to support. The key is to avoid designing yourself into a corner.
Another example is the seemingly easy task of deleting data. There is a philosophical argument that data lake data shouldn’t be deleted at all, but philosophy can easily lose in a regulated environment where confidential data must be kept confidential. Some components, for example, earlier versions of Hive, do not support deletion of individual rows.
There is a mantra of “storage is cheap”. Whilst it might be cheap to buy a disk, there is still an overhead to get hardware into corporate data centres.
Finally, do not underestimate compute requirements. When estimating the number of terabytes of storage a data lake might need, try to also estimate the compute requirements required to ingest, refine and query your data. This is much more difficult to do.
Stay current on your favourite topics
Once a firm has a data lake, what things are they starting to use them for that they did not plan on during inception?
The payback on a data lake investment is not that unexpected. MiFID II Order Recordkeeping and periodic reporting needs may have pushed firms towards a data lake but in many cases, businesses always had other regulatory and analytical use cases in mind.
MiFID II has meant that consistent instrument (ISIN) and client (LEI) identifiers are now likely to be available, which opens up a wealth of analytical possibilities.
The kind of data ingested into a data lake within a bank (transactions, orders, quotes, market data) is for the first time stored in one place and gives potential users huge flexibility in querying. For example, whilst MiFID II may not require it, it is not a big leap to include additional data alongside the regulatory information – such as the terminal state of an RFQ. This could turn the data, ostensibly for a regulatory report, into a tool for analysing and improving actual trading performance and P&L.
What Governance needs to be in place for a data lake?
Taking the example of MiFID II, for some, a data lake might be a dumping ground for regulatory data that then becomes someone else’s problem when that same data is needed by a regulator. An ungoverned data lake is of limited use. Best practice is to manage the data lake using zones. This is a common concept (https://dzone.com/articles/data-lake-governance-best-practices) however the exact number and purpose for the zones could vary data lake to data lake.
For each piece of data ingested into a data lake, a typical lifecycle could be:
In a model like this:
Raw – “raw means raw” – the literal copy. This should never be modified – it might be needed in its raw form again in the future. The raw form might be XML, CSV, JSON, FIX, internal messaging formats, tag-value pairs, perhaps even some compressed internal binary format.
Refined – raw data should be cleaned and normalised. If data was in a binary or key/value pair format, it can be unpacked into columns.
Enriched – data sources can be combined and enriched with external content, ready for final presentation. Here, data is still typically for ‘internal use only’.
Trusted – at the top of the data hierarchy, data can be masked and permissioned according to the desired security model. It can be sliced into views to serve specific use cases. Most users of the data lake would be expected to use the Trusted Zone.
A well-designed and operated data lake is much more than just raw storage. Data should move through each zone, and each zone should have appropriate entitlements and access restrictions.
Key Lessons Learnt
Building a successful data lake requires the same discipline, forethought and technical skill as any other complex technology project. Key considerations Citihub Consulting recommend for successful data lake projects are:
1. “Shift Left” on Security: include security teams at the design stage, integrate authentication and entitlements as early as possible.
2. Be ruthless in seeking code efficiencies: seemingly simple operations can have a high cost when run on large datasets; jobs can start to fail as datasets grow.
3. Make an early investment in housekeeping: for example, Hadoop prefers fewer, larger files – be ready with a strategy to coalesce small files before they can cause an impact.
4. Treat logging and monitoring as first class: a distributed environment can present some unique challenges, and frameworks such as Apache Spark can be too verbose if left unchecked.
5. Design data models carefully: when large joins are expensive and rows cannot be easily updated (or updated at all), problems cannot always be solved in the way they would with an RDBMS.