One Less Fish in the Data Lake: Achieving Cost Efficient Electronic Record Keeping

It was with sadness that I recently heard Teradata will stop the development of the RainStor archive product that it acquired less than 18 months ago.  In 2012-13, I helped a client deploy RainStor as a long-term online archive for their electronic record keeping.  The beauty of the solution was that it massively compressed data, allowing the client to migrate from high-end SAN to commodity NAS and to enable significant simplification of the operational procedures required to replicate and backup the data.  The project replaced a VERY large database cluster and 160TB of SAN with an array of virtual machines – the solution used 12x less storage on disk that was 10x cheaper.  It wasn’t all roses of course; industrialising the process of loading data into RainStor was a challenge, and some heavy duty regulatory reports required complex query optimisation before performance was acceptable.  However, the long-term benefits were obvious; the solution provided the bank with a method of aligning the operational cost of retaining the data with the fact that the intrinsic value of the data decreases significantly with age.

Somewhat coincidentally, at the same time as hearing this news, I’d been discussing with my colleagues the impact of MiFID II on our clients’ electronic record keeping and data retention obligations.  Coming into effect in Jan 2018, MiFID II electronic record keeping standards require that European firms retain all electronic trading transactions for 5 years. In the US the SEC’s Rule 613 for a Consolidated Audit Trail is also imposing similar obligations on US entities. Similarly, Dodd Frank record keeping and reporting have placed a significant burden on capital markets firms where retention requirements may exceed 30 years.

For a top Investment Bank, it’s not unreasonable to expect that the total number of transactions (Quotes, Orders and Fills) will run into 10’s of billions per year.  And if, for simplicity’s sake, we assume each record is 500B on average then that’s of the order of 5TB of data per year.  Archiving to tape or offline storage media naturally results in performance and data retrieval implications which are unlikely to be acceptable for the new trade reporting guidelines.  Storing this in a traditional database results in overheads for indexes, logs, dump files and filesystem headroom which can inflate the actual storage required more than 3x. Then double that to ensure you have a DR copy.  So we’re now at 30TB of storage per year (that’s usable storage – multiply up again for raw disk!) which will become 150TB over the term to meet the regulations, assuming zero growth.  From a cost perspective, a conservative benchmark for internal chargeback for SAN is in the region of $0.50/GB/month so that 150TB is going to set you back $1M/year before we include compute and database costs.  Using native database compression features may help to alleviate these effects; however, it’s safe to say that very large data sets will be stored and managed with increased cost and complexity.

As an industry, we are going to have to be more creative than just throwing disks at the problem.  Given the demanding recovery and reporting requirements of regulators, it is not going to be as simple as just archiving the data offline to tape.  The industry needs a cost effective mechanism for retaining large volumes of data online so that the entire history can be queried in near real-time.  Unfortunately, it looks like one solution to this challenge, RainStor, will no longer be available.

