What would be the best approach to handle real-time intraday data storage?
For personal research I've always imported from flat files only into memory (historical EOD), so I don't have much experience with this. I'm currently working on a side project, which would require daily stock quotes updated every minute from an external feed. For the time being, I suppose any popular database solution should handle it without sweating too much in this scenario. But I would like the adopted solution to scale easily when real-time ticks become necessary.
A similar problem has been mentioned by Marko, though it was mostly specific to R. I'm looking for a universal data storage accessible both for lightweight web front-ends (PHP/Ruby/Flex) and analytical back-end (C++, R or Python, don't know yet).
From what chrisaycock mentioned column oriented databases should be the most viable solution. And it seems to be the case.
But I'm not sure I understand all the intricacies of column oriented storage in some exemplary usage scenarios:
- Fetching all or subset of price data for a specific ticker for front-end charting
- Compared to row based solutions fetching price data should be faster because it's a sequential read. But how does storing multiple tickers in one place influence this? For example a statement like "select all timestamps and price data where ticker is equal to something". Don't I have to compare the ticker on every row I fetch? And in the situation where I have to provide complete data for some front-end application, wouldn't serving a raw flat file for the instrument requested be more efficient?
- Analytics performed in the back-end
- Things like computing single values for a stock (e.g. variance, return for last x days) and dependent time-series (daily returns, technical indicators etc.). Fetching input data for computations should be more efficient as in the preceding case, but what about writing? The gain I see is bulk writing the final result (like value of computed indicator for every timestamp), but still I don't know how the database handles my mashup of different tickers in one table. Does horizontal partitioning/sharding handle it for me automatically or am I better splitting manually into table per instrument structure (which seems unnecessary cumbersome)?
- Updating the database with new incoming ticks
- Using row based orientation would be more efficient here, wouldn't it? And the same goes about updating aggregated data (for example daily OHLC tables). Won't it be a possible bottleneck?
All this is in the context of available open source solutions. I thought initially about InfiniDB or HBase, but I've seen MonetDB and InfoBright being mentioned around here too. I don't really need "production quality" (at least not yet) as mentioned by chrisaycock in the referenced question, so would any of this be a better choice than the others?
And the last issue - from approximately which load point are specialized time-series databases necessary? Unfortunately, things like kdb+ or FAME are out of scope in this case, so I'm contemplating how much can be done on commodity hardware with standard relational databases (MySQL/PostgreSQL) or key-value stores (like Tokyo/Kyoto Cabinet's B+ tree) - is it a dead end really? Should I just stick with some of the aforementioned column oriented solutions owing to the fact that my application is not mission critical or is even that an unnecessary precaution?
Thanks in advance for your input on this. If some part is too convoluted, let me know in a comment. I will try to amend accordingly.
EDIT:
It seems that strictly speaking HBase is not a column oriented store but rather a sparse, distributed, persistent multidimensional sorted map, so I've crossed it out from the original question.
After some research I'm mostly inclined towards InfiniDB. It has all the features I need, supports SQL (standard MySQL connectors/wrappers can be used for access) and full DML subset. The only thing missing in the open source edition is on the fly compression and scaling out to clusters. But I guess it's still a good bang for the buck, considering it's free.
Column-oriented storage is faster for reading because of the cache efficiency. Looking at your sample query:
select price, time from data where symbol = `AAPL
Here I'm concerned with three columns: price
, time
, and symbol
. If all ticks were stored by row, the database would have to read through all rows just to search for the symbols. It would look like this on disk:
IBM | 09:30:01 | 164.05; IBM | 09:30:02 | 164.02; AAPL | 09:30:02 | 336.85
So the software must skip over the price and time entries just to read the symbols. That would cause a cache miss for every tick!
Now let's look at the column-oriented storage:
IBM | IBM | AAPL; 09:30:01 | 09:30:02 | 09:30:02; 164.05 | 164.02 | 336.85
Here the database can sequentially scan the symbol list. This is cache efficient. Once the software has the array indices that represent the symbol locations of interest, the database can jump to the specific time and price entries via random access. (You may notice that the columns are actually associative arrays; the first element in each column refers to the first row in aggregate, so jumping to the N th row means simply accessing the N th element in each array.)
As you can imagine, column-oriented storage really shines during analytics. To compute the moving average of the prices per symbol, the database will index-sort the symbol column to determine the proper ordering of the price entries, and then begin the calculation with the prices in contiguous (sequential) layout. Again, cache efficient.
Beyond the column-oriented layout, many of these really new databases also store everything in memory when performing calculations. That is, if the data set is small enough, the software will read the entire tick history into memory, which will eliminate page faults when running queries. Thus, it will never access the disk!
A second optimization that kdb+ does is that it will automatically enumerate text. (This feature is inspired by Lisp symbols). So searching for a particular stock does not involve typical string searching; it's simply an integer search after the initial enumeration look-up.
With the sequential storage, in-memory allocation, and the automatic text enumeration, searching for a symbol is really just scanning for an integer in an array. That's why a database like kdb+ is a few orders of magnitude faster than common relational databases for reading and analytics.
As you've pointed-out in your question, writing is a weakness of column-oriented storage. Because each column is an array (in-memory) or file (on-disk), changing a single row means updating each array or file individually as opposed to simply streaming the entire row at once. Furthermore, appending data in-memory or on-disk is pretty straightforward, as is updating/inserting data in-memory, but updating/inserting data on-disk is practically impossible. That is, the user can't change historical data without some massive hack.
For this reason, historical data (stored on-disk) is often considered append-only. In practice, column-oriented databases require the user to adopt a bitemporal or point-in-time schema. (I advise this schema for financial applications anyway for both better time-series analysis and proper compliance reporting.)
I don't know enough about your application to determine performance or production-level requirements. I just hope the above guide will help you make an informed decision with regard to why column-oriented storage is often your best bet for analytics.