Saturday, October 9, 2010

Behold the ideal database tooling combination

As I've mentioned earlier, I've spent most of my time in the last year managing data instead of developing strategies. Its a dirty job... but somebody's gonna do it  and with the right tooling somebody has already done it for you.
In my search for the ideal database tooling I've tried native matlab files, excel, MySql, Access and plain csv, before settling for far-from-ideal xml - *.mat combination. Still, a lot of programming is required to manage data this way.
A couple of days ago I finally found the solution: SQLite + mksqlite . Both packages are simply excellent: a serverless database contained in a single file plus a very fast matlab interface. The best thing with this combination is that one does not need a db server or any drivers to take full advantage of an SQL database.
Thanks to the open source community, I can now really start focusing on strategy development instead of data management.


  1. Another vote for sqlite. I use it with python and R (and my web UI which is in Django).

    I also use memcached (very simple key/value cache which runs as a separate server) to cache intermediate strategies, indicators, etc while I'm testing additional stuff. I could just use a python dictionary but, find memcached more flexible (I don't have to worry about caching within Python modules, cleaning up cache, etc, plus its useful with my web UI too).

  2. I'd suggest going with HDF5. It's supported by Matlab, R, Java, C++, Fortran, Python and perhaps some other platforms/languages. When using built-in gzip compression filter it easily achives over 1:20 compression ratio for time series data represented as array of doubles. Performance is staggering, esp. when compared to any relational DB solution.

    At first glance the HDF5 data model may look tricky but all the code you need to manage your data won't exceed a few hundreds of lines.

    Of course if you don't operate on intraday data then even a plain text files as storage will be fine.

  3. @wburzyns: Thanks for the tip, this could be the way for dealing with really huge datafiles. For now I'm sticking with SQL untill I run into performance issues. Just because it is easy.

  4. Could you elaborate on how you use SQLite with Matlab (e.g., what data do you store, how do you store it (schema), how do you maintain stored historical data for name/ticker changes/splits, etc., how do you update the data/add new data)?