Thursday, March 18, 2010

Intraday data handling can be easy, when you've got the right tools.

The title of this post could be just as well 'In praise of Matlab' ( or possibly SciLab or R, but not Excel). Yesterday it took me only a couple of hours to complete a task that otherwise would be almost impossible to achieve without the right tooling. I was so excited about how easy it turned out to be that I decided to share my experience.
For some time now I was going to take a look at Forex intraday data. It tempted me because of the law of  'large numbers'. A model should be allowed to make a couple of hundred of trades before its profitability can be estimated reliably. For a swing-trading system this should mean at least half a year paper trading and I just don't like to wait. In case of an intraday system, one only needs a couple of days!  Forex data is freely available (intraday data archive)   making it a good candidate to test some intraday strategies.
But once you download some data, a problem becomes obvious: the data is sampled at irregular intervals, making it very difficult to compare one dataset to another. Take a look at the beginning of the data files:
USD/CHF:
1051976017,D,USD/CHF,2010-01-31 17:04:45,1.060100,1.060600
1051976080,D,USD/CHF,2010-01-31 17:05:20,1.060200,1.060600
1051976102,D,USD/CHF,2010-01-31 17:05:22,1.060100,1.060600
1051976119,D,USD/CHF,2010-01-31 17:05:22,1.060200,1.060700
------------------------------------------------------------------------

The sampling time seems to vary between less than a second to more than a minute! Even if you don't want to compare these two datasets, the data is still unusable for a backtest.

Now imagine trying to align these datasets in Excel. If somebody has an idea, please let me know ;-).
However, in a technical software tool (like Matlab ) there should be an interpolation function.
In Matlab it is the wonderful interp1 . It accepts available data along with a new vector of time values for interpolation.
I've used a 10-second interval for interpolation, while synchronizing start times between different datasets. The result is a neat matrix with prices at even time periods. Of course this introduces an error , but looking at the interpolated data (see graph below, deeply zoomed in), less than 1/2 pip, nothing serious.

Again, it gives me a headache just thinking about having to complete this task without the right tooling.