Sunday, December 30, 2012

Pca - how it really works

I suppose that my previous post did not provide insights on how PCA really works. Here is another try at the subject, using a simple pair as an example.
Let's take SPY and IWM, which are highly correlated. If daily returns of IWM are plotted against daily returns of SPY, the relationship is highly linear (see left chart).
Applying PCA on this data gives two principal component vectors, plotted in red (first) and green (second). These two vectors are orhogonal, with the first one pointing in the direction of highest variance. Transformed data is nothing more than the original data projected on the new coordinate axis formed by these two vectors. The transformed data is shown in the right chart. As you can clearly see, all  points are still there, but the dataset is rotated.
The second vector is in this case -0.78 SPY + 0.62 IWM which produces a market-neutral spread.  Of course the same result would be achieved by using the beta of IWM.
The fun thing about PCA is that it is useful in building three- and more legged spreads. The procedure is exactly the same as above, but the transformation is done in a higer dimensional space. 


  1. Thanks for trying again. I understand PCA, just wondering how this would resolve to a portfolio. Should i short spy for .78 weight and go long on iwm for .62 weight to create a neutral spread?

    1. exactly. The weights are in capital, $78 short SPY & 62$ long iwm. Note that 78/62=1.26 is pretty close to beta of IWM, which is 1.2 according to google.

  2. Hi Jev, thanks for the interesting blog. I am trying to repeat your calc on SPY and IWM for data from 1/1/2012 through 12/31/2012. However the Coeff that princomp produces for me are different. In my runs Mathlab returns:
    Coeff =
    0.8883 -0.4593
    0.4593 0.8883

    Any idea why the difference from your result?
    My mathlab code is pretty simple:


    % d1 and d2 were set to the beginning and end of 2012
    [date, close, open, low, high, volume, closeadj] ...
    = StockQuoteQuery(smbl,d1,d2,freq) ;

    % adjusted price data from yahoo
    % I checked and the data is correct
    SPY = closeadj;

    [date, close, open, low, high, volume, closeadj] ...
    = StockQuoteQuery(smb2,d1,d2,freq) ;

    IWM = closeadj;

    comb1 = [SPY,IWM];
    comb2 = [SPY-mean(SPY),IWM2-mean(IWM)];

    [COEFF2,SCORE2] = princomp(comb2);
    [COEFF1,SCORE1] = princomp(comb1);

    % with or without the use of mean
    % which is requested by the PCA
    % article you referred to) the
    % result comes out the same

    Thanks for your help,

    1. Try applying pca on daily returns instead of raw prices.

  3. Thanks. That indeed produces the same result as yours, however is it correct? PCA finds the core orthogonal components of a signal. When I fed it with the IWM and SPY prices - these prices are the signal, not the price changes.

    And if you run IWM and SPY via the two number sets - the one that you get from calculating on daily changes like you did, and the one you get from calculating for the actual price signal, the later seems to produce a much smoother output that is more "market neutral", IMHO.

    I tested for 2012 (signal from 1/3/2012 to 12/31/2012). With PCA factors of (-0.78, 0.62) you get a signal that deviates from the zero for months, and does the year with stdev of 1.43%. With factors of (-0.8883, 0.4593) you get a smaller amplitude that is always near zero. Same average but the stdev of this signal is much smaller, only 0.6%. It looks more market neutral.

    So which is right?
    Best... Zvi

  4. What 2 number sets you are talking about? The result has to be different if you transform the data one way or another. Any formula you can point to?