Dimensionality reduction - part 1


Nowadays the data sets contain a lot of features (thousand, tens of thousands, etc.) and if you would like to train a model using these sets you need to be patient as it can take A LOT OF TIME!!! However, there is a way to speed up the process and use dimensionality reduction (please note that this would decrease the performance of your model). So, I would like to review a number of algorithms for that.

1.                The most common approach is Principal Component Analysis (PCA).

There are a lot of information about this method in Web. I personally, like the following video from HSE by Boris Demeshev (only if you know Russian of course :)  ):
Let’s have a look on this using Python:




Centring the features:

Using Singular Value Decomposition (SVD):



We see that, svd method returns the linear parameters for our features (the first row), so in order to get a principal component (second row) we need to multiply our centred matrix to these parameters.

The same output by using sklearn:



The only difference is the sign – this is because the equation to identify linear parameters has two roots (please see the link to the Boris’s video), so we can use any of them. BTW, we don’t need to centre the data, so sklearn can do it for us.




Комментарии