Data Science in a Box using IPython: Scipy and Scikit-Learn (3/4)

Article
04/17/2013

In the first two blogs of this series, we installed the IPython notebook using the minimum requirement.

The third blog post will walk you through some of the common packages used for Data Science.

SciPy/NumPy packages are usually mentioned together. At this point, we have not installed SciPy, SciPy includes a collection of numerical packages, that includes Linear solvers that we used in a previous post. Enter the Big Data Matrix: analyzing meanings and relations of everything (2/2).

To install the package type: sudo apt-get install python-scipy

Scikit Learn is a fantastic python-based machine learning package, it includes algorithms for both supervised and unsupervised learning. Moreover, it includes support for sample datasets, data import tools, and model evaluation.

Scikit Learn is included with your Ubuntu distribution, but the default is about 2 versions behind. The best way to install Scikit Learn is to use PIP.

type: pip install scikit-learn

The installation process includes building many of the packages from scratch; much of the code base is written in C. Check the installation for errors. You can verify by checking new files in /usr/local/lib/python2.7/dist-packages for sklearn.

Getting samples

It is easy to find samples and run them in IPython Notebook. You can easily get them from various websites and even tutorials. To save you time, I’ve make a small collection at: https://github.com/wenming/BigDataSamples

Get the package by typing: wget https://github.com/wenming/BigDataSamples/archive/master.zip

On your Ubuntu box, you might have to install unzip by typing: sudo apt-get install unzip

Unzip master.zip; then copy content from BigDataSamples-master/ipythonMLsamples into your Ipython dir. A sample command may look like:

cp /home/azureuser/samples/BigDataSamples-master/ipythonMLsamples/* /home/azureuser/.ipython/

Check to make sure the files have been copied.

Running the samples

Go back to the website for IPython, log in, and the files listed should show up in the root directory. Click on K-Means clustering on the handwritten digits data.

Click on the Play button to run the machine learning sample.

The code uses the K-means algorithm with 3 different types of initialization, then plots the results.

The code for making the color scattered plot.

Additional samples to explore

These samples are also includes, feel free to explore them on your own.

A demo of K-Means clustering on the handwritten digits data
A demo of structured Ward hierarchical clustering on Lena image
Faces dataset decompositions
Gaussian Processes regression
Manifold learning
Non-linear SVM
Hand writing recognition using SVM
Hierarchical clustering-structured vs unstructured ward
demo2 of the K Means clustering algorithm
Weighted SVM
Visualizing the stock market structure

Conclusion

IPython Notebook gives us a quick and easy way to share compute resources through the web-based IPython notebook interface. Scikit-Learn, NumPy, and Scipy all simply work out of the box for IPython notebook. The simple, yet powerful combination lets users focus on learning and getting the data analysis done.

In the next blog, we will introduce additional packages in Python that can be used for Data analysis including scaling out using clustering.

Data Science in a Box using IPython: Scipy and Scikit-Learn (3/4)

Getting samples

Running the samples

Additional samples to explore

Conclusion

Additional resources