Python for the Data Scientist
In a previous notebook I introduced the R programming language and environment. While R is very powerful, widely used and has multiple packages, another language called “Python” is also popular with Data Scientists. Yes, you can do amazing things in R – in fact, part of R is written in R (think about that for a moment), and the amount of packages you can get for it is simply huge.
But Python has some distinct differences that make it attractive for working in data analytics. It scales well, is fairly easy to learn and use, has an extensible framework, has support for almost every platform around, and you can use it to write extensive programs that work with almost any other system and platform.
It used to be that if you wanted to write scalable programs that worked in a complete system, you used Python, and if you wanted to specialize on statistics you used R. But that’s changed. R has grown to encompass more functionality, and with the RRE packages it scales well out from installed memory. On the other hand, Python has added a number of libraries dealing with math, statistics, science, data and more, that it is starting the rival the R language in usefulness for a Data Scientist.
So in short, if you’re dealing primarily with statistics and data, R is a great language to learn and use. If you want to add in more functionality dealing with systems not specifically involved in statistics and data, Python is great to learn and use. Actually, if you’re serious about Data Science, you should learn both.
Installing the tools
In this notebook entry I’ll show you a couple of tools you should install to use Python, assuming that you’re on Windows. For Linux or Mac, the process is similar but the tools are different, so I’ll cover those in another post.
Begin by installing Python itself. You can find it here: https://www.python.org/downloads/.
Right away you face a choice – Version 3 or Version 2? And why is that even a choice?
Well, Python is a victim of it’s own success. It does so many things, and so many things well, that it was adopted into many organizations in a big way. Like R, it has packages (called “Modules”) that allow it to be extended significantly. There are so many modules that were written for the 2.x version of Python that it is taking a long time to convert them to 3.x. In some cases, they simply won’t be ported at all. Since organizations may depend on these Modules, the earlier version is still around.
I use the 3.x version. Most of the functions a Data Scientist needs are ported, and new ones are being developed for 3.x, not 2.x.
The Module list is huge (here are a few: https://wiki.python.org/moin/UsefulModules ) and if you want to check to ensure the one you want to work with is supported in version 3, check this link: http://py3readiness.org/.
Still confused? Read more here: https://wiki.python.org/moin/Python2orPython3
Python comes with an editor, called IDLE. I actually like using a full Integrated Development Environment, so I use Visual Studio 2015 Community Edition. It’s free, robust, and if you select “Custom” during the installation, you can allow add-ins to the product. I did that during my installation and selected “Python Tools”. Now I can work with Python code in Visual Studio.
If you want to install Visual Studio Community Edition, download it here: https://www.visualstudio.com/en-us/products/visual-studio-community-vs.aspx#.
Learn more about working with Visual Studio here: https://msdn.microsoft.com/en-us/library/dn762121%28v=vs.140%29.aspx?f=255&MSPPError=-2147217396
Getting started with the language
With Python installed, you have a litany of resources you can use to learn how to use it. My favorite is “Learn Python the Hard Way”, located here: http://learnpythonthehardway.org/.
You can find the official documentation here: https://wiki.python.org/moin/BeginnersGuide/Overview
And once you’re familiar with Python, check out Data Science and Python: http://www.kdnuggets.com/2014/01/tutorial-data-science-python.html
There are literally dozens of other resources you can use to learn Python. Another one I like is at Code Academy: https://www.codecademy.com/. Nothing to install, and it’s free! I’ve completed this course and it’s quite good.
No, not the notebook you’re reading now, this is something else entirely. It’s a way of working with and integrating Python code directly in a web document. The basic concepts are here: https://ipython.org/notebook.html
And something you’ll see used quite frequently among Data Scientists is an implementation of ipython called Jupyter. You can sign up and use it here: http://jupyter.org/
And just like that, you’re on your way.