Volume 31 Number 3
Introduction to SciPy Programming for C# Developers
There’s no formal definition of the term data science, but I think of it as using software programs to analyze data using classical statistics techniques and machine learning algorithms. Until recently, much of data science analysis was performed with expensive commercial products, but in the past few years the use of open source alternatives has increased greatly.
Based on conversations with my colleagues, the three most common open source approaches for data science analysis are the R language, the Python language combined with the SciPy (“scientific Python”) library, and integrated language and execution environments such as SciLab and Octave.
In this article I’ll give you a quick tour of programming with SciPy so you can understand exactly what it is and determine if you want to spend time learning it. This article assumes you have some experience with C# or a similar general-purpose programming language such as Java, but doesn’t assume you know anything about Python or SciPy.
In my opinion, the most difficult part about learning a new programming language or technology is just getting started, so I’ll describe in detail how to install (and uninstall) the software needed to run a SciPy program. Then, I’ll describe several ways to edit and execute a SciPy program and explain why I prefer using the Integrated Development Environment (IDLE) program.
I’ll conclude by walking you through a representative program that uses SciPy to solve a system of linear equations, in order to demonstrate similarities and differences with C# programming. Figure 1 shows the output of the demo program and gives you an idea of where this article is headed.
Figure 1 Output from a Representative SciPy Program
Installing the SciPy Stack
The SciPy stack has three components: Python, NumPy and SciPy. The Python language has basic features such as while loop control structures and a general-purpose list data type, but interestingly, no built-in array type. The NumPy library adds support for arrays and matrices, plus some relatively simple functions such as array search and array sort. The SciPy library adds intermediate and advanced functions that work with data stored in arrays and matrices.
To run a SciPy program (technically a script because Python is interpreted rather than compiled), you install Python, then NumPy, then SciPy. Installation isn’t too difficult, and you can install a software bundle that includes all three components. One common bundle is the Anaconda distribution, which is maintained by Continuum Analytics at continuum.io. However, I’ll show you how to install the components individually.
Python is supported on nearly all versions of Windows. To install Python, go to python.org/downloads, where you’ll find the option to install either a Python 3.x version or a 2.x version (see Figure 2). The two versions aren’t fully compatible, but the NumPy and SciPy libraries are supported on both. I suggest installing the 2.x version because there are some third-party functions that aren’t yet supported on the 3.x version.
Figure 2 Installing Python
When you click a download button, you’ll get the option to either run the .msi installer program immediately or save it so you can run it later. You can click the Run button. The installer uses a wizard. The first screen asks if you want to install for all users or just the current user. The default is for all users so click the Next button.
The next screen asks you to specify the root installation directory. The default is C:\Python27 (rather than the more usual C:\Program Files directory) and I suggest you use the default location and click Next. The following screen lets you include or exclude various features such as documentation and utility tools like pip (“pip installs Python”). The default Python features are fine, so click on the Next button.
The installation starts and you’ll see a window with a familiar blue progress bar. When installation finishes, you’ll see a window with a Finish button. Click on that button.
By default, the Python installation process doesn’t modify your machine’s PATH environment variable. You’ll want to add C:\Python27, C:\Python27\Scripts and C:\Python27\Lib\idlelib to the PATH variable so you can run Python from a command shell and launch the IDLE editor without having to navigate to their directory locations.
You should verify that Python is installed correctly. Launch a command shell and navigate to your system root directory by entering a cd \ command. Now enter the command python -- version (note the two dash characters). If Python responds, it’s been successfully installed.
Installing NumPy and SciPy
It’s possible to install the NumPy and SciPy packages from source code using the Python pip utility. The pip approach works well with packages that are pure Python code, but NumPy and SciPy have hooks to compiled C language code, so installing them using pip is quite tricky.
Luckily, members of the Python community have created pre-compiled binary installers for NumPy and SciPy. The ones I recommend using are maintained in the SourceForge repository. To install NumPy, go to bit.ly/1Q3mo4M, where you’ll see links to various versions. I recommend using the most recent version that has the most download activity.
You’ll see a list of links. Look for a link with a name that resembles numpy-1.10.2-win32-superpack-python2.7.exe, as shown in Figure 3. Make sure you have the executable that corresponds to your version of Python and click on that link. After a brief delay you’ll get an option to run the self-extracting executable installer immediately, or save it to install later. Click on the Run button.
Figure 3 Installing NumPy
The NumPy installer uses a wizard. The first screen just shows an introductory splash window. Click the Next button. The next screen asks you to specify the installation directory. The installer will find your existing Python installation and recommend installing NumPy in the C:\Python27\Lib\site-packages directory. Accept this location and click Next.
The next screen gives you a last chance to back out of the install, but don’t do so. Click the Next button. You’ll see a progress window during installation and if you watch closely, you’ll see some interesting logging messages. When the NumPy installation finishes, you’ll be presented with a Finish button. Click on it. Then you’ll see a final Setup Completed window with a Close button. Click on it.
After you’ve installed NumPy, the next step is to install the SciPy package, which is identical to installing NumPy. Go to bit.ly/1QbwJ0z and find a recent, well-used directory. Go into that directory and find a link to an executable with a name like scipy-0.16.1-win32-superpack-python2.7.exe and click on it to launch the self-extracting executable installer.
One nice characteristic of the SciPy stack is that it’s very easy to uninstall components. You can go to the Windows Control Panel, Programs and Features, select the component (that is, Python, or NumPy, or SciPy) to remove and then click the Uninstall button, and the component will be quickly and cleanly removed.
Editing and Running a SciPy Program
If you write programs using a .NET language, there aren’t many options available and you almost certainly use Visual Studio. But when writing a Python program you have many options. I recommend using the IDLE editor and execution environment.
The idle.bat program launcher file is located by default in the C:\Python27\Lib\idelib directory. If you added this directory to your system PATH environment variable, you can launch IDLE by opening a command shell and entering the command idle. This will start the IDLE Python Shell program as shown in the top part of Figure 4.
Figure 4 Editing and Running a Program Using IDLE
You can create a new Python source code file by clicking on the File | New File item on the menu bar. This opens a similar-looking separate editor window as shown in the bottom part of Figure 4. Type these seven statements in the editor window:
# test.py import numpy as np import scipy.misc as sm arr = np.array([1.0, 3.0, 5.0]) print arr n = sm.factorial(4) print "4! = " + str(n)
Then save your program as test.py in any convenient directory. Now you can run your program by clicking on the Run | Run Module menu item in the editor window, or by hitting the F5 shortcut key. Program output will be displayed in the Python Shell window. Simple!
Some experienced Python developers take potshots at IDLE because it is rather simple. But that’s exactly why I like it. You don’t get anything near the sophisticated programming environment of Visual Studio, but you do get syntax coloring and a good error message generator when you’ve written incorrect code.
Instead of using IDLE to edit and run programs, you can use any text editor, including Notepad, to write and save a Python program. Then you can execute the program from a command line like this:
C:\IntroToPython> python test.py
This assumes you have the path to the python.exe interpreter in your system PATH environment variable. Output will be displayed in the command shell.
There are many Python IDEs. One popular open source IDE that’s specifically intended for use with SciPy is the Scientific Python Development Environment (Spyder) program. You can find information about it at pythonhosted.org/spyder.
An interesting alternative to IDLE and Spyder is the open source Python Tools for Visual Studio (PTVS) plug-in. As the name implies, PTVS allows you to edit and run Python programs using Visual Studio. You can find information about PTVS at microsoft.github.io/PTVS.
A SciPy Demo Program
Take a look at the Python program in Figure 5, or better yet, type or download the file that accompanies this article into a Python editor and run the program. The demo is not intended to be a comprehensive set of SciPy examples, but it is designed to give you a good feel for what SciPy programming is like.
Figure 5 A Representative SciPy Program
# linear_systems.py # Python 2.7 import numpy as np import scipy.linalg as spla def my_print(arr, cols, dec, nl): n = len(arr) fmt = "%." + str(dec) + "f" # like %.4f for i in xrange(n): # alt: for x in arr if i > 0 and i % cols == 0: print "" print fmt % arr[i], if nl == True: print "\n" def main(): print "\nBegin linear system using SciPy demo \n" print "Goal is to solve the system: \n" print "3x0 + 4x1 - 8x2 = 9" print "2x0 - 5x1 + 6x2 = 7" print " x0 + 9x1 - 7x2 = 3" print "" A = np.matrix([[3.0, 4.0, -8.0], [2.0, -5.0, 6.0], [1.0, 9.0, -7.0]]) b = np.array([9.0, 7.0, 3.0]) # b is an array b = np.reshape(b, (3,1)) # b is a col vector print "Matrix A is " print A print "" print "Array b is " my_print(b, b.size, 2, True) d = spla.det(A) if d == 0.0: print "Determinant of A is zero so no solution " else: Ai = spla.inv(A) print "Determinant of A is non-zero " print "Inverse of A is " print Ai print "" Aib = np.dot(Ai, b) print "A inverse times b is " print Aib print "" x = spla.solve(A, b) print "Using x = linalg.solve(A,b) gives x = " print x print "" try: A = np.array([[2.0, 4.0], [3.0, 6.0]]) print "Matrix A is " print A print "" print "Inverse of A is " Ai = spla.inv(A) print Ai except Exception, e: print "Fatal error: " + str(e) print "\nEnd SciPy demo \n" if __name__ == "__main__": main()
The demo program begins with two comment lines:
# linear_systems.py # Python 2.7
Because Python 2.x and 3.x versions are not fully compatible, it’s not a bad idea to be explicit about which version of Python you used. Next, the demo loads the entire NumPy module and one SciPy sub-module:
import numpy as np import scipy.linalg as spla
You can think of these statements as somewhat like adding a reference from a Microsoft .NET Framework DLL to a C# program and then bringing the assembly into scope with a using statement. The linalg sub-module stands for linear algebra. SciPy is organized into 16 primary sub-modules plus two utility sub-modules. Next, the demo implements a program-defined function to display an array:
def my_print(arr, cols, dec, nl): n = len(arr) fmt = "%." + str(dec) + "f" # like %.4f for i in xrange(n): # alt: for x in arr if i > 0 and i % cols == 0: print "" print fmt % arr[i], if nl == True: print "\n"
Python uses indentation rather than curly brace characters to delimit code blocks. Here, I use two spaces for indentation to save space; most Python programmers use four spaces for indentation.
Function my_print has four parameters: an array to display, the number of columns to display the values, the number of decimals for each value and a flag indicating whether to print a newline. The len function returns the size (number of cells) of the array. An alternative is to use the array size property:
n = arr.size
The xrange function returns an iterator and is the standard way to traverse an array. An alternative is to use a “for x in arr” pattern, which is similar to the C# foreach statement.
Because both Python and C# have roots in the C language, much of Python syntax is familiar to C# programmers. In the demo, % is the modulo operator, but it’s also used for formatting floating point value output; and is used as a logical operator rather than &&; == is a check for equality; and True and False (capitalized) are Boolean constants.
Next, the demo creates a program-defined function named main, which starts with some print statements that explain the problem to solve:
def main(): print "\nBegin linear system using SciPy demo \n" print "Goal is to solve the system: \n" print "3x0 + 4x1 - 8x2 = 9" print "2x0 - 5x1 + 6x2 = 7" print " x0 + 9x1 - 7x2 = 3"
The goal is to find values for variables x0, x1 and x2 so that all three equations are satisfied. The name main is not a Python keyword, so it could have been called anything. Having a main function of some sort isn’t required. For short programs (typically less than one page of code), I usually dispense with a main function and just start with executable statements.
Next, the demo program sets up the problem by putting the coefficient values into a NumPy 3x3 matrix named A and the constants into a NumPy array named b:
A = np.matrix([[3.0, 4.0, -8.0], [2.0, -5.0, 6.0], [1.0, 9.0, -7.0]]) b = np.array([9.0, 7.0, 3.0])
The matrix and array functions here actually accept Python lists (indicated by square brackets) with hardcoded values as their arguments. You can also create matrices and arrays using the NumPy zeros function, and you can read data from a text file into a matrix or an array using the loadtxt function.
If you took an algebra class, you might remember that to solve a system of equations Ax = b for x, where A is a square matrix of coefficients and b is a column matrix (that is, n rows but only 1 column) of the constants, you must find the matrix inverse of A and then matrix-multiply the inverse times column matrix b.
At this point in the demo, b is an array with three cells rather than a 3x1 column matrix. To convert b into a column matrix, the demo program uses the reshape function:
b = np.reshape(b, (3,1))
The NumPy library has many functions that can manipulate arrays and matrices. For example, the flatten function will convert a matrix to an array. Now, as it turns out, the SciPy matrix multiplication function is smart enough to infer what you intend if you multiply a matrix and an array so the call to reshape isn’t really necessary here.
Next, the demo program displays the values in matrices A and b:
print "Matrix A is " print A print "" print "Array b is " my_print(b, b.size, 2, True)
In Python 2.x, print is a statement rather than a function, as it is in Python 3.x, so parentheses are optional. The program-defined my_print function doesn’t return a value, so it’s equivalent to a void C# function and is called as you might expect. Python supports named-parameter calls so the call could’ve been:
my_print(arr=b, cols=3, dec=2, nl=True)
Next, the demo program finds the inverse of matrix A:
d = spla.det(A) if d == 0.0: print "Determinant of A is zero so no solution " else: Ai = spla.inv(A) print "Determinant of A is non-zero " print "Inverse of A is " print Ai
The SciPy det function returns the determinant of a square matrix. If a matrix of coefficients for a system of linear equation has a determinant equal to zero, the matrix can’t be inverted. The Python if-else statement should look familiar to you. Python has a neat “elif” keyword for if-else-if control structures, for example:
if n < 0: print "n is negative" elif n == 0: print "n equals zero" else: print "n is positive"
Next, the demo solves the system of equations using matrix multiplication via the NumPy dot function:
Aib = np.dot(Ai, b) print "A inverse times b is " print Aib
The dot function is so named because matrix multiplication is a form of what’s called the dot product.
Next, the demo program solves the system of equations directly, using the NumPy solve function:
x = spla.solve(A, b) print "Using x = linalg.solve(A,b) gives x = " print x
Many SciPy and NumPy functions have optional parameters with default values, which is somewhat equivalent to C# method overloading. The SciPy solve function has five optional parameters. The point is that when you see a SciPy or NumPy example function call, even if you think you understand the example, it’s a good idea to take a look at the documentation to see if there are any useful optional parameters.
There’s some overlap between the NumPy and SciPy libraries. For example, the NumPy package also has linalg sub-module that has a solve function. However, the NumPy solve function has no optional parameters.
Next, the demo program shows an example of the Python try-except mechanism:
try: A = np.array([[2.0, 4.0], [3.0, 6.0]]) Ai = spla.inv(A) print Ai except Exception, e: print "Fatal error: " + str(e)
This pattern should look familiar to you if you’ve ever used the C# try-catch statements. In C#, when you concatenate strings, you can do so implicitly. For example, in C# you could write:
int n = 99; Console.WriteLine("The value of n is " + n);
But when you concatenate strings in Python, you must do so explicitly with a cast using the str function:
n = 99 print "The value of n is " + str(n)
The demo program concludes with a print statement and a special Python incantation:
print "\nEnd SciPy demo \n" if __name__ == "__main__": main()
The last statement of the demo program could have been just main() ,which would be interpreted as an instruction to call program-defined function main, and the program would run fine. Adding the if __name__ == "__main__" pattern (note that there are two underscore characters before and after name and main) establishes the current module as the program entry point. When a Python program begins execution, the interpreter internally labels the initial module as:
So, suppose you had some other program-defined modules with executable statements and you imported them. Without the if-check, the Python interpreter would see executable statements in the imported modules and execute them. Put slightly differently, if you add the if-check to your program-defined Python files, these files can be imported by other Python programs and won’t cause trouble.
So, What’s the Point?
Your initial reaction to this article could well be something like, “Well, this is all somewhat interesting, but in my day-to-day job I really don’t need to solve systems of linear equations or use obscure math functions.” My response would be, “Well, that’s true, but perhaps one of the reasons you don’t use some of the functionality of the SciPy library is that you’re not aware of what types of problems you can solve.”
Put another way, in my opinion, developers tend to tackle problems for which they have the tools. For example, if you know Windows Communication Foundation (WCF) then you’ll use WCF (and have my sympathy). If you add SciPy to your personal skill set, you might discover you have data you can turn into useful information.
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at firstname.lastname@example.org.
Thanks to the following Microsoft technical experts for reviewing this article: Dan Liebling and Kirk Olynyk