Learn Data Science from the experts
Guest blog by Ilias Chrysovergis Microsoft Student Partner at Imperial College London
Hey! My name is Ilias Chrysovergis and I am doing my MSc in Communications & Signal Processing at Imperial College London. My main academic interests are artificial intelligence, machine learning, signal processing and data science but I am also really keen on Virtual, Augmented and Mixed Reality. I really enjoy learning new technologies in order to realize their impact on humanity. Having participated in various projects, I always try to leverage and combine different technologies in order to create innovative solutions for real-life problems. In the summer of 2016, my team AMANDA ( http://amandaproject.net/ ) and I, won the first place at the World Citizenship Category in the World Finals of the Imagine Cup. I am always open to new challenges and willing to work with extraordinary people and teams. You can find me on LinkedIn via the following link: https://www.linkedin.com/in/ilias-chrysovergis-99872690/
What you are going to read about?
Massive Online Open Courses (MOOCs) are courses provided via the web, which aim to democratize education by offering high quality training on a large scale. In a very cost-effective way, every person who has a device along with internet connection is capable of taking courses from the best universities, institutes and experts in every academic and technological field. In the fast-growing, mobile-first and cloud-first world, Microsoft is eager to empower every person and every organization to achieve more. By leveraging the MOOCs’ technology, it offers a variety of courses, from design thinking and web development to programming languages and data science. I am going to tell you about my journey at Data Science and what you are going to learn if you decide to start your journey by taking the same path to mine.
What is Data Science?
The huge number of existed sensors, devices, apps and data sources are providing humanity with a plethora of data every day. These data are extremely crucial for our life and can improve the living standards, if we manage to use them effectively in order to understand how we and our environment behave. However, it is not easy to utilize those data to make our lives better, because the process to extract knowledge out of them is too complicated. The data science process, as can be seen in the following figure (Fig.1), is not a trivial task and most of the times it requires the cooperation of experts from interdisciplinary fields.
Figure 1: Data science process flowchart from "Doing Data Science", Cathy O'Neil and Rachel Schutt, 2013
According to Cathy O’Neil and Rachel Schutt (Fig.1) there are many different procedures that one must complete in order to create a data product ready to be consumed by people or organizations, from raw data that have been collected. First of all, the data need to be processed in order to transform into ways that make them easy to be used. Afterwards, the dataset must be cleaned, because almost all the times it contains missing values, duplicates as well as redundant or useless information. Then, visualizing and exploring them is a requirement in order to understand and make hypotheses about them as well as realize what algorithms an analyst should use in order to model them correctly. Following in the modelling module, the machine learning takes place. Different algorithms with different parameters are implemented and an analyst has to decide which one of those provide us with the best results. As soon as the models & algorithms module has been completed, the data scientist can visualize his findings and results in order to make a decision or to help his/her supervisor or client make decisions. Also, a data product can be implemented in order to create retrospective analytics, predictive analytics, real-time analytics or intelligent SaaS apps which are going to be used by people, companies or other organizations .
But why I should learn data science?
All those sensors, devices, apps and data sources are bombarding us with plenty of data. Unfortunately, most of them are redundant and useless and so they result into information overload. Therefore, it is crucial for every person and organization on Earth to be capable of leveraging his/her or the environment’s data in an effective and productive way. In order to achieve this, big groups of data scientists and analysts will join forces in the near future to create really effective and useful data products. In the information era that we are going through, everyone should help towards that direction. Humanity needs you too to take your part and make everyone able to use data and information. We must stop data use humans and let humans utilize them.
Data Science Essentials at edX
I first came across the “Data Science Essentials” course from Microsoft on edX two years ago. During my summer holidays I wanted to explore new things and after completing the “Introduction to R” course, I wanted to learn more about machine learning, because everyone was speaking about it but it was all Greek to me (although my home country is Greece :P) . Back then, the course I am writing about was called “Data Science and Machine Learning Essentials”. I decided to undertake it and it was then when I realized that I had entered a new world. It was a chance to learn a lot of different things about the data science process and different techniques and algorithms used in machine learning. I was also taught about working with data and visualizing them. Furthermore, it was the first time I encountered Azure Machine Learning Studio, a tool developed by Microsoft, that helps you create and consume easily machine learning projects.
After two years, I enrolled again in the descendant of that course and I realized that only a few things had changed. Now, I am convinced that the basic principles of data science and machine learning do not change and that the course I undertook two years ago is not going to become obsolete and useless in the future. While viewing the videos, I remembered the first days I came across all those new things, as well as the times I used them in real-life problems for projects I was working on or for assignments for the university. Therefore, I urge you to take that course, as a new path will lay out in front of you.
So, let’s see the different modules that the course has been divided into:
a. Introduction to Data Science
This module teaches you how to think analytically on data and gives you an overview of the data science process. It, also, introduces you into the most popular data science technologies and tools. Those are the R and Python programming languages as well as Azure Machine Learning Studio and Jupyter Notebooks. If you do not have experience in those programming languages you can also undertake several different courses at edX such as “Introduction to R for Data Science”  and “Introduction to Python: Fundamentals” , because this course assumes that you have some experience in those programming languages.
b. Probability and Statistics for Data Science
Being a data scientist is not an easy task. You must leverage different technologies and know different theories or techniques in order to become an expert. Probability and statistics play a crucial role into the data science process, so make sure that you have understood the basic principles. Discrete and continuous random variables, probability or cumulative distributions should not be unknown words for you. Also, make sure that you understand the difference between descriptive and summary statistics as well as that you know what correlation means and why correlation does not imply causation.
c. Simulation and Hypothesis Testing
Dealing with big data is not as simple as someone may think. Due to the dependence of your data on many different parameters and the complex interactions among them, it may be extraordinarily difficult for you to predict how your data will behave in various scenarios and different environments. Most of the times the distribution that your data follow will not be one of the most common that you have learned. It may be a combination of different types of distributions such as, normal, uniform or exponential. The ability to simulate different distributions and data sources will be very beneficial to you. In this module you will learn the basics of simulation as well as how to perform a simulation in R or Python or even both if you want to become a data science ninja.
In addition to this, very frequently you will have to determine whether or not an observed value represents the probable value in a total population. You will need to apply a lot of different hypothesis tests like Z-tests, T-tests, Chi-Squared tests or other tests according to your data. You must also learn what the p-values are, the Type 1 and Type 2 errors as well as the confidence intervals. Do not panic! This module is very detailed, so you will not have problem in dealing with those topics. Just pay attention at what the instructor says and focuses on.
d. Exploring and Visualizing Data
In this module you will learn different ways to explore and manipulate data in Azure Machine Learning Studio as well as with R or Python. The dplyr and pandas libraries, for R and Python respectively, will be very helpful for operations like data frame manipulation, computing columns and chaining. Additionally, you will learn to create visualizations, because most of the times it will be extremely difficult for you to understand patterns and relationships in your data by just watching huge tables of data with tens of columns and hundreds or thousands of rows. You will start from univariate and 2-D plots and then you will dive deep into aesthetics for multidimensional plots and faceting plots. You will come across some very clever ways to visualize four or five dimensions into a two-dimensional graph.
e. Data Cleansing and Manipulation
After all those things you have already learned, it is important to understand the different ways you can ingest data. In this module you will learn the different data sources and formats that you can have in Azure ML Studio. Then, you will learn how to join data, a very useful technique, because most of the times you will acquire data from different data sources and with a variety of formats. You will also comprehend what metadata are.
I know that you may have been exhausted with all those staff and you want to jump into the machine learning module but please wait for one more topic. Data cleansing is crucial during the data science process. If you do not clean your data it is almost certain that your machine learning algorithms will give you very bad results. Cleaning, filtering and transforming your data is vital because otherwise you are going to fall into the trap that most data scientists call “Garbage In – Garbage Out” . So, be patient and watch this topic too. You will learn to handle missing and repeated values, extract and engineer new features, find outliers and errors in your data and finally scale your data. And you will learn to deal with all this staff not only in R and Python but also in Azure ML Studio.
f. Introduction to Machine Learning
Finally, the time has come. You are ready to get into the basic principles of Machine Learning. You will be taught both supervised and unsupervised learning. The first chapter of the module is divided into three main topics. Those are Classification, Regression and Clustering. You will examine the basics of each topic, evaluate classifiers and regression models and create classification and regression models as well as k-means clustering in Azure ML.
Machine Learning is a subset of Artificial Intelligence. As you can see in the following figure (Fig.2), artificial intelligence deals with creating agents able to perceive the world around them and organize plans in order to make decisions that help them achieve their goals. According to Arthur Samuel, machine learning gives computers the ability to learn without being explicitly programmed. This is very crucial in developing intelligent agents because it is extraordinarily difficult for humans to know everything about the agents’ environment in order to pre-program all the rules and models of the world that they are going to operate in.
Machine learning is divided into supervised, unsupervised and reinforcement learning. In this course you are going to have a taste of supervised learning and unsupervised learning only. For more on machine learning you can read the following article from which I found the following figure:
Figure 2: Machine learning is one of the many subfields of artificial intelligence
In the second chapter, you will investigate how to publish and use an Azure ML model, because the primary objective of creating one is to be consumed by you, your organization or by your clients in the future. Therefore, in that last topic you will learn the basics of creating and consuming an Azure ML web service.
g. Labs and Final Exam
The course provides you with a variety of labs and questions for each module. It will be very beneficial for you to attempt these labs and try to answer the questions of each module. By practicing these techniques, you will dive deeper into the process of data science and you will gain insight into the procedure. There is also a final exam that challenges you to work in data cleansing and exploration as well as machine learning. If you want to earn a verified certificate by Microsoft, keep in mind that the questions of the lab account for 60% and the final exam for 40%. To pass the course you must achieve an overall score of 70%, so you should complete both the labs and the final exam. Otherwise, if you are not willing to earn a verified certificate or to take the labs and the exams just for the experience, you will have already learned the essentials of data science.
Where to go next?
After finishing the “Data Science Essentials” a whole new world is there for you. You can concentrate on machine learning by undertaking the “Principles of Machine Learning”  or “Applied Machine Learning”  courses. You can also learn how to build predictive solutions for big data by enrolling at the “Developing Big Data Solutions with Azure Machine Learning” course . If you are more into artificial intelligence, try “Deep Learning Explained” . If you have experience at building apps, attempt to start “Developing Intelligent Apps and Bots” . And finally, if you want to become an expert to get your dream data science job, start the “Microsoft Professional Program in Data Science” .