ML 01 - Machine Learning and R

In this post I will mention about Machine Learning and its practical and theoretical usage samples around Open Source Software and Microsoft technologies. Machine learning is programming computers to optimize a performance criterion using example data or past experience (1). In another words making computers more intelligent, making them have insights, giving them the ability to predict the missing data in the whole. To succeed these goals or more, you need data. More data you have, more precise decisions, forecasts, intelligence capability the computer will have. Here not only the amount of data is important, but also the quality and content of the data is also important.

Imagine you want to make weather forecast and based on previous data (data that you have) you want to predict tomorrows weather status. If you have just few data samples from the past and all these samples are collected during the winter time, then you may not predict tomorrow's data precisely. You need data covering not only winter time period but all days, seasons of the year. On the other hand, if you have data with whole coverage of the seasons but only previous year's daily data, it may also not be enough for precise weather forecast because you need more historical data. You need not only last year data but may need years of past data. Also for better prediction not only the comprehensive past weather data may be enough but also you may need additional data, parameters such as air pollution data per region, moisture, soil humidity etc.

to summarize, you may need:

  • Massive amount of past data (not only previous few days or years).
  • Comprehensive data. (data homogeneously distributed over cyclic time period, not only from specific session but from all sessions).
  • Additional parameters/data that affects the data value to be predicted (not only temperature but humidity data etc.).

In almost every programming language you can develop programs with machine learning algorithms but there is a dedicated programming language and environment for this purpose called R. R is a language for statistical computing and graphics. You can get the free open source development environment and libraries from https://www.r-project.org. Moreover, a free IDE with GUI can be found at https://www.rstudio.com. There also exist hundreds of open source algorithms, software libraries under https://cran.r-project.org. In the next series of this post, we will mention about Microsoft Azure Machine Learning service which is running custom R, Python scripts together with pre-installed CRAN libraries.

Let's start working on a sample project where we will make data prediction based on existing dataset, set of data samples gathered from observations, either by manual processing or from sensor like devices. Assuming you have Microsoft Excel or similar tool; create two columns X and Y, and 30 rows with header and data values from 1 to 29. Here there is no specific reason of selecting 1 to 29 range, it might be 1 to 100 or -50 to -15 whatever you prefer. Below image is the plot of these X, Y data pairs on an Excel chart.

It is very intuitive that for any X value, the corresponding Y value is same as the X. Here the Y values are labels that corresponds to X values; which means you know the label of a specific X value. Now create a new column named YNoise where you will create 29 random numbers each range between -1 to 1. You can use "=IF(RANDBETWEEN(0,1), -1 * RAND(), RAND())" or a better function to generate these random values. Finally create two new more columns named X and YwNoise, copy original X values to the new X column and then take the sum of original Y and YNoise columns and put the result in new YwNoise column. Plot the resulting values on a new chart.

Assume the original dataset (without noise) is the ideal set that we are trying to model, predict. In real world scenario, you don't have any clue about this ideal dataset and with ML techniques you attempt to find the best model that mimics the ideal dataset. For the sake of this example we provide a portion of this ideal dataset together with its formulation which is X = Y. Based on this formula if we take X = 100, which is not in our existing dataset (we have X values that range from 1 to 29, not 100), we can directly say that the corresponding Y value is 100. Because we know the formula, we can predict the corresponding value of any X.

In real life, most of the time you don't have ideal data. Your labels (in above sample Y values) are affected from various, mostly unpredicted, dynamic parameters / noise. To simulate this dynamic parameters, that's is the reason we add random noise to the ideal dataset. With this noisy dataset (X and YwNoise), it is not that easy to find the formulation or relation between these two columns (especially in the case of a more complex and in big data). When we plot this noisy dataset on to chart, it is intuitive that there is a linear relation with X and Y values. Using Linear Regression method, if we fit (best fit) a line on the noisy dataset, we can get the formula of this line (something like y = mx+b, for this specific example m = 1 and b = 0) and using this formula, we can predict corresponding Y value of any other X value which is not in our existing dataset.

 

 

Dataset that we used is very simple and in two dimension. In real life scenario you may have more complex and multidimensional data. i.e. you may have X1, X2, X3, …XN and Y columns, where Y is the label of X1 to XN value set. Here X1 maybe price value of a car, X2 maybe engine power, X3 max speed, X4 time to reach 100 mph/sec and Y is the brand of the car. Having this car dataset, in the future you may have any X1 to X4 values and predict the possible brands of the car (Y value).

Summary: Having a dataset where you don't have any relation information within the data, you may build a model to predict missing values of a new data sample that is not in your initial dataset.

As mentioned earlier, size of the dataset is also important. Considering the below sample, if we have only 4 samples in our dataset then following the above procedures to find the best fit line (linear regression) and its formula over these three point is more difficult. As seen on the chart below there may be many alternative lines (L1, L2, L3) passing through these three points. And each line will take us to different formulation, so completely different result set. i.e. for x = 27, taking L1 formulation the y = 40, L2 formulation the y = 26, L3 formulation the y = 21.

Below is another sample but with larger size of dataset. There are 30 points but if we look at this data cluster, which is cumulated in a specific region, from a larger scale, again we may draw more than one line fitting on the dataset. In closer scale, it seems that there is no big difference but from a larger scale, again the result will be very different among the lines that we fit on the dataset.

Linear or Non-Linear

In the next sections we will go in detail about using the right model, algorithm in our ML solution. For now, we will try to keep the simplicity and mention about the basics. For the sample dataset case that is mentioned earlier, we used linear regression method. Without having knowledge about the ideal dataset, intuitively we decided about linear method. What if we have a dataset in larger scale that spread on different clusters and not linear? Take a look at the below chart:

 

For the above dataset, linear regression is not the best model. Still we can find best fitting line that passes from these two cluster but it will not give precise results about finding the missing values. Using non-linear models will give more precise results, i.e. fitting 3rd degree cubic polynomial may be the right decision. Below image represents the 3rd degree polynomial fitted on our dataset.

Again we can't guarantee that the cubic polynomial is the best option. What if there were other data values in another cluster that is not on the path of the curve? Than we have to find another method. This shows us the importance of data size, quality and also the model that we are using. If we have some insights about the data that it is somehow linear, then using linear model is the best option because of its simplicity and computation resource requirements. Using more complex models for simple cases, datasets is also another option but it is waste of computation resource and time consuming approach. So it is very important to use the correct model in the right place.

All we mentioned are about two dimensional data and not always it is possible to build a model with polynomials. Because the higher degree the polynomial more complex the computation it requires. What if we have multidimensional data? Imagine the below surface with parameters x and y, then the labels as z.

Depending on the complexity, computation time constraint, Accuracy, computation memory constraint etc. there exist many algorithms like neural networks, decision trees and we have to find the best algorithm to build our solution.

RStudio

RStudio is one of the most commonly used Open Source IDE for developing programs in R language. In this section we will use this tool to get a brief overview about the R language, its syntax and capabilities. You can copy/paste the below R code into a new RScript editor in RStudio and run it with CTRL+ALT+R key combination. This self-explanatory code generates a dataset, plot it on a chart and write it in a CSV file for further use.

 

# Generate noisy data for supervised learning sample

# Generate dataset that ranges from -20 to 20 with step value 1

x <- seq(-70, 70, 3)

 

# Generate a point cloud around the line

# with n slope of 1 and a y-intercept of 0

noise_magnitude = 5

m <- 1

b <- 0

y <- m * (x + noise_magnitude * rnorm(x)) + b;

 

# plot point cloud on a chart

plot(x, y)

 

# combine two columns to create data grid

linoise <- cbind(x, y)

 

# write out to a CSV file

write.csv(linoise, file = "linoise.csv", row.names = FALSE)

Using the dataset generated with the above R code, we will train a linear regression model by using an R function "LinearModel" with "lm" command. Next we define three new random x values which does not exist in our dataset. Here I selected 95, 51 and 33. You can pick any other number of random values that you wish. In this example try to pick numbers between -100 to 100 otherwise don't forget to update the axis limit parameters of the "plot" command in the following code. With the below code, aim is to plot our dataset on a chart, fit a line on this dataset. Predict the corresponding y values of the random x values that you picked and plot them on the chart with red color marks.

#read data into variable

csvdata <- read.csv("linoise.csv")

 

#create a linear model using the existing data

fit <- lm(y ~ x, csvdata)

 

# define one or more random X value that you

# want to find their corresponding Y labels

rnd <- data.frame(x = c(95, 51, 33))

 

# Using the linear model fit, predict Y values

py = predict(fit, rnd)

 

# Plot existing data on the chart

plot(csvdata$x, csvdata$y, xlim=c(-100, 100), ylim=c(-100, 100), xlab = NA, ylab = NA)

# find and plot best fit line on the point data

abline(fit)

 

# plot the 3 randon points and their calculated labels as RED dots on the same chart

par(new = T) # draw on existing chart

plot(rnd$x, py, pch=19, col="red", xlim=c(-100, 100), ylim=c(-100, 100), xlab = NA, ylab = NA)

Executing the above code will result in the following chart. It is clear that the linear model that we constructed is working as expected because plot of the randomly picked 3 x values and their corresponding y values lies exactly on the line.

To conclude, along with the practical ML definition, we have demonstrated a simple ML sample scenario with a simple code written in R language.

 

I would love to hear what you think about ML. Please leave your thoughts and questions in the comments box below.

 

  1. Introduction to Machine Learning, Ethem Alpaydin