View and summarize SQL Server data using R (walkthrough)
This lesson introduces you to functions in the RevoScaleR package and steps you through the following tasks:
- Connect to SQL Server
- Define a query that has the data you need, or specify a table or view
- Define one or more compute contexts to use when running R code
- Optionally, define transformations that are applied to the data source while it is being read from the source
Define a SQL Server compute context
Run the following R statements in an R environment on the client workstation. This section assumes a data science workstation with Microsoft R Client, because it includes all the RevoScaleR packages, as well as a basic, lightweight set of R tools. For example, you can use Rgui.exe to run the R script in this section.
If the RevoScaleR package is not already loaded, run this line of R code:
The quotation marks are optional, in this case, though recommended.
If you get an error, make sure that your R development environment is using a library that includes the RevoScaleR package. Use a command such as
.libPaths()to view the current library path.
Create the connection string for SQL Server and save it in an R variable, connStr.
You must change the placeholder "your_server_name" to a valid SQL Server instance name. For the server name, you might be able to use only the instance name, or you might need to fully qualify the name, depending on your network.
For SQL Server authentication, the connection syntax is as follows:
connStr <- "Driver=SQL Server;Server=your_server_name;Database=nyctaxi_sample;Uid=your-sql-login;Pwd=your-login-password"
For Windows authentication, the syntax is a bit different:
connStr <- "Driver=SQL Server;Server=your_server_name;Database=nyctaxi_sample;Trusted_Connection=True"
Generally, we recommend that you use Windows authentication where possible, to avoid saving passwords in your R code.
Define variables to use in creating a new compute context. After you create the compute context object, you can use it to run R code on the SQL Server instance.
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="") sqlWait <- TRUE sqlConsoleOutput <- FALSE
R uses a temporary directory when serializing R objects back and forth between your workstation and the SQL Server computer. You can specify the local directory that is used as sqlShareDir, or accept the default.
Use sqlWait to indicate whether you want R to wait for results from the server. For a discussion of waiting versus non-waiting jobs, see Distributed and parallel computing with RevoScaleR in Microsoft R.
Use the argument sqlConsoleOutput to indicate that you don't want to see output from the R console.
You call the RxInSqlServer constructor to create the compute context object with the variables and connection strings already defined, and save the new object in the R variable sqlcc.
sqlcc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, wait = sqlWait, consoleOutput = sqlConsoleOutput)
By default, the compute context is local, so you need to explicitly set the active compute context.
- rxSetComputeContext returns the previously active compute context invisibly so that you can use it
- rxGetComputeContext returns the active compute context
Note that setting a compute context only affects operations that use functions in the RevoScaleR package; the compute context does not affect the way that open-source R operations are performed.
Create a data source using RxSqlServer
When using the Microsoft R libraries like RevoScaleR and MicrosoftML, a data source is an object you create using RevoScaleR functions. The data source object specifies some set of data that you want to use for a task, such as model training or feature extraction. You can get data from a variety of sources including SQL Server. For the list of currently supported sources, see RxDataSource.
Earlier you defined a connection string, and saved that information in an R variable. You can re-use that connection information to specify the data you want to get.
Save a SQL query as a string variable. The query defines the data for training the model.
sampleDataQuery <- "SELECT TOP 1000 tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude FROM nyctaxi_sample"
We've used a TOP clause here to make things run faster, but the actual rows returned by the query can vary depending on order. Hence, your summary results might also be different from those listed below. Feel free to remove the TOP clause.
Pass the query definition as an argument to the RxSqlServerData function.
inDataSource <- RxSqlServerData( sqlQuery = sampleDataQuery, connectionString = connStr, colClasses = c(pickup_longitude = "numeric", pickup_latitude = "numeric", dropoff_longitude = "numeric", dropoff_latitude = "numeric"), rowsPerRead=500 )
The argument colClasses specifies the column types to use when moving the data between SQL Server and R. This is important because SQL Server uses different data types than R, and more data types. For more information, see R Libraries and Data Types.
The argument rowsPerRead is important for managing memory usage and efficient computations. Most of the enhanced analytical functions inR Services (In-Database) process data in chunks and accumulate intermediate results, returning the final computations after all of the data has been read. By adding the rowsPerRead parameter, you can control how many rows of data are read into each chunk for processing. If the value of this parameter is too large, data access might be slow because you don't have enough memory to efficiently process such a large chunk of data. On some systems, setting rowsPerRead to an excessively small value can also provide slower performance.
At this point, you've created the inDataSource object, but it doesn't contain any data. The data is not pulled from the SQL query into the local environment until you run a function such as rxImport or rxSummary.
However, now that you've defined the data objects, you can use it as the argument to other functions.
Use the SQL Server data in R summaries
In this section, you'll try out several of the functions provided in R Services (In-Database) that support remote compute contexts. By applying R functions to the data source, you can explore, summarize, and chart the SQL Server data.
Call the function rxGetVarInfo to get a list of the variables in the data source and their data types.
rxGetVarInfo is a handy function; you can call it on any data frame, or on a set of data in a remote data object, to get information such as the maximum and minimum values, the data type, and the number of levels in factor columns.
Consider running this function after any kind of data input, feature transformation, or feature engineering. By doing so, you can ensure that all the features you want to use in your model are of the expected data type and avoid errors.
rxGetVarInfo(data = inDataSource)
Var 1: tipped, Type: integer Var 2: fare_amount, Type: numeric Var 3: passenger_count, Type: integer Var 4: trip_time_in_secs, Type: numeric, Storage: int64 Var 5: trip_distance, Type: numeric Var 6: pickup_datetime, Type: character Var 7: dropoff_datetime, Type: character Var 8: pickup_longitude, Type: numeric Var 9: pickup_latitude, Type: numeric Var 10: dropoff_longitude, Type: numeric
Now, call the RevoScaleR function rxSummary to get more detailed statistics about individual variables.
rxSummary is based on the R
summaryfunction, but has some additional features and advantages. rxSummary works in multiple compute contexts and supports chunking. You can also use rxSummary to transform values, or summarize based on factor levels.
In this example, you summarize the fare amount based on the number of passengers.
start.time <- proc.time() rxSummary(~fare_amount:F(passenger_count,1,6), data = inDataSource) used.time <- proc.time() - start.time print(paste("It takes CPU Time=", round(used.time+used.time,2)," seconds, Elapsed Time=", round(used.time,2), " seconds to summarize the inDataSource.", sep=""))
- The first argument to rxSummary specifies the formula or term to summarize by. Here, the
F()function is used to convert the values in passenger_count into factors before summarizing. You also have to specify the minimum value (1) and maximum value (6) for the passenger_count factor variable.
- If you do not specify the statistics to output, by default rxSummary outputs Mean, StDev, Min, Max, and the number of valid and missing observations.
- This example also includes some code to track the time the function starts and completes, so that you can compare performance.
If the rxSummary function runs successfully, you should see results like these, followed by a list of statistics by category.
rxSummary(formula = ~fare_amount:F(passenger_count, 1,6), data = inDataSource) Data: inDataSource (RxSqlServerData Data Source) Number of valid observations: 1000
- The first argument to rxSummary specifies the formula or term to summarize by. Here, the
Bonus exercise on big data
Try defining a new query string with all the rows. We recommend you set up a new data source object for this experiment. You might also try changing the rowsToRead parameter to see how it affects throughput.
bigDataQuery <- "SELECT tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude FROM nyctaxi_sample" bigDataSource <- RxSqlServerData( sqlQuery = bigDataQuery, connectionString = connStr, colClasses = c(pickup_longitude = "numeric", pickup_latitude = "numeric", dropoff_longitude = "numeric", dropoff_latitude = "numeric"), rowsPerRead=500 ) start.time <- proc.time() rxSummary(~fare_amount:F(passenger_count,1,6), data = bigDataSource) used.time <- proc.time() - start.time print(paste("It takes CPU Time=", round(used.time+used.time,2)," seconds, Elapsed Time=", round(used.time,2), " seconds to summarize the inDataSource.", sep=""))
While this is running, you can use a tool like Process Explorer or SQL Profiler to see how the connection is made and the R code is run using SQL Server services.
Another option is to monitor R jobs running on SQL Server using these custom reports.