Load data into memory using rxImport (SQL Server and RevoScaleR tutorial)

Applies to: SQL Server 2016 (13.x) and later versions

This is tutorial 10 of the RevoScaleR tutorial series on how to use RevoScaleR functions with SQL Server.

In this tutorial, you'll learn how to get data from SQL Server, and then use the rxImport function to put the data of interest into a local file. That way, you can analyze it in the local compute context repeatedly, without having to re-query the database.

The rxImport function can be used to move data from a data source into a data frame in session memory, or into an XDF file on disk. If you don't specify a file as destination, data is put into memory as a data frame.

Extract a subset of data from SQL Server to local memory

You've decided that you want to examine only the high risk individuals in more detail. The source table in SQL Server is big, so you want to get the information about just the high-risk customers. You then load that data into a data frame in the memory of the local workstation.

  1. Reset the compute context to your local workstation.

    rxSetComputeContext("local")
    
  2. Create a new SQL Server data source object, providing a valid SQL statement in the sqlQuery parameter. This example gets a subset of the observations with the highest risk scores. That way, only the data you really need is put in local memory.

    sqlServerProbDS \<- RxSqlServerData(
        sqlQuery = paste("SELECT * FROM ccScoreOutput2",
        "WHERE (ccFraudProb > .99)"),
        connectionString = sqlConnString)
    
  3. Call the function rxImport to read the data into a data frame in the local R session.

    highRisk <- rxImport(sqlServerProbDS)
    

    If the operation was successful, you should see a status message like this one: "Rows Read: 35, Total Rows Processed: 35, Total Chunk Time: 0.036 seconds"

  4. Now that the high-risk observations are in an in-memory data frame, you can use various R functions to manipulate the data frame. For example, you can order customers by their risk score, and print a list of the customers who pose the highest risk.

    orderedHighRisk <- highRisk[order(-highRisk$ccFraudProb),]
    row.names(orderedHighRisk) <- NULL
    head(orderedHighRisk)
    

Results

ccFraudLogitScore   state gender cardholder balance numTrans numIntlTrans creditLine ccFraudProb1
9.786345    SD   Male  Principal   23456       25            5 75   0.99994382
9.433040    FL Female  Principal   20629       24           28 75   0.99992003
8.556785    NY Female  Principal   19064       82           53 43   0.99980784
8.188668    AZ Female  Principal   19948       29            0 75   0.99972235
7.551699    NY Female  Principal   11051       95            0 75   0.99947516
7.335080    NV   Male  Principal   21566        4            6  75   0.9993482

More about rxImport

You can use rxImport not just to move data, but to transform data in the process of reading it. For example, you can specify the number of characters for fixed-width columns, provide a description of the variables, set levels for factor columns, and even create new levels to use after importing.

The rxImport function assigns variable names to the columns during the import process, but you can indicate new variable names by using the colInfo parameter, or change data types using the colClasses parameter.

By specifying additional operations in the transforms parameter, you can do elementary processing on each chunk of data that is read.

Next steps