Sample data in SQL Server on Azure

This article shows how to sample data stored in SQL Server on Azure using either SQL or the Python programming language. It also shows how to move sampled data into Azure Machine Learning by saving it to a file, uploading it to an Azure blob, and then reading it into Azure Machine Learning Studio.

The Python sampling uses the pyodbc ODBC library to connect to SQL Server on Azure and the Pandas library to do the sampling.

Note

The sample SQL code in this document assumes that the data is in a SQL Server on Azure. If it is not, refer to Move data to SQL Server on Azure article for instructions on how to move your data to SQL Server on Azure.

Why sample your data? If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. Sampling facilitates data understanding, exploration, and feature engineering. Its role in the Team Data Science Process (TDSP) is to enable fast prototyping of the data processing functions and machine learning models.

This sampling task is a step in the Team Data Science Process (TDSP).

Using SQL

This section describes several methods using SQL to perform simple random sampling against the data in the database. Choose a method based on your data size and its distribution.

The following two items show how to use newid in SQL Server to perform the sampling. The method you choose depends on how random you want the sample to be (pk_id in the following sample code is assumed to be an autogenerated primary key).

  1. Less strict random sample

    select  * from <table_name> where <primary_key> in 
    (select top 10 percent <primary_key> from <table_name> order by newid())
    
  2. More random sample

    SELECT * FROM <table_name>
    WHERE 0.1 >= CAST(CHECKSUM(NEWID(), <primary_key>) & 0x7fffffff AS float)/ CAST (0x7fffffff AS int)
    

Tablesample can be used for sampling the data as well. This option may be a better approach if your data size is large (assuming that data on different pages is not correlated) and for the query to complete in a reasonable time.

SELECT *
FROM <table_name> 
TABLESAMPLE (10 PERCENT)

Note

You can explore and generate features from this sampled data by storing it in a new table

Connecting to Azure Machine Learning

You may directly use the sample queries above in Azure Machine Learning code (perhaps a notebook, or code inserted into Designer). See this link for more dtails about how to connect to storage with an Azure Machine Learning datastore.

Using the Python programming language

This section demonstrates using the pyodbc library to establish an ODBC connect to a SQL server database in Python. The database connection string is as follows: (replace servername, dbname, username, and password with your configuration):

#Set up the SQL Azure connection
import pyodbc    
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=<dbname>;UID=<username>;PWD=<password>')

The Pandas library in Python provides a rich set of data structures and data analysis tools for data manipulation for Python programming. The following code reads a 0.1% sample of the data from a table in Azure SQL Database into a Pandas data:

import pandas as pd

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select column1, column2... from <table_name> tablesample (0.1 percent)''', conn)

You can now work with the sampled data in the Pandas data frame.

Connecting to Azure Machine Learning

You can use the following sample code to save the down-sampled data to a file and upload it to an Azure blob. The data in the blob can be directly read into an Azure Machine Learning Experiment using the Import Data module. The steps are as follows:

  1. Write the pandas data frame to a local file

    dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)
    
  2. Upload local file to Azure blob

    from azure.storage import BlobService
    import tables
    
    STORAGEACCOUNTNAME= <storage_account_name>
    LOCALFILENAME= <local_file_name>
    STORAGEACCOUNTKEY= <storage_account_key>
    CONTAINERNAME= <container_name>
    BLOBNAME= <blob_name>
    
    output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)    
    localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory
    
    try:
    
    #perform upload
    output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,localfileprocessed)
    
    except:            
        print ("Something went wrong with uploading blob:"+BLOBNAME)
    
  3. This guide provides an overview of the next step to access data in Azure Machine Learning through datastores and datasets.

The Team Data Science Process in Action example

To walk through an example of the Team Data Science Process a using a public dataset, see Team Data Science Process in Action: using SQL Server.