Sample data in Azure blob containers, SQL Server, and Hive tables
The following articles describe how to sample data that is stored in one of three different Azure locations:
- Azure blob container data is sampled by downloading it programmatically and then sampling it with sample Python code.
- SQL Server data is sampled using both SQL and the Python Programming Language.
- Hive table data is sampled using Hive queries.
This sampling task is a step in the Team Data Science Process (TDSP).
Why sample data?
If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. Downsizing may facilitate data understanding, exploration, and feature engineering. This sampling role in the Cortana Analytics Process is to enable fast prototyping of the data processing functions and machine learning models.