Stratified Splitting using rxExecBy
In Microsoft R Server 9.1, we have a new function called rxExecBy() which can be used to partition input data source by keys and apply user defined function on individual partitions. You can read more about rxExecBy() here : Pleasingly Parallel using rxExecBy
In this article, we will look at how to use rxExecby to perform Stratified Splitting. The training and test datasets contain a representative sample of the value in the strata column or stratification key column. With stratified splitting, the data is divided in such a way that a percentage of each target column value is put in both training and test dataset. This is used if the column you choose as the strata has categories that should be balanced. For example, you might want to balance by income, gender, etc. In stratified splitting, the strata column must contain nominal or categorical data. If the strata column you specify has continuous numeric data (or) has too many unique values, the column is not a good candidate for splitting over strata.
In the following example, we will split the AirlineDemoSmall XDF file into 75% train and 25% test XDF stratifiying over the column DayOfWeek. After splitting, we will check the counts of each category of strata column DayOfWeek in train and test to verify the stratified split.