Remove Duplicate Rows
Removes the duplicate rows from a dataset
Category: Data Transformation / Manipulation
This article describes how to use the Remove Duplicate Rows module in Azure Machine Learning Studio, to remove potential duplicates from a dataset.
For example, assume your data looks like the following, and represents multiple records for patients.
Clearly, this example has multiple columns with potentially duplicate data. Whether they are actually duplicates depends on your knowledge of the data.
For example, you might know that many patients have the same name. You wouldn't eliminate duplicates using any name columns, only the ID column. That way, only the rows with duplicate ID values are filtered out, regardless of whether the patients have the same name or not.
Alternatively, you might decide to allow duplicates in the ID field, and use some other combination of fileds to find unique records, such as first name, last name, age, and gender.
To set the criteria for whether a row is duplicate or not, you specify a single column or a set of columns to use as keys. Two rows are considered duplicates only when the values in all key columns are equal.
When you run the module, it creates a candidate dataset, and returns a set of rows that have no duplicates across the set of columns you specified.
The source dataset is not altered; this module creates a new dataset that is filtered to exclude duplicates, based on the criteria you specify.
How to use Remove Duplicate Rows
Add the module to your experiment. You can find the Remove Duplicate Rows module under Data Transformation, Manipulation.
Connect the dataset that you want to check for duplicate rows.
In the Properties pane, under Key column selection filter expression, click Launch column selector, to choose columns to use in identifying duplicates.
In this context, Key does not mean a unique identifier. All columns that you select using the Column Selector are designated as key columns. All un-selected columns are considered non-key columns. The combination of columns that you select as keys determines the uniqueness of the records. (Think of it as a SQL statement that uses multiple equality joins.)
- "I want to ensure that IDs are unique": Choose only the ID column.
- "I want to ensure that the combination of first name, last name, and ID is unique": Select all three columns.
Use the Retain first duplicate row checkbox to indicate which row to return when duplicates are found:
- If selected, the first row is returned and others discarded.
If you uncheck this option, the last duplicate row is kept in the results, and others are discarded.
See the Technical notes section for information on how missinng values are handled.
Run the experiment, or click the module and select Run Selected.
To review the results, right-click the module, select Results dataset, and click Visualize.
If the results are difficult to understand, or if you want to exclude some columns from consideration, you can remove columns by using the Select Columns in Dataset module.
To see examples of how this module is used, see the Azure AI Gallery:
Breast cancer detection: Remove Duplicate Rows is used to consolidate the training and test datasets after adding feature columns.
Movie recommendation: Uses Remove Duplicate Rows to ensure that there is only one user rating per movie.
Twitter sentiment analysis: Remove Duplicate Rows is applied to only the ID and popularity columns, to ensure that there is only one ordinal ranking value per movie. In other words, a movie cannot be both 1st and 3rd, so a single value is used even if users ranked the movie differently.
This section contains implementation details, tips, and answers to frequently asked questions.
The module works by looping through all rows of the input dataset. It collects into a candidate output dataset all rows where the unique combination of key column values appears for the first time.
The column array type is preserved independently of the results of row filtering. You cannot force the array to a particular data type by filtering out invalid values; the column array type is based on all values in the column. This restriction also applies when filtering missing values.
The algorithm used for comparing data values is hash-forced.
The input dataset might have missing values in non-key columns and key columns. These rules apply to missing values:
A missing value is considered a valid value in key columns. Missing values can be present in both key.
In a sparse dataset, the missing value is considered equal only if it equals the default representation of a sparse value.
In key columns, a missing value is considered equal to other missing values, but not equal to non-missing values.
|Dataset||Data Table||Input dataset|
|Key column selection filter expression||any||ColumnSelection||Choose the key columns to use when searching for duplicates.|
|Retain first duplicate row||any||Boolean||true||Indicate whether to keep the first row of a set of duplicates and discard others. If False, the last duplicate row encountered is kept.|
|Results dataset||Data Table||Filtered dataset|
|Error 0003||An exception occurs if one or more of the input datasets are null or empty.|
|Error 0020||An exception occurs if the number of columns in some of the datasets passed to the module is too small.|
|Error 0017||An exception occurs if one or more specified columns have a type that is unsupported by the current module.|
For a list of errors specific to Studio modules, see Machine Learning Error codes
For a list of API exceptions, see Machine Learning REST API Error Codes.