Evaluate Recommender

Evaluates the accuracy of recommender model predictions

Category: Machine Learning / Evaluate

Note

Applies to: Machine Learning Studio

This content pertains only to Studio. Similar drag and drop modules have been added to the visual interface in Machine Learning service. Learn more in this article comparing the two versions.

Module overview

This article describes how to use the Evaluate Recommender module in Azure Machine Learning Studio, to measure the accuracy of predictions made by a recommendation model. Using this module, you can evaluet four different kinds of recommendations:

  • Ratings predicted for a given user and item

  • Items recommended for a given user

  • A list of users found to be related to a given user

  • A list of items found to be related to a given item

When you create predictions using a recommendation model, slightly different results are returned for each of these supported prediction types. The Evaluate Recommender module deduces the kind of prediction from the column format of the scored dataset. For example, the scored dataset might contain:

  • user-item-rating triples
  • users and their recommended items
  • users and their related users
  • items and their related items

The module also applies the appropriate performance metrics, based on the type of prediction being made.

Tip

Learn everything you need to know about the end-to-end experience of building a recommendation system in this tutorial from the .NET development team. Includes sample code and discussion of how to call Azure Machine Learning from an application.

Building recommendation engine for .NET applications using Azure Machine Learning

How to configure Evaluate Recommender

The Evaluate Recommender module compares the predictions output by a recommendation model with the corresponding "ground truth" data. For example, the Score Matchbox Recommender module produces scored datasets that can be analyzed with Evaluate Recommender.

Requirements

Evaluate Recommender requires the following datasets as input.

Test dataset

The test dataset contains the "ground truth" data in the form of user-item-rating triples.

If you already have a dataset containing user-item-rating triples, you can apply the Split Data module, using the RecommenderSplit option, to create a training dataset and a related test set from the existing dataset.

Scored dataset

The scored dataset contains the predictions that were generated by the recommendation model.

The columns in this second dataset depend on the kind of prediction you were performing during scoring. For example, the scored dataset might contain any of the following:

  • Users, items, and the ratings the user would likely give for the item
  • A list of users and items recommended for them
  • A list of users, with users who are probably similar to them
  • A list of items, together with smiliar items

Metrics

Performance metrics for the model are generated based on the type of input. For details, see these sections:

Evaluate predicted ratings

When evaluating predicted ratings, the scored dataset (the second input to Evaluate Recommender) must contain user-item-rating triples, meeting these requirements:

  • The first column of the dataset contains user identifiers.

  • The second column contains the item identifiers.

  • The third column contains the corresponding user-item ratings.

Important

For evaluation to succeed, the column names must be User, Item, and Rating, respectively.

Evaluate Recommender compares the ratings in the ground truth dataset to the predicted ratings of the scored dataset, and computes the mean absolute error (MAE) and the root mean squared error (RMSE).

The other parameters of Evaluate Recommender have no effect on evaluation of rating predictions.

Evaluate item recommendations

When evaluating item recommendation, use a scored dataset that includes the recommended items for each user:

  • The first column of the dataset must contain the user identifier.

  • All subsequent columns should contain the corresponding recommended item identifiers, ordered by how relevant an item is to the user.

    Before connecting this dataset, we recommend that you sort the dataset so that the most relevant items come first.

The other parameters of Evaluate Recommender have no effect on evaluation of item recommendations.

Important

For Evaluate Recommender to work, the column names must be User, Item 1, Item 2, Item 3 and so forth.

Evaluate Recommender computes the average normalized discounted cumulative gain (NDCG) and returns it in the output dataset.

Because it is impossible to know the actual "ground truth" for the recommended items, Evaluate Recommender uses the user-item ratings in the test dataset as gains in the computation of the NDCG. To evaluate, the recommender scoring module must only produce recommendations for items with ground truth ratings (in the test dataset).

Evaluate predictions of related users

When evaluating predictions of related users, use a scored dataset that contains the related users for each user of interest:

  • The first column must contain the identifiers for each user of interest.

  • All subsequent columns contain the identifiers for the predicted related users. Related users are ordered by the strength of the realtionship (most related user first).

  • For Evaluate Recommender to work, the column names must be User, Related User 1, Related User 2, Related User 3, and so forth.

Tip

You can influence evaluation by setting the minimum number of items that a user of interest and its related users must have in common.

Evaluate Recommender computes the average normalized discounted cumulative gain (NDCG), based on Manhattan (L1 Sim NDCG) and Euclidean (L2 Sim NDCG) distances, and returns both values in the output dataset. Because there is no actual ground truth for the related users, Evaluate Recommender uses the following procedure to compute the average NDCGs.

For each user of interest in the scored dataset:

  1. Find all items in the test dataset which have been rated by both the user of interest and the related user under consideration.

  2. Create two vectors from the ratings of these items: one for the user of interest, and one for the related user under consideration.

  3. Compute the gain as the similarity of the resulting two rating vectors, in terms of their Manhattan (L1) or Euclidean (L2) distance.

  4. Compute the L1 Sim NDCG and the L2 Sim NDCG, using the gains of all related users.

  5. Average the NDCG values over all users in the scored dataset.

In other words, gain is computed as the similarity (normalized Manhattan or Euclidian distances) between a user of interest (the entry in the first column of scored dataset) and a given related user (the entry in the n-th column of the scored dataset). The gain of this user pair is computed using all items for which both items have been rated in the original data (test set). The NDCG is then computed by aggregating the individual gains for a single user of interest and all related users, using logarithmic discounting. That is, one NDCG value is computed for each user of interest (each row in the scored dataset). The number that is finally reported is the arithmetic average over all users of interest in the scored dataset (i.e. its rows).

Hence, to evaluate, the recommender scoring module must only predict related users who have items with ground truth ratings (in the test dataset).

Evaluate predictions of related items

When evaluating the prediction of related items, use a scored dataset that contains the related items for each item of interest:

  • The first column must contain identifiers for the items of interest.

  • All subsequent columns should contain identifiers for the predicted related items, ordered by how related they are to the item of interest (most related item first).

  • For Evaluate Recommender to work, the column names must be Item, Related Item 1, Related Item 2, Related Item 3, and so forth.

Tip

You can influence evaluation by setting the minimum number of users that an item of interest and its related items must have in common.

Evaluate Recommender computes the average normalized discounted cumulative gain (NDCG) based on Manhattan (L1 Sim NDCG) and Euclidean (L2 Sim NDCG) distances and returns both values in the output dataset. Because there is no actual ground truth for the related items, Evaluate Recommender computes the average NDCGs as follows:

For each item of interest in the scored dataset:

  1. Find all users in the test dataset who have rated both the item of interest and the related item under consideration.

  2. Create two vectors from the ratings of these users, one for the item of interest and for the related item under consideration.

  3. Compute the gain as the similarity of the resulting two rating vectors in terms of their Manhattan (L1) or Euclidean (L2) distance.

  4. Compute the L1 Sim NDCG and the L2 Sim NDCG using the gains of all related items.

  5. Average the NDCG values over all items of interest in the scored dataset.

In other words, gain is computed as the similarity (normalized Manhattan or Euclidian distances) between an item of interest (the entry in the first column of scored dataset) and a given related item (the entry in the n-th column of the scored dataset). The gain of this item pair is computed using all users who have rated both of these items in the original data (test set). The NDCG is then computed by aggregating the individual gains for a single item of interest and all its related items, using logarithmic discounting. That is, one NDCG value is computed for each item of interest (each row in the scored dataset). The number that is finally reported is the arithmetic average over all items of interest in the scored dataset (i.e. its rows).

Therefore, to evaluate, the recommender scoring module must only predict related items with ground truth ratings (in the test dataset).

Examples

For examples of how recommendation models are used in Azure Machine Learning, see the Azure AI Gallery:

Expected inputs

Name Type Description
Test dataset Data Table Test dataset
Scored dataset Data Table Scored dataset

Module parameters

Name Range Type Default Description
Minimum number of items that the query user and the related user must have rated in common >=1 Integer 2 Specify the minimum number of items that must have been rated by both the query user and the related user

This parameter is optional
Minimum number of users that the query item and the related item must have been rated by in common >=1 Integer 2 Specify the minimum number of users that must have rated both the query item and the related item

This parameter is optional

Outputs

Name Type Description
Metric Data Table A table of evaluation metrics

Exceptions

Exception Description
Error 0022 Exception occurs if number of selected columns in input dataset does not equal to the expected number.
Error 0003 Exception occurs if one or more of inputs are null or empty.
Error 0017 Exception occurs if one or more specified columns have type unsupported by current module.
Error 0034 Exception occurs if more than one rating exists for a given user-item pair.
Error 0018 Exception occurs if input dataset is not valid.
Error 0002 Exception occurs if one or more parameters could not be parsed or converted from specified type into required by target method type.

For a list of errors specific to Studio modules, see Machine Learning Error codes.

For a list of API exceptions, see Machine Learning REST API Error Codes.

See also

Train Matchbox Recommender
Score Matchbox Recommender