Generare raccomandazioni di film mediante Apache Mahout con Hadoop in HDInsight (PowerShell)Generate movie recommendations by using Apache Mahout with Hadoop in HDInsight (PowerShell)

Informazioni su come usare la libreria di Machine Learning Apache Mahout con Azure HDInsight per generare raccomandazioni di film.Learn how to use the Apache Mahout machine learning library with Azure HDInsight to generate movie recommendations. L'esempio riportato in questo documento usa Azure PowerShell per eseguire i processi Mahout.The example in this document uses Azure PowerShell to run Mahout jobs.

prerequisitiPrerequisites

Generare raccomandazioni con Azure PowerShellGenerate recommendations by using Azure PowerShell

Avviso

Il processo in questa sezione funziona con Azure PowerShell.The job in this section works by using Azure PowerShell. Molte delle classi offerte da Mahout attualmente non funzionano con Azure PowerShell.Many of the classes provided with Mahout do not currently work with Azure PowerShell. Per l'elenco delle classi che non funzionano con Azure PowerShell, vedere la sezione Risoluzione dei problemi.For a list of classes that do not work with Azure PowerShell, see the Troubleshooting section.

Per un esempio su come usare SSH per connettersi a HDInsight ed eseguire esempi Mahout direttamente nel cluster, vedere Generare raccomandazioni mediante Mahout e HDInsight (SSH).For an example of using SSH to connect to HDInsight and run Mahout examples directly on the cluster, see Generate movie recommendations using Mahout and HDInsight (SSH).

Una delle funzioni fornite da Mahout è un motore di raccomandazione.One of the functions that is provided by Mahout is a recommendation engine. Questo motore accetta i dati nei formati userID, itemId e prefValue (la preferenza degli utenti per l'elemento).This engine accepts data in the format of userID, itemId, and prefValue (the users preference for the item). Mahout usa i dati per determinare gli utenti con preferenze di elementi simili, che possono essere usate per le raccomandazioni.Mahout uses the data to determine users with like-item preferences, which can be used to make recommendations.

Nell'esempio seguente viene illustrata una procedura dettagliata semplificata del funzionamento del processo di raccomandazione:The following example is a simplified walk-through of how the recommendation process works:

  • Co-occorrenza: a Joe, Alice e Bob piacciono Guerre stellari, L'Impero colpisce ancora e Il ritorno dello Jedi.co-occurrence: Joe, Alice, and Bob all liked Star Wars, The Empire Strikes Back, and Return of the Jedi. Mahout determina che agli utenti a cui piace uno di questi film piacciono anche gli altri due.Mahout determines that users who like any one of these movies also like the other two.

  • Co-occorrenza: a Bob e Alice piacciono anche La minaccia fantasma, L'attacco dei cloni e La vendetta dei Sith.co-occurrence: Bob and Alice also liked The Phantom Menace, Attack of the Clones, and Revenge of the Sith. Mahout determina che agli utenti a cui piacciono i tre film precedenti piacciono anche questi tre.Mahout determines that users who liked the previous three movies also like these movies.

  • Raccomandazione per somiglianza: poiché a Joe piacciono i primi tre film, Mahout cerca i film che piacciono ad altri utenti con preferenze simili ma che Joe non ha guardato o per i quali non ha ancora espresso una preferenza o una valutazione.Similarity recommendation: Because Joe liked the first three movies, Mahout looks at movies that others with similar preferences liked, but Joe has not watched (liked/rated). In questo caso, Mahout raccomanda La minaccia fantasma, L'attacco dei cloni e La vendetta dei Sith.In this case, Mahout recommends The Phantom Menace, Attack of the Clones, and Revenge of the Sith.

Informazioni sui datiUnderstanding the data

GroupLens Research offre i dati di classificazione dei film in un formato compatibile con Mahout.GroupLens Research provides rating data for movies in a format that is compatible with Mahout. Questi dati sono disponibili nello spazio di archiviazione predefinito del cluster in /HdiSamples/HdiSamples/MahoutMovieData.This data is available on the default storage for your cluster at /HdiSamples/HdiSamples/MahoutMovieData.

Sono disponibili due file, moviedb.txt (informazioni sui film) e user-ratings.txt.There are two files, moviedb.txt (information about the movies) and user-ratings.txt. Il file user-ratings.txt viene usato durante l'analisi.The user-ratings.txt file is used during analysis. Il file moviedb.txt viene usato per generare testo descrittivo quando si visualizzano i risultati dell'analisi.The moviedb.txt file is used to provide user-friendly text when displaying the results of the analysis.

I dati contenuti in user-ratings.txt presentano una struttura userID, movieID, userRating e timestamp che indica come ogni utente ha classificato un film.The data contained in user-ratings.txt has a structure of userID, movieID, userRating, and timestamp, which tells how highly each user rated a movie. Di seguito è riportato un esempio dei dati:Here is an example of the data:

196    242    3    881250949
186    302    3    891717742
22     377    1    878887116
244    51     2    880606923
166    346    1    886397596

Eseguire il processoRun the job

Usare lo script Windows PowerShell seguente per eseguire un processo che usi il motore di raccomandazione Mahout con i dati del film:Use the following Windows PowerShell script to run a job that uses the Mahout recommendation engine with the movie data:

Nota

Il file richiede le informazioni usate per connettersi al cluster HDInsight ed eseguire i processi.This file prompts you for information that is used to connect to your HDInsight cluster and run jobs. Potrebbero volerci diversi minuti per completare i processi e scaricare il file output.txt.It may take several minutes for the jobs to complete and download the output.txt file.

# Login to your Azure subscription
# Is there an active Azure subscription?
$sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
if(-not($sub))
{
    Add-AzureRmAccount
}

# If you have multiple subscriptions, set the one to use
# $subscriptionID = "<subscription ID to use>"
# Select-AzureRmSubscription -SubscriptionId $subscriptionID

# Get cluster info
$clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
$creds=Get-Credential -UserName "admin" -Message "Enter the login for the cluster"

#Get the cluster info so we can get the resource group, storage, etc.
$clusterInfo = Get-AzureRmHDInsightCluster -ClusterName $clusterName
$resourceGroup = $clusterInfo.ResourceGroup
$storageAccountName = $clusterInfo.DefaultStorageAccount.split('.')[0]
$container = $clusterInfo.DefaultStorageContainer
$storageAccountKey = (Get-AzureRmStorageAccountKey `
    -Name $storageAccountName `
-ResourceGroupName $resourceGroup)[0].Value

#Create a storage context and upload the file
$context = New-AzureStorageContext `
    -StorageAccountName $storageAccountName `
    -StorageAccountKey $storageAccountKey

#Use Hive to figure out the path to the mahout examples
#Because the file name/path has a version number in it that changes
$queryString = "!ls /usr/hdp/current/mahout-client"
$hiveJobDefinition = New-AzureRmHDInsightHiveJobDefinition -Query $queryString
$hiveJob=Start-AzureRmHDInsightJob -ClusterName $clusterName -JobDefinition $hiveJobDefinition -HttpCredential $creds
wait-azurermhdinsightjob -ClusterName $clusterName -JobId $hiveJob.JobId -HttpCredential $creds > $null
#Get the files returned from Hive
$files=get-azurermhdinsightjoboutput -clustername $clusterName -JobId $hiveJob.JobId -DefaultContainer $container -DefaultStorageAccountName $storageAccountName -DefaultStorageAccountKey $storageAccountKey -HttpCredential $creds
#Find the file that starts with mahout-examples and ends in job.jar
$jarFile = $files | select-string "mahout-examples.+job\.jar" | % {$_.Matches.Value}
#Add the full path
$jarFile = "file:///usr/hdp/current/mahout-client/$jarFile"

# The arguments for the mahout job
# * input - the path to the data uploaded to HDInsight
# * output - the path to store output data
# * tempDir - the directory for temp files
$jobArguments = "-s", "SIMILARITY_COOCCURRENCE", `
                "--input", "/HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt",
                "--output", "/example/out",
                "--tempDir", "/example/temp"

# Create the job definition
$jobDefinition = New-AzureRmHDInsightMapReduceJobDefinition `
    -JarFile $jarFile `
    -ClassName "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob" `
    -Arguments $jobArguments

# Start the job
$job = Start-AzureRmHDInsightJob `
    -ClusterName $clusterName `
    -JobDefinition $jobDefinition `
    -HttpCredential $creds

# Wait on the job to complete
Write-Host "Wait for the job to complete ..." -ForegroundColor Green
Wait-AzureRmHDInsightJob `
        -ClusterName $clusterName `
        -JobId $job.JobId `
        -HttpCredential $creds

# Write out any error information
Write-Host "STDERR"
Get-AzureRmHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $job.JobId `
        -HttpCredential $creds `
        -DisplayOutputType StandardError

# Download the output
Get-AzureStorageBlobContent `
        -Blob example/out/part-r-00000 `
        -Container $container `
        -Destination output.txt `
        -Context $context
#Download movie and user files for use in displaying results
Get-AzureStorageBlobContent -blob "HdiSamples/HdiSamples/MahoutMovieData/moviedb.txt" `
        -Container $container `
        -Destination moviedb.txt `
        -Context $context
Get-AzureStorageBlobContent -blob "HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt" `
        -Container $container `
        -Destination user-ratings.txt `
        -Context $context

Nota

I processi Mahout non rimuovono i dati temporanei creati durante l'elaborazione del processo.Mahout jobs do not remove temporary data that is created while processing the job. Nel processo di esempio è specificato il parametro --tempDir per isolare i file temporanei in una directory specifica.The --tempDir parameter is specified in the example job to isolate the temporary files into a specific directory.

Il processo Mahout non restituisce l'output in STDOUT,The Mahout job does not return the output to STDOUT. ma lo archivia nella directory di output specificata come part-r-00000.Instead, it stores it in the specified output directory as part-r-00000. Lo script scaricherà questo file in output.txt nella directory corrente nella workstation.The script downloads this file to output.txt in the current directory on your workstation.

Il testo seguente riporta un esempio del contenuto di questo file:The following text is an example of the content of this file:

1    [234:5.0,347:5.0,237:5.0,47:5.0,282:5.0,275:5.0,88:5.0,515:5.0,514:5.0,121:5.0]
2    [282:5.0,210:5.0,237:5.0,234:5.0,347:5.0,121:5.0,258:5.0,515:5.0,462:5.0,79:5.0]
3    [284:5.0,285:4.828125,508:4.7543354,845:4.75,319:4.705128,124:4.7045455,150:4.6938777,311:4.6769233,248:4.65625,272:4.649266]
4    [690:5.0,12:5.0,234:5.0,275:5.0,121:5.0,255:5.0,237:5.0,895:5.0,282:5.0,117:5.0]

La prima colonna rappresenta il valore userID.The first column is the userID. I valori racchiusi tra "[" e "]" sono movieId:recommendationScore.The values contained in '[' and ']' are movieId:recommendationScore.

Lo script scarica anche i file moviedb.txt e user-ratings.txt , necessari per formattare l'output in modo più leggibile.The script also downloads the moviedb.txt and user-ratings.txt files, which are needed to format the output to be more readable.

Visualizzare l'outputView the output

Anche se l'output generato risulta appropriato per l'uso in un'applicazione, non è facilmente leggibile.Although the generated output might be OK for use in an application, it's not user-friendly. Il moviedb.txt dal server può essere utilizzato per risolvere il movieId nel nome di un film.The moviedb.txt from the server can be used to resolve the movieId to a movie name. Usare lo script PowerShell seguente per vedere le raccomandazioni con i nomi dei film:Use the following PowerShell script to display recommendations with movie names:

    Displays recommendations for movies.
.DESCRIPTION
    Displays recommendations generated by Mahout
    with HDInsight example in a human readable format.
.EXAMPLE
    .\Show-Recommendation -userId 4
        -userDataFile "user-ratings.txt"
        -movieFile "moviedb.txt"
        -recommendationFile "output.txt"
#>

Param(
    #The user ID
    [Parameter(Mandatory = $true)]
    [String]$userId,

    [Parameter(Mandatory = $true)]
    [String]$userDataFile,

    [Parameter(Mandatory = $true)]
    [String]$movieFile,

    [Parameter(Mandatory = $true)]
    [String]$recommendationFile
)
# Read movie ID & description into hash table
$movieById = @{}
foreach($line in Get-Content $movieFile)
{
    $tokens = $line.Split("|")
    $movieById[$tokens[0]] = $tokens[1]
}
# Load movies user has already seen (rated)
# into a hash table
$ratedMovieIds = @{}
foreach($line in Get-Content $userDataFile)
{
    $tokens = $line.Split("`t")
    if($tokens[0] -eq $userId)
    {
        # Resolve the ID to the movie name
        $ratedMovieIds[$movieById[$tokens[1]]] = $tokens[2]
    }
}
# Read recommendations generated by Mahout
$recommendations = @{}
foreach($line in get-content $recommendationFile)
{
    $tokens = $line.Split("`t")
    if($tokens[0] -eq $userId)
    {
        #Trim leading/treailing [] and split at ,
        $movieIdAndScores = $tokens[1].TrimStart("[").TrimEnd("]").Split(",")
        foreach($movieIdAndScore in $movieIdAndScores)
        {
            #Split at : and store title and score in a hash table
            $idAndScore = $movieIdAndScore.Split(":")
            $recommendations[$movieById[$idAndScore[0]]] = $idAndScore[1]
        }
        break
    }
}

Write-Output "Rated movies" -ForegroundColor Green
Write-Output "---------------------------" -ForegroundColor Green
$ratedFormat = @{Expression={$_.Name};Label="Movie";Width=40}, `
                @{Expression={$_.Value};Label="Rating"}
$ratedMovieIds | format-table $ratedFormat
Write-Output "---------------------------" -ForegroundColor Green

write-Output "Recommended movies" -ForegroundColor Green
Write-Output "---------------------------" -ForegroundColor Green
$recommendationFormat = @{Expression={$_.Name};Label="Movie";Width=40}, `
                        @{Expression={$_.Value};Label="Score"}
$recommendations | format-table $recommendationFormat

Usare il comando seguente per visualizzare le indicazioni in un formato semplice:Use the following command to display the recommendations in a user-friendly format:

.\show-recommendation.ps1 -userId 4 -userDataFile .\user-ratings.txt -movieFile .\moviedb.txt -recommendationFile .\output.txt

L'output è simile al testo seguente:The output is similar to the following text:

Reading movies descriptions
Reading rated movies
Reading recommendations
Rated movies
---------------------------
Movie                                    Rating
-----                                    ------
Devil's Own, The (1997)                  1
Alien: Resurrection (1997)               3
187 (1997)                               2
(lines ommitted)

---------------------------
Recommended movies
---------------------------

Movie                                    Score
-----                                    -----
Good Will Hunting (1997)                 4.6504064
Swingers (1996)                          4.6862745
Wings of the Dove, The (1997)            4.6666665
People vs. Larry Flynt, The (1996)       4.834559
Everyone Says I Love You (1996)          4.707071
Secrets & Lies (1996)                    4.818182
That Thing You Do! (1996)                4.75
Grosse Pointe Blank (1997)               4.8235292
Donnie Brasco (1997)                     4.6792455
Lone Star (1996)                         4.7099237

Risoluzione dei problemiTroubleshooting

Impossibile sovrascrivere i fileCannot overwrite files

I processi Mahout non eliminano i file temporanei creati durante l'elaborazione.Mahout jobs do not clean up temporary files that were created during processing. Inoltre, i processi non sovrascrivono un file di output esistente.In addition, the jobs do not overwrite existing output file.

Per evitare errori durante l'esecuzione dei processi Mahout, eliminare i file temporanei e di output da un'esecuzione all'altra.To avoid errors when running Mahout jobs, delete temporary and output files between runs. Utilizzare il seguente script PowerShell per rimuovere i file creati dagli script precedenti in questo documento:To remove the files created by the earlier scripts in this document, use the following PowerShell script:

# Login to your Azure subscription
# Is there an active Azure subscription?
$sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
if(-not($sub))
{
    Connect-AzureRmAccount
}

# Get cluster info
$clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
$creds=Get-Credential -Message "Enter the login for the cluster"

#Get the cluster info so we can get the resource group, storage, etc.
$clusterInfo = Get-AzureRmHDInsightCluster -ClusterName $clusterName
$resourceGroup = $clusterInfo.ResourceGroup
$storageAccountName = $clusterInfo.DefaultStorageAccount.split('.')[0]
$container = $clusterInfo.DefaultStorageContainer
$storageAccountKey = (Get-AzureRmStorageAccountKey `
    -Name $storageAccountName `
-ResourceGroupName $resourceGroup)[0].Value

#Create a storage context and upload the file
$context = New-AzureStorageContext `
    -StorageAccountName $storageAccountName `
    -StorageAccountKey $storageAccountKey

#Azure PowerShell can't delete blobs using wildcard,
#so have to get a list and delete one at a time
# Start with the output
$blobs = Get-AzureStorageBlob -Container $container -Context $context -Prefix "example/out"
foreach($blob in $blobs)
{
    Remove-AzureStorageBlob -Blob $blob.Name -Container $container -context $context
}
# Next the temp files
$blobs = Get-AzureStorageBlob -Container $container -Context $context -Prefix "example/temp"
foreach($blob in $blobs)
{
    Remove-AzureStorageBlob -Blob $blob.Name -Container $container -context $context
}

Classi che non funzionano con Azure PowerShellClasses that do not work with Azure PowerShell

I processi Mahout che usano le classi seguenti restituiscono una serie di messaggi di errore quando vengono usati da Windows PowerShell:Mahout jobs that use the following classes return various error messages when used from Windows PowerShell:

  • org.apache.mahout.utils.clustering.ClusterDumperorg.apache.mahout.utils.clustering.ClusterDumper
  • org.apache.mahout.utils.SequenceFileDumperorg.apache.mahout.utils.SequenceFileDumper
  • org.apache.mahout.utils.vectors.lucene.Driverorg.apache.mahout.utils.vectors.lucene.Driver
  • org.apache.mahout.utils.vectors.arff.Driverorg.apache.mahout.utils.vectors.arff.Driver
  • org.apache.mahout.text.WikipediaToSequenceFileorg.apache.mahout.text.WikipediaToSequenceFile
  • org.apache.mahout.clustering.streaming.tools.ResplitSequenceFilesorg.apache.mahout.clustering.streaming.tools.ResplitSequenceFiles
  • org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizerorg.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer
  • org.apache.mahout.classifier.sgd.TrainLogisticorg.apache.mahout.classifier.sgd.TrainLogistic
  • org.apache.mahout.classifier.sgd.RunLogisticorg.apache.mahout.classifier.sgd.RunLogistic
  • org.apache.mahout.classifier.sgd.TrainAdaptiveLogisticorg.apache.mahout.classifier.sgd.TrainAdaptiveLogistic
  • org.apache.mahout.classifier.sgd.ValidateAdaptiveLogisticorg.apache.mahout.classifier.sgd.ValidateAdaptiveLogistic
  • org.apache.mahout.classifier.sgd.RunAdaptiveLogisticorg.apache.mahout.classifier.sgd.RunAdaptiveLogistic
  • org.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainerorg.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainer
  • org.apache.mahout.classifier.sequencelearning.hmm.ViterbiEvaluatororg.apache.mahout.classifier.sequencelearning.hmm.ViterbiEvaluator
  • org.apache.mahout.classifier.sequencelearning.hmm.RandomSequenceGeneratororg.apache.mahout.classifier.sequencelearning.hmm.RandomSequenceGenerator
  • org.apache.mahout.classifier.df.tools.Describeorg.apache.mahout.classifier.df.tools.Describe

Per eseguire i processi che usano queste classi, connettersi al cluster HDInsight usando SSH ed eseguire i processi dalla riga di comando.To run jobs that use these classes, connect to the HDInsight cluster using SSH and run the jobs from the command line. Per un esempio su come usare SSH per eseguire processi Mahout, vedere Generare raccomandazioni mediante Mahout e HDInsight (SSH).For an example of using SSH to run Mahout jobs, see Generate movie recommendations using Mahout and HDInsight (SSH).

Passaggi successiviNext steps

A questo punto, dopo aver appreso come usare Mahout, trovare altri modi per usare i dati in HDInsight:Now that you have learned how to use Mahout, discover other ways of working with data on HDInsight: