ngram: estrattori delle funzionalità di Machine Learning

Articolo
05/23/2023

Estrattori di funzionalità che possono essere usati con mtText.

Utilizzo

  ngramCount(ngramLength = 1, skipLength = 0, maxNumTerms = 1e+07,
    weighting = "tf")

  ngramHash(ngramLength = 1, skipLength = 0, hashBits = 16,
    seed = 314489979, ordered = TRUE, invertHash = 0)

Arguments

`ngramLength`

Intero che specifica il numero massimo di token da accettare durante la costruzione di un n-grammo. Il valore predefinito è 1.

`skipLength`

Intero che specifica il numero massimo di token da omettere durante la costruzione di un n-grammo. Se il valore specificato come skip length è k, gli n-grammi possono contenere fino a k skip (non necessariamente consecutivi). Ad esempio, se k=2, le funzioni 3 gram estratte dal testo "the sky is blue today" sono: "the sky is", "the sky blue", "the sky today", "the is blue", "the is today" e "the blue today". Il valore predefinito è 0.

`maxNumTerms`

Intero che specifica il numero massimo di categorie da includere nel dizionario. Il valore predefinito è 10000000.

`weighting`

Stringa di caratteri che specifica i criteri di ponderazione:

"tf": per usare la frequenza dei termini.
"idf": per usare la frequenza inversa del documento.
"tfidf": per utilizzare sia la frequenza dei termini che la frequenza inversa del documento.

`hashBits`

valore intero. Numero di bit in cui eseguire l'hash. Deve essere compreso tra 1 e 30 inclusi.

`seed`

valore intero. Valore di inizializzazione hash.

`ordered`

TRUE per includere la posizione di ogni termine nell'hash. In caso contrario, FALSE. Il valore predefinito è TRUE.

`invertHash`

Numero intero che specifica il limite per il numero di chiavi che è possibile usare per generare il nome dello slot. 0 significa che non viene invertito l'hashing. -1 significa che non è previsto alcun limite. Mentre un valore pari a zero offre prestazioni migliori, è necessario un valore diverso da zero per ottenere nomi di coefficiente significativi.

Dettagli

ngramCount consente di definire argomenti per l'estrazione di funzionalità in base a conteggi. Accetta le opzioni seguenti: ngramLength, skipLength, maxNumTerms e weighting.

ngramHash consente di definire argomenti per l'estrazione di funzionalità in base a hash. Accetta le opzioni seguenti: ngramLength, skipLength, hashBits, seed, ordered e invertHash.

Valore

Stringa di caratteri che definisce la trasformazione.

Autore/i

Microsoft Corporation Microsoft Technical Support

Vedi anche

featurizeText.

Esempi


  myData <- data.frame(opinion = c(
     "I love it!",
     "I love it!",
     "Love it!",
     "I love it a lot!",
     "Really love it!",
     "I hate it",
     "I hate it",
     "I hate it.",
     "Hate it",
     "Hate"),
     like = rep(c(TRUE, FALSE), each = 5),
     stringsAsFactors = FALSE)

 outModel1 <- rxLogisticRegression(like~opinionCount, data = myData, 
     mlTransforms = list(featurizeText(vars = c(opinionCount = "opinion"), 
         wordFeatureExtractor = ngramHash(invertHash = -1, hashBits = 3)))) 
 summary(outModel1)   

 outModel2 <- rxLogisticRegression(like~opinionCount, data = myData, 
     mlTransforms = list(featurizeText(vars = c(opinionCount = "opinion"), 
         wordFeatureExtractor = ngramCount(maxNumTerms = 5, weighting = "tf"))))         
 summary(outModel2)