Creating a Document's Keyword Wordcloud with Azure Cognitive Services and R
In this post we will look at how to utilize the Azure Cognitive Services APIs from R to add this extend to take advantage of the pretrained deep neural networks implemented as part of Azure Cognitive Services.
Specifically we will look at the KeyPhrases API and how we can use it to give a visual understanding of the contents of a document. This will give an instant understanding of a large body of text without needing to scan though the document itself. Alternatively, this can also be used to categorise the document contents for statistical purposes or for document retrieval.
In the following example we are taking letters of complaints and understanding the complaint at a glance. It should also be noted that if you were receiving letter as a physical written correspondence you could use handwriting recognition as part of the Computer Vision API https://azure.microsoft.com/en-gb/services/cognitive-services/computer-vision/ to turn this into text.
To begin with we will need to go to Azure and create a Cognitive Services Text Analytics API
(There is a free tier you may use that allows 5000 calls per month which will be more than enough for a proof of concept)
The Text Analytics API can also be trialed via the following website https://azure.microsoft.com/en-gb/services/cognitive-services/text-analytics/
The following R code will take the output from the Text Analytics API and produce a word cloud. All you need to do is replace the text "[Cognitive API Key]" with your key. This can be found in the Azure portal
library(httr) library(jsonlite) subscription.key <- "[Cognitive API Key]" #For this demo we will load just one record of text into a data frame but we could submit #multiple lines in a batch as well (complaint text from citizensadvice.org.uk) request_body <- data.frame( language = c("en"), id = c("1"), text = c("Re: Unsatisfactory Holiday at Hotel Balfour, Torrevieja on 12 August 2014 to 19 August 2014 Booking ref: 123456789 I have just returned from a holiday at Hotel Balfour, Torrevieja with my wife and children, which was most disappointing. Please find below a list of our complaints: 1) There was no shower in the hotel as specified in the brochure 2) The kitchens were closed for the whole of our stay 3) The hotel was 5 miles from the beach and not 1 mile as it said in the brochure We contacted your representative at the resort on 14 August 2015, but they were unable to resolve the matter and advised us to complain upon our return home. Under The Package Travel, Package Holidays and Package Tours Regulations 1992 you have a responsibility to provide all the elements of the package contracted for as they were described. We are legally entitled to receive compensation from you for loss of value, consequential losses and for the disappointment and loss of enjoyment we suffered. As you failed to provide us with the holiday we booked we are seeking £150 compensation from you for the problems we encountered, and for the distress and disappointment we suffered as a result. I have also sent a copy of this letter and enclosure to ABTA (of which I note you are a member). I look forward to receiving a response from you within 14 days of receipt of this letter. ")) #Convert to JSON for the API call request_body_json <- toJSON(list(documents = request_body), auto_unbox = TRUE) #Call the key phrases API result <- POST("https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases", body = request_body_json, add_headers(.headers = c("Content-Type" = "application/json", "Ocp-Apim-Subscription-Key" = subscription.key))) Output <- content(result) #We now have the data in a list so convert to a vector for the word cloud result <- unlist(Output$documents[]$keyPhrases) library(wordcloud) wordcloud(result, c(length(result):1), #reverse order for the weighting for the words scale = c(1.5, .1), random.order = TRUE, random.color = TRUE, colors = rainbow(length(unlist(Output$documents[]$keyPhrases))))
Running this code will produce the following word cloud
Without reading the totality of the document we can now get a good understanding of the complaint that was made streamlining process and giving a faster resolution to the problem.