question

NickRomanek-5256 avatar image
0 Votes"
NickRomanek-5256 asked PramodValavala-MSFT answered

Web Scraper using Azure Logic App - SIMPLE

Hi,

I'm building my first "web scraper". This can be accomplished probably a dozen different ways but I want it all to be done in Azure. This leads me to function apps and logic apps.

I built a logic app that runs an HTTP get against a URL, then creates an HTML file in a Sharepoint document library. I was very close to accomplishing this but the way the page is formatted prevents it from downloading all the text within the webpage. (https://www.teamviewer.com/en-us/eula/#eula).

The other way I could accomplish this is simply by downloading the HTML file from the URL. If I right click and save the page, the file displays everything I want, but I'm not finding a way to download a webpage file using Logic or Function app.

I will attach two pictures. One of the logic app that I made, and the other of the result of the file the logic app creates in Sharepoint next to how the original site looks.



133763-logicapp.png133699-eula.png


azure-functionsazure-logic-apps
logicapp.png (23.0 KiB)
eula.png (360.0 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

NickRomanek-5256 avatar image
0 Votes"
NickRomanek-5256 answered

I did solve this using Power Automate desktop in case that helps anyone. Ideally, this would all be done in the cloud, but here is a screen shot of my solution using PAD.

It clicks on a shortcut that takes you to the site, and then saves the file with the current date into SharePoint.

If anyone can help me get it working in the cloud that would be appreciated. 133768-pad.png



pad.png (43.2 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

PramodValavala-MSFT avatar image
0 Votes"
PramodValavala-MSFT answered

@NickRomanek-5256 When saving a page from browser, it saves other supporting files as well. In your case, based on the screenshot that you've shared, you are likely missing the required CSS files.

For that, you would have to do two things
1. Download linked CSS files (and JS too if required but would be a more complex for dynamic sites without SSR)
2. Update the links in the downloaded HTML to point to the downloaded supporting files

This is more or less what browsers do today I believe, and you could replicate the same by parsing the HTML to find supporting files and update the content accordingly (which might be a bit too complex to do with logic apps). It would be better to offload this to Azure Functions and leverage an existing scraping library that already does the job for you.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.