Web Scraper using Azure Logic App - SIMPLE

Nick Romanek 1 Reputation point
2021-09-21T01:54:38.76+00:00

Hi,

I'm building my first "web scraper". This can be accomplished probably a dozen different ways but I want it all to be done in Azure. This leads me to function apps and logic apps.

I built a logic app that runs an HTTP get against a URL, then creates an HTML file in a Sharepoint document library. I was very close to accomplishing this but the way the page is formatted prevents it from downloading all the text within the webpage. (https://www.teamviewer.com/en-us/eula/#eula).

The other way I could accomplish this is simply by downloading the HTML file from the URL. If I right click and save the page, the file displays everything I want, but I'm not finding a way to download a webpage file using Logic or Function app.

I will attach two pictures. One of the logic app that I made, and the other of the result of the file the logic app creates in Sharepoint next to how the original site looks.

133763-logicapp.png133699-eula.png

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
4,313 questions
Azure Logic Apps
Azure Logic Apps
An Azure service that automates the access and use of data across clouds without writing code.
2,864 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Nick Romanek 1 Reputation point
    2021-09-21T03:38:38.52+00:00

    I did solve this using Power Automate desktop in case that helps anyone. Ideally, this would all be done in the cloud, but here is a screen shot of my solution using PAD.

    It clicks on a shortcut that takes you to the site, and then saves the file with the current date into SharePoint.

    If anyone can help me get it working in the cloud that would be appreciated. 133768-pad.png

    0 comments No comments

  2. Pramod Valavala 20,591 Reputation points Microsoft Employee
    2021-09-21T11:33:10.993+00:00

    @Nick Romanek When saving a page from browser, it saves other supporting files as well. In your case, based on the screenshot that you've shared, you are likely missing the required CSS files.

    For that, you would have to do two things

    1. Download linked CSS files (and JS too if required but would be a more complex for dynamic sites without SSR)
    2. Update the links in the downloaded HTML to point to the downloaded supporting files

    This is more or less what browsers do today I believe, and you could replicate the same by parsing the HTML to find supporting files and update the content accordingly (which might be a bit too complex to do with logic apps). It would be better to offload this to Azure Functions and leverage an existing scraping library that already does the job for you.

    0 comments No comments