Download html which also have hyperlinks which points to other html to download as well and go deep to download

nellie 126 Reputation points
2021-02-07T16:26:28.087+00:00

Hi there,
How would you do this?
Point to a start HTML main webpage, download this main and then retrieve all the links and download the sub html pages and then all the subpages retrieved do the same thing and get the links and then also download this.
It's a recursive procedure call that will get all the pages regardless of how deep the links to other pages.
Is there a way you can do this in c# ?

thanks.

ASP.NET
ASP.NET
A set of technologies in the .NET Framework for building web applications and XML web services.
3,316 questions
C#
C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
10,388 questions
0 comments No comments
{count} votes

Accepted answer
  1. XuDong Peng-MSFT 10,181 Reputation points Microsoft Vendor
    2021-02-08T06:51:19.043+00:00

    Hi @nellie ,

    According to your description, I think it can be implemented in C#.

    First, you can use WebClient to download html resources.

    using System.Net;  
      
    using (WebClient client = new WebClient ()) // WebClient class inherits IDisposable  
    {  
        client.DownloadFile("http://yoursite.com/page.html", @"C:\localfile.html");  
      
        // Or you can get the file content without saving it  
        string htmlCode = client.DownloadString("http://yoursite.com/page.html");  
    }  
    

    And then use Html Agility Pack to traverse all <a> tags in the resource, and then filter to obtain downloadable hyperlink addresses. But there may be other problems, so you need to do some exception handling.

    public static int i = 1;  
        public static void downloadRes(string url)  
        {  
            using (WebClient client = new WebClient())  
            {  
                client.DownloadFile(url, "D:\\localfile" + i++ + ".html");  
                HtmlWeb hw = new HtmlWeb();  
                HtmlDocument doc = hw.Load(url);  
                foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))  
                {  
                    string href = link.Attributes["href"].Value.ToString();  
                    if (href.StartsWith("https"))  
                    {  
                        downloadRes(href);  
                    }  
                }  
            }  
        }  
    

    Hope this can help you.

    Best regards,
    Xudong Peng


    If the answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

0 additional answers

Sort by: Most helpful