How to remove html tags and line break plain text

peter liles 556 Reputation points
2024-03-12T14:25:43.47+00:00
       I wish to re format the email html page text below to a plain text style in a legible presentation with line breaks where breaks are present. The regularExpression example i am following for simplification, does not seem to work?
```Note, i am operating in visual basic and not #C     

 Dim altViewPlainText As System.Net.Mail.AlternateView = System.Net.Mail.AlternateView.CreateAlternateViewFromString(System.Text.RegularExpressions.Regex.Replace(Body, "<(.|\n)*?>", String.Empty), Nothing, "text/plain")

---

<h3>MYPROJECT.COM</h3><br /><hr /><h2>Lovely Chromebooks laptop not to be missed</h2><TABLE style='font: 16px verdana, arial;' ><TR style='verƟcal-align:top;'><TD><img src='cid:mylogo' width='100' alt='Logo1' style='margin: 20px 0px 0px 20px;' ></TD><TD>Order Date: <B>29/02/2024</B><br>Dispatch Date: <B>UnKnown</B><br>Delivery Date: <B>Unknown</B><br>Order ID: <B>5339</B></TD></TR></TABLE><hr /><div style='font: 16px verdana, arial;'>For aƩension of:-<br /><strong>Peter10</strong> (Seller)<br /><strong>pete</strong> (Buyer)<br />We note from our record that a request has been made to Return item/s recently purchased from MyProject<br />All correspondance between both Seller And Buyer should be conducted using the Messaging service provided, accessed from your Account page.<br />The Seller should respond And let Buyer know what to do next.<br />If Seller has accepted responsibility for the Return he Is obliged to supply the Return label And make available as an image aƩachment And accessed via Message Service.<br />To review return purchase details, Click the Link address shown And login with your credenƟals.<br />Thank you.<br />


ASP.NET
ASP.NET
A set of technologies in the .NET Framework for building web applications and XML web services.
3,256 questions
0 comments No comments
{count} votes

Accepted answer
  1. Lan Huang-MSFT 25,556 Reputation points Microsoft Vendor
    2024-03-13T07:30:39.0433333+00:00

    Hi @peter liles,

    You can try this regex[<(?!br[\x20/>])[^<>]+>] using and change "text/plain" to MediaTypeNames.Text.Html. If "text/plain" is not changed, only the <br/> tag is displayed and no line wrapping is done.

    Dim altViewPlainText As AlternateView = AlternateView.CreateAlternateViewFromString(Regex.Replace(Body, "<(?!br[\x20/>])[^<>]+>", String.Empty), Nothing, MediaTypeNames.Text.Html)
    

    User's image

    Best regards,
    Lan Huang


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Michael Taylor 48,281 Reputation points
    2024-03-12T15:12:31.7666667+00:00

    You cannot use a RE to parse HTML. REs are not that powerful. Specifically they do pattern matching and that doesn't work to parse HTML. If it did then we'd never need actual parsers. REs are used to scan and identity tokens from which parsers are built to understand what the tokens mean. For example the following HTML would fail any RE you tried to use on it unless you explicitly wrote the RE to handle this very specific case.

    <!-- <br /> -->
    <script>  
       function foo () {
          console.write("<BR/>");
       }
    </script>
    

    Unless your HTML is incredibly simple AND you make assumptions about what it contains then your only real option is to use an HTML parser. I have had a lot of luck using HtmlAgilityPack in the past. Basically you load the HTML into the parser and then you can swap out the things you don't want, like linebreaks.

    It isn't clear to me what "to a plain text style in a legible presentation" really means to you. It sounds like you want to replace the HTML with the equivalent text (e.g. an actual table for a <table>, a header for <h1>, etc). But that is exactly what the browser is for. So if you want the formatted equivalent then I'd say load the HTML into a browser engine and then "print" the output. You won't get the raw text back but that isn't going to happen anyway because there is no textual representation of a table structure (or divs or images).

    If you really need a "plain text" equivalent then you'll have to define that structure yourself and an HTML parser can at least handle the parsing aspect. You then just have to decide how to "render" each HTML element (e.g. horizontal line, image, etc). This is interestingly like how Markdown works but it goes the other way.

    If you really just want the HTML but with linebreaks converted then a simple call to Replace would be sufficient. It won't work properly with comments or scripts (as mentioned earlier) but that may not be an issue you care about.

    Private ReadOnly s_htmlBreakRe As New Regex("<br\s*(/)?>", RegexOptions.IgnoreCase Or RegexOptions.Singleline)
    
    Function ReplaceHtmlLineBreak(html As String) As String
        Return s_htmlBreakRe.Replace(html, Environment.NewLine)
    End Function