How to remove html tags and line break plain text

Question

       I wish to re format the email html page text below to a plain text style in a legible presentation with line breaks where breaks are present. The regularExpression example i am following for simplification, does not seem to work?
```Note, i am operating in visual basic and not #C     

 Dim altViewPlainText As System.Net.Mail.AlternateView = System.Net.Mail.AlternateView.CreateAlternateViewFromString(System.Text.RegularExpressions.Regex.Replace(Body, "<(.|
)*?>", String.Empty), Nothing, "text/plain")

---

MYPROJECT.COM

Lovely Chromebooks laptop not to be missed

Order Date: 29/02/2024
Dispatch Date: UnKnown
Delivery Date: Unknown
Order ID: 5339

For aƩension of:-
Peter10 (Seller)
pete (Buyer)
We note from our record that a request has been made to Return item/s recently purchased from MyProject
All correspondance between both Seller And Buyer should be conducted using the Messaging service provided, accessed from your Account page.
The Seller should respond And let Buyer know what to do next.
If Seller has accepted responsibility for the Return he Is obliged to supply the Return label And make available as an image aƩachment And accessed via Message Service.
To review return purchase details, Click the Link address shown And login with your credenƟals.
Thank you.

Accepted Answer

Hi @peter liles,

You can try this regex[<(?!br[\x20/>])[^<>]+>] using and change "text/plain" to MediaTypeNames.Text.Html. If "text/plain" is not changed, only the
tag is displayed and no line wrapping is done.

Dim altViewPlainText As AlternateView = AlternateView.CreateAlternateViewFromString(Regex.Replace(Body, "<(?!br[\x20/>])[^<>]+>", String.Empty), Nothing, MediaTypeNames.Text.Html)

User's image

Best regards,
Lan Huang

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Answer

You cannot use a RE to parse HTML. REs are not that powerful. Specifically they do pattern matching and that doesn't work to parse HTML. If it did then we'd never need actual parsers. REs are used to scan and identity tokens from which parsers are built to understand what the tokens mean. For example the following HTML would fail any RE you tried to use on it unless you explicitly wrote the RE to handle this very specific case.

Unless your HTML is incredibly simple AND you make assumptions about what it contains then your only real option is to use an HTML parser. I have had a lot of luck using HtmlAgilityPack in the past. Basically you load the HTML into the parser and then you can swap out the things you don't want, like linebreaks.

It isn't clear to me what "to a plain text style in a legible presentation" really means to you. It sounds like you want to replace the HTML with the equivalent text (e.g. an actual table for a

, a header for

, etc). But that is exactly what the browser is for. So if you want the formatted equivalent then I'd say load the HTML into a browser engine and then "print" the output. You won't get the raw text back but that isn't going to happen anyway because there is no textual representation of a table structure (or divs or images).

If you really need a "plain text" equivalent then you'll have to define that structure yourself and an HTML parser can at least handle the parsing aspect. You then just have to decide how to "render" each HTML element (e.g. horizontal line, image, etc). This is interestingly like how Markdown works but it goes the other way.

If you really just want the HTML but with linebreaks converted then a simple call to Replace would be sufficient. It won't work properly with comments or scripts (as mentioned earlier) but that may not be an issue you care about.

Private ReadOnly s_htmlBreakRe As New Regex("", RegexOptions.IgnoreCase Or RegexOptions.Singleline)

Function ReplaceHtmlLineBreak(html As String) As String
    Return s_htmlBreakRe.Replace(html, Environment.NewLine)
End Function

How to remove html tags and line break plain text

1 additional answer