Using Powershell to parse local HTML documents (then replace text within specific tags) fails for specific HTML files - $HTML = New-Object -Com "HTMLFile"

Swin 51

Hi All,

Windows 10 - Windows PS 5.1

We have local HTML documents produced with an authoring tool (Madcap flare) which produces well formatted HTML. Unfortunately, it adds non-printing HTML Characters (   ) within <code> tags that then cause issues when users copy and paste text into another application. I want to run a post process PS script to parse the HTML, pull out the <code> tags, and replace the text within those tags, but ONLY within those tags.

I have tried to create an HTMLFile and see if I could pull out the text within the tags using the following code, but it returns nothing. I even have tried some real basic HTML as well and just looked at <p> tags, but still it results in nothing.

$Source = Get-Content -path example.html -raw
$HTML = New-Object -Com "HTMLFile"
$HTML.IHTMLDocument2_write($Source)

$HTML.all.tags("code") | % InnerText

I was think that perhaps converting the HTML to XML using ConvertTo-XML, which seems to work, but then I'm not sure where to go next.

I tried to upload some example HTML, but these forums will not allow me to do that, either in line or as a txt file :(

Even tried DropBox and OneDrive links - both are denied :(

8 answers

Chris Swinney 1

Hey @Rich Matheisen (or anyone :) )

Other things side tracked my for a while (like some marriage type thing) but I finally managed to sit back down today to re-look at this.

Interestingly, I manged to get something working by doing as I did in the video, i.e. reading in a dummy HTML first, then re-reading in the real HTML file. I can loop though the NBSP characters within the give tags and replace them, however the result is just not quite what I intended :(

The problem we are having with the NBSP characters is that the break certain browser views. These NBSP specifically are inside a <CODE> tag as mentioned, and those tags are with a <PRE> tag. What I was trying to do was replace the NBSP with a completely normal space - i.e. " ".

$HTML.all.tags('pre') | ForEach-Object{$_.outerHTML = $_.outerHTML -replace '&nbsp;', ' '}

Unfortunately, whist ALL NBSP characters are replaced, they are replace with nothing, so effectively removed.

This is what we start with:

   <div style="mc-code-lang: Python;" class="codeSnippetBody" data-mc-continue="False" data-mc-line-number-start="1" data-mc-use-line-numbers="False"><pre><code>{<br />&#160;&#160;{% <span style="color: #a71d5d; ">if</span> service_config %}<br />&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"action"</span> : <span style="color: #df5000; ">"continue"</span>,<br />&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"result"</span> : {{service_config|pex_to_json}}<br />&#160;&#160;{% <span style="color: #a71d5d; ">else</span> %}<br />&#160;&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"action"</span> : <span style="color: #df5000; ">"reject"</span>,<br />&#160;&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"result"</span> : {}<br />&#160;&#160;{% endif %}<br />}</code></pre>  
       </div>

And after reading in, this is what we get in the outerHTML property:

   outerHTML                    : <CODE>{<BR>&nbsp;&nbsp;{% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR>&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR:   
                                      #df5000">"continue"</SPAN>,<BR>&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"result"</SPAN> : {{service_config|pex_to_json}}<BR>&nbsp;&nbsp;{% <SPAN style="COLOR: #a71d5d">else</SPAN>   
                                      %}<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR:   
                                      #df5000">"result"</SPAN> : {}<BR>&nbsp;&nbsp;{% endif %}<BR>}</CODE>

OK so far. But after parsing and writing back out to HTML, we end up with:

   <DIV class=codeSnippetBody style="mc-code-lang: Python" data-mc-use-line-numbers="False" data-mc-line-number-start="1" data-mc-continue="False"><PRE><CODE>{<BR>{% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR><SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"continue"</SPAN>,<BR><SPAN style="COLOR: #df5000">"result"</SPAN> : {{service_config|pex_to_json}}<BR>{% <SPAN style="COLOR: #a71d5d">else</SPAN> %}<BR><SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR><SPAN style="COLOR: #df5000">"result"</SPAN> : {}<BR>{% endif %}<BR>}</CODE></PRE></DIV>

This doesn't work as all text is left justified

However, If the HTML file is manually edited, so that simple spaces are inserted where the NBSP characters used to be, everything works as it should do. No browser view are broken and all is good.

   <DIV class=codeSnippetBody style="mc-code-lang: Python" data-mc-use-line-numbers="False" data-mc-line-number-start="1" data-mc-continue="False"><PRE><CODE>{<BR>  {% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR>    <SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"continue"</SPAN>,<BR>    <SPAN style="COLOR: #df5000">"result"</SPAN> : {{service_config|pex_to_json}}<BR>  {% <SPAN style="COLOR: #a71d5d">else</SPAN> %}<BR>     <SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR>     <SPAN style="COLOR: #df5000">"result"</SPAN> : {}<BR>  {% endif %}<BR>}</CODE></PRE></DIV>

So, whilst over the oddity in the way HTML ComObject works, I can get my read around simply replacing NBSP with ' '.

Rich Matheisen 45,111 Reputation points

2020-10-08T20:09:57.667+00:00

I think your problem is the use of styles inside PRE tags. The <pre> tag is supposed to represent preformatted text and whitespace and line breaks are preserved. Outside a <pre> tag multiple, consecutive, spaces are collapsed into a single space.

In your two examples (of what doesn't work, and what works), the length of the data is different (by 22 characters), which probably accounts for the left justification if the missing characters are spaces at the beginning of the line of code.

It seems to me that the use of the <CODE> tags to format and colorize the data would make the <PRE> tags kinda pointless. This may account for your original problem of copying the code and preserving the formatting.

You might want to find an HTML discussion board and ask about this as I'm no HTML guru.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment
Chris Swinney 1 Reputation point

2020-10-09T19:26:10.813+00:00

In your two examples (of what doesn't work, and what works), the length of the data is different (by 22 characters), which probably accounts for the left justification if the missing characters are spaces at the beginning of the line of code

Hey @Rich Matheisen , indeed the additional characters are simply me manually adding whitespace between the <BR> and <SPAN tags, which does the trick, but not particularly efficient over a few hundred files.

As I understand it code tags nested inside pre tags are fine, and indeed seems to be a recommended practice (https://www.w3.org/TR/2011/WD-html5-author-20110809/the-code-element.html). Whitespaces are used to format the the text (as I did manually above).

I have had the issue clarified. The code tags are used to define jinja2 code snippets that users will cut&paste into a text window. Unfortunately, the NBSP characters are copied, and this breaks the jinja2 parser :(.

I might have to see if I can build a regex find/replace on the raw text file rather than parsing it through the COM HTMLfile object, which I wasn't looking forward to....
Please sign in to rate this answer.
Rich Matheisen 45,111 Reputation points

2020-10-09T19:55:03.657+00:00

I can see that the "works" HTML is longer because it contains more spaces. :-) I guess my point was that if you replace the characters with powershell in the HTMLFile COM object and then write the changes to the file the HTML parser will compress the consecutive spaces to a single space giving it the "left justified" look. I know that shouldn't happen inside a <pre> tag, but the HTML parser used by the COM object may have a bug.
Sign in to comment

Share via

Using Powershell to parse local HTML documents (then replace text within specific tags) fails for specific HTML files - $HTML = New-Object -Com "HTMLFile"

8 answers