Using Powershell to parse local HTML documents (then replace text within specific tags) fails for specific HTML files - $HTML = New-Object -Com "HTMLFile"

Swin 51 Reputation points
2020-08-16T15:26:33.883+00:00

Hi All,

Windows 10 - Windows PS 5.1

We have local HTML documents produced with an authoring tool (Madcap flare) which produces well formatted HTML. Unfortunately, it adds non-printing HTML Characters ( &#160; ) within <code> tags that then cause issues when users copy and paste text into another application. I want to run a post process PS script to parse the HTML, pull out the <code> tags, and replace the text within those tags, but ONLY within those tags.

I have tried to create an HTMLFile and see if I could pull out the text within the tags using the following code, but it returns nothing. I even have tried some real basic HTML as well and just looked at <p> tags, but still it results in nothing.

$Source = Get-Content -path example.html -raw
$HTML = New-Object -Com "HTMLFile"
$HTML.IHTMLDocument2_write($Source)

$HTML.all.tags("code") | % InnerText

I was think that perhaps converting the HTML to XML using ConvertTo-XML, which seems to work, but then I'm not sure where to go next.

I tried to upload some example HTML, but these forums will not allow me to do that, either in line or as a txt file :(

Even tried DropBox and OneDrive links - both are denied :(

Windows Server PowerShell
Windows Server PowerShell
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.PowerShell: A family of Microsoft task automation and configuration management frameworks consisting of a command-line shell and associated scripting language.
5,396 questions
0 comments No comments
{count} votes

8 answers

Sort by: Most helpful
  1. Chris Swinney 1 Reputation point
    2020-10-08T18:12:31.94+00:00

    Hey @Rich Matheisen (or anyone :) )

    Other things side tracked my for a while (like some marriage type thing) but I finally managed to sit back down today to re-look at this.

    Interestingly, I manged to get something working by doing as I did in the video, i.e. reading in a dummy HTML first, then re-reading in the real HTML file. I can loop though the NBSP characters within the give tags and replace them, however the result is just not quite what I intended :(

    The problem we are having with the NBSP characters is that the break certain browser views. These NBSP specifically are inside a <CODE> tag as mentioned, and those tags are with a <PRE> tag. What I was trying to do was replace the NBSP with a completely normal space - i.e. " ".

    $HTML.all.tags('pre') | ForEach-Object{$_.outerHTML = $_.outerHTML -replace '&nbsp;', ' '}  
    

    Unfortunately, whist ALL NBSP characters are replaced, they are replace with nothing, so effectively removed.

    This is what we start with:

       <div style="mc-code-lang: Python;" class="codeSnippetBody" data-mc-continue="False" data-mc-line-number-start="1" data-mc-use-line-numbers="False"><pre><code>{<br />&#160;&#160;{% <span style="color: #a71d5d; ">if</span> service_config %}<br />&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"action"</span> : <span style="color: #df5000; ">"continue"</span>,<br />&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"result"</span> : {{service_config|pex_to_json}}<br />&#160;&#160;{% <span style="color: #a71d5d; ">else</span> %}<br />&#160;&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"action"</span> : <span style="color: #df5000; ">"reject"</span>,<br />&#160;&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"result"</span> : {}<br />&#160;&#160;{% endif %}<br />}</code></pre>  
           </div>  
    

    And after reading in, this is what we get in the outerHTML property:

       outerHTML                    : <CODE>{<BR>&nbsp;&nbsp;{% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR>&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR:   
                                          #df5000">"continue"</SPAN>,<BR>&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"result"</SPAN> : {{service_config|pex_to_json}}<BR>&nbsp;&nbsp;{% <SPAN style="COLOR: #a71d5d">else</SPAN>   
                                          %}<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR:   
                                          #df5000">"result"</SPAN> : {}<BR>&nbsp;&nbsp;{% endif %}<BR>}</CODE>  
    

    OK so far. But after parsing and writing back out to HTML, we end up with:

       <DIV class=codeSnippetBody style="mc-code-lang: Python" data-mc-use-line-numbers="False" data-mc-line-number-start="1" data-mc-continue="False"><PRE><CODE>{<BR>{% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR><SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"continue"</SPAN>,<BR><SPAN style="COLOR: #df5000">"result"</SPAN> : {{service_config|pex_to_json}}<BR>{% <SPAN style="COLOR: #a71d5d">else</SPAN> %}<BR><SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR><SPAN style="COLOR: #df5000">"result"</SPAN> : {}<BR>{% endif %}<BR>}</CODE></PRE></DIV>  
    

    This doesn't work as all text is left justified

    However, If the HTML file is manually edited, so that simple spaces are inserted where the NBSP characters used to be, everything works as it should do. No browser view are broken and all is good.

       <DIV class=codeSnippetBody style="mc-code-lang: Python" data-mc-use-line-numbers="False" data-mc-line-number-start="1" data-mc-continue="False"><PRE><CODE>{<BR>  {% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR>    <SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"continue"</SPAN>,<BR>    <SPAN style="COLOR: #df5000">"result"</SPAN> : {{service_config|pex_to_json}}<BR>  {% <SPAN style="COLOR: #a71d5d">else</SPAN> %}<BR>     <SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR>     <SPAN style="COLOR: #df5000">"result"</SPAN> : {}<BR>  {% endif %}<BR>}</CODE></PRE></DIV>  
    

    So, whilst over the oddity in the way HTML ComObject works, I can get my read around simply replacing NBSP with ' '.

    0 comments No comments

  2. Rich Matheisen 45,111 Reputation points
    2020-10-08T20:09:57.667+00:00

    I think your problem is the use of styles inside PRE tags. The <pre> tag is supposed to represent preformatted text and whitespace and line breaks are preserved. Outside a <pre> tag multiple, consecutive, spaces are collapsed into a single space.

    In your two examples (of what doesn't work, and what works), the length of the data is different (by 22 characters), which probably accounts for the left justification if the missing characters are spaces at the beginning of the line of code.

    It seems to me that the use of the <CODE> tags to format and colorize the data would make the <PRE> tags kinda pointless. This may account for your original problem of copying the code and preserving the formatting.

    You might want to find an HTML discussion board and ask about this as I'm no HTML guru.

    0 comments No comments

  3. Chris Swinney 1 Reputation point
    2020-10-09T19:26:10.813+00:00

    In your two examples (of what doesn't work, and what works), the length of the data is different (by 22 characters), which probably accounts for the left justification if the missing characters are spaces at the beginning of the line of code

    Hey @Rich Matheisen , indeed the additional characters are simply me manually adding whitespace between the <BR> and <SPAN tags, which does the trick, but not particularly efficient over a few hundred files.

    As I understand it code tags nested inside pre tags are fine, and indeed seems to be a recommended practice (https://www.w3.org/TR/2011/WD-html5-author-20110809/the-code-element.html). Whitespaces are used to format the the text (as I did manually above).

    I have had the issue clarified. The code tags are used to define jinja2 code snippets that users will cut&paste into a text window. Unfortunately, the NBSP characters are copied, and this breaks the jinja2 parser :(.

    I might have to see if I can build a regex find/replace on the raw text file rather than parsing it through the COM HTMLfile object, which I wasn't looking forward to....