question

swinster avatar image
0 Votes"
swinster asked RichMatheisen-8856 commented

Using Powershell to parse local HTML documents (then replace text within specific tags) fails for specific HTML files - $HTML = New-Object -Com "HTMLFile"

Hi All,

Windows 10 - Windows PS 5.1

We have local HTML documents produced with an authoring tool (Madcap flare) which produces well formatted HTML. Unfortunately, it adds non-printing HTML Characters ( &#160; ) within <code> tags that then cause issues when users copy and paste text into another application. I want to run a post process PS script to parse the HTML, pull out the <code> tags, and replace the text within those tags, but ONLY within those tags.

I have tried to create an HTMLFile and see if I could pull out the text within the tags using the following code, but it returns nothing. I even have tried some real basic HTML as well and just looked at <p> tags, but still it results in nothing.

$Source = Get-Content -path example.html -raw
$HTML = New-Object -Com "HTMLFile"
$HTML.IHTMLDocument2_write($Source)

$HTML.all.tags("code") | % InnerText


I was think that perhaps converting the HTML to XML using ConvertTo-XML, which seems to work, but then I'm not sure where to go next.


I tried to upload some example HTML, but these forums will not allow me to do that, either in line or as a txt file :(

Even tried DropBox and OneDrive links - both are denied :(

windows-server-powershell
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

RichMatheisen-8856 avatar image
0 Votes"
RichMatheisen-8856 answered RichMatheisen-8856 edited

You may not have to do any replacement at all. The parser should replace the numeric code with the mnemonic. But if it is a problem you should be able to modify the code tags using something like this (admittedly crude) code:

Okay, I quit. The freaking filters on the Microsoft side have decided that posting code in an area supposedly devoted to a programming language is forbidden. This sucks.

If there are any "moderators" reading this, they need to get this fixed. It isn't the first time I've run into this problem since they shut down the old forums and switch to this POS system.

$HTML.all.tags('code') | ForEach-Object{$.innerHTML = $.innerHTML -Replace '&nbsp;', 'NON-BREAKING-SPACE'}

You can save the modified data by using the dame code as above, but substitute "HTML" for "code" and pipe to "ForEach-Object{$_.document.documentElement.OuterHTML} | Out-File ...."

I really do apologize for having to butcher the code like this, but this freaking system refuses to accept it if it's properly formatted


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

swinster avatar image
0 Votes"
swinster answered RichMatheisen-8856 commented

@RichMatheisen-8856 - np and fully understand - I spent a huge amount of time trying to upload something that I thought would help, and failed dismally.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Well, FWIW, here's the code . . . but as a screenshot. :-( I hope that makes what I was trying to say somewhat more clear.

18907-capture.jpg


0 Votes 0 ·
capture.jpg (195.8 KiB)
swinster avatar image
0 Votes"
swinster answered RichMatheisen-8856 commented

@RichMatheisen-8856 Thanks - although I am utterly confused with what is going on - not with what you posted, this is fine, and indeed pretty much what I posted in the first place. What I am confused about is the results I am getting in PowerShell.

The reason I posted in the first place was because I was getting no result when I simply tried to output the "code" tags ( $HTML.all.tag('code') ) - i.e. nothing was seemingly parsed the HTMLfile document. This is sooo bazaar to my mind, I have had to record it on video to see if you, or anyone else, can understand what is going on.



· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I cannot tell how frustrating it is to use this freaking system! ARRRRGGGHHHHH! Again I received "you do not have permission to post"!

Instead of text you're going to get a screenshot of text! I hope it helps.

19640-capture.jpg


0 Votes 0 ·
capture.jpg (149.3 KiB)
swinster avatar image
0 Votes"
swinster answered swinster edited

Confusing PowerShell Behaviour Video - 3 mins 30 sec


So, it seems as if my file (which I cannot seem to post in any way shape or form here) is unable to be written into the new HTMLfile object, and once tried with this file, subsequent tries to re-write to this same object with other HTML files also fail, BUT if you write a simple HTML file to a new HTMLfile object, then you can re-write my more complex file with no issue :/

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

swinster avatar image
0 Votes"
swinster answered RichMatheisen-8856 commented

A more complex HTML file


It would appear as if direct HTML links don't work, but the Hyperlink button does.

· 5
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Are you doing this from actual files? Any chance you could get the page from the web server instead? Using Invoke-WebRequest would provide you with a much cleaner interface than using COM objects.

I wish Invoke-WebRequest used URIs other than HTTP and HTTPS, but it doesn't.

0 Votes 0 ·
swinster avatar image swinster RichMatheisen-8856 ·

@RichMatheisen-8856 - Unfortunately, we are parsing local files (as per the link) which are generated from our documentation authoring tool (MadCap Flair) :(. The intention is that this will be a post process PS1 script that is called by Madcap after it generates the HTML file.

What did you make of the video?

0 Votes 0 ·

I think the video explained the problem pretty well. It's left me flummoxed, though. I've got nothing I can add!

0 Votes 0 ·
Show more comments
ChrisSwinney-1188 avatar image
0 Votes"
ChrisSwinney-1188 answered ChrisSwinney-1188 published

Hey @RichMatheisen-8856 (or anyone :) )

Other things side tracked my for a while (like some marriage type thing) but I finally managed to sit back down today to re-look at this.

Interestingly, I manged to get something working by doing as I did in the video, i.e. reading in a dummy HTML first, then re-reading in the real HTML file. I can loop though the NBSP characters within the give tags and replace them, however the result is just not quite what I intended :(

The problem we are having with the NBSP characters is that the break certain browser views. These NBSP specifically are inside a <CODE> tag as mentioned, and those tags are with a <PRE> tag. What I was trying to do was replace the NBSP with a completely normal space - i.e. " ".

 $HTML.all.tags('pre') | ForEach-Object{$_.outerHTML = $_.outerHTML -replace '&nbsp;', ' '}

Unfortunately, whist ALL NBSP characters are replaced, they are replace with nothing, so effectively removed.

This is what we start with:

<div style="mc-code-lang: Python;" class="codeSnippetBody" data-mc-continue="False" data-mc-line-number-start="1" data-mc-use-line-numbers="False"><pre><code>{<br />&#160;&#160;{% <span style="color: #a71d5d; ">if</span> service_config %}<br />&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"action"</span> : <span style="color: #df5000; ">"continue"</span>,<br />&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"result"</span> : {
                 {service_config|pex_to_json}}<br />&#160;&#160;{% <span style="color: #a71d5d; ">else</span> %}<br />&#160;&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"action"</span> : <span style="color: #df5000; ">"reject"</span>,<br />&#160;&#160;&#160;&#160;&#160;<span style="color: #df5000; ">"result"</span> : {}<br />&#160;&#160;{% endif %}<br />}</code></pre>
    </div>


And after reading in, this is what we get in the outerHTML property:

outerHTML                    : <CODE>{<BR>&nbsp;&nbsp;{% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR>&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: 
                                   #df5000">"continue"</SPAN>,<BR>&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"result"</SPAN> : {
                 {service_config|pex_to_json}}<BR>&nbsp;&nbsp;{% <SPAN style="COLOR: #a71d5d">else</SPAN> 
                                   %}<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<SPAN style="COLOR: 
                                   #df5000">"result"</SPAN> : {}<BR>&nbsp;&nbsp;{% endif %}<BR>}</CODE>


OK so far. But after parsing and writing back out to HTML, we end up with:

<DIV class=codeSnippetBody style="mc-code-lang: Python" data-mc-use-line-numbers="False" data-mc-line-number-start="1" data-mc-continue="False"><PRE><CODE>{<BR>{% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR><SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"continue"</SPAN>,<BR><SPAN style="COLOR: #df5000">"result"</SPAN> : {
                 {service_config|pex_to_json}}<BR>{% <SPAN style="COLOR: #a71d5d">else</SPAN> %}<BR><SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR><SPAN style="COLOR: #df5000">"result"</SPAN> : {}<BR>{% endif %}<BR>}</CODE></PRE></DIV>


This doesn't work as all text is left justified

However, If the HTML file is manually edited, so that simple spaces are inserted where the NBSP characters used to be, everything works as it should do. No browser view are broken and all is good.

<DIV class=codeSnippetBody style="mc-code-lang: Python" data-mc-use-line-numbers="False" data-mc-line-number-start="1" data-mc-continue="False"><PRE><CODE>{<BR>  {% <SPAN style="COLOR: #a71d5d">if</SPAN> service_config %}<BR>    <SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"continue"</SPAN>,<BR>    <SPAN style="COLOR: #df5000">"result"</SPAN> : {
                 {service_config|pex_to_json}}<BR>  {% <SPAN style="COLOR: #a71d5d">else</SPAN> %}<BR>     <SPAN style="COLOR: #df5000">"action"</SPAN> : <SPAN style="COLOR: #df5000">"reject"</SPAN>,<BR>     <SPAN style="COLOR: #df5000">"result"</SPAN> : {}<BR>  {% endif %}<BR>}</CODE></PRE></DIV>


So, whilst over the oddity in the way HTML ComObject works, I can get my read around simply replacing NBSP with ' '.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

RichMatheisen-8856 avatar image
0 Votes"
RichMatheisen-8856 answered

I think your problem is the use of styles inside PRE tags. The <pre> tag is supposed to represent preformatted text and whitespace and line breaks are preserved. Outside a <pre> tag multiple, consecutive, spaces are collapsed into a single space.

In your two examples (of what doesn't work, and what works), the length of the data is different (by 22 characters), which probably accounts for the left justification if the missing characters are spaces at the beginning of the line of code.

It seems to me that the use of the <CODE> tags to format and colorize the data would make the <PRE> tags kinda pointless. This may account for your original problem of copying the code and preserving the formatting.

You might want to find an HTML discussion board and ask about this as I'm no HTML guru.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ChrisSwinney-1188 avatar image
0 Votes"
ChrisSwinney-1188 answered RichMatheisen-8856 commented

In your two examples (of what doesn't work, and what works), the length of the data is different (by 22 characters), which probably accounts for the left justification if the missing characters are spaces at the beginning of the line of code

Hey @RichMatheisen-8856, indeed the additional characters are simply me manually adding whitespace between the <BR> and <SPAN tags, which does the trick, but not particularly efficient over a few hundred files.

As I understand it code tags nested inside pre tags are fine, and indeed seems to be a recommended practice (https://www.w3.org/TR/2011/WD-html5-author-20110809/the-code-element.html). Whitespaces are used to format the the text (as I did manually above).

I have had the issue clarified. The code tags are used to define jinja2 code snippets that users will cut&paste into a text window. Unfortunately, the NBSP characters are copied, and this breaks the jinja2 parser :(.

I might have to see if I can build a regex find/replace on the raw text file rather than parsing it through the COM HTMLfile object, which I wasn't looking forward to....

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I can see that the "works" HTML is longer because it contains more spaces. :-) I guess my point was that if you replace the &#160; characters with powershell in the HTMLFile COM object and then write the changes to the file the HTML parser will compress the consecutive spaces to a single space giving it the "left justified" look. I know that shouldn't happen inside a <pre> tag, but the HTML parser used by the COM object may have a bug.

1 Vote 1 ·