Using Powershell to parse local HTML documents (then replace text within specific tags) fails for specific HTML files - $HTML = New-Object -Com "HTMLFile"

Swin 51 Reputation points
2020-08-16T15:26:33.883+00:00

Hi All,

Windows 10 - Windows PS 5.1

We have local HTML documents produced with an authoring tool (Madcap flare) which produces well formatted HTML. Unfortunately, it adds non-printing HTML Characters ( &#160; ) within <code> tags that then cause issues when users copy and paste text into another application. I want to run a post process PS script to parse the HTML, pull out the <code> tags, and replace the text within those tags, but ONLY within those tags.

I have tried to create an HTMLFile and see if I could pull out the text within the tags using the following code, but it returns nothing. I even have tried some real basic HTML as well and just looked at <p> tags, but still it results in nothing.

$Source = Get-Content -path example.html -raw
$HTML = New-Object -Com "HTMLFile"
$HTML.IHTMLDocument2_write($Source)

$HTML.all.tags("code") | % InnerText

I was think that perhaps converting the HTML to XML using ConvertTo-XML, which seems to work, but then I'm not sure where to go next.

I tried to upload some example HTML, but these forums will not allow me to do that, either in line or as a txt file :(

Even tried DropBox and OneDrive links - both are denied :(

Windows Server PowerShell
Windows Server PowerShell
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.PowerShell: A family of Microsoft task automation and configuration management frameworks consisting of a command-line shell and associated scripting language.
5,381 questions
0 comments No comments
{count} votes

8 answers

Sort by: Most helpful
  1. Rich Matheisen 45,091 Reputation points
    2020-08-16T20:40:08.857+00:00

    You may not have to do any replacement at all. The parser should replace the numeric code with the mnemonic. But if it is a problem you should be able to modify the code tags using something like this (admittedly crude) code:

    Okay, I quit. The freaking filters on the Microsoft side have decided that posting code in an area supposedly devoted to a programming language is forbidden. This sucks.

    If there are any "moderators" reading this, they need to get this fixed. It isn't the first time I've run into this problem since they shut down the old forums and switch to this POS system.

    $HTML.all.tags('code') | ForEach-Object{$.innerHTML = $.innerHTML -Replace ' ', 'NON-BREAKING-SPACE'}

    You can save the modified data by using the dame code as above, but substitute "HTML" for "code" and pipe to "ForEach-Object{$_.document.documentElement.OuterHTML} | Out-File ...."

    I really do apologize for having to butcher the code like this, but this freaking system refuses to accept it if it's properly formatted

    0 comments No comments

  2. Swin 51 Reputation points
    2020-08-19T15:40:13.527+00:00

    @Rich Matheisen - np and fully understand - I spent a huge amount of time trying to upload something that I thought would help, and failed dismally.


  3. Swin 51 Reputation points
    2020-08-23T19:32:52.6+00:00

    @Rich Matheisen Thanks - although I am utterly confused with what is going on - not with what you posted, this is fine, and indeed pretty much what I posted in the first place. What I am confused about is the results I am getting in PowerShell.

    The reason I posted in the first place was because I was getting no result when I simply tried to output the "code" tags ( $HTML.all.tag('code') ) - i.e. nothing was seemingly parsed the HTMLfile document. This is sooo bazaar to my mind, I have had to record it on video to see if you, or anyone else, can understand what is going on.


  4. Swin 51 Reputation points
    2020-08-23T20:02:55.467+00:00

    Confusing PowerShell Behaviour Video - 3 mins 30 sec

    So, it seems as if my file (which I cannot seem to post in any way shape or form here) is unable to be written into the new HTMLfile object, and once tried with this file, subsequent tries to re-write to this same object with other HTML files also fail, BUT if you write a simple HTML file to a new HTMLfile object, then you can re-write my more complex file with no issue :/

    0 comments No comments

  5. Swin 51 Reputation points
    2020-08-23T20:14:59.807+00:00

    A more complex HTML file

    It would appear as if direct HTML links don't work, but the Hyperlink button does.