un-messing Unicode in PowerShell

Article
12/29/2016

PowerShell has a bit of a problem with accepting the output of the native commands that print Unicode into its pipelines. PowerShell tries to be smart in determining whether the command prints Unicode or ASCII, so if the output happens to be nicely formatted and contains the proper Unicode byte order mark (0xFF 0xFE) than it gets accepted OK. But if it doesn't, PowerShell mangles the output by taking it as ASCII and internally converting it to Unicode. Even redirecting the output to a file doesn't help because it's implemented in PowerShell by pipelining and then saving to the file from PowerShell, so everything gets just as mangled.

One workaround that works is to start the command through an explicit cmd.exe and do the redirection in cmd.exe:

 cmd /c "mycommand >output.txt 2>error.txt"

Then you can read the file with Get-Content -Encoding Unicode. Unfortunately, there is no encoding override for the pipelines and/or in Encode-Command.

If you really need a pipeline, another workaround is again to start cmd.exe, now with two commands: the first one would print the Unicode byte mark, and the second would be your command. But basically there is no easy way to do it in cmd itself, you'll have to write the first command yourself.

Well, yet another workaround is to de-mangle the mangled output. Here is the function that does it:

 function ConvertFrom-Unicode
{
<#
.SYNOPSIS
Convert a misformatted string produced by reading the Unicode UTF-16LE text
as ASCII to the proper Unicode string.

It's slow, so whenever possible, it's better to read the text directly as
Unicode. One case where it's impossible is piping the input from an
exe into Powershell.

WARNING:
The conversion is correct only if the original text contained only ASCII
(even though expanded to Unicode). The Unicode characters with codes 10
or 13 in the lower byte throw off Powershell's line-splitting, and it's
impossible to reassemble the original characters back together.
#>
    param(
        ## The input string.
        [Parameter(ValueFromPipeline = $true)]
        [string] $String,
        ## Auto-detect whether the input string is misformatted, and
        ## do the conversion only if it is, otherwise return the string as-is.
        [switch] $AutoDetect
    )

    process {
        $len = $String.Length

        if ($len -eq 0) {
            return $String # nothing to do, and would confuse the computation
        }

        $i = 0
        if ([int32]$String[0] -eq 0xFF -and [int32]$String[1] -eq 0xFE) {
            $i = 2 # skip the encoding detection code
        } else {
            if ([int32]$String[0] -eq 0) {
                # Weird case when the high byte of Unicode CR or LF gets split off and
                # prepended to the next line. Skip that byte.
                $i = 1
                if ($len -eq 1) {
                    return # This string was created by breaking up CR-LF, return nothing
                }
            } elseif ($Autodetect) {
                if ($len -lt 2 -or [int32]$String[1] -ne 0) {
                    return $String # this looks like ASCII
                }
            }
        }

        $out = New-Object System.Text.StringBuilder
        for (; $i -lt $len; $i+=2) {
            $null = $out.Append([char](([int32]$String[$i]) -bor (([int32]$String[$i+1]) -shl 8)))
        }
        $out.ToString()
    }
}

Export-ModuleMember -Function ConvertFrom-Unicode

Here is an example of use:

 $data = (Receive-Job $Buf.job | ConvertFrom-Unicode)

un-messing Unicode in PowerShell

Additional resources