question

AJ-AJ avatar image
0 Votes"
AJ-AJ asked AJ-AJ edited

Powershell to remove special characters in text file

Hi there,

On executing the below script with the attached input file, looking for help to remove the special characters. How to remove them but single quotes need to be the same. You could see some special characters in output line 9,10,11,12

output

["Continental District – Denver", "Org Unit"],
["The Team Lead*’s Roadmap", "Org Unit"], --<<The single quotes in the word Lead's need to be retained>>
["Data Services
 - ABC", "Stem Face"],
["App tttm
Â* - CST", "Stem Face"],


script used

$file = Get-Content -Path "C:\temp\pinput.txt" -Raw
$file = $file -ireplace '(?<match1>[\"[^]])\r\n(?<match2>[^]]\"])','${match1}${match2}'
$file |Out-File -FilePath "C:\temp\oput.txt"

[155372-pinput.txt][1]
[1]: /answers/storage/attachments/155372-pinput.txt

Thanks.

windows-server-powershell
pinput.txt (245 B)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

DaveKwas avatar image
0 Votes"
DaveKwas answered AJ-AJ edited
 get-content C:\temp\pinput.txt -Encoding UTF8 | Set-Content c:\temp\poutput.txt

I believe the output from this command retains the single quote but removes all other special characters.

· 10
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks. The above works as expected. Any suggestion on how to integrate this into the existing above code?

Is there a way to

$file = Get-Content -Path "C:\temp\pinput.txt" -Raw # when i use Encoding here, the logic does not work. You might notice [abcd] is broken into 2 lines in input file and the logic fixes it by bringing [abcd] in one line]
$file = $file -ireplace '(?<match1>[\"[^]])\r\n(?<match2>[^]]\"])','${match1}${match2}'
$file # can i read the content of this by using -Encoding UTF8 and write to output? i tried something like below but if fails. i'm sorry im new to ps. appreciate your help.
$file = $file -Encoding UTF8 | Set-Content c:\temp\poutput.txt

Thanks

0 Votes 0 ·

Are you using the "-Raw" and "-Encoding UTF8" together? Or are you replacing "-Raw" with "-Encoding UTF8"?

1 Vote 1 ·
AJ-AJ avatar image AJ-AJ RichMatheisen-8856 ·

I replaced -Raw with -Encoding UTF8. When i do that, the existing logic of abcd (split as 2 lines in input and logic fixes as single line) does not work for some reason when i use -encoding utf8.

May be im missing something?

0 Votes 0 ·
Show more comments

I'll admit to have a very limited understanding of regex but even with just your code and no alterations, I can't see the regex effecting the output in anyway. You mention the abcd line which is spilt which I believe in the source file is line 7 & 8, when running I'm not seeing any formatting change between the value of $file after reading the source file and once the regex is applied (see image)
155733-pout-img.jpg



Is it possible you could maybe provide an example of what you would expect the full poutput.txt to look like after running your script?

If you are able to get the required output from your regex without using the -Encoding UTF8 option, can you not just run the output from the fixed formatting (bring abcd back together etc) and do another open/save using the UTF8 encoding?

 $file = Get-Content -Path "C:\temp\pinput.txt" -Raw
 $file = $file -ireplace '(?<match1>[\"[^]])\r\n(?<match2>[^]]\"])','${match1}${match2}'
 $file | Out-File -FilePath "C:\temp\poutput.txt"
 $file = get-content "C:\temp\poutput.txt" -Encoding UTF8
 $file | Set-Content "C:\temp\poutput.txt" -force


0 Votes 0 ·
pout-img.jpg (80.8 KiB)

His intention is to drop the intervening newline (CrLf) between the two patterns. That would join the two patterns on a single line.

[ab<newline>
cd]

would become

[abcd]

He's hasn't mentioned that this is a continuation of an earlier post he made about how to accomplish that task. The regex needs work. His original post only asked to combine the lines when a line began with "[" and didn't end with "]" and was followed by a line that did not begin with "[" but did end with "]". He later added additional conditions and said he was satisfied with his own solution.

0 Votes 0 ·

My bad Dave. Not sure if it was copy paste issue. Now encoding issue is resolved per yours and Richs' suggestion. So marked the thread as answered. Thanks for your help.

The below is the correct code.. if you give it a shot it will show the right way.

$file = Get-Content -Path "C:\temp\pinput.txt" -Raw
$file = $file -ireplace '(?<match1>[\"[^]])\r\n(?<match2>[^]]\"])','${match1}${match2}'
$file

input: (pattern is to find only [" and the same line does not contain "] anywhere in that line then contact the next line. Other lines to be as is. Now if 2nd line has leading spaces, i'm not able to remove it.
["ab
cd"]

expected output
["abcd"]

Thanks a lot for your help.

0 Votes 0 ·