How to convert UTF16LE CSV file to UTF8 without losing Commas - powershell

We receive Cognos reports that are encoded as UTF16LE. I am trying to create a powershell script to convert the UTF16LE files to UTF8. My logic so far does loop through the directory (whichever directory I place the script in as hardcoding the directory names that contain date/numbers caused errors) and save the files as UTF-8; however, the delimiters seem to be removed.
I believe that it may be due to the way that I am reading the data, as I am not specifying UTF16LE; however, I am unsure of any way to do that. My script so far is:
$files = Get-ChildItem
$dt = get-date -Format yyyyMMdd
$extension = "_" + "$dt" + "_utf8.csv"
ForEach ($file in $files) {
$file_name = $file.basename
$new_file = "$file_name" + "$extension"
echo $new_file
#Get-Content $file | Set-Content -Encoding UTF8 $new_file
}
Read-Host -Prompt "Press Enter to Close Window"
Any and all insight into this issue would be greatly appreciated.

PowerShell's Import-CSV and Export-CSV cmdlets support the -Encoding parameter (links to Microsoft Docs), so you could replace your line
Get-Content $file | Set-Content -Encoding UTF8 $new_file
with
Import-CSV -Path $File -Encoding Unicode | Export-CSV -Path $New_File -Encoding UTF8
(UTF16LE encoding is what PowerShell calls "Unicode"; UTF16BE is "BigEndianUnicode". The default is UTF8NoBOM, UTF8 without Byte Order Mark.)

Since all you want to do is convert the character encoding, reading and writing as a string would be the most straightforward. As always, read a text file with the character encoding it was written with:
Get-Content -Encoding Unicode $file | Set-Content -Encoding UTF8 $new_file
Encoding "Unicode" for UTF-16 harkens back to the infancy of the Unicode character set when UCS-2 was going to be "it" for many environments. Then the explosion happened and UTF-16 was born from UCS-2. Systems invented since then quite reasonably use UTF16 or similar when they mean UTF-16 and "Unicode" for UTF-16 is esoteric and imponderable.

Related

ASCII encoding does not work in powershell

Just can't get ASCII encoding get to work in PowerShell. Tried a bunch of different approaches.
Whatever I try I get an UTF8 encoded file (that is what NPP tells me):
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding ascii
PSVersion = 5.1.14393.5066
Any hint is welcome!
ASCII is a 7-bit character set and doesn't contain any accented character, so obviously storing öäü in ASCII doesn't work. If you need UTF-8 then you need to specify encoding as utf8
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding utf8
If you need another encoding then specify it accordingly. For example to get the ANSI code page use this
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding default
-Encoding default will save the file in the current ANSI code page and -Encoding oem will use the current OEM code page. Just press Tab after -Encoding and PowerShell will cycle through the list of supported encodings. For encodings not in that list you can trivially deal with them using System.Text.Encoding
Note that "ANSI code page" is a misnomer and the actual encoding changes depending on each environment so it won't be reliable. For example if you change the code page manually then it won't work anymore. For a more reliable behavior you need to explicitly specify the encoding (typically Windows-1252 for Western European languages). In older PowerShell use
[IO.File]::WriteAllLines("c:\temp\check.txt", $newLine, [Text.Encoding]::GetEncoding(1252)
and in PowerShell Core you can use
$newLine | Out-File -FilePath "check2.txt" -Encoding ([Text.Encoding]::GetEncoding(1252))
See How do I change my Powershell script so that it writes out-file in ANSI - Windows-1252 encoding?
Found the solution:
$file = "c:\temp\check-ansi.txt"
$newLine = "Ein Test öÖäÄüÜ"
Remove-Item $file
[IO.File]::WriteAllLines($file, $newLine, [Text.Encoding]::GetEncoding(28591))

Keep Same Encoding With Set-Content Multiple Files in PowerShell

I'm attempting to write a script to be used to migrate an application from server to server and/or from one drive letter to another drive letter. My goal is to copy the directory from one location, move it to another, and then run a script to edit all instances of the old hostname, IP address, and drive letter to reflect the new hostname, IP address, and drive letter on the new server. This appears to do exactly that:
ForEach($File in (Get-ChildItem $path\* -Include *.xml,*.config -Recurse)){
(Get-Content $File.FullName -Raw) -replace [RegEx]::Escape($oldhost),$newhost `
-replace [RegEx]::Escape($oldip),$newip `
-replace "$olddriveletter(?=:\Application)",$newDriveLetter |
Set-Content $File.FullName -NoNewLine
}
The one problem I am having is that the files all have different types of encoding. Some ANSI, some UTF-8, some Unicode, etc. When I run the script, it saves everything as ANSI and then my application fails to work. I know how to add the encoding parameter, but is there any way to keep the same encoding on each individual file, without writing out a script specifying each individual file in the directory and the encoding that each individual file has?
That would be difficult. It's too bad that get-content doesn't pass an encoding property. Here's a script that tries to get the encoding if there's a signature. Maybe you can just run it first and check them all. But some windows files are unicode no bom. At least xml files can say the encoding. get-childitem *.xml | select-string encoding There might be a better way to load xml files, see the bottom answer: Powershell: Setting Encoding for Get-Content Pipeline
# encoding.ps1
# https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
param([Parameter(ValueFromPipeline=$True)] $filename)
process {
$reader = [IO.StreamReader]::new($filename, [Text.Encoding]::default,$true)
$peek = $reader.Peek()
$encoding = $reader.currentencoding
$reader.close()
[pscustomobject]#{Name=split-path $filename -leaf
BodyName=$encoding.BodyName
EncodingName=$encoding.EncodingName}
}
# end encoding.ps1
PS C:\users\me> get-childitem chinese16.txt | encoding
Name BodyName EncodingName
---- -------- ------------
chinese16.txt utf-16 Unicode
Something like this will use the encoding indicated in the xml file, even if it didn't truly match beforehand. (This also makes the xml pretty.)
PS C:\users\me> [xml]$xml = get-content file.xml
PS C:\users\me> $xml.save("$pwd\file.xml")
Use the file.exe from the git binaries to find out the encoding.
Then, add the encoding parameter to the set-content line with if else statements to meet your needs.
ForEach($File in (Get-ChildItem $path\*)){
$Content = Get-Content $File.FullName -Raw -replace [RegEx]::Escape($oldhost),$newhost `
-replace [RegEx]::Escape($oldip),$newip `
-replace "$olddriveletter(?=:\Application)",$newDriveLetter
$Encoding = file --mime-encoding $File
$FullName = $File.FullName
Write-Host "$FullName - $Encoding"
if(-NOT ($Encoding -like "UTF")){
Set-Content $Content -NoNewLine -Encoding UTF8
}
else {
Set-Content $Content -NoNewLine
}
}
Reference:
https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/set-content
http://gnuwin32.sourceforge.net/packages/file.htm

Modify a JSON file with PowerShell without writing BOM

I need to modify an existing UTF8 encoded JSON file with PowerShell. I tried with the following code:
$fileContent = ConvertFrom-Json "$(Get-Content $filePath -Encoding UTF8)"
$fileContent.someProperty = "someValue"
$fileContent | ConvertTo-Json -Depth 999 | Out-File $filePath
This adds a BOM to the file and also encodes it in UTF16. Is it possible to have ConvertFrom-Json and ConvertTo-Json do not do the encoding / BOM?
This has nothing to do with ConvertTo-Json or ConvertFrom-Json. The encoding is defined by the output cmdlet. Out-File defaults to Unicode, Set-Content to ASCII. With each of them the desired encoding can be defined explicitly:
... | Out-File $filePath -Encoding UTF8
or
... | Set-Content $filePath -Encoding UTF8
That will still write a (UTF8) BOM to the output file, but I wouldn't consider UTF-8 encoding without BOM a good practice anyway.
If you want ASCII-encoded output files (no BOM) replace UTF8 with Ascii:
... | Out-File $filePath -Encoding Ascii
or
... | Set-Content $filePath # Ascii is default encoding for Set-Content

UTF8 encoding without BOM - PowerShell

I have a bat file where I encode some CSV files. The problem is that there are one character at the begining of the file once the encoding have been done (BOM byte I guess). This character bothers me cause after encoding, I use this file to generate a database.
Here is the line for encoding (inside bat file):
powershell -Command "&{ param($Path); (Get-Content $Path) | Out-File $Path -Encoding UTF8 }" CSVs\\pass.csv
Is there any way to encode the file without BOM (if this is the problem)??
Thanks!
I found the solution.
Just change the line with this:
powershell -Command "&{ param($Path); $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False); $MyFile = Get-Content $Path; [System.IO.File]::WriteAllLines($Path, $MyFile, $Utf8NoBomEncoding) }" CSVs\\pass.csv

Powershell: Setting Encoding for Get-Content Pipeline

I have a file saved as UCS-2 Little Endian I want to change the encoding so I ran the following code:
cat tmp.log -encoding UTF8 > new.log
The resulting file is still in UCS-2 Little Endian. Is this because the pipeline is always in that format? Is there an easy way to pipe this to a new file as UTF8?
As suggested here:
Get-Content tmp.log | Out-File -Encoding UTF8 new.log
I would do it like this:
get-content tmp.log -encoding Unicode | set-content new.log -encoding UTF8
My understanding is that the -encoding option selects the encdoing that the file should be read or written in.
load content from xml file with encoding.
(Get-Content -Encoding UTF8 $fileName)
If you are reading an XML file, here's an even better way that adapts to the encoding of your XML file:
$xml = New-Object -Typename XML
$xml.load('foo.xml')
PowerShell's get-content/set-content encoding flag doesn't handle all encoding types. You may need to use IO.File, for example to load a file using Windows-1252:
$myString = [IO.File]::ReadAllText($filePath, [Text.Encoding]::GetEncoding(1252))
Text.Encoding::GetEncoding
Text.Encoding::GetEncodings