Check if file is not encoded twice - powershell

I used the answer to this question:
Using PowerShell to write a file in UTF-8 without the BOM
to encode a file(UCS-2) to UTF-8. The problem is that if I run the encoding twice(or more times) the Cyrillic text is broked. How to stop the encode if the file is already in UTF-8?
The code is:
$MyFile = Get-Content $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyFile, $Utf8NoBomEncoding)

Use:
$MyFile = Get-Content -Encoding UTF8 $MyPath
Initially, when $MyPath is UTF-16LE-encoded ("Unicode" encoding, which I assume is what you meant), PowerShell will ignore the -Encoding parameter due to the presence of a BOM in the file, which unambiguously identifies the encoding.
If your original file does not have a BOM, more work is needed.
Once you've saved $MyPath as UTF-8 without BOM, you must tell Windows PowerShell[1] that you expect UTF-8 encoding with -Encoding UTF8, as it interprets files as "ANSI"-encoded by default (encoded according to the typically single-byte code page associated with the legacy system locale).
[1] Note that the cross-platform PowerShell Core edition defaults to BOM-less UTF-8.

Related

ASCII encoding does not work in powershell

Just can't get ASCII encoding get to work in PowerShell. Tried a bunch of different approaches.
Whatever I try I get an UTF8 encoded file (that is what NPP tells me):
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding ascii
PSVersion = 5.1.14393.5066
Any hint is welcome!
ASCII is a 7-bit character set and doesn't contain any accented character, so obviously storing öäü in ASCII doesn't work. If you need UTF-8 then you need to specify encoding as utf8
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding utf8
If you need another encoding then specify it accordingly. For example to get the ANSI code page use this
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding default
-Encoding default will save the file in the current ANSI code page and -Encoding oem will use the current OEM code page. Just press Tab after -Encoding and PowerShell will cycle through the list of supported encodings. For encodings not in that list you can trivially deal with them using System.Text.Encoding
Note that "ANSI code page" is a misnomer and the actual encoding changes depending on each environment so it won't be reliable. For example if you change the code page manually then it won't work anymore. For a more reliable behavior you need to explicitly specify the encoding (typically Windows-1252 for Western European languages). In older PowerShell use
[IO.File]::WriteAllLines("c:\temp\check.txt", $newLine, [Text.Encoding]::GetEncoding(1252)
and in PowerShell Core you can use
$newLine | Out-File -FilePath "check2.txt" -Encoding ([Text.Encoding]::GetEncoding(1252))
See How do I change my Powershell script so that it writes out-file in ANSI - Windows-1252 encoding?
Found the solution:
$file = "c:\temp\check-ansi.txt"
$newLine = "Ein Test öÖäÄüÜ"
Remove-Item $file
[IO.File]::WriteAllLines($file, $newLine, [Text.Encoding]::GetEncoding(28591))

PowerShell : Set-Content Replace word and Encoding UTF8 without BOM

I'd like to escape \ to \\ in csv file to upload to Redshift.
Following simple PowerShell script can replace $TargetWord \ to $ReplaceWord \\ , as expected, but export utf-8 with bom and sometimes causes the Redshift copy error.
Any advice would be appreciated to improve it. Thank you in advance.
Exp_Escape.ps1
Param(
[string]$StrExpFile,
[string]$TargetWord,
[string]$ReplaceWord
)
# $(Get-Content "$StrExpFile").replace($TargetWord,$ReplaceWord) | Set-Content -Encoding UTF8 "$StrExpFile"
In PowerShell (Core) 7+, you would get BOM-less UTF-8 files by default; -Encoding utf8 and -Encoding utf8NoBom express that default explicitly; to use a BOM, -Encoding utf8BOM is needed.
In Windows PowerShell, unfortunately, you must use a workaround to get BOM-less UTF-8, because -Encoding utf8 only produces UTF-8 files with BOM (and no other utf8-related values are supported).
The workaround requires combining Out-String with New-Item, which (curiously) creates BOM-less UTF-8 files by default even in Windows PowerShell:
Param(
[string]$StrExpFile,
[string]$TargetWord,
[string]$ReplaceWord
)
$null =
New-Item -Force $StrExpFile -Value (
(Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord) | Out-String
)
Note:
$null = is needed to discard the output object that New-Item emits (which is a file-info object describing the newly created files.
-Force is needed in order to quietly overwrite an existing file by the same name (as Set-Content and Out-File do by default).
The -Value argument must be a single (multi-line) string to write to the file, which is what Out-String ensures.
Caveats:
For non-string input objects, Out-String creates the same rich for-display representations as Out-File and as you would see in the console by default.
New-Item itself does not append a trailing newline when it writes the string to the file, but Out-String curiously does; while this happens to be handy here, it is generally problematic, as discussed in GitHub issue #14444.
The alternative to using Out-String is to create the multi-line string manually, which is a bit more cumbersome ("`n" is used to create LF-only newlines, which PowerShell and most programs happily accept even on Windows; for platform-native newlines (CRLF) on Windows, use [Environment]::NewLine instead):
$null =
New-Item -Force $StrExpFile -Value (
((Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord) -join "`n`") + "`n"
)
Since the entire file content must be passed as an argument,[1] it must fit into memory as a whole; the convenience function discussed next avoids this problem.
For a convenience wrapper function around Out-File for use in Windows PowerShell that creates BOM-less UTF-8 files, see this answer.
Alternative, with direct use of .NET APIs:
.NET APIs produce BOM-less UTF-8 files by default.
However, because .NET's working directory usually differs from PowerShell's, full file paths must always be used, which requires more effort:
# In order for .NET API calls to work as expected,
# file paths must be expressed as *full, native* paths.
$OutDir = Split-Path -Parent $StrExpFile
if ($OutDir -eq '') { $OutDir = '.' }
$strExpFileFullPath = Join-Path (Convert-Path $OutDir) (Split-Path -Leaf $StrExpFile)
# Note: .NET APIs create BOM-less UTF-8 files *by default*
[IO.File]::WriteAllLines(
$strExpFileFullPath,
(Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord)
)
The above uses the System.IO.File.WriteAllLines method.
[1] Note that while New-Item technically supports receiving the content to write to the file via the pipeline, it unfortunately writes each to the target file, successively, with only the last one ending up in the file.

Get SamAccountName from text file with DisplayNames

I got a Script that works, but since we have coworker which have ö,ü,ä in their names the csv resolves them into ? (Example: Hörnlima = H?rnlima). Because of this it doesn't give me back any SamAccountname and the list isn't correct anymore. How can I correct that?
Script:
Import-Csv D:\Files\PowerShell\Test\4ME\DisplaynameToSamAccountName\Displaynames.txt | ForEach {
Get-ADUser -Filter "DisplayName -eq '$($_.DisplayName)'" -Properties Name, SamAccountName |
Select Name,SamAccountName
} | Export-CSV -path D:\Files\PowerShell\Test\4ME\DisplaynameToSamAccountName\Accountnames.csv -NoTypeInformation
Any ideas appreciated.
tl;dr:
Use, e.g., Export-Csv -Encoding utf8 ... to save your file with UTF-8 character encoding, which ensures that accented characters such as ö are preserved.
In Windows PowerShell, Export-Csv regrettably defaults to ASCII encoding, which means that any characters outside the US-ASCII range - notably accented characters such as ö - are transliterated to literal ?.
That is, such characters are lost, because they cannot be represented in ASCII encoding.
In PowerShell [Core] v6+, all cmdlets, including Export-Csv, now thankfully default to BOM-less UTF-8 encoding.
As for the behavior when you append to a preexisting CSV file with the -Append switch without specifying -Encoding, see this answer.
Therefore, especially in Windows PowerShell, use the -Encoding parameter to specify the desired character encoding:
-Encoding utf8 is advisable, because it is capable of encoding all Unicode characters.
In Windows PowerShell, the resulting file will invariably have a BOM.
In PowerShell [Core] v6+, it will be BOM-less, which is generally better for cross-platform compatibility, but you can alternatively use -Encoding utf8BOM to use a BOM.
-Encoding Unicode (UTF-16LE) encoding is another option, but results in larger files (most characters are encodes by 2 bytes). This encoding always results in a BOM.
-Encoding Default (Windows PowerShell) or
-Encoding (Get-Culture).TextInfo.ANSICodePage (PowerShell [Core] v6+) on Windows uses your system's active ANSI code page to create a BOM-less file.
This legacy encoding is best avoided, however, for multiple reasons:
Many modern applications assume UTF-8 encoding in the absence of a BOM.
Even those that read the file as ANSI-encoded may interpret a file differently if the host system happens to have a different active ANSI page.
Since the active ANSI code page is (for Western cultures) a fixed, single-byte encoding, only 256 characters can be represented, which is only a small subset of all Unicode characters.
Note that when PowerShell reads a file that is BOM-less, including source code, the behavior differs between the two editions:
In Windows PowerShell, Default is assumed, i.e. the system's active ANSI code page.
Note that in recent versions of Windows 10 it is now possible to make UTF-8 the ANSI code page, but such a system-wide change can have unintended consequences - see this answer.
In PowerShell [Core] v6+, UTF-8 is assumed.

Powershell Chinese characters encoding error

I have a file called test.txt which contains a single Chinese character, 中, in it.
This character looks like this
under hex-editor's view.
If I do get-content test.txt | Out-File test_output.txt, the content of test_output.txt is different from test.txt. Why is this hapenning?
I've tried all the encoding parameters listed here ("Unicode", "UTF7", "UTF8", "UTF32", "ASCII", "BigEndianUnicode", "Default", and "OEM"), but none of them correctly converts the Chinese character.
How can I correctly convert Chinese characters using Get-Content and Out-File?
The encoding, e4 b8 ad, looks like URLencode of 中, is this why all the encoding parameters are not compatible with this Chinese character?
I use Notepad++ and Notepad++'s hex-editor plugin as my text-editor and hex-editor, respectively.
I tried get-content test.txt -encoding UTF8 | Out-File test_output.txt -encoding UTF8
My test.txt is "e4 b8 ad 0a". And the output is "ef bb bf e4 b8 ad 0d 0a"
test.txt is in UTF-8.
Get-Content doesn't recognize UTF-8 unless with BOM. Out-File uses UTF-16 by default.
So specifying encoding for both commands is necessary
In my case, the Unicode encoding solved my problem with the Chinese characters. The file I was modifying contained a C# code on a TFS sever.
$path="test.cs"
Get-Content -Path $path -Encoding Unicode
Set-Content -Path $path -Encoding Unicode
it might help somebody else.

Powershell: Setting Encoding for Get-Content Pipeline

I have a file saved as UCS-2 Little Endian I want to change the encoding so I ran the following code:
cat tmp.log -encoding UTF8 > new.log
The resulting file is still in UCS-2 Little Endian. Is this because the pipeline is always in that format? Is there an easy way to pipe this to a new file as UTF8?
As suggested here:
Get-Content tmp.log | Out-File -Encoding UTF8 new.log
I would do it like this:
get-content tmp.log -encoding Unicode | set-content new.log -encoding UTF8
My understanding is that the -encoding option selects the encdoing that the file should be read or written in.
load content from xml file with encoding.
(Get-Content -Encoding UTF8 $fileName)
If you are reading an XML file, here's an even better way that adapts to the encoding of your XML file:
$xml = New-Object -Typename XML
$xml.load('foo.xml')
PowerShell's get-content/set-content encoding flag doesn't handle all encoding types. You may need to use IO.File, for example to load a file using Windows-1252:
$myString = [IO.File]::ReadAllText($filePath, [Text.Encoding]::GetEncoding(1252))
Text.Encoding::GetEncoding
Text.Encoding::GetEncodings