How to keep UTF-8 in batch for csv file? - powershell

Hi Stackoverflow community!
I have .csv file with some values "{Null}" and "Null". I use a batch file (.cmd) with PowerShell function to replace that values with "".
The issue is that the output file has a different encoding (utf-16le) than the input (UTF-8). Is there a way to keep the original encoding?
powershell -Command "(gc myfile.csv) -replace '{NULL}', '' | Out-File myfile_replaced.csv"
I tried to find a solution and understood, the Notepad by default has UTF-16le encoding. Theoretically, I could change the Encoding of the Notepad++, but this is not an option, as the code should be shared with others.
And this should be implemented in Batch, otherwise I could manually Search and Replace the values.

Out-File supports using -Encoding as a parameter. This is true for various other cmdlets that write files (e.g. Export-Csv) as well.
As per documentation:
-Encoding
Specifies the encoding for the exported CSV file. The default value is UTF8NoBOM.
The acceptable values for this parameter are as follows:
ASCII: Uses the encoding for the ASCII (7-bit) character set.
BigEndianUnicode: Encodes in UTF-16 format using the big-endian byte order.
OEM: Uses the default encoding for MS-DOS and console programs.
Unicode: Encodes in UTF-16 format using the little-endian byte order.
UTF7: Encodes in UTF-7 format.
UTF8: Encodes in UTF-8 format.
UTF8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
UTF8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)
UTF32: Encodes in UTF-32 format.
Beginning with PowerShell 6.2, the Encoding parameter also allows numeric IDs of registered code pages (like -Encoding 1251) or string names of registered code pages (like -Encoding "windows-1251"). For more information, see the .NET documentation for Encoding.CodePage.

Unfortunately, out-file or ">" or ">>" defaults to "unicode" or utf16 encoding. You can even mix two encodings in the same file with ">>" or "out-file -append". You can use set-content instead or "out-file -encoding utf8". Actually set-content defaults to ansi encoding. But without special characters, it will be the same as utf8 (without the bom), or you can use a -encoding option with set-content as well. Notepad defaults to ansi, but can recognize utf8 or unicode even without bom's or encoding signatures.
powershell -Command "(gc myfile.csv) -replace '{NULL}', '' | set-content myfile_replaced.csv"

Related

Simple way to convert txt file from UTF-8 to ASCII

I am trying to convert just one file from UTF-8 to ASCII. I found the following script online, and it creates the Out-File but it does not change the encoding to ASCII. Why is this not working?
Get-Content -Path "File/Path/to/file.txt" | Out-File -FilePath "File/Path/to/processed.txt" -Encoding ASCII
tl;dr
-Encoding ASCII does work, though your editor's GUI may still report the resulting file as UTF-8-encoded, for the reasons explained below.
First, a general caveat:
If your input file also contains non-ASCII-range characters, they will be transliterated to verbatim ?, i.e. you'll potentially lose information.
Conversely, if your input files are UTF-8-encoded but do not contain non-ASCII characters, they in effect already are ASCII-encoded files; see below.
ASCII encoding is a subset of UTF-8 encoding (except that ASCII encoding never involves a BOM).
Therefore, any (BOM-less) file composed exclusively of bytes representing ASCII characters is by definition also a valid UTF-8 file.
Modern editors default to BOM-less UTF-8; that is, if a file doesn't start with a BOM, they assume that it is UTF-8-encoded, and that's what their GUIs reflect - even if a given file happens to be composed of ASCII characters only.
To verify that your output file is indeed only composed of ASCII characters, use the following:
# This should return $false; '\P{IsBasicLatin}' matches any NON-ASCII character.
(Get-Content -Raw File/Path/to/processed.txt) -cmatch '\P{IsBasicLatin}'
For an explanation of this test, especially with respect to needing to use -cmatch, the case-sensitive variant of the -match operator, see this answer.
A complete example:
# Write a string that contains non-ASCII characters to a
# file with -Encoding Ascii
# The resulting fill will contain 1 line, with content 'caf?'
# That is, the "é" character was "lossily" transliterated to (ASCII) "?"
'café' | Out-File -Encoding Ascii temp.txt
# Examining the file for non-ASCII characters now indicates that
# there are none, i.e, $false is returned.
(Get-Content -Raw temp.txt) -cmatch '\P{IsBasicLatin}'

using powershell to replace extended ascii character in a text file

I'm needing to replace a hex 93 character to a "" string inside several csv files. Below is the code that I'm using. But it is not working I think the reason that it does not work is because the hex value is greater than 7F (Dec 127). I've tried several other methods to no avail. Any help would be appreciated.
$q1 = [String](0x93 -as [char])
Get-ChildItem ".\*.csv" -Recurse | ForEach {
(Get-Content $_ | ForEach { $_.replace($q1, '""') }) |
Set-Content $_
}
Note: Attach is a image of the format-hex dump of my test file. The first character is the one that I need to perform the replace on:
In Windows PowerShell, the default character encoding when reading from / writing to[1] files is "ANSI", i.e., the legacy 8-bit code page implied by the active system locale.
(By contrast, PowerShell Core defaults to UTF-8.)
For instance, the code page associated with the system locale on an US-English system is 1252, i.e., Windows-1252, where code point 0x93 is the non-ASCII “ quotation mark.
Howere, once a text file's content has been read into memory, in memory a string's characters are represented as UTF-16LE code units, i.e., as .NET [string] instances.
As a Unicode character, “ has code point U+201c, expressed as 0x201c in UTF-16LE.
Therefore - because in memory all strings are UTF-16LE code units - what you need to replace is [char] 0x201c:
$q1 = [char] 0x201c # “
Get-ChildItem *.csv -Recurse | ForEach-Object {
(Get-Content $_.FullName) -replace $q1, '""' | Set-Content $_.FullName
}
Note that Set-Content too uses the default character encoding, so the rewritten files will use "ANSI" encoding too - use the -Encoding parameter to change the output encoding, if desired.
Also note the (...) around the Get-Content call, which ensures that the input file i read into memory in full up front, which enables writing back to the same file in the same pipeline.
While this approach is convenient, note that it bears a slight risk of data loss if writing back to the input file is interrupted before completion.
Converting an "ANSI" code point to a Unicode code point
The following shows how an "ANSI" (8-bit) code point such as 0x93 can be converted to its equivalent UTF-16 code point, 0x201c:
# Convert an array of "ANSI" code points (1 byte each) to the UTF-16
# string they represent.
# Note: In Windows PowerShell, [Text.Encoding]::Default contains
# the "ANSI" encoding set by the system locale.
$str = [Text.Encoding]::Default.GetString([byte[]] 0x93) # -> '“'
# Get the UTF-16 code points of the characters making up the string.
$codePoints = [int[]] [char[]] $str
# Format the first and only code point as a hex. number.
'0x{0:x}' -f $codePoints[0] # -> '0x201c'
[1] Writing files with Set-Content, that is; using Out-File / >, by contrast, creates UTF-16LE ("Unicode") files. The cmdlets in Windows PowerShell display a bewildering array of differing encodings: see this answer. Fortunately, PowerShell Core now consistently defaults to (BOM-less) UTF-8.

Powershell Out-file special characters

I have a script that processes data from files and writes result based on a condition to txt. Given data are strings with words like: "Distribución" or "México". When processed, those special characters like "é" and "ó" are broken (typical white square or question mark).
How can i encode the output file to make it work with those characters? I tried encoding in Utf8, utf8 without BOM, it doesn't work. Here is to file writing line:
...| Out-file -encoding XXX .\result.txt
in XXX i tried ASCII, Utf8, nothing works :/
Out-File will always add a BOM. It's a particularly annoying "feature" of that Cmdlet. Unfortunately - to my knowledge - there is no quick way to save a file using UTF8 WITHOUT a BOM in powershell. You can, however, leverage .Net to do this. This isn't really production ready, but here's a quick example:
$outputPath = "D:\temp.txt"
$data = "Distribución or México"
[System.IO.File]::WriteAllLines($outputPath, $data)
Wrap it in a Cmdlet, function and / or module to make it reusable. Of course you can take more control over the file encoding with .Net too.

How to expand file content with powershell

I want to do this :
$content = get-content "test.html"
$template = get-content "template.html"
$template | out-file "out.html"
where template.html contains
<html>
<head>
</head>
<body>
$content
</body>
</html>
and test.html contains:
<h1>Test Expand</h1>
<div>Hello</div>
I get weird characters in first 2 characters of out.html :
��
and content is not expanded.
How to fix this ?
To complement Mathias R. Jessen's helpful answer with a solution that:
is more efficient.
ensures that the input files are read as UTF-8, even if they don't have a (pseudo-)BOM (byte-order mark).
avoids the "weird character" problem altogether by writing a UTF-8-encoded output file without that pseudo-BOM.
# Explicitly read the input files as UTF-8, as a whole.
$content = get-content -raw -encoding utf8 test.html
$template = get-content -raw -encoding utf8 template.html
# Write to output file using UTF-8 encoding *without a BOM*.
[IO.File]::WriteAllText(
"$PWD/out.html",
$ExecutionContext.InvokeCommand.ExpandString($template)
)
get-content -raw (PSv3+) reads the files in as a whole, into a single string (instead of an array of strings, line by line), which, while more memory-intensive, is faster. With HTML files, memory usage shouldn't be a concern.
An additional advantage of reading the files in full is that if the template were to contain multi-line subexpressions ($(...)), the expansion would still function correctly.
get-content -encoding utf8 ensures that the input files are interpreted as using character encoding UTF-8, as is typical in the web world nowadays.
This is crucial, given that UTF-8-encoded HTML files normally do not have the 3-byte pseudo-BOM that PowerShell needs in order to correctly identify a file as UTF-8-encoded (see below).
A single $ExecutionContext.InvokeCommand.ExpandString() call is then sufficient to perform the template expansion.
Out-File -Encoding utf8 would invariably create a file with the pseudo-BOM, which is undesired.
Instead, [IO.File]::WriteAllText() is used, taking advantage of the fact that the .NET Framework by default creates UTF-8-encoded files without the BOM.
Note the use of $PWD/ before out.html, which is needed to ensure that the file gets written in PowerShell's current location (directory); unfortunately, what the .NET Framework considers the current directory is not in sync with PowerShell.
Finally, the obligatory security warning: use this expansion technique only on input that you trust, given that arbitrary embedded commands may get executed.
Optional background information
PowerShell's Out-File, > and >> use UTF-16 LE character encoding with a BOM (byte-order mark) by default (the "weird characters", as mentioned).
While Out-File -Encoding utf8 allows creating UTF-8 output files instead,
PowerShell invariably prepends a 3-byte pseudo-BOM to the output file, which some utilities, notably those with Unix heritage, have problems with - so you would still get "weird characters" (albeit different ones).
If you want a more PowerShell-like way of creating BOM-less UTF-8 files,
see this answer of mine, which defines an Out-FileUtf8NoBom function that otherwise emulates the core functionality of Out-File.
Conversely, on reading files, you must use Get-Content -Encoding utf8 to ensure that BOM-less UTF-8 files are recognized as such.
In the absence of the UTF-8 pseudo-BOM, Get-Content assumes that the file uses the single-byte, extended-ASCII encoding specified by the system's legacy codepage (e.g., Windows-1252 on English-language systems, an encoding that PowerShell calls Default).
Note that while Windows-only editors such as Notepad create UTF-8 files with the pseudo-BOM (if you explicitly choose to save as UTF-8; default is the legacy codepage encoding, "ANSI"), increasingly popular cross-platform editors such as Visual Studio Code, Atom, and Sublime Text by default do not use the pseudo-BOM when they create files.
For the "weird characters", they're probably BOMs (Byte-order marks). Specify the output encoding explicitly with the -Encoding parameter when using Out-File, for example:
$Template |Out-File out.html -Encoding UTF8
For the string expansion, you need to explicitly tell powershell to do so:
$Template = $Template |ForEach-Object {
$ExecutionContext.InvokeCommand.ExpandString($_)
}
$Template | Out-File out.html -Encoding UTF8

Does powershell string manipulation adds junk character?

When i am placing my $string variable contents into some text file, it adds some junk character it seems. For example i am writing an utility name in string and copy that string to file like follows.
$CodeCount+="ccount.exe"
$CodeCount | Out-file "C:\CodeCount.bat"
When i am executing this batch file , it fails by showing some junk character in front.
Even i tried with trim() but still the same result.
How to avoid adding Junk character in front?
Don't use out-file, atleast, in the default form. Use set-content. Or explicitly set the encoding in out-file something like out-file -encoding ascii. Out-file by default uses utf16 ( ucs-2 actually) encoding which doesn't get treated as plain text in many applications ( including version control systems like Hg, Git ) and add the "funny" characters that you mention.
Note that the redirection > is same as out-file and will give you the same results.
That's because the standard encoding used for Out-file is Unicode which adds some BOM values at the beginning of the file (you can see these values when you open your file with a hex editor). To avoid these bytes use -encoding ASCII:
$CodeCount | Out-file -encoding ascii "C:\CodeCount.bat"
The default encoding of Out-File is Little Endian Unicode. cmd.exe doesn't work well with this encoding so just use ASCII:
Out-File -Encoding ASCII