Import-Csv / Export-Csv with german umlauts (ä,ö,ü) - powershell

i came across a little issue when dealing with csv-exports which contains mutated vowels like ä,ö,ü (German Language Umlaute)
i simply export with
Get-WinEvent -FilterHashtable #{Path=$_;ID=4627} -ErrorAction SilentlyContinue |export-csv -NoTypeInformation -Encoding Default -Force ("c:\temp\CSV_temp\"+ $_.basename + ".csv")
which works fine. i have the ä,ö,ü in my csv-file correctly.
after that i do a little sorting with:
Get-ChildItem 'C:\temp\*.csv' |
ForEach-Object { Import-Csv $_.FullName } |
Sort-Object { [DateTime]::ParseExact($_.TimeCreated, $pattern, $culture) } |
Export-Csv 'C:\temp\merged.csv' -Encoding Default -NoTypeInformation -Force
i played around with all encodings, ASCII, BigEndianUnicode, UniCode(s) with no success.
how can i preserve the special characters ä,ö,ü and others when exporting and sorting?

Mathias R. Jessen provides the crucial pointer in a comment on the question:
It is the Import-Csv call, not Export-Csv, that is the cause of the problem in your case:
Like Export-Csv, Import-Csv too needs to be passed -Encoding Default in order to properly process text files encoded with the system's active "ANSI" legacy code page, which is an 8-bit, single-byte character encoding such as Windows-1252.
In Windows PowerShell, even though the generic text-file processing Get-Content / Set-Content cmdlet pair defaults to Default encoding (as the name suggests), regrettably and surprisingly, Import-Csv and Export-Csv do not.
Note that on reading a default encoding is only assumed if the input file has no BOM (byte-order mark, a.k.a Unicode signature, a magic byte sequence at the start of the file that unambiguously identifies the file's encoding).
Not only do Import-Csv and Export-Csv have defaults that differ from Get-Content / Set-Content, they individually have different defaults:
Import-Csv defaults to UTF-8.
Export-Csv defaults to ASCII(!), which means that any non-ASCII characters -such as ä, ö, ü - are transliterated to literal ? chars., resulting in loss of data.
By contrast, in PowerShell Core, the cross-platform edition built on .NET Core, the default encoding is (BOM-less) UTF-8, consistently, across all cmdlets, which greatly simplifies matters and makes it much easier to determine when you do need to use the -Encoding parameter.
Demonstration of the Windows PowerShell Import-Csv / Export-Csv behavior
Import-Csv - defaults to UTF-8:
# Sample CSV content.
$str = #'
Column1
aäöü
'#
# Write sample CSV file 't.csv' using UTF-8 encoding *without a BOM*
# (Note that this cannot be done with standard PowerShell cmdlets.)
$null = new-item -type file t.csv -Force
[io.file]::WriteAllLines((Convert-Path t.csv), $str)
# Use Import-Csv to read the file, which correctly preserves the UTF-8-encoded
# umlauts
Import-Csv .\t.csv
The above yields:
Column1
-------
aäöü
As you can see, the umlauts were correctly preserved.
By contrast, had the file been "ANSI"-encoded ($str | Set-Content t.csv; -Encoding Default implied), the umlauts would have gotten corrupted.
Export-Csv - defaults to ASCII - risk of data loss:
Building on the above example:
Import-Csv .\t.csv | Export-Csv .\t.new.csv
Get-Content .\t.new.csv
yields:
"Column1"
"a???"
As you can see, the umlauts were replaced by literal question marks (?).

Related

Replace String in a Binary Clipboard Dump from OneNote

I'm using an AHK script to dump the current clipboard contents to a file (which contains a copy of a part of Microsoft OneNote page to a file).
I would like to modify this binary file to search for a specific string and be able to import it back into AHK.
I tried the following but it looks like powershell is doing something additional to the file (like changing the encoding) and the import of the file into the clipboard is failing.
$ThisFile = 'B:\Users\Desktop\onenote-new-entry.txt'
$data = Get-Content $ThisFile
$data = $data.Replace('asdf','TESTREPLACE!')
$data | Out-File -encoding utf8 $ThisFile
Any suggestions on doing a string replace to the file without changing existing encoding?
I tried manually modifying in a text editor and it works fine. Obviously though I would like to have the modifications be done in mass and automatically which is why I need a script.
The text copied from OneNote and then dumped to file via AHK looks like this:
However, note the clipboard dump file has a lot of other meta-data as shown below when opened in an editor. To download for testing with PS, click here:
Since your file is a mix of binary data and UTF-8 text, you cannot use text processing (as you tried with Out-File -Encoding utf8), because the binary data would invariably be interpreted as text too, resulting in its corruption.
PowerShell offers no simple method for editing binary files, but you can solve your problem via an auxiliary "hex string" representation of the file's bytes:
# To compensate for a difference between Windows PowerShell and PowerShell (Core) 7+
# with respect to how byte processing is requested: -Encoding Byte vs. -AsByteStream
$byteEncParam =
if ($IsCoreCLR) { #{ AsByteStream = $true } }
else { #{ Encoding = 'Byte' } }
# Read the file *as a byte array*.
$ThisFile = 'B:\Users\Desktop\onenote-new-entry.txt'
$data = Get-Content #byteEncParam -ReadCount 0 $ThisFile
# Convert the array to a "hex string" in the form "nn-nn-nn-...",
# where nn represents a two-digit hex representation of each byte,
# e.g. '41-42' for 0x41, 0x42, which, if interpreted as a
# single-byte encoding (ASCII), is 'AB'.
$dataAsHexString = [BitConverter]::ToString($data)
# Define the search and replace strings, and convert them into
# "hex strings" too, using their UTF-8 byte representation.
$search = 'asdf'
$replacement = 'TESTREPLACE!'
$searchAsHexString = [BitConverter]::ToString([Text.Encoding]::UTF8.GetBytes($search))
$replaceAsHexString = [BitConverter]::ToString([Text.Encoding]::UTF8.GetBytes($replacement))
# Perform the replacement.
$dataAsHexString = $dataAsHexString.Replace($searchAsHexString, $replaceAsHexString)
# Convert he modified "hex string" back to a byte[] array.
$modifiedData = [byte[]] ($dataAsHexString -split '-' -replace '^', '0x')
# Save the byte array back to the file.
Set-Content #byteEncParam $ThisFile -Value $modifiedData
Note:
As discussed in the comments, in the case at hand this can only be expected to work if the search and the replacements strings are of the same length, because the file also contains metadata denoting the position and length of the embedded text parts. A replacement string of different length would require adjusting that metadata accordingly.
The string replacement performed is (a) literal, and (b) case-sensitive, and (c) - for accented characters such as é - only works the if the input - like string literals in .NET - uses the composed Unicode normalization form , where é is a single code point and encoded as such (resulting in a multi-byte UTF-8 escape sequence).
More sophisticated replacements, such as regex-based ones, would only be possible if you knew how to split the file data into binary and textual parts, allowing you to operate on the textual parts directly.
Optional reading: Modifying a UTF-8 file without incidental alterations:
Note:
The following applies to text-only files that are UTF-8-encoded.
Unless extra steps are taken, reading and re-saving such files in PowerShell can result in unwanted incidental changes to the file. Avoiding them is discussed below.
PowerShell never preserves information about the character encoding of an input file, such as one read with Get-Content. Also, unless you use -Raw, information about the specific newline format is lost, as well as whether the file had a trailing newline or not.
Assuming that you know the encoding:
Read the file with Get-Content -Raw and specify the encoding with -Encoding (if necessary). You'll receive the file's content as a single, multi-line .NET string.
Use Set-Content -NoNewLine to save the modified string back to the file, using -Encoding with the original encoding.
Caveat: In Windows PowerShell, -Encoding utf8 invariably creates a UTF-8 file with BOM, unlike in PowerShell (Core) 7+, which defaults to BOM-less UTF-8 and requires you to use -Encoding utf8BOM if you want a BOM.
If you're using Windows PowerShell and do not want a UTF-8 BOM, use $null =New-Item -Force ... as a workaround, and pass the modified string to the -Value parameter.
Therefore:
$ThisFile = 'B:\Users\Desktop\onenote-new-entry.txt'
$data = Get-Content -Raw -Encoding utf8 $ThisFile
$data = $data.Replace('asdf','TESTREPLACE!')
# !! Note the caveat re BOM mentioned above.
$data | Set-Content -NoNewLine -Encoding utf8 $ThisFile
Streamlined reformulation, in a single pipeline:
(Get-Content -Raw -Encoding utf8 $ThisFile) |
ForEach-Object Replace 'asdf', 'TESTREPLACE!' |
Set-Content -NoNewLine -Encoding utf8 $ThisFile
With the New-Item workaround, if the output file mustn't have a BOM:
(Get-Content -Raw -Encoding utf8 $ThisFile) |
ForEach-Object Replace 'asdf', 'TESTREPLACE!' |
New-Item -Force $ThisFile |
Out-Null # suppress New-Item's output (a file-info object)

Issues merging multiple CSV files in Powershell

I found a nifty command here - http://www.stackoverflow.com/questions/27892957/merging-multiple-csv-files-into-one-using-powershell that I am using to merge CSV files -
Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\merged\merged.csv -NoTypeInformation -Append
Now this does what it says on the tin and works great for the most part. I have 2 issues with it however, and I am wondering if there is a way they can be overcome:
Firstly, the merged csv file has CRLF line endings, and I am wondering how I can make the line endings just LF, as the file is being generated?
Also, it looks like there are some shenanigans with quote marks being added/moved around. As an example:
Sample row from initial CSV:
"2021-10-05"|"00:00"|"1212"|"160477"|"1.00"|"3.49"LF
Same row in the merged CSV:
"2021-10-05|""00:00""|""1212""|""160477""|""1.00""|""3.49"""CRLF
So see that the first row has lost its trailing quotes, other fields have doubled quotes, and the end of the row has an additional quote. I'm not quite sure what is going on here, so any help would be much appreciated!
For dealing with the quotes, the cause of the “problem” is that your CSV does not use the default field delimiter that Import-CSV assumes - the C in CSV stands for comma, and you’re using the vertical bar. Add the parameter -Delimiter "|" to both the Import-CSV and Export-CSV cmdlets.
I don’t think you can do anything about the line-end characters (CRLF vs LF); that’s almost certainly operating-system dependent.
Jeff Zeitlin's helpful answer explains the quote-related part of your problem well.
As for your line-ending problem:
As of PowerShell 7.2, there are no PowerShell-native features that allow you to control the newline format of file-writing cmdlets such as Export-Csv.
However, if you use plain-text processing, you can use multi-line strings built with the newline format of interest and save / append them with Set-Content and its -NoNewLine switch, which writes the input strings as-is, without a (newline) separator.
In fact, to significantly speed up processing in your case, plain-text handling is preferable, since in essence your operation amounts to concatenating text files, the only twist being that the header lines of all but the first file should be skipped; using plain-text handling also bypasses your quote problem:
$tokenCount = 1
Get-ChildItem -Filter *.csv |
Get-Content -Raw |
ForEach-Object {
# Get the file content and replace CRLF with LF.
# Include the first line (the header) only for the first file.
$content = ($_ -split '\r?\n', $tokenCount)[-1].Replace("`r`n", "`n")
$tokenCount = 2 # Subsequent files should have their header ignored.
# Make sure that each file content ends in a LF
if (-not $content.EndsWith("`n")) { $content += "`n" }
# Output the modified content.
$content
} |
Set-Content -NoNewLine ./merged/merged.csv # add -Encoding as needed.

How can I keep UNIX LF line endings?

I have a large (9 GiB), ASCII encoded, pipe delimited file with UNIX-style line endings; 0x0A.
I want to sample the first 100 records into a file for investigation. The following will produce 100 records (1 header record and 99 data records). However, it changes the line endings to DOS/Winodws style; CRLF, 0x0D0A.
Get-Content -Path .\wellmed_hce_elig_20191223.txt |
Select-Object -first 100 |
Out-File -FilePath .\elig.txt -Encoding ascii
I know about iconv, recode, and dos2unix. Those programs are not on my system and are not permitted to be installed. I have searched and found a number of places on how to get to CRLF. I have not found anything on getting to or keeping LF.
How can I produce the file with LF line endings instead of CRLF?
To complement Theo's helpful answer with a performance optimization based on the little-used -ReadCount parameter:
Set-Content -NoNewLine -Encoding ascii .\outfile.txt -Value (
(Get-Content -First 100 -ReadCount 100 .\file.txt) -join "`n") + "`n"
)
-First 100 instructs Get-Content to read (at most) 100 lines.
-ReadCount 100 causes these 100 lines to be read and emitted together, as an array, which speeds up reading and subsequent processing.
Note: In PowerShell [Core] v7.0+ you can use shorthand -ReadCount 0 in combination with -First <n> to mean: read the requested <n> lines as a single array; due to a bug in earlier versions, including Windows PowerShell, -ReadCount 0 always reads the entire file, even in the presence of -First (aka -TotalCount aka -Head).
Also, even as of PowerShell [Core] 7.0.0-rc.2 (current as of this writing), combining -ReadCount 0 with -Last <n> (aka -Tail) should be avoided (for now): while output produced is correct, behind the scenes it is again the whole file that is read; see this GitHub issue.
Note the + "`n", which ensures that the output file will have a trailing newline as well (which text files in the Unix world are expected to have).
While the above also works with -Last <n> (-Tail <n>) to extract from the end of the file, Theo's (slower) Select-Object solution offers more flexibility with respect to extracting arbitrary ranges of lines, thanks to available parameters -Skip, -SkipLast, and -Index; however, offering these parameters also directly on Get-Content for superior performance is being proposed in this GitHub feature request.
Also note that I've used Set-Content instead of Out-File.
If you know you're writing text, Set-Content is sufficient and generally faster (though in this case this won't matter, given that the data to write is passed as a single value).
For a comprehensive overview of the differences between Set-Content and Out-File / >, see this answer.
Set-Content vs. Out-File benchmark:
Note: This benchmark compares the two cmdlets with respect to writing many input strings received via the pipeline to a file.
# Sample array of 100,000 lines.
$arr = (, 'foooooooooooooooooooooo') * 1e5
# Time writing the array lines to a file, first with Set-Content, then
# with Out-File.
$file = [IO.Path]::GetTempFileName()
{ $arr | Set-Content -Encoding Ascii $file },
{ $arr | Out-File -Encoding Ascii $file } | % { (Measure-Command $_).TotalSeconds }
Remove-Item $file
Sample timing in seconds from my Windows 10 VM with Windows PowerShell v5.1:
2.6637108 # Set-Content
5.1850954 # Out-File; took almost twice as long.
You could join the lines from the Get-Content cmdlet with the Unix "`n" newline and save that.
Something like
((Get-Content -Path .\wellmed_hce_elig_20191223.txt |
Select-Object -first 100) -join "`n") |
Out-File -FilePath .\elig.txt -Encoding ascii -NoNewLine

Getting extra text in Powershell Export-Csv file

I have an array of objects in Powershell. It was working, but now when I do an Export-Csv on the array, it property and value names are transformed like:
Account_No -> +ACI-Account+AF8-No+ACI-
Does anyone know why it is doing this?
Thanks
I am using PS 5.1, and the command is:
$rowsWithErrs | Export-Csv -Path $rowErrCsvPath -NoTypeInformation -Encoding UTF7
It looks like there isn't anything wrong with what you are doing. Everything is getting sent out in the format that you are expecting.
The only problem is that the application that you are using to view your data is not using the same encoding that was used to write the data.
The extra characters are what you see when interpreting text as UTF8 or something similar or compatible with UTF8 (which is the standard for most systems) instead of UTF7 when the text was encoded as UTF7.
example
> "Account_No" | Out-File -FilePath test.txt -Encoding UTF7
> Get-Content test.txt -Encoding UTF8
Account+AF8-No
> Get-Content test.txt -Encoding UTF7
Account_No
if reading csv data in Powershell you can do the following
> $csv = Import-Csv -FilePath $filepath -Encoding UTF7
if reading csv data in Excel, on the data tab select From Text/CSV at the top of the import window select File Origin 65000: Unicode (UTF-7)
For other applications like VS Code or Notepad++ you may be out of luck if you want to view the data there because it looks like they do not support UTF-7 encoding.

Output ("echo") a variable to a text file

I'm running a PowerShell script against many servers, and it is logging output to a text file.
I'd like to capture the server the script is currently running on. So far I have:
$file = "\\server\share\file.txt"
$computername = $env:computername
$computername | Add-Content -Path $file
This last line adds question marks in the output file. Oops.
How do I output a variable to a text file in PowerShell?
The simplest Hello World example...
$hello = "Hello World"
$hello | Out-File c:\debug.txt
Note: The answer below is written from the perspective of Windows PowerShell.
However, it applies to the cross-platform PowerShell (Core) v6+ as well, except that the latter - commendably - consistently defaults to BOM-less UTF-8 as the character encoding, which is the most widely compatible one across platforms and cultures..
To complement bigtv's helpful answer helpful answer with a more concise alternative and background information:
# > $file is effectively the same as | Out-File $file
# Objects are written the same way they display in the console.
# Default character encoding is UTF-16LE (mostly 2 bytes per char.), with BOM.
# Use Out-File -Encoding <name> to change the encoding.
$env:computername > $file
# Set-Content calls .ToString() on each object to output.
# Default character encoding is "ANSI" (culture-specific, single-byte).
# Use Set-Content -Encoding <name> to change the encoding.
# Use Set-Content rather than Add-Content; the latter is for *appending* to a file.
$env:computername | Set-Content $file
When outputting to a text file, you have 2 fundamental choices that use different object representations and, in Windows PowerShell (as opposed to PowerShell Core), also employ different default character encodings:
Out-File (or >) / Out-File -Append (or >>):
Suitable for output objects of any type, because PowerShell's default output formatting is applied to the output objects.
In other words: you get the same output as when printing to the console.
The default encoding, which can be changed with the -Encoding parameter, is Unicode, which is UTF-16LE in which most characters are encoded as 2 bytes. The advantage of a Unicode encoding such as UTF-16LE is that it is a global alphabet, capable of encoding all characters from all human languages.
In PSv5.1+, you can change the encoding used by > and >>, via the $PSDefaultParameterValues preference variable, taking advantage of the fact that > and >> are now effectively aliases of Out-File and Out-File -Append. To change to UTF-8 (invariably with a BOM, in Windows PowerShell), for instance, use:
$PSDefaultParameterValues['Out-File:Encoding']='UTF8'
Set-Content / Add-Content:
For writing strings and instances of types known to have meaningful string representations, such as the .NET primitive data types (Booleans, integers, ...).
.psobject.ToString() method is called on each output object, which results in meaningless representations for types that don't explicitly implement a meaningful representation; [hashtable] instances are an example:
#{ one = 1 } | Set-Content t.txt writes literal System.Collections.Hashtable to t.txt, which is the result of #{ one = 1 }.ToString().
The default encoding, which can be changed with the -Encoding parameter, is Default, which is the system's active ANSI code page, i.e. the single-byte culture-specific legacy encoding for non-Unicode applications, which is most commonly Windows-1252.
Note that the documentation currently incorrectly claims that ASCII is the default encoding.
Note that Add-Content's purpose is to append content to an existing file, and it is only equivalent to Set-Content if the target file doesn't exist yet.
If the file exists and is nonempty, Add-Content tries to match the existing encoding.
Out-File / > / Set-Content / Add-Content all act culture-sensitively, i.e., they produce representations suitable for the current culture (locale), if available (though custom formatting data is free to define its own, culture-invariant representation - see Get-Help about_format.ps1xml).
This contrasts with PowerShell's string expansion (string interpolation in double-quoted strings), which is culture-invariant - see this answer of mine.
As for performance:
Since Set-Content doesn't have to apply default formatting to its input, it performs better, and therefore is the preferred choice if your input is composed of strings and/or of objects whose default stringification via the standard .NET .ToString() method is sufficient.
As for the OP's symptom with Add-Content:
Since $env:COMPUTERNAME cannot contain non-ASCII characters (or verbatim ? characters), Add-Content's addition to the file should not result in ? characters, and the likeliest explanation is that the ? instances were part of the preexisting content in output file $file, which Add-Content appended to.
After some trial and error, I found that
$computername = $env:computername
works to get a computer name, but sending $computername to a file via Add-Content doesn't work.
I also tried $computername.Value.
Instead, if I use
$computername = get-content env:computername
I can send it to a text file using
$computername | Out-File $file
Your sample code seems to be OK. Thus, the root problem needs to be dug up somehow. Let's eliminate chance for typos in the script. First off, make sure you put Set-Strictmode -Version 2.0 in the beginning of your script. This will help you to catch misspelled variable names. Like so,
# Test.ps1
set-strictmode -version 2.0 # Comment this line and no error will be reported.
$foo = "bar"
set-content -path ./test.txt -value $fo # Error! Should be "$foo"
PS C:\temp> .\test.ps1
The variable '$fo' cannot be retrieved because it has not been set.
At C:\temp\test.ps1:3 char:40
+ set-content -path ./test.txt -value $fo <<<<
+ CategoryInfo : InvalidOperation: (fo:Token) [], RuntimeException
+ FullyQualifiedErrorId : VariableIsUndefined
The next part about question marks sounds like you have a problem with Unicode. What's the output when you type the file with Powershell like so,
$file = "\\server\share\file.txt"
cat $file
Here is an easy one:
$myVar > "c:\myfilepath\myfilename.myextension"
You can also try:
Get-content "c:\someOtherPath\someOtherFile.myextension" > "c:\myfilepath\myfilename.myextension"