I have a large (9 GiB), ASCII encoded, pipe delimited file with UNIX-style line endings; 0x0A.
I want to sample the first 100 records into a file for investigation. The following will produce 100 records (1 header record and 99 data records). However, it changes the line endings to DOS/Winodws style; CRLF, 0x0D0A.
Get-Content -Path .\wellmed_hce_elig_20191223.txt |
Select-Object -first 100 |
Out-File -FilePath .\elig.txt -Encoding ascii
I know about iconv, recode, and dos2unix. Those programs are not on my system and are not permitted to be installed. I have searched and found a number of places on how to get to CRLF. I have not found anything on getting to or keeping LF.
How can I produce the file with LF line endings instead of CRLF?
To complement Theo's helpful answer with a performance optimization based on the little-used -ReadCount parameter:
Set-Content -NoNewLine -Encoding ascii .\outfile.txt -Value (
(Get-Content -First 100 -ReadCount 100 .\file.txt) -join "`n") + "`n"
)
-First 100 instructs Get-Content to read (at most) 100 lines.
-ReadCount 100 causes these 100 lines to be read and emitted together, as an array, which speeds up reading and subsequent processing.
Note: In PowerShell [Core] v7.0+ you can use shorthand -ReadCount 0 in combination with -First <n> to mean: read the requested <n> lines as a single array; due to a bug in earlier versions, including Windows PowerShell, -ReadCount 0 always reads the entire file, even in the presence of -First (aka -TotalCount aka -Head).
Also, even as of PowerShell [Core] 7.0.0-rc.2 (current as of this writing), combining -ReadCount 0 with -Last <n> (aka -Tail) should be avoided (for now): while output produced is correct, behind the scenes it is again the whole file that is read; see this GitHub issue.
Note the + "`n", which ensures that the output file will have a trailing newline as well (which text files in the Unix world are expected to have).
While the above also works with -Last <n> (-Tail <n>) to extract from the end of the file, Theo's (slower) Select-Object solution offers more flexibility with respect to extracting arbitrary ranges of lines, thanks to available parameters -Skip, -SkipLast, and -Index; however, offering these parameters also directly on Get-Content for superior performance is being proposed in this GitHub feature request.
Also note that I've used Set-Content instead of Out-File.
If you know you're writing text, Set-Content is sufficient and generally faster (though in this case this won't matter, given that the data to write is passed as a single value).
For a comprehensive overview of the differences between Set-Content and Out-File / >, see this answer.
Set-Content vs. Out-File benchmark:
Note: This benchmark compares the two cmdlets with respect to writing many input strings received via the pipeline to a file.
# Sample array of 100,000 lines.
$arr = (, 'foooooooooooooooooooooo') * 1e5
# Time writing the array lines to a file, first with Set-Content, then
# with Out-File.
$file = [IO.Path]::GetTempFileName()
{ $arr | Set-Content -Encoding Ascii $file },
{ $arr | Out-File -Encoding Ascii $file } | % { (Measure-Command $_).TotalSeconds }
Remove-Item $file
Sample timing in seconds from my Windows 10 VM with Windows PowerShell v5.1:
2.6637108 # Set-Content
5.1850954 # Out-File; took almost twice as long.
You could join the lines from the Get-Content cmdlet with the Unix "`n" newline and save that.
Something like
((Get-Content -Path .\wellmed_hce_elig_20191223.txt |
Select-Object -first 100) -join "`n") |
Out-File -FilePath .\elig.txt -Encoding ascii -NoNewLine
Related
I found a nifty command here - http://www.stackoverflow.com/questions/27892957/merging-multiple-csv-files-into-one-using-powershell that I am using to merge CSV files -
Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\merged\merged.csv -NoTypeInformation -Append
Now this does what it says on the tin and works great for the most part. I have 2 issues with it however, and I am wondering if there is a way they can be overcome:
Firstly, the merged csv file has CRLF line endings, and I am wondering how I can make the line endings just LF, as the file is being generated?
Also, it looks like there are some shenanigans with quote marks being added/moved around. As an example:
Sample row from initial CSV:
"2021-10-05"|"00:00"|"1212"|"160477"|"1.00"|"3.49"LF
Same row in the merged CSV:
"2021-10-05|""00:00""|""1212""|""160477""|""1.00""|""3.49"""CRLF
So see that the first row has lost its trailing quotes, other fields have doubled quotes, and the end of the row has an additional quote. I'm not quite sure what is going on here, so any help would be much appreciated!
For dealing with the quotes, the cause of the “problem” is that your CSV does not use the default field delimiter that Import-CSV assumes - the C in CSV stands for comma, and you’re using the vertical bar. Add the parameter -Delimiter "|" to both the Import-CSV and Export-CSV cmdlets.
I don’t think you can do anything about the line-end characters (CRLF vs LF); that’s almost certainly operating-system dependent.
Jeff Zeitlin's helpful answer explains the quote-related part of your problem well.
As for your line-ending problem:
As of PowerShell 7.2, there are no PowerShell-native features that allow you to control the newline format of file-writing cmdlets such as Export-Csv.
However, if you use plain-text processing, you can use multi-line strings built with the newline format of interest and save / append them with Set-Content and its -NoNewLine switch, which writes the input strings as-is, without a (newline) separator.
In fact, to significantly speed up processing in your case, plain-text handling is preferable, since in essence your operation amounts to concatenating text files, the only twist being that the header lines of all but the first file should be skipped; using plain-text handling also bypasses your quote problem:
$tokenCount = 1
Get-ChildItem -Filter *.csv |
Get-Content -Raw |
ForEach-Object {
# Get the file content and replace CRLF with LF.
# Include the first line (the header) only for the first file.
$content = ($_ -split '\r?\n', $tokenCount)[-1].Replace("`r`n", "`n")
$tokenCount = 2 # Subsequent files should have their header ignored.
# Make sure that each file content ends in a LF
if (-not $content.EndsWith("`n")) { $content += "`n" }
# Output the modified content.
$content
} |
Set-Content -NoNewLine ./merged/merged.csv # add -Encoding as needed.
I am doing some file clean up before loading into my data warehouse and have run into a file sizing issue:
(Get-Content -path C:\Workspace\workfile\myfile.txt -Raw) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
My file is about 2GB. I am receiving the following error and not sure how to correct.
Get-Content : Exception of type 'System.OutOfMemoryException' was
thrown, ........
I am NOT a coder, but I do like learning so am building my own data warehouse. So if you do respond, keep my experience level in mind :)
A performant way of reading a text file line by line - without loading the entire file into memory - is to use a switch statement with the -File parameter.
A performant way of writing a text file is to use a System.IO.StreamWriter instance.
As Mathias points out in his answer, using verbatim \" with the regex-based -replace operator actually replaces " alone, due to the escaping rules of regexes. While you could address that with '\\"', in this case a simpler and better-performing alternative is to use the [string] type's Replace() method, which operates on literal substrings.
To put it all together:
# Note: Be sure to use a *full* path, because .NET's working dir. usually
# differs from PowerShell's.
$streamWriter = [System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
switch -File C:\Workspace\workfile\myfile.txt {
default { $streamWriter.WriteLine($_.Replace('\"', '"')) }
}
$streamWriter.Close()
Note: If you're using an old version of Windows PowerShell, namely version 4 or below, use
New-Object System.IO.StreamWriter 'C:\Workspace\workfile\myfileCLEAN.txt'
instead of
[System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
Get-Content -Raw makes PowerShell read the entire file into a single string.
.NET can't store individual objects over 2GB in size in memory, and each character in a string takes up 2 bytes, so after reading the first ~1 billion characters (roughly equivalent to a 1GB ASCII-encoded text file), it reaches the memory limit.
Remove the -Raw switch, -replace is perfectly capable of operating on multiple input strings at once:
(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
Beware that -replace is a regex operator, and if you want to remove \ from a string, you need to escape it:
(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
While this will work, it'll still be slow due to the fact that we're still loading >2GB of data into memory before applying -replace and writing to the output file.
Instead, you might want to pipe the output from Get-Content to the ForEach-Object cmdlet:
Get-Content -path C:\Workspace\workfile\myfile.txt |ForEach-Object {
$_ -replace '\\"','"'
} |Set-Content C:\Workspace\workfile\myfileCLEAN.txt
This allows Get-Content to start pushing output prior to finishing reading the file, and PowerShell therefore no longer needs to allocate as much memory as before, resulting in faster execution.
Get-Content loads the whole file into memory.
Try processing line by line to improve memory utilization.
$infile = "C:\Workspace\workfile\myfile.txt"
$outfile = "C:\Workspace\workfile\myfileCLEAN.txt"
foreach ($line in [System.IO.File]::ReadLines($infile)) {
Add-Content -Path $outfile -Value ($line -replace '\\"','"')
}
i came across a little issue when dealing with csv-exports which contains mutated vowels like ä,ö,ü (German Language Umlaute)
i simply export with
Get-WinEvent -FilterHashtable #{Path=$_;ID=4627} -ErrorAction SilentlyContinue |export-csv -NoTypeInformation -Encoding Default -Force ("c:\temp\CSV_temp\"+ $_.basename + ".csv")
which works fine. i have the ä,ö,ü in my csv-file correctly.
after that i do a little sorting with:
Get-ChildItem 'C:\temp\*.csv' |
ForEach-Object { Import-Csv $_.FullName } |
Sort-Object { [DateTime]::ParseExact($_.TimeCreated, $pattern, $culture) } |
Export-Csv 'C:\temp\merged.csv' -Encoding Default -NoTypeInformation -Force
i played around with all encodings, ASCII, BigEndianUnicode, UniCode(s) with no success.
how can i preserve the special characters ä,ö,ü and others when exporting and sorting?
Mathias R. Jessen provides the crucial pointer in a comment on the question:
It is the Import-Csv call, not Export-Csv, that is the cause of the problem in your case:
Like Export-Csv, Import-Csv too needs to be passed -Encoding Default in order to properly process text files encoded with the system's active "ANSI" legacy code page, which is an 8-bit, single-byte character encoding such as Windows-1252.
In Windows PowerShell, even though the generic text-file processing Get-Content / Set-Content cmdlet pair defaults to Default encoding (as the name suggests), regrettably and surprisingly, Import-Csv and Export-Csv do not.
Note that on reading a default encoding is only assumed if the input file has no BOM (byte-order mark, a.k.a Unicode signature, a magic byte sequence at the start of the file that unambiguously identifies the file's encoding).
Not only do Import-Csv and Export-Csv have defaults that differ from Get-Content / Set-Content, they individually have different defaults:
Import-Csv defaults to UTF-8.
Export-Csv defaults to ASCII(!), which means that any non-ASCII characters -such as ä, ö, ü - are transliterated to literal ? chars., resulting in loss of data.
By contrast, in PowerShell Core, the cross-platform edition built on .NET Core, the default encoding is (BOM-less) UTF-8, consistently, across all cmdlets, which greatly simplifies matters and makes it much easier to determine when you do need to use the -Encoding parameter.
Demonstration of the Windows PowerShell Import-Csv / Export-Csv behavior
Import-Csv - defaults to UTF-8:
# Sample CSV content.
$str = #'
Column1
aäöü
'#
# Write sample CSV file 't.csv' using UTF-8 encoding *without a BOM*
# (Note that this cannot be done with standard PowerShell cmdlets.)
$null = new-item -type file t.csv -Force
[io.file]::WriteAllLines((Convert-Path t.csv), $str)
# Use Import-Csv to read the file, which correctly preserves the UTF-8-encoded
# umlauts
Import-Csv .\t.csv
The above yields:
Column1
-------
aäöü
As you can see, the umlauts were correctly preserved.
By contrast, had the file been "ANSI"-encoded ($str | Set-Content t.csv; -Encoding Default implied), the umlauts would have gotten corrupted.
Export-Csv - defaults to ASCII - risk of data loss:
Building on the above example:
Import-Csv .\t.csv | Export-Csv .\t.new.csv
Get-Content .\t.new.csv
yields:
"Column1"
"a???"
As you can see, the umlauts were replaced by literal question marks (?).
I am trying to remove start and end spaces in column data in CSV file. I've got a solution to remove all spaces in the csv, but it's creating non-readable text in description column.
Get-Content –path test.csv| ForEach-Object {$_.Trim() -replace "\s+" } | Out-File -filepath out.csv -Encoding ascii
e.g.
'192.168.1.2' ' test-1-TEST' 'Ping Down at least 1 min' '3/11/2017' 'Unix Server' 'Ping' 'critical'
'192.168.1.3' ' test-2-TEST' ' Ping Down at least 3 min' '3/11/2017' 'windows Server' 'Ping' 'critical'
I only want to remove space only from ' test-1-TEST' and not from 'Ping Down at least 1 min'. Is this possible?
"IP","ServerName","Status","Date","ServerType","Test","State"
"192.168.1.2"," test-1-TEST","Ping Down at least 1 min","3/11/2017","Unix Server","Ping","critical"
"192.168.1.3"," test-2-TEST"," Ping Down at least 3 min","3/11/2017","windows Server","Ping","critical"
For example file above:
Import-Csv C:\folder\file.csv | ForEach-Object {
$_.ServerName = $_.ServerName.trim()
$_
} | Export-Csv C:\folder\file2.csv -NoTypeInformation
Replace ServerName with the name of the Column you want to remove spaces from (aka trim).
If your CSV does not have header (which means its not a true CSV) and/or you want to better preserve the original file structure and formatting you could try to expand on your regex a little.
(Get-Content c:\temp\test.txt -Raw) -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
That should remove all leading and trailing spaces inside the quoted values. Not the delimeters themselves.
Read the file in as one string. Could be bad idea depending on file size. Not required as the solution is not dependent on that. Can still be read line be line with the same transformation achieving the same result. Use two replacements that are similar. First is looking for spaces that exist after a single quote but not followed by another quote or space. Second is looking for spaces before a quote that are not preceded by a quote or space.
Just wanted to give a regex example. You can look into this with more detail and explanation at regex101.com. There you will see an alternation pattern instead of two separate replacements.
(Get-Content c:\temp\test.txt -Raw) -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
The first example is a little easier on the eyes.
I was having issues consistently replicating this but if you are having issues with it replacing newlines as well then you can just do the replacement one line at a time and that should work as well.
(Get-Content c:\temp\test.txt) | Foreach-Object{
$_ -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
} | Set-Content c:\temp\test.txt
I am trying to extract each line starting with "%%" in all files in a folder and then copy those lines to a separate text file. Currently using this code in PowerShell code, but I am not getting any results.
$files = Get-ChildItem "folder" -Filter *.txt
foreach ($file in $files)
{
if ($_ -like "*%%*")
{
Set-Content "Output.txt"
}
}
I think that mklement0's suggestion to use Select-String is the way to go. Adding to his answer, you can pipe the output of Get-ChildItem into the Select-String so that the entire process becomes a Powershell one liner.
Something like this:
Get-ChildItem "folder" -Filter *.txt | Select-String -Pattern '^%%' | Select -ExpandProperty line | Set-Content "Output.txt"
The Select-String cmdlet offers a much simpler solution (PSv3+ syntax):
(Select-String -Path folder\*.txt -Pattern '^%%').Line | Set-Content Output.txt
Select-String accepts a filename/path pattern via its -Path parameter, so, in this simple case, there is no need for Get-ChildItem.
If, by contrast, you input file selection is recursive or uses more complex criteria, you can pipe Get-ChildItem's output to Select-String, as demonstrated in Dave Sexton's helpful answer.
Note that, according to the docs, Select-String by default assumes that the input files are UTF-8-encoded, but you can change that with the -Encoding parameter; also consider the output encoding discussed below.
Select-String's -Pattern parameter expects a regular expression rather than a wildcard expression.
^%% only matches literal %% at the start (^) of a line.
Select-String outputs [Microsoft.PowerShell.Commands.MatchInfo] objects that contain information about each match; each object's .Line property contains the full text of an input line that matched.
Set-Content Output.txt sends all matching lines to single output file Output.txt
Set-Content uses the system's legacy Windows codepage (an 8-bit single-byte encoding - even though the documentation mistakenly claims that ASCII files are produced).
If you want to control the output encoding explicitly, use the -Encoding parameter; e.g., ... | Set-Content Output.txt -Encoding Utf8.
By contrast, >, the output redirection operator always creates UTF-16LE files (an encoding PowerShell calls Unicode), as does Out-File by default (which can be changed with -Encoding).
Also note that > / Out-File apply PowerShell's default formatting to the input objects to obtain the string representation to write to the output file, whereas Set-Content treats the input as strings (calls .ToString() on input objects, if necessary). In the case at hand, since all input objects are already strings, there is no difference (except for the character encoding, potentially).
As for what you've tried:
$_ inside your foreach ($file in $files) refers to a file (a [System.IO.FileInfo] object), so you're effectively evaluating your wildcard expression *%%* against the input file's name rather than its contents.
Aside from that, wildcard pattern *%%* will match %% anywhere in the input string, not just at its start (you'd have to use %%* instead).
The Set-Content "Output.txt" call is missing input, because it is not part of a pipeline and, in the absence of pipeline input, no -Value argument was passed.
Even if you did provide input, however, output file Output.txt would get rewritten as a whole in each iteration of your foreach loop.
First you have to use
Get-Content
in order to get the content of the file. Then you do the string match and based on that you again set the content back to the file. Use get-content and put another loop inside the foreach to iterate all the lines in the file.
I hope this logic helps you
ls *.txt | %{
$f = $_
gc $f.fullname | {
if($_.StartWith("%%") -eq 1){
$_ >> Output.txt
}#end if
}#end gc
}#end ls
Alias
ls - Get-ChildItem
gc - Get-Content
% - ForEach
$_ - Iterator variable for loop
>> - Redirection construct
# - Comment
http://ss64.com/ps/