Issues merging multiple CSV files in Powershell - powershell

I found a nifty command here - http://www.stackoverflow.com/questions/27892957/merging-multiple-csv-files-into-one-using-powershell that I am using to merge CSV files -
Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\merged\merged.csv -NoTypeInformation -Append
Now this does what it says on the tin and works great for the most part. I have 2 issues with it however, and I am wondering if there is a way they can be overcome:
Firstly, the merged csv file has CRLF line endings, and I am wondering how I can make the line endings just LF, as the file is being generated?
Also, it looks like there are some shenanigans with quote marks being added/moved around. As an example:
Sample row from initial CSV:
"2021-10-05"|"00:00"|"1212"|"160477"|"1.00"|"3.49"LF
Same row in the merged CSV:
"2021-10-05|""00:00""|""1212""|""160477""|""1.00""|""3.49"""CRLF
So see that the first row has lost its trailing quotes, other fields have doubled quotes, and the end of the row has an additional quote. I'm not quite sure what is going on here, so any help would be much appreciated!

For dealing with the quotes, the cause of the “problem” is that your CSV does not use the default field delimiter that Import-CSV assumes - the C in CSV stands for comma, and you’re using the vertical bar. Add the parameter -Delimiter "|" to both the Import-CSV and Export-CSV cmdlets.
I don’t think you can do anything about the line-end characters (CRLF vs LF); that’s almost certainly operating-system dependent.

Jeff Zeitlin's helpful answer explains the quote-related part of your problem well.
As for your line-ending problem:
As of PowerShell 7.2, there are no PowerShell-native features that allow you to control the newline format of file-writing cmdlets such as Export-Csv.
However, if you use plain-text processing, you can use multi-line strings built with the newline format of interest and save / append them with Set-Content and its -NoNewLine switch, which writes the input strings as-is, without a (newline) separator.
In fact, to significantly speed up processing in your case, plain-text handling is preferable, since in essence your operation amounts to concatenating text files, the only twist being that the header lines of all but the first file should be skipped; using plain-text handling also bypasses your quote problem:
$tokenCount = 1
Get-ChildItem -Filter *.csv |
Get-Content -Raw |
ForEach-Object {
# Get the file content and replace CRLF with LF.
# Include the first line (the header) only for the first file.
$content = ($_ -split '\r?\n', $tokenCount)[-1].Replace("`r`n", "`n")
$tokenCount = 2 # Subsequent files should have their header ignored.
# Make sure that each file content ends in a LF
if (-not $content.EndsWith("`n")) { $content += "`n" }
# Output the modified content.
$content
} |
Set-Content -NoNewLine ./merged/merged.csv # add -Encoding as needed.

Related

How to remove white space using powershell in multiple text file in the same folder?

folder name is: c:\home\alltext\
inside has: 2 text files with different names(each text contents extra whitespace that I want to trim)
text1.txt
text2.txt
I don't want to use notepad++ and do one by one text.txt if I have more than 2 command.
I tried PowerShell it returns both text1 and text2 together in same one text.txt.
How can I trim them in one command and return individual txt?
This is my command:
(get-content c:\home\alltext\*.txt).trim() -ne '' | Set-content c:\home\alltext\*.txt
You need to process the input files one by one:
Get-ChildItem c:\home\alltext*.txt | ForEach-Object {
Set-Content -LiteralPath $_.FullName -Value (($_ | Get-Content).Trim() -ne '')
}
Note that PowerShell never preserves the original character encoding when reading text files, so you may have to use the -Encoding parameter with Set-Content.
As for what you tried:
(get-content c:\home\alltext*.txt).trim() -ne '' streams the non-blank lines of all files matching wildcard expression c:\home\alltext*.txt, across file boundaries.
Perhaps surprisingly, not only does Set-Content's (positionally implied) -Path parameter accept wildcard expressions too, it writes the same content (the stringified versions of whatever input it receives) to whatever files happen to match that wildcard expression.
This problematic behavior is discussed in GitHub issue #6729; unfortunately, it was decided to retain the current behavior.

How to compare file on local PC and github?

I have a file on my PC called test.ps1
I have a file hosted on my github called test.ps1
both of them have the same contents a string inside them
I am using the following script to try and comapare them:
$fileA = Get-Content -Path "C:\Users\User\Desktop\test.ps1"
$fileB = (Invoke-webrequest -URI "https://raw.githubusercontent.com/repo/Scripts/test.ps1")
if(Compare-Object -ReferenceObject $fileA -DifferenceObject ($fileB -split '\r?\n'))
{"files are different"}
Else {"Files are the same"}
echo ""
Write-Host $fileA
echo ""
Write-Host $fileB
however my output is showing the exact same data for both but it says the files are different. The output:
files are different
a string
a string
is there some weird EOL thing going on or something?
tl;dr
# Remove a trailing newline from the downloaded file content
# before splitting into lines.
# Parameter names omitted for brevity.
Compare-Object $fileA ($fileB -replace '\r?\n\z' -split '\r?\n' )
If the files are truly identical (save for any character-encoding and newline-format differences, and whether or not the local file has a trailing newline), you'll see no output (because Compare-Object only reports differences by default).
If the lines look the same, it sounds like character encoding is not the problem, though it's worth pointing out that Get-Content in Windows PowerShell, in the absence of a BOM, assumes that a file is ANSI-encoded, so a UTF-8 file without BOM that contains characters outside the ASCII range will be misinterpreted - use -Encoding utf8 to fix that.
Assuming that the files are truly identical (including not having variations in whitespace, such as trailing spaces at the end of lines), the likeliest explanation is that the file being retrieved has a trailing newline, as is typical for text files.
Thus, if the downloaded file has a trailing newline, as is to be expected, if you apply -split '\r?\n' to the multi-line string representing the entire file content in order to split it into lines, you'll end up with an extra, empty array element at the end, which causes Compare-Object to report that element as a difference.
Compare-Object emitting an object is evaluated as $true in the implied Boolean context of your if statement's conditional, which is why files are different is output.
The above -replace operation, -replace '\r?\n\z' (\z matches the very end of a (multi-line) string), compensates for that, by removing the trailing newline before splitting into lines.

Powershell: I need to clean a set of csv files, there are an inconsistent number of garbage rows above the headers that must go before import

I have a set of CSV files that I need to import data from, the issue I'm running into is that the number of garbage rows above the header line, and their content, is always different. The header rows themselves are consistent so i could use that to detect what the starting point should be.
I'm not quite sure where to start, the files are structured as below.
Here there be garbage.
So much garbage, between 12 and 25 lines of it.
Header1,Header2,Header3,Header4,Header5
Data1,Data2,Data3,Data4,Data5
My assumption on the best method to do this would be to do something that checks for the line number of the header row and then a get-content function specifying the starting line number be the result of the preceding check.
Any guidance would be most appreciated.
If the header line is as you say consistent, you could do something like this:
$header = 'Header1,Header2,Header3,Header4,Header5'
# read the file as single multiline string
# and split on the escaped header line
$data = ((Get-Content -Path 'D:\theFile.csv' -Raw) -split [regex]::Escape($header), 2)[1] |
ConvertFrom-Csv -Header $($header -split ',')
As per your comment you really only wanted to do a clean-up on these files instead of importing data from it (your question says "I need to import data"), all you have to do is append this line of code:
$data | Export-Csv -Path 'D:\theFile.csv' -NoTypeInformation
The line ConvertFrom-Csv -Header $($header -split ',') parses the data into an array of objects (re)using the headerline that was taken off by the split.
A pure textual approach (without parsing of the data) still needs to write out the headerline, because by splitting the file content of this removed it from the resulting array:
$header = 'Header1,Header2,Header3,Header4,Header5'
# read the file as single multiline string
# and split on the escaped header line
$data = ((Get-Content -Path 'D:\theFile.csv' -Raw) -split [regex]::Escape($header), 2)[1]
# rewrite the file with just the header line
$header | Set-Content -Path 'D:\theFile.csv'
# then write all data lines we captured in variable $data
$data | Add-Content -Path 'D:\theFile.csv'
To offer a slightly more concise (and marginally more efficient) alternative to Theo's helpful answer, using the -replace operator:
If you want to import the malformed CSV file directly:
(Get-Content -Raw file.csv) -replace '(?sm)\A.*(?=^Header1,Header2,Header3,Header4,Header5$)' |
ConvertFrom-Csv
If you want to save the cleaned-up data back to the original file (adjust -Encoding as needed):
(Get-Content -Raw file.csv) -replace '(?sm)\A.*(?=^Header1,Header2,Header3,Header4,Header5$)' |
Set-Content -NoNewLine -Encoding utf8 file.csv
Explanation of the regex:
(?sm) sets the following regex options: single-line (s: make . match newlines too) and multi-line (m: make ^ and $ also match the start and end of individual lines inside a multi-line string).
\A.* matches any (possibly empty) text (.*) from the very start (\A) of the input string.
(?=...) is a positive lookahead assertion that matches the enclosed subexpression (symbolized by ... here) without consuming it (making it part of what the regex considers the matching part of the string).
^Header1,Header2,Header3,Header4,Header5$ matches the header line of interest, as a full line.

Powershell .txt to CSV Formatting Irregularities

I have a large number of .txt files pulled from pdf and formatted with comma delimiters.
I'm trying to append these text files to one another with a new line between each. Earlier in the formatting process I took multi-line input and formatted it into one line with entries separated by commas.
Yet when appending one txt file to another in a csv the previous formatting with many line breaks returns. So my final output is valid csv, but not representative of each text file being one line of csv entries. How can I ensure the transition from txt to csv retains the formatting of the txt files?
I've used Export-CSV, Add-Content, and the >> operator with similar outcomes.
To summarize, individual .txt files with the following format:
,927,Dance like Misty"," shine like Lupita"," slay like Serena. speak like Viola"," fight like Rosa! ,United States ,16 - 65+
Turn into the following when appended together in a csv file:
,927
,Dance like Misty"," shine like Lupita"," slay like Serena. speak like Viola"," fight like Rosa!
,United States
,16 - 65+
How the data was prepped:
Removing new lines
Foreach($f in $FILES){(Get-Content $f -Raw).Replace("`n","") | Set-Content $f -Force}
Adding one new line to the end of each txt file
foreach($f in $FILES){Add-Content -Path $f -value "`n" |Set-Content $f -Force}
Trying to Convert to CSV, one text file per line with comma delimiter:
cat $FILES | sc csv.csv
Or
foreach($f in $FILES){import-csv $f -delimiter "," | export-csv $f}
Or
foreach($f in $FILES){ Export-Csv -InputObject $f -append -path "test.csv"}
Return csv with each comma separated value on a new line, instead of each txt file as one line.
This was resolved by realizing that even though notepad was showing no newlines, there were still hidden return carriage characters. On loading the apparently one line csv files into Notepad++ and toggling "show hidden characters" this oversight was evident.
By replacing both \r and \n characters before converting to CSV,
Foreach($f in $FILES){(Get-Content $f -Raw).Replace("\n","").Replace("\r","" |
Set-Content $f -Force}
The CSV conversion process worked as planned using the following
cat $FILES | sc final.csv
Final verdict --
The text file that appeared to be a one line entry ready to become CSV
,927,Dance like Misty"," shine like Lupita"," slay like Serena. speak like Viola"," fight like Rosa! ,United States ,16 - 65+
Still had return carriage characters between each value. This was made evident by trying another text editor with the feature "show hidden characters."

Remove start and end spaces in specific csv column

I am trying to remove start and end spaces in column data in CSV file. I've got a solution to remove all spaces in the csv, but it's creating non-readable text in description column.
Get-Content –path test.csv| ForEach-Object {$_.Trim() -replace "\s+" } | Out-File -filepath out.csv -Encoding ascii
e.g.
'192.168.1.2' ' test-1-TEST' 'Ping Down at least 1 min' '3/11/2017' 'Unix Server' 'Ping' 'critical'
'192.168.1.3' ' test-2-TEST' ' Ping Down at least 3 min' '3/11/2017' 'windows Server' 'Ping' 'critical'
I only want to remove space only from ' test-1-TEST' and not from 'Ping Down at least 1 min'. Is this possible?
"IP","ServerName","Status","Date","ServerType","Test","State"
"192.168.1.2"," test-1-TEST","Ping Down at least 1 min","3/11/2017","Unix Server","Ping","critical"
"192.168.1.3"," test-2-TEST"," Ping Down at least 3 min","3/11/2017","windows Server","Ping","critical"
For example file above:
Import-Csv C:\folder\file.csv | ForEach-Object {
$_.ServerName = $_.ServerName.trim()
$_
} | Export-Csv C:\folder\file2.csv -NoTypeInformation
Replace ServerName with the name of the Column you want to remove spaces from (aka trim).
If your CSV does not have header (which means its not a true CSV) and/or you want to better preserve the original file structure and formatting you could try to expand on your regex a little.
(Get-Content c:\temp\test.txt -Raw) -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
That should remove all leading and trailing spaces inside the quoted values. Not the delimeters themselves.
Read the file in as one string. Could be bad idea depending on file size. Not required as the solution is not dependent on that. Can still be read line be line with the same transformation achieving the same result. Use two replacements that are similar. First is looking for spaces that exist after a single quote but not followed by another quote or space. Second is looking for spaces before a quote that are not preceded by a quote or space.
Just wanted to give a regex example. You can look into this with more detail and explanation at regex101.com. There you will see an alternation pattern instead of two separate replacements.
(Get-Content c:\temp\test.txt -Raw) -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
The first example is a little easier on the eyes.
I was having issues consistently replicating this but if you are having issues with it replacing newlines as well then you can just do the replacement one line at a time and that should work as well.
(Get-Content c:\temp\test.txt) | Foreach-Object{
$_ -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
} | Set-Content c:\temp\test.txt