Powershell: I need to clean a set of csv files, there are an inconsistent number of garbage rows above the headers that must go before import - powershell

I have a set of CSV files that I need to import data from, the issue I'm running into is that the number of garbage rows above the header line, and their content, is always different. The header rows themselves are consistent so i could use that to detect what the starting point should be.
I'm not quite sure where to start, the files are structured as below.
Here there be garbage.
So much garbage, between 12 and 25 lines of it.
Header1,Header2,Header3,Header4,Header5
Data1,Data2,Data3,Data4,Data5
My assumption on the best method to do this would be to do something that checks for the line number of the header row and then a get-content function specifying the starting line number be the result of the preceding check.
Any guidance would be most appreciated.

If the header line is as you say consistent, you could do something like this:
$header = 'Header1,Header2,Header3,Header4,Header5'
# read the file as single multiline string
# and split on the escaped header line
$data = ((Get-Content -Path 'D:\theFile.csv' -Raw) -split [regex]::Escape($header), 2)[1] |
ConvertFrom-Csv -Header $($header -split ',')
As per your comment you really only wanted to do a clean-up on these files instead of importing data from it (your question says "I need to import data"), all you have to do is append this line of code:
$data | Export-Csv -Path 'D:\theFile.csv' -NoTypeInformation
The line ConvertFrom-Csv -Header $($header -split ',') parses the data into an array of objects (re)using the headerline that was taken off by the split.
A pure textual approach (without parsing of the data) still needs to write out the headerline, because by splitting the file content of this removed it from the resulting array:
$header = 'Header1,Header2,Header3,Header4,Header5'
# read the file as single multiline string
# and split on the escaped header line
$data = ((Get-Content -Path 'D:\theFile.csv' -Raw) -split [regex]::Escape($header), 2)[1]
# rewrite the file with just the header line
$header | Set-Content -Path 'D:\theFile.csv'
# then write all data lines we captured in variable $data
$data | Add-Content -Path 'D:\theFile.csv'

To offer a slightly more concise (and marginally more efficient) alternative to Theo's helpful answer, using the -replace operator:
If you want to import the malformed CSV file directly:
(Get-Content -Raw file.csv) -replace '(?sm)\A.*(?=^Header1,Header2,Header3,Header4,Header5$)' |
ConvertFrom-Csv
If you want to save the cleaned-up data back to the original file (adjust -Encoding as needed):
(Get-Content -Raw file.csv) -replace '(?sm)\A.*(?=^Header1,Header2,Header3,Header4,Header5$)' |
Set-Content -NoNewLine -Encoding utf8 file.csv
Explanation of the regex:
(?sm) sets the following regex options: single-line (s: make . match newlines too) and multi-line (m: make ^ and $ also match the start and end of individual lines inside a multi-line string).
\A.* matches any (possibly empty) text (.*) from the very start (\A) of the input string.
(?=...) is a positive lookahead assertion that matches the enclosed subexpression (symbolized by ... here) without consuming it (making it part of what the regex considers the matching part of the string).
^Header1,Header2,Header3,Header4,Header5$ matches the header line of interest, as a full line.

Related

Issues merging multiple CSV files in Powershell

I found a nifty command here - http://www.stackoverflow.com/questions/27892957/merging-multiple-csv-files-into-one-using-powershell that I am using to merge CSV files -
Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\merged\merged.csv -NoTypeInformation -Append
Now this does what it says on the tin and works great for the most part. I have 2 issues with it however, and I am wondering if there is a way they can be overcome:
Firstly, the merged csv file has CRLF line endings, and I am wondering how I can make the line endings just LF, as the file is being generated?
Also, it looks like there are some shenanigans with quote marks being added/moved around. As an example:
Sample row from initial CSV:
"2021-10-05"|"00:00"|"1212"|"160477"|"1.00"|"3.49"LF
Same row in the merged CSV:
"2021-10-05|""00:00""|""1212""|""160477""|""1.00""|""3.49"""CRLF
So see that the first row has lost its trailing quotes, other fields have doubled quotes, and the end of the row has an additional quote. I'm not quite sure what is going on here, so any help would be much appreciated!
For dealing with the quotes, the cause of the “problem” is that your CSV does not use the default field delimiter that Import-CSV assumes - the C in CSV stands for comma, and you’re using the vertical bar. Add the parameter -Delimiter "|" to both the Import-CSV and Export-CSV cmdlets.
I don’t think you can do anything about the line-end characters (CRLF vs LF); that’s almost certainly operating-system dependent.
Jeff Zeitlin's helpful answer explains the quote-related part of your problem well.
As for your line-ending problem:
As of PowerShell 7.2, there are no PowerShell-native features that allow you to control the newline format of file-writing cmdlets such as Export-Csv.
However, if you use plain-text processing, you can use multi-line strings built with the newline format of interest and save / append them with Set-Content and its -NoNewLine switch, which writes the input strings as-is, without a (newline) separator.
In fact, to significantly speed up processing in your case, plain-text handling is preferable, since in essence your operation amounts to concatenating text files, the only twist being that the header lines of all but the first file should be skipped; using plain-text handling also bypasses your quote problem:
$tokenCount = 1
Get-ChildItem -Filter *.csv |
Get-Content -Raw |
ForEach-Object {
# Get the file content and replace CRLF with LF.
# Include the first line (the header) only for the first file.
$content = ($_ -split '\r?\n', $tokenCount)[-1].Replace("`r`n", "`n")
$tokenCount = 2 # Subsequent files should have their header ignored.
# Make sure that each file content ends in a LF
if (-not $content.EndsWith("`n")) { $content += "`n" }
# Output the modified content.
$content
} |
Set-Content -NoNewLine ./merged/merged.csv # add -Encoding as needed.

How to efficiently delete the last line of a multiline string when the line is empty/blank?

I'm trying to delete blank line at the bottom from the each sqlcmd output files, provided other vendor.
$List=Get-ChildItem * -include *.csv
foreach($file in $List) {
$data = Get-Content $file
$name = $file.name
$length = $data.length -1
$data[$length] = $null
$data | Out-File $name -Encoding utf8
}
It takes bit long time to remove the blank line. Anyone knows a more efficient way?
Using Get-Content -Raw to load files as a whole, as a single string into memory and operating on that string will give you the greatest speed boost.
While that isn't always an option depending on file size, you mention sqlcmd files, which can be assumed to be small enough.
Note:
By blank line I mean a line that is either completely empty or contains whitespace (other than newlines) only.
The trimmed string will not have a final terminating newline following the last line, but if you pass it to Set-Content (or Out-File), one will be appended by default; use -NoNewline to suppress that, but not that especially on Unix-like platforms even the last line of text files is expected to have a trailing newline.
Trailing (or leading) whitespace on a non-blank line is by design not trimmed, except where noted.
The solutions use the -replace operator, which operates on regexes (regular expressions).
Remove all trailing blank lines:
Note: If you really want to remove only the last line if it happens to be blank, see the second-to-last solution below.
(Get-Content -Raw $file) -replace '\r?\n\s*$'
In the context of your command (slightly modified):
Get-ChildItem -Filter *.sqlcmd | ForEach-Object {
(Get-Content -Raw $_.FullName) -replace '\r?\n\s*$' |
Set-Content $_.FullName -Encoding utf8 -WhatIf # save back to same file
}
Note: The -WhatIf common parameter in the command above previews the operation. Remove -WhatIf once you're sure the operation will do what you want.
If it's acceptable / desirable to also trim trailing whitespace from the last non-blank line, you can more simply write:
(Get-Content -Raw $file).TrimEnd()
Remove all blank lines, wherever they occur in the file:
(Get-Content -Raw $file) -replace '(?m)\A\s*\r?\n|\r?\n\s*$'
Here's a conceptually much simpler version that operates on the array of lines output by Get-Content without -Raw (and also returns an array), but it performs much worse.
#(Get-Content $file) -notmatch '^\s*$'
Do not combine this with Set-Content / Out-Content -NoNewline, as that will concatenate the lines stored in the array elements directly, without line breaks between them. Without -NoNewline, you'll invariably get a terminating newline after the last line.
Remove only the last line if it is blank:
(Get-Content -Raw $file) -replace '\r?\n[ \t]*\Z'
Note:
[ \t] matches spaces and tabs, whereas \s more generally matches all forms of Unicode whitespace, including that outside the ASCII range.
An optional trailing newline at the very end of the file (to terminate the last line) is not considered a blank line in this case - whether such a newline is present or not does not make a difference.
Unconditionally remove the last line, whether it is blank or not:
(Get-Content -Raw $file) -replace '\r?\n[^\n]*\Z'
Note:
An optional trailing newline at the very end of the file (to terminate the last line) is not considered a blank line in this case - whether such a newline is present or not does not make a difference.
If you want to remove the last non-blank line, use
(Get-Content -Raw $file).TrimEnd() -replace '\r?\n[^\n]*\Z'
try replacing with this line. you will not have blank lines in your array value $data.
$data = get-content $file.FullPath | Where-Object {$_.trim() -ne "" }

how to sort a txt file in specific order in Powershell

I have this first text for example
today is sunny in the LA
and the temperature is 21C
today is cloudy in the NY
and the temperature is 18C
today is sunny in the DC
and the temperature is 25C
and this is the order I want:
18C
25C
21C
I want to change the first file to be the same order as the second one but without deleting anything:
today is cloudy in the NY
and the temperature is 18C
today is sunny in the DC
and the temperature is 25C
today is sunny in the LA
and the temperature is 21C
Note: The PSv3+ solution below answers a different question: it sorts the paragraphs numerically by the temperature values contained in them, not in an externally prescribed order.
As such, it may still be of interest, given the question's generic title.
For an answer to the question as asked, see my other post.
Here's a concise solution, but note that it requires reading the input file into memory as a whole (in any event, Sort-Object collects its input objects all in memory as well, since it does not use temporary files to ease potential memory pressure):
((Get-Content -Raw file.txt) -split '\r?\n\r?\n' -replace '\r?\n$' |
Sort-Object { [int] ($_ -replace '(?s).+ (\d+)C$', '$1') }) -join
[Environment]::NewLine * 2
(Get-Content -Raw file.txt) reads the input file into memory as a whole, as a single, multi-line string.
-split '\r?\n\r?\n' breaks the multi-line string into an array of paragraphs (blocks of lines separated by an empty line), and -replace '\r?\n$' removes a trailing newline, if any, from the paragraph at the very end of the file.
Regex \r?\n matches both Windows-style CRLF and Unix-style LF-only newlines.
Sort-Object { [int] ($_ -replace '(?s).+ (\d+)C$', '$1') }) numerically sorts the paragraphs by the temperature number at the end of each paragraph (e.g. 18).
$_ represents the input paragraph at hand.
-replace '...', '...' performs string replacement based on a regex, which in this case extracts the temperature number string from the end of the paragraph.
See Get-Help about_Regular_Expressions for information about regexes (regular expressions) and Get-Help about_Comparison_Operators for information about the -replace operator.
Cast [int] converts the number string to an integer for proper numerical sorting.
-join [Environment]::NewLine * 2 reassembles the sorted paragraphs into a single multi-line string, with paragraphs separated by an empty line.
[Environment]::NewLine is the platform-appropriate newline sequence; you can alternatively hard-code newlines as "`r`n" (CRLF) or "`n" (LF).
You can send the output to a new file by appending something like
... | Set-Content sortedFile.txt (which makes the file "ANSI"-encoded in Windows PowerShell, and UTF-8-encoded in PowerShell Core by default; use -Encoding as needed).
Since the entire input file is read into memory up front, it is possible to write the results directly back to the input file (... | Set-Content file.txt), but doing so bears the slight risk of data loss, namely if writing is interrupted before completion.
Nas' helpful answer works, but it is an O(m*n) operation; that is, with m paragraphs to output in prescribed order and n input paragraphs, m * n operations are required; if all input paragraphs are to be output (in the prescribed order), i.e, if m equals n, the effort is quadratic.
The following PSv4+ solution will scale better, as it only requires linear rather than quadratic effort:
# The tokens prescribing the sort order, which may come from
# another file read with Get-Content, for instance.
$tokensToSortBy = '18C', '25C', '21C'
# Create a hashtable that indexes the input file's paragraphs by the sort
# token embedded in each.
((Get-Content -Raw file.txt) -split '\r?\n\r?\n' -replace '\r?\n$').ForEach({
$htParagraphsBySortToken[$_ -replace '(?s).* (\d+C)$(?:\r?\n)?', '$1'] = $_
})
# Loop over the tokens prescribing the sort order, and retrieve the
# corresponding paragraph, then reassemble the paragraphs into a single,
# multi-line string with -join
$tokensToSortBy.ForEach({ $htParagraphsBySortToken[$_] }) -join [Environment]::NewLine * 2
(Get-Content -Raw file.txt) reads the input file into memory as a whole, as a single, multi-line string.
-split '\r?\n\r?\n' breaks the multi-line string into an array of paragraphs (blocks of lines separated by an empty line), and -replace '\r?\n$' removes a trailing newline, if any, from the paragraph at the very end of the file.
Regex \r?\n matches both Windows-style CRLF and Unix-style LF-only newlines.
$_ -replace '(?s).* (\d+C)$(?:\r?\n)?', '$1' extracts the sort token (e.g., 25C) from each paragraph, which becomes the hashtable's key.
$_ represents the input paragraph at hand.
-replace '...', '...' performs string replacement based on a regex.
See Get-Help about_Regular_Expressions for information about regexes (regular expressions) and Get-Help about_Comparison_Operators for information about the -replace operator.
-join [Environment]::NewLine * 2 reassembles the sorted paragraphs into a single multi-line string, with paragraphs separated by an empty line.
[Environment]::NewLine is the platform-appropriate newline sequence; you can alternatively hard-code newlines as "`r`n" (CRLF) or "`n" (LF).
You can send the output to a new file by appending something like
... | Set-Content sortedFile.txt to the last statement (which makes the file "ANSI"-encoded in Windows PowerShell, and UTF-8-encoded in PowerShell Core by default; use -Encoding as needed).
$text = Get-Content -path C:\text.txt
$order = '18C','25C','21C'
foreach ($item in $order)
{
$text | ForEach-Object {
if ($_ -match "$item`$") { # `$ to match string at the end of the line
Write-Output $text[($_.ReadCount-2)..($_.ReadCount)] # output lines before and after match
}
}
}

Remove start and end spaces in specific csv column

I am trying to remove start and end spaces in column data in CSV file. I've got a solution to remove all spaces in the csv, but it's creating non-readable text in description column.
Get-Content –path test.csv| ForEach-Object {$_.Trim() -replace "\s+" } | Out-File -filepath out.csv -Encoding ascii
e.g.
'192.168.1.2' ' test-1-TEST' 'Ping Down at least 1 min' '3/11/2017' 'Unix Server' 'Ping' 'critical'
'192.168.1.3' ' test-2-TEST' ' Ping Down at least 3 min' '3/11/2017' 'windows Server' 'Ping' 'critical'
I only want to remove space only from ' test-1-TEST' and not from 'Ping Down at least 1 min'. Is this possible?
"IP","ServerName","Status","Date","ServerType","Test","State"
"192.168.1.2"," test-1-TEST","Ping Down at least 1 min","3/11/2017","Unix Server","Ping","critical"
"192.168.1.3"," test-2-TEST"," Ping Down at least 3 min","3/11/2017","windows Server","Ping","critical"
For example file above:
Import-Csv C:\folder\file.csv | ForEach-Object {
$_.ServerName = $_.ServerName.trim()
$_
} | Export-Csv C:\folder\file2.csv -NoTypeInformation
Replace ServerName with the name of the Column you want to remove spaces from (aka trim).
If your CSV does not have header (which means its not a true CSV) and/or you want to better preserve the original file structure and formatting you could try to expand on your regex a little.
(Get-Content c:\temp\test.txt -Raw) -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
That should remove all leading and trailing spaces inside the quoted values. Not the delimeters themselves.
Read the file in as one string. Could be bad idea depending on file size. Not required as the solution is not dependent on that. Can still be read line be line with the same transformation achieving the same result. Use two replacements that are similar. First is looking for spaces that exist after a single quote but not followed by another quote or space. Second is looking for spaces before a quote that are not preceded by a quote or space.
Just wanted to give a regex example. You can look into this with more detail and explanation at regex101.com. There you will see an alternation pattern instead of two separate replacements.
(Get-Content c:\temp\test.txt -Raw) -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
The first example is a little easier on the eyes.
I was having issues consistently replicating this but if you are having issues with it replacing newlines as well then you can just do the replacement one line at a time and that should work as well.
(Get-Content c:\temp\test.txt) | Foreach-Object{
$_ -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
} | Set-Content c:\temp\test.txt

Matching lines in file from list

I have two text files, Text1.txt and Text2.txt
Text2.txt is a list of keywords, one keyword per line. I want to read from Text1.txt and any time a keyword in the Text2.txt list shows up, pipe that entire line of text to a new file, output.txt
Without using Text2.txt I figured out how to do it manually in PowerShell.
Get-Content .\Text1.txt | Where-Object {$_ -match 'CAPT'} | Set-Content output.txt
That seems to work, it searches for "CAPT" and returns the entire line of text, but I don't know how to replace the manual text search with a variable that pulls from Text2.txt
Any ideas?
Using some simple regex you can make a alternative matching string from all the keywords in the file Text2.txt
$pattern = (Get-Content .\Text2.txt | ForEach-Object{[regex]::Escape($_)}) -Join "|"
Get-Content .\Text1.txt | Where-Object {$_ -match $pattern} | Set-Content output.txt
In case your keywords have special regex characters we need to be sure they are escaped here. The .net regex method Escape() handles that.
This is not an efficient approach for large files but it is certainly a simple method. If your keywords were all similar like CAPT CAPS CAPZ then we could improve it but I don't think it would be worth it depending how often the keywords change.
Changing the pattern
If you wanted to just match the first 4 characters from the lines in your input file that is just a matter of making a change in the loop.
$pattern = (Get-Content .\Text2.txt | ForEach-Object{[regex]::Escape($_.Substring(0,4))}) -Join "|"