stripping extra text qualifier from a CSV - part 2 - powershell

For part 1, see this SO post
I have a CSV that has certain fields separated by the " symbol as a TextQualifier.
See below for example. Note that each integer (eg. 1,2,3 etc) is supposed to be a string. the qualified strings are surrounded by the " symbol.
1,2,3,"qualifiedString1",4,5,6,7,8,9,10,11,12,13,14,15,16,"qualifiedString2""
Notice how the last qualified string has a " symbol as part of the string.
User #mjolinor suggested this powershell script, which works to fix the above scenario, but it does not fix the "Part 2" scenario below.
(get-content file.txt -ReadCount 0) -replace '([^,]")"','$1' |
set-content newfile.txt
Here is part 2 of the question. I need a solution for this:
The extra " symbol can appear randomly in the string. Here's another example:
1,2,3,"qualifiedString1",4,5,6,7,8,9,10,11,12,13,14,15,16,"qualifiedS"tring2"
Can you suggest an elegant way to automate the cleaning of the CSV to eliminate redundant " qualifiers?

You just need a different regex:
(get-content file.txt -ReadCount 0) -replace '(?<!,)"(?!,|$)',''|
set-content newfile.txt
That one will replace any double quote that is not immediately preceeded by a comma, or followed by either a comma or the end of the line.
$text = '1,2,3,"qualifiedString1",4,5,6,7,8,9,10,11,12,13,14,15,16,"qualifiedS"tring2"'
$text -replace '(?<!,)"(?!,|$)',''
1,2,3,"qualifiedString1",4,5,6,7,8,9,10,11,12,13,14,15,16,"qualifiedString2"

Related

PowerShell not removing new line characters

Environment: Windows 10 pro 20H2, PowerShell 5.1.19041.1237
In a .txt file, my following PowerShell code is not replacing the newline character(s) with " ". Question: What I may be missing here, and how can we make it work?
C:\MyFolder\Test.txt File:
This is first line.
This is second line.
This is third line.
This is fourth line.
Desired output [after replacing the newline characters with " " character]:
This is first line. This is second line. This is third line. This is fourth line.
PowerShell code:
PS C:\MyFolder\test.txt> $content = get-content "Test.txt"
PS C:\MyFolder\test.txt> $content = $content.replace("`r`n", " ")
PS C:\MyFolder\test.txt> $content | out-file "Test.txt"
Remarks
The above code works fine if I replace some other character(s) in file. For example, if I change the second line of the above code with $content = $content.replace("third", "3rd"), the code successfully replaces third with 3rd in the above file.
You need to pass -Raw parameter to Get-Content. By default, without the Raw parameter, content is returned as an array of newline-delimited strings.
Get-Content "Test.txt" -Raw
Quoting from the documentation,
-Raw
Ignores newline characters and returns the entire contents of a file in one string with the newlines preserved. By default, newline
characters in a file are used as delimiters to separate the input into
an array of strings. This parameter was introduced in PowerShell 3.0.
The simplest way of doing this is to not use the -Raw switch and then do a replacement on it, but make use of the fact that Get-Content splits the content on Newlines for you.
All it then takes is to join the array with a space character.
(Get-Content -Path "Test.txt") -join ' ' | Set-Content -Path "Test.txt"
As for what you have tried:
By using Get-Content without the -Raw switch, the cmdlet returns a string array of lines split on the Newlines.
That means there are no Newlines in the resulting strings anymore to replace and all that is needed is to 'stitch' the lines together with a space character.
If you do use the -Raw switch, the cmdlet returns a single, multiline string including the Newlines.
In your case, you then need to do the splitting or replacing yourself and for that, don't use the string method .Replace, but the regex operator -split or -replace with a search string '\r?\n'.
The question mark in there makes sure you split on newlines in Windows format (CRLF), but also works on *nix format (LF).

How to remove a multi line block of text from $pattern in Powershell

I'm getting the contents of a text file which is partly created by gsutil and I'm trying to put its contents in $body but I want to omit a block of text that contains special characters. The problem is that I'm not able to match this block of text in order for it to be removed. So when I print out $body it still contains all the text that I'm trying to omit.
Here's a part of my code:
$pattern = #"
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this you and any
users that download such composite files will need to have a compiled
crcmod installed (see "gsutil help crcmod").
"#
$pattern = ([regex]::Escape($pattern))
$body = Get-Content -Path C:\temp\file.txt -Raw | Select-String -Pattern $pattern -NotMatch
So basically I need it to display everything inside the text file except for the block of text in $pattern. I tried without -Raw and without ([regex]::Escape($pattern)) but it won't remove that entire block of text.
It has to be because of the special characters, probably the " , . () because if I make the pattern simple such as:
$pattern = #"
NOTE: You are uploading one or more
"#
then it works and this part of text is removed from $body.
It'd be nice if everything inside $pattern between the #" and "# was treated literally. I'd like the simplest solution without functions, etc. I'd really appreciate it if someone could help me out with this.
With the complete text of your question stored in file .\SO_55538262.txt
This script with manually escaped pattern:
$pattern = '(?sm)^==\> NOTE: You .*?"gsutil help crcmod"\)\.'
$body = (Get-Content .\SO_55538262.txt -raw) -replace $pattern
$body
Returns here:
I'm getting the contents of a text file which is partly created by gsutil and I'm trying to put its contents in $body but I want to omit a block of text that contains special characters. The problem is that I'm not able to match this block of text in order for it to be removed. So when I print out $body it still contains all the text that I'm trying to omit.
Here's a part of my code:
$pattern = #"
"#
$pattern = ([regex]::Escape($pattern))
$body = Get-Content -Path C:\temp\file.txt -Raw | Select-String -Pattern $pattern -NotMatch
So basically I need it to display everything inside the text file except for the block of text in $pattern. I tried without -Raw and without ([regex]::Escape($pattern)) but it won't remove that entire block of text.
It has to be because of the special characters, probably the " , . () because if I make the pattern simple such as:
$pattern = #" NOTE: You are uploading one or more "#
then it works and this part of text is removed from $body.
It'd be nice if everything inside $pattern between the #" and "# was treated literally. I'd like the simplest solution without functions, etc.
Explanation of the RegEx from regex101.com:
(?sm)^==\> NOTE: You .*?"gsutil help crcmod"\)\.
(?sm) match the remainder of the pattern with the following effective flags: gms
s modifier: single line. Dot matches newline characters
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
^ asserts position at start of a line
== matches the characters == literally (case sensitive)
\> matches the character > literally (case sensitive)
NOTE: You matches the characters NOTE: You literally (case sensitive)
.*?
. matches any character
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
"gsutil help crcmod" matches the characters "gsutil help crcmod" literally (case sensitive)
\) matches the character ) literally (case sensitive)
\. matches the character . literally (case sensitive)
An easy way to tackle this task (without regex) would be using the -notin operator. Since Get-Content is returning your file content as a string[]:
#requires -Version 4
$set = #('==> NOTE: You are uploading one or more large file(s), which would run'
'significantly faster if you enable parallel composite uploads. This'
'feature can be enabled by editing the'
'"parallel_composite_upload_threshold" value in your .boto'
'configuration file. However, note that if you do this you and any'
'users that download such composite files will need to have a compiled'
'crcmod installed (see "gsutil help crcmod").')
$filteredContent = #(Get-Content -Path $path).
Where({ $_.Trim() -notin $set }) # trim added for misc whitespace
v2 compatible solution:
#(Get-Content -Path $path) |
Where-Object { $set -notcontains $_.Trim() }

how to sort a txt file in specific order in Powershell

I have this first text for example
today is sunny in the LA
and the temperature is 21C
today is cloudy in the NY
and the temperature is 18C
today is sunny in the DC
and the temperature is 25C
and this is the order I want:
18C
25C
21C
I want to change the first file to be the same order as the second one but without deleting anything:
today is cloudy in the NY
and the temperature is 18C
today is sunny in the DC
and the temperature is 25C
today is sunny in the LA
and the temperature is 21C
Note: The PSv3+ solution below answers a different question: it sorts the paragraphs numerically by the temperature values contained in them, not in an externally prescribed order.
As such, it may still be of interest, given the question's generic title.
For an answer to the question as asked, see my other post.
Here's a concise solution, but note that it requires reading the input file into memory as a whole (in any event, Sort-Object collects its input objects all in memory as well, since it does not use temporary files to ease potential memory pressure):
((Get-Content -Raw file.txt) -split '\r?\n\r?\n' -replace '\r?\n$' |
Sort-Object { [int] ($_ -replace '(?s).+ (\d+)C$', '$1') }) -join
[Environment]::NewLine * 2
(Get-Content -Raw file.txt) reads the input file into memory as a whole, as a single, multi-line string.
-split '\r?\n\r?\n' breaks the multi-line string into an array of paragraphs (blocks of lines separated by an empty line), and -replace '\r?\n$' removes a trailing newline, if any, from the paragraph at the very end of the file.
Regex \r?\n matches both Windows-style CRLF and Unix-style LF-only newlines.
Sort-Object { [int] ($_ -replace '(?s).+ (\d+)C$', '$1') }) numerically sorts the paragraphs by the temperature number at the end of each paragraph (e.g. 18).
$_ represents the input paragraph at hand.
-replace '...', '...' performs string replacement based on a regex, which in this case extracts the temperature number string from the end of the paragraph.
See Get-Help about_Regular_Expressions for information about regexes (regular expressions) and Get-Help about_Comparison_Operators for information about the -replace operator.
Cast [int] converts the number string to an integer for proper numerical sorting.
-join [Environment]::NewLine * 2 reassembles the sorted paragraphs into a single multi-line string, with paragraphs separated by an empty line.
[Environment]::NewLine is the platform-appropriate newline sequence; you can alternatively hard-code newlines as "`r`n" (CRLF) or "`n" (LF).
You can send the output to a new file by appending something like
... | Set-Content sortedFile.txt (which makes the file "ANSI"-encoded in Windows PowerShell, and UTF-8-encoded in PowerShell Core by default; use -Encoding as needed).
Since the entire input file is read into memory up front, it is possible to write the results directly back to the input file (... | Set-Content file.txt), but doing so bears the slight risk of data loss, namely if writing is interrupted before completion.
Nas' helpful answer works, but it is an O(m*n) operation; that is, with m paragraphs to output in prescribed order and n input paragraphs, m * n operations are required; if all input paragraphs are to be output (in the prescribed order), i.e, if m equals n, the effort is quadratic.
The following PSv4+ solution will scale better, as it only requires linear rather than quadratic effort:
# The tokens prescribing the sort order, which may come from
# another file read with Get-Content, for instance.
$tokensToSortBy = '18C', '25C', '21C'
# Create a hashtable that indexes the input file's paragraphs by the sort
# token embedded in each.
((Get-Content -Raw file.txt) -split '\r?\n\r?\n' -replace '\r?\n$').ForEach({
$htParagraphsBySortToken[$_ -replace '(?s).* (\d+C)$(?:\r?\n)?', '$1'] = $_
})
# Loop over the tokens prescribing the sort order, and retrieve the
# corresponding paragraph, then reassemble the paragraphs into a single,
# multi-line string with -join
$tokensToSortBy.ForEach({ $htParagraphsBySortToken[$_] }) -join [Environment]::NewLine * 2
(Get-Content -Raw file.txt) reads the input file into memory as a whole, as a single, multi-line string.
-split '\r?\n\r?\n' breaks the multi-line string into an array of paragraphs (blocks of lines separated by an empty line), and -replace '\r?\n$' removes a trailing newline, if any, from the paragraph at the very end of the file.
Regex \r?\n matches both Windows-style CRLF and Unix-style LF-only newlines.
$_ -replace '(?s).* (\d+C)$(?:\r?\n)?', '$1' extracts the sort token (e.g., 25C) from each paragraph, which becomes the hashtable's key.
$_ represents the input paragraph at hand.
-replace '...', '...' performs string replacement based on a regex.
See Get-Help about_Regular_Expressions for information about regexes (regular expressions) and Get-Help about_Comparison_Operators for information about the -replace operator.
-join [Environment]::NewLine * 2 reassembles the sorted paragraphs into a single multi-line string, with paragraphs separated by an empty line.
[Environment]::NewLine is the platform-appropriate newline sequence; you can alternatively hard-code newlines as "`r`n" (CRLF) or "`n" (LF).
You can send the output to a new file by appending something like
... | Set-Content sortedFile.txt to the last statement (which makes the file "ANSI"-encoded in Windows PowerShell, and UTF-8-encoded in PowerShell Core by default; use -Encoding as needed).
$text = Get-Content -path C:\text.txt
$order = '18C','25C','21C'
foreach ($item in $order)
{
$text | ForEach-Object {
if ($_ -match "$item`$") { # `$ to match string at the end of the line
Write-Output $text[($_.ReadCount-2)..($_.ReadCount)] # output lines before and after match
}
}
}

Remove start and end spaces in specific csv column

I am trying to remove start and end spaces in column data in CSV file. I've got a solution to remove all spaces in the csv, but it's creating non-readable text in description column.
Get-Content –path test.csv| ForEach-Object {$_.Trim() -replace "\s+" } | Out-File -filepath out.csv -Encoding ascii
e.g.
'192.168.1.2' ' test-1-TEST' 'Ping Down at least 1 min' '3/11/2017' 'Unix Server' 'Ping' 'critical'
'192.168.1.3' ' test-2-TEST' ' Ping Down at least 3 min' '3/11/2017' 'windows Server' 'Ping' 'critical'
I only want to remove space only from ' test-1-TEST' and not from 'Ping Down at least 1 min'. Is this possible?
"IP","ServerName","Status","Date","ServerType","Test","State"
"192.168.1.2"," test-1-TEST","Ping Down at least 1 min","3/11/2017","Unix Server","Ping","critical"
"192.168.1.3"," test-2-TEST"," Ping Down at least 3 min","3/11/2017","windows Server","Ping","critical"
For example file above:
Import-Csv C:\folder\file.csv | ForEach-Object {
$_.ServerName = $_.ServerName.trim()
$_
} | Export-Csv C:\folder\file2.csv -NoTypeInformation
Replace ServerName with the name of the Column you want to remove spaces from (aka trim).
If your CSV does not have header (which means its not a true CSV) and/or you want to better preserve the original file structure and formatting you could try to expand on your regex a little.
(Get-Content c:\temp\test.txt -Raw) -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
That should remove all leading and trailing spaces inside the quoted values. Not the delimeters themselves.
Read the file in as one string. Could be bad idea depending on file size. Not required as the solution is not dependent on that. Can still be read line be line with the same transformation achieving the same result. Use two replacements that are similar. First is looking for spaces that exist after a single quote but not followed by another quote or space. Second is looking for spaces before a quote that are not preceded by a quote or space.
Just wanted to give a regex example. You can look into this with more detail and explanation at regex101.com. There you will see an alternation pattern instead of two separate replacements.
(Get-Content c:\temp\test.txt -Raw) -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
The first example is a little easier on the eyes.
I was having issues consistently replicating this but if you are having issues with it replacing newlines as well then you can just do the replacement one line at a time and that should work as well.
(Get-Content c:\temp\test.txt) | Foreach-Object{
$_ -replace "(?<=')\s+(?=[^' ])|(?<=[^' ])\s+(?=')"
} | Set-Content c:\temp\test.txt

Add quotes to each column in a CSV via Powershell

I am trying to create a Powershell script which wraps quotes around each columns of the file on export to CSV. However the Export-CSV applet only places these where they are needed, i.e. where the text has a space or similar within it.
I have tried to use the following to wrap the quotes on each line but it ends up wrapping three quotes on each column.
$r.SURNAME = '"'+$r.SURNAME+'"';
Is anyone able to share how to forces these on each column of the file - so far I can just find info on stripping these out.
Thanks
Perhaps a better approach would be to simply convert to CSV (not export) and then a simple regex expression could add the quotes then pipe it out to file.
Assuming you are exporting the whole object $r:
$r | ConvertTo-Csv -NoTypeInformation `
| % { $_ -replace ',(.*?),',',"$1",' } `
| Select -Skip 1 | Set-Content C:\temp\file.csv
The Select -Skip 1 removes the header. If you want the header just take it out.
To clarify what the regex expression is doing:
Match: ,(.*?),
Explanation: This will match section of each line that has a comma followed by any number of characters (.*) without being greedy (? : basically means it will only match the minimum number of characters that is needed to complete the match) and the finally is ended with a comma. The parenthesis will hold everything between the two commas in a match variable to be used later in the replace.
Replace: ,"$1",
Explanation: The $1 holds the match between the two parenthesis mention above in the match. I am surrounding it with quotes and re-adding the commas since I matched on those as well they must be replaced or they are simply consumed. Please note, that while the match portion of the -replace can have double quotes without an issue, the replace section must be surrounded in single quotes or the $1 gets interpreted by PowerShell as a PowerShell variable and not a match variable.
You can also use the following code:
$r.SURNAME = "`"$($r.SURNAME)`""
I have cheated to get what I want by re-parsing the file through the following - guess that it acts as a simple find and replace on the file.
get-content C:\Data\Downloads\file2.csv
| foreach-object { $_ -replace '"""' ,'"'}
| set-content C:\Data\Downloads\file3.csv
Thanks for the help on this.