I have some CSV data I need to clean up by removing inline linefeeds and special characters like typographic quotes. I feel like I could get this working with Python or Unix utils, but I'm stuck on a pretty vanilla Windows 2012 box, so I'm giving PowerShell v5 a shot despite my lack of experience with it.
Here's what I'm looking to achieve:
$InputFile:
"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020306","John","Davis","Employee was not dressed appropriately."{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, “That’s not my job”"{CRLF}
"00020309","Bob","Meyers","Employee did the following:{LF}
• Showed up late{LF}
• Did not complete assignments{LF}
• Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}
$OutputFile:
"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, ""That's not my job"""{CRLF}
"00020309","Bob","Meyers","Employee did the following: * Showed up late * Did not complete assignments * Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}
The following code works:
(Get-Content $InputFile -Raw) `
-replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*' | Set-Content $OutputFile -Encoding ASCII
However, the actual data I'm dealing with is a 4GB file with over a million lines. Get-Content -Raw runs out of memory. I tried Get-Content -ReadCount 10000, but that removes all linefeeds, presumably because it reads line-wise.
More Googling brought me to Import-Csv which I got from here:
Import-Csv $InputFile | ForEach {
$_.notes = $_.notes -replace '(?<!\x0d)\x0a',' '
$_
} | Export-Csv $OutputFile -NoTypeInformation -Encoding ASCII
but I don't appear to have a notes property on my objects:
Exception setting "notes": "The property 'notes' cannot be found on this object. Verify that the property exists and can be set."
At C:\convert.ps1:53 char:5
+ $_.notes= $_.notes -replace '(?<!\x0d)\x0a',' '
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], SetValueInvocationException
+ FullyQualifiedErrorId : ExceptionWhenSetting
I found another example using the Value property, but I got the same error.
I tried running Get-Member on each object and it looks like it's assigning properties based on the header from the file, like I may be able to get it with $_.DESCRIPTION, but I don't know enough PowerShell to run the replacements on all of the properties :(
Please help? Thanks!
Update:
I ended up giving up on PS and coding this in AutoIT. It's not great, and it will be more difficult to maintain, especially since there hasn't been a new release in 2.5 years. But it works, and it crunches the prod file in 4 minutes.
Unfortunately, I couldn't key on the LF easily either, so I ended up going with the logic to create new lines based on ^"[^",] (Line starts with a quote and the second character is not a quote or comma).
Here's the AutoIT code:
#include <FileConstants.au3>
If $CmdLine[0] <> 2 Then
ConsoleWriteError("Error in parameters" & #CRLF)
Exit 1
EndIf
Local Const $sInputFilePath = $CmdLine[1]
Local Const $sOutputFilePath = $CmdLine[2]
ConsoleWrite("Input file: " & $sInputFilePath & #CRLF)
ConsoleWrite("Output file: " & $sOutputFilePath & #CRLF)
ConsoleWrite("***** WARNING *****" & #CRLF)
ConsoleWrite($sOutputFilePath & " is being OVERWRITTEN!" & #CRLF & #CRLF)
Local $bFirstLine = True
Local $hInputFile = FileOpen($sInputFilePath, $FO_ANSI)
If $hInputFile = -1 Then
ConsoleWriteError("An error occurred when reading the file.")
Exit 1
EndIf
Local $hOutputFile = FileOpen($sOutputFilePath, $FO_OVERWRITE + $FO_ANSI)
If $hOutputFile = -1 Then
ConsoleWriteError"An error occurred when opening the output file.")
Exit 1
EndIf
ConsoleWrite("Processing..." &#CRLF)
While True
$sLine = FileReadLine($hInputFile)
If #error = -1 Then ExitLoop
;Replace typographic single quotes and backtick with apostrophe
$sLine = StringRegExpReplace($sLine, "[‘’´]","'")
;Replace typographic double quotes with normal quote (doubled for in-field CSV)
$sLine = StringRegExpReplace($sLine, '[“”]','""')
;Replace bullet and middot with asterisk
$sLine = StringRegExpReplace($sLine, '[•·]','*')
;Replace non-breaking space (0xA0) and delete (0x7F) with space
$sLine = StringRegExpReplace($sLine, "[\xa0\x7f]"," ")
If $bFirstLine = False Then
If StringRegExp($sLine,'^"[^",]') Then
$sLine = #CRLF & $sLine
Else
$sLine = " " & $sLine
EndIf
Else
$bFirstLine = False
EndIf
FileWrite($hOutputFile, $sLine)
WEnd
ConsoleWrite("Done!" &#CRLF)
FileClose($hInputFile)
FileClose($hOutputFile)
The first answer may be better than this, as I'm not sure if PS needs to load everything into memory this way or not (though I think it does), but going off what you started above, I was thinking along this line...
# Import CSV into a variable
$InputFile = Import-Csv $InputFilePath
# Gets all field names, stores in $Fields
$InputFile | Get-Member -MemberType NoteProperty |
Select-Object Name | Set-Variable Fields
# Updates each field entry
$InputFile | ForEach-Object {
$thisLine = $_
$Fields | ForEach-Object {
($thisLine).($_.Name) = ($thisLine).($_.Name) `
-replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*'
}
$thisLine | Export-Csv $OutputFile -NoTypeInformation -Encoding ASCII -Append
}
Here's another "line-by-line" attempt, somewhat akin to mklement0's answer. It assumes that no "row-continuation" line will begin with ". Hopefully it performs much better!
# Clear contents of file (Not sure if you need/want this...)
if (Test-Path -type leaf $OutputFile) { Clear-Content $OutputFile }
# Flag for first entry, since no data manipulation needed there
$firstEntry = $true
foreach($line in [System.IO.File]::ReadLines($InputFile)) {
if ($firstEntry) {
Add-Content -Path $OutputFile -Value $line -NoNewline
$firstEntry = $false
}
else {
if ($line[0] -eq '"') { Add-Content -Path $OutputFile "`r`n" -NoNewline}
else { Add-Content -Path $OutputFile " " -NoNewline}
$sanitizedLine = $line -replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*'
Add-Content -Path $OutputFile -Value $sanitizedLine -NoNewline
}
}
The technique is based on this other answer and its comments: https://stackoverflow.com/a/47146987/7649168
(Also thanks to mklement0 for explaining the performance issues of my previous answer.)
Note:
See my other answer for a robust solution.
The answer below may still be of interest for a general line-by-line processing solution that performs well, although it invariably treats LF-only instance as line separators too (it has been updated to use the same regex to distinguish between a line that is start of a row and one that is a row's continuation that you use in the AutoIt solution you've added to the question).
Given the size of your file, I suggest sticking with plain-text processing for performance reasons:
The switch statement enables fast line-by-line processing; it recognizes both CRLF and LF as newlines, as PowerShell generally does. Note, however, given that each line returned has its trailing newline stripped, you won't be able to tell whether the input line ended in CRLF of just LF.
Using a .NET type directly, System.IO.StreamWriter, bypasses the pipeline and enables fast writes to the output file.
For general PowerShell performance tips, see this answer.
$inputFile = 'in.csv'
$outputFile = 'out.csv'
# Create a stream writer for the output file.
# Default to BOM-less UTF-8, but you can pass a [System.Text.Encoding]
# instance as the second argument.
# Note: Pass a *full* path, because .NET's working dir. usually differs from PowerShell's
$outFileWriter = [System.IO.StreamWriter]::new("$PWD/$outputFile")
# Use a `switch` statement to read the input file line by line.
$outLine = ''
switch -File $inputFile -Regex {
'^"[^",]' { # (Start of) a new row.
if ($outLine) { # write previous, potentially synthesized line
$outFileWriter.WriteLine($outLine)
}
$outLine = $_ -replace "[‘’´]", "'" -replace '[“”]', '""' -replace '\u00a0', ' '
}
default { # Continuation of a row.
$outLine += ' ' + $_ -replace "[‘’´]", "'" -replace '[“”]', '""' -replace '\u00a0', ' ' `
-replace '[•·]', '*' -replace '\n'
}
}
# Write the last line.
$outFileWriter.WriteLine($outLine)
$outFileWriter.Close()
Note: The above assumes that no row continuation also matches regex pattern '^"[^",]', which is hopefully robust enough (you've deemed it to be, given that you based your AutoIt solution on it).
This simple distinction between the start of a row and continuations on subsequent lines obviates the need for lower-level file I/O in order to distinguish between CRLF and LF newlines, which my other answer does.
The following two approaches would work in principle, but are too slow with a large input file such as yours.
Object-oriented processing with Import-Csv / Export-Csv:
Use Import-Csv to parse the CSV into objects, modify the objects' DESCRIPTION property values, then reexport with Export-Csv. Since the row-internal LF-only newlines are inside double-quoted fields, they are recognized as being part of the same row.
While a robust and conceptually elegant approach, it is by far the slowest and also very memory-intensive - see GitHub issue #7603, which discusses the reasons, and GiHub feature request #11027 to improve the situation by outputting hashtables rather than custom objects ([pscustomobject]).
Plain-text processing with Get-Content / Set-Content:
Use Get-Content -Delimiter "`r`n" to split the text file into lines by CRLF only, not also LF, transform each line as needed and save it to the output file with Set-Content.
While you pay a performance penalty for the conceptual elegance of using the pipeline in general, which makes saving the results with Set-Content line by line somewhat slow, Get-Content is especially slow, because it decorates each output string (line) with additional properties about the originating file, which is costly. See the green-lighted, but not yet implemented GitHub feature request #7537 to improve performance (and memory use) by omitting this decoration.
Solution:
For performance reasons, direct use of .NET APIs is therefore required.
Note: If the PowerShell solution should still be too slow, consider creating a helper class via ad-hoc compilation of C# code using Add-Type; ultimately, of course, using only compiled code will perform best.
While there is no direct equivalent to Get-Content -Delimiter "`r`n", you can read text files in fixed-size blocks (arrays) of characters, , using the System.IO.StreamReader.ReadBlock() method (.NET Framework 4.5+ / .NET Core 1+), on which you can then perform the desired transformations, as shown below.
Note:
For best performance, choose a high $BUFSIZE value below to minimize the number of reads and processing iterations; obviously, the value must be chosen so that you don't run out of memory.
There's not even a need to parse the blocks read into CRLF newlines, because you can simply target the LF-only lines with a regex that is a modified version of the one from your original approach, '(?<!\r|^)\n' (see code comments below).
For brevity, error handling is omitted, but the .Close() calls to close the files should generally be placed in the finally block of a try / catch / finally statement.
# In- and output file paths.
# Note: Be sure to use *full* paths, because .NET's working dir. usually
# differs from PowerShell's.
$inFile = "$PWD/in.csv"
$outFile = "$PWD/out.csv"
# How many characters to read at once.
# This is a tradeoff between execution speed and memory use.
$BUFSIZE = 100MB
$buf = [char[]]::new($BUFSIZE)
$inStream = [IO.StreamReader]::new($inFile)
$outStream = [IO.StreamWriter]::new($outFile)
# Process the file in fixed-size blocks of characters.
while ($charsRead = $inStream.ReadBlock($buf, 0, $BUFSIZE)) {
# Convert the array of chars. to a string.
$block = [string]::new($buf, 0, $charsRead)
# Transform the block and write it to the output file.
$outStream.Write(
# Transform block-internal LF-only newlines to spaces and perform other
# subsitutions.
# Note: The |^ part inside the negative lookbehind is to deal with the
# case where the block starts with "`n" due to the block boundaries
# accidentally having split a CRLF sequence.
($block -replace '(?<!\r|^)\n', ' ' -replace "[‘’´]", "'" -replace '[“”]', '""' -replace '\u00a0', ' ' -replace '[•·]', '*')
)
}
$inStream.Close()
$outStream.Close()
I have a tab separated file like:
tyuy wqf fdfd
zx c vbn 733t 601 asd
Last line is like zx c[tab]vbn[tab]733t 601[tab]asd.
I need to trim data before the first tab in a 2Gb file with some 100 characters per line.
I want to copy content of the file line by line after the first tab
wqf fdfd
vbn 733t 601 asd
I wrote a script that works on small test files
powershell -Command "(gc in.txt) -replace '^[^\t]+\t' , '$1' | Out-File -encoding ASCII out.txt"
However, it consumed 10Gb of memory and took hours to run.
Isthere a way to make this script faster? A bat file for cmd.exe would work too. Python and Perl cannot be installed on that computer.
I would use the -split operator to get the part after the first tab character.
Because you are working with a large file, these options may work better for you:
Using [System.IO.File]::ReadLines
foreach ($line in [System.IO.File]::ReadLines("D:\in.txt")) {
Add-Content -Path 'D:\out.txt' -Value ($line -split '\t', 2 )[-1]
}
But perhaps faster by using StreamReader and StreamWriter
$reader = New-Object System.IO.StreamReader("D:\in.txt")
$writer = New-Object System.IO.StreamWriter("D:\out.txt")
while (($line = $reader.ReadLine()) -ne $null) {
$writer.WriteLine(($line -split '\t', 2 )[-1])
}
$reader.Dispose()
$writer.Dispose()
Get-Content is inefficient for large files. Using methods of the .NET System.IO.File class is a better way to go.
Check out this article for a comparison of different techniques: Reading large text files with Powershell
We have 23 GB SQL file that cannot be opened in any text file, so I am trying to use a Powershell Script.
((Get-Content D:\test.sql) -replace 'XXY','UUUUU') |Set-Content D:\test.sql
The problem is that it takes too long time, the text I want to modify stays in 1st 20 line, so I tried
((Get-Content D:\test.sql) |Select -First 20 -replace 'xxxx','UUUU') |Set-Content D:\test.sql
No luck, no errors nothing happens Am I missing anything?
Here two options to speed up the reading/writing process.
Both read one line at a time, replace the stuff you need replaced and write that line to a new output file.
The first makes use of the .NET System.IO.File ReadLines method:
foreach ($line in [System.IO.File]::ReadLines("D:\test.sql")) {
Add-Content -Path 'D:\test2.sql' -Value ($line -replace 'XXY','UUUUU')
}
Perhaps even faster that the above, would be to use .Net System.IO.StreamReader and System.IO.StreamWriter:
$reader = New-Object System.IO.StreamReader("D:\test.sql")
$writer = New-Object System.IO.StreamWriter("D:\test2.sql")
while (($line = $reader.ReadLine()) -ne $null) {
$writer.WriteLine(($line -replace 'XXY','UUUUU'))
}
$reader.Dispose()
$writer.Dispose()
I am currently trying to create a script that allows me to check multiple web url's in order to see if they are online and active. My company has multiple servers with different environments active (Production, Staging, Development etc.) I need a script that can check all the environments URL's and tell me whether or not they are online each and every morning so I can be ahead of the game in addressing any servers or websites being down.
My issue however is I can't solely base the logic strictly on an HTTP code to deem the site online or not, some of our websites may be online from an HTTP standpoint but have components or webparts of the site that is down displaying an error message on the page.
I am having trouble coming up with a script that can not only check the HTTP status as well as scan the page and parse out any error messages and then write to host based on both pieces of logic whether or not the site is "Online" or "Down"
Here is what I have so far, you will notice it does not include anything regarding parse for key words as I don't know how to implement...
#Lower Environments Checklist Automated Script
Write-Host Report generated at (Get-date)
write-host("Lower Environments Status Check");
$msg = ""
$array = get-content C:\LowerEnvChecklist\appurls.txt
$log = "C:\LowerEnvChecklist\lowerenvironmentslog.txt"
write-host("Checking appurls.txt...One moment please.");
("`n--------------------------------------------------------------------------- ") | out-file $log -Append
Get-Date | Out-File $log -Append
("`n***Checking Links***") | out-file $log -Append
("`n") | out-file $log -Append
for ($i=0; $i -lt $array.length; $i++) {
$HTTP_Status = -1
$HTTP_Request = [System.Net.WebRequest]::Create($array[$i])
$HTTP_Request.Timeout =60000
$HTTP_Response = $HTTP_Request.GetResponse()
$HTTP_Status = [int]$HTTP_Response.StatusCode
If ($HTTP_Status -eq 200) {
$msg = $array[$i] + " is ONLINE!"
}
Else {
$msg = $array[$i] + " may be DOWN, please check!"
}
$HTTP_Response.Close()
$msg | Out-File $log -Append -width 120
write-host $msg
}
("`n") | out-file $log -Append
("`n***Lower Environments Checklist Completed***") | out-file $log -Append
write-host("Lower Environments Checklist Completed");
appurls.txt just contains the internal URLs I need checked FYI.
Any help would be much appreciated! Thanks.
Here is something to at least give you an idea what to do. Need to capture the website data in order to parse it. Then we run a regex query against that which is built from an array of strings. Those strings are texts that might be seen on a page that is not working.
# build a regex query of error strings to match against.
$errorTexts = "error has occurred","Oops","Unable to display widget data","unexpected error occurred","temporarily unavailable"
$regex = ($errorTexts | ForEach-Object{[regex]::Escape($_)}) -join "|"
# Other preproccessing would go here
# Loop through each element of the array
ForEach($target in $array){
# Erase results for the next pass in case of error.
$result, $response, $stream, $page = $null
# Navigate to the website.
$result = [System.Net.WebRequest]::Create($target)
$response = $result.GetResponse()
$stream = [System.IO.StreamReader]$response.GetResponseStream()
$page = $stream.ReadToEnd()
# Determine if the page is truly up based on the information above.
If($response.StatusCode -eq 200){
# While the page might have rendered need to determine there are no errors present
if($page -notmatch $regex){
$msg = "$target is online!"
} else {
$msg = "$target may be DOWN, please check!"
}
} else {
$msg = "$target may be DOWN, please check!"
}
# Log Result
$msg | Out-File $log -Append -width 120
# Close the connection
$response.Close()
}
# Other postproccessing would go here
I wanted to show what a here-string looked like to replace some of your out-file repetition. Your log file header used to be several lines of this. I have reduced it to one.
#"
---------------------------------------------------------------------------
$(Get-Date)
***Checking Links***
"# | Out-File $log -Append
Also consider CodeReview.SE for critiquing working code. There are other areas which could in theory be improved but are out of scope for this question.
Following situation:
A PowerShell script creates a file with UTF-8 encoding
The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, and possibly changing the line separators
The same PowerShell script reads the file, adds some more content and writes it all as UTF-8 back to the same file
This can be iterated many times
With Get-Content and Out-File -Encoding UTF8 I have problems reading it correctly. It's stumbling over the BOM it has written before (putting it in the content, breaking my parsing regex), does not use UTF-8 encoding and even deletes line breaks in the original content part.
I need a function that can read any file with UTF-8 encoding, ignore and delete the BOM and not modify the content. What should I use?
Update
I have added a little test script that shows what I'm trying to do and what happens instead.
# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
$data = Get-Content -Path test.txt
if ($data -match "^[0-9-]{10} - r([0-9]+)")
{
$startRev = [int]$matches[1] + 1
}
}
Write-Host Next revision is $startRev
# Define example data to add
$startRev = $startRev + 10
$newMsgs = "2014-04-01 - r" + $startRev + "`r`n`r`n" + `
"Line 1`r`n" + `
"Line 2`r`n`r`n"
# Write new data back
$data = $newMsgs + $data
$data | Out-File test.txt -Encoding UTF8
After running it a few times, new sections should be added to the beginning of the file, the existing content should not be altered in any way (currently loses line breaks) and no additional new lines should be added at the end of the file (seems to happen sometimes).
Instead, the second run gives me an error.
If the file is supposed to be UTF8 why don't you try to read it decoding UTF8 :
Get-Content -Path test.txt -Encoding UTF8
Really JPBlanc is right. If you want it read as UTF8 then specify that when the file is read.
On a side note, you're losing formatting in here with the [String]+[String] stuff. Not to mention your regex match doesn't work. Check out the regex search changes, and the changes made to the $newMsgs, and the way I'm outputting your data to the file.
# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
$data = Get-Content -Path test.txt #-Encoding UTF8
if($data -match "\br([0-9]+)\b"){
$startRev = [int]([regex]::Match($data,"\br([0-9]+)\b")).groups[1].value + 1
}
}
Write-Host Next revision is $startRev
# Define example data to add
$startRev = $startRev + 10
$newMsgs = #"
2014-04-01 - r$startRev`r`n`r`n
Line 1`r`n
Line 2`r`n`r`n
"#
# Write new data back
$newmsgs,$data | Out-File test.txt -Encoding UTF8
Get-Content doesn't seem to handle UTF-files without BOM at all (if you omit the Encoding-flag). System.IO.File.ReadLines seems to be an alternative, examples:
PS C:\temp\powershellutf8> $a = Get-Content .\utf8wobom.txt
PS C:\temp\powershellutf8> $b = Get-Content .\utf8wbom.txt
PS C:\temp\powershellutf8> $a2 = Get-Content .\utf8wbom.txt -Encoding UTF8
PS C:\temp\powershellutf8> $a
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ <== This doesnt seem to be right at all
PS C:\temp\powershellutf8> $b
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8> $a2
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8>
PS C:\temp\powershellutf8> $c = [IO.File]::ReadLines('.\utf8wbom.txt');
PS C:\temp\powershellutf8> $c
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8> $d = [IO.File]::ReadLines('.\utf8wobom.txt');
PS C:\temp\powershellutf8> $d
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ <== Works!