how to use variable in parameter name - powershell

I am trying to read multiple files in powershell script in a for-loop always switching the file name dynamically.
I tried a different notation, but I'm still getting errors.
The code is like:
$fileName1 = "D:\file333"
$fileName2 = "D:\file444"
$datePattern =" C.{3} "
[Char[]]$buffer = new-object char[] 10000
for ($j=1; $j -le 2; $j++)
{
[string]$fileNumber = $j.ToString()
$inFile = new-object -TypeName System.IO.StreamReader -ArgumentList
$fileName($fileNumber)
[int]$bytesRead = $inFile.Read($buffer, 0, $buffer.Length)
while ($bytesRead -gt 0) {
[string]$bufferString = -join $buffer
$results = $bufferString | Select-String $datePattern -AllMatches
$results.Matches.Value
[int]$bytesRead = $inFile.Read($buffer, 0, $buffer.Length)
}
}
I have two questions:
What would be a right notation for the -ArgumentList, to get FileName1 and then FileName2?
Does the Read method applied on the inFile really requires a Char[] type as argument for $buffer? And Is it possible to get the [String] parameter instead?
Can someone give me a hint?

Okay,
I'm sure you were reminded when you first submitted the question, but please remember to put your code in a code block, like so:
$fileName1 = "D:\file333"
$fileName2 = "D:\file444"
$datePattern =" C.{3} "
[Char[]]$buffer = new-object char[] 10000
for ($j=1; $j -le 2; $j++)
{
[string]$fileNumber = $j.ToString()
$inFile = new-object -TypeName System.IO.StreamReader -ArgumentList
$fileName($fileNumber)
[int]$bytesRead = $inFile.Read($buffer, 0, $buffer.Length)
while ($bytesRead -gt 0)
{
[string]$bufferString = -join $buffer
$results = $bufferString | Select-String $datePattern -AllMatches
$results.Matches.Value
[int]$bytesRead = $inFile.Read($buffer, 0, $buffer.Length)
}
}
Now, I'm REALLY unsure of what you're trying to do here. Whatever it is, I strongly suggest you create a well founded question here around it.
In any case, powershell doesn't support the behavior you're trying to create. It's an easy newbie trap to fall into. Oh, I want to find files 1 and 2, so I make a loop and use the iterator, right? Wrong!
You need to loop over a collection containing the files. Powershell doesn't think of the files as part of a collection unless you put it in one.
Try this.
$files = #("D:\file3333", "D:\file444")
$datePattern =" C.{3} "
[Char[]]$buffer = new-object char[] 10000
foreach(file in files)
{
[int]$bytesRead = new-object -TypeName System.IO.StreamReader -ArgumentList $file | $_.Read($buffer, 0, $buffer.length)
while($bytesRead -gt 0)
{
[string]$bufferString = -join $buffer
$results = $bufferString | Select-String $datePattern -AllMatches
$results.Matches.Value
[int]$bytesRead = $inFile.Read($buffer, 0, $buffer.Length)
}
}
Here, we put the files into a collection, and iterate over that collection with a foreach loop. We can collapse your multiple lines to call the file into a buffer stream into one line by using the pipe operator.
I have NO IDEA what you're trying to do in the while loop, so I'd be happy to help you make it a bit more powershell friendly, but you'll have to ask the question you mean to ask before I can do so.

Related

How to sort 30Million csv records in Powershell

I am using oledbconnection to sort the first column of csv file. Oledb connection is executed up to 9 million records within 6 min duration successfully. But when am executing 10 million records, getting following alert message.
Exception calling "ExecuteReader" with "0" argument(s): "The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or
there is not enough temporary storage space on the disk to store the query result."
is there any other solution to sort 30 million using Powershell?
here is my script
$OutputFile = "D:\Performance_test_data\output1.csv"
$stream = [System.IO.StreamWriter]::new( $OutputFile )
$sb = [System.Text.StringBuilder]::new()
$sw = [Diagnostics.Stopwatch]::StartNew()
$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='D:\Performance_test_data\';Extended Properties='Text;HDR=Yes;CharacterSet=65001;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from 1crores.csv order by col6"
$conn.open()
$data = $cmd.ExecuteReader()
echo "Query has been completed!"
$stream.WriteLine( "col1,col2,col3,col4,col5,col6")
while ($data.read())
{
$stream.WriteLine( $data.GetValue(0) +',' + $data.GetValue(1)+',' + $data.GetValue(2)+',' + $data.GetValue(3)+',' + $data.GetValue(4)+',' + $data.GetValue(5))
}
echo "data written successfully!!!"
$stream.close()
$sw.Stop()
$sw.Elapsed
$cmd.Dispose()
$conn.Dispose()
You can try using this:
$CSVPath = 'C:\test\CSVTest.csv'
$Delimiter = ';'
# list we use to hold the results
$ResultList = [System.Collections.Generic.List[Object]]::new()
# Create a stream (I use OpenText because it returns a streamreader)
$File = [System.IO.File]::OpenText($CSVPath)
# Read and parse the header
$HeaderString = $File.ReadLine()
# Get the properties from the string, replace quotes
$Properties = $HeaderString.Split($Delimiter).Replace('"',$null)
$PropertyCount = $Properties.Count
# now read the rest of the data, parse it, build an object and add it to a list
while ($File.EndOfStream -ne $true)
{
# Read the line
$Line = $File.ReadLine()
# split the fields and replace the quotes
$LineData = $Line.Split($Delimiter).Replace('"',$null)
# Create a hashtable with the properties (we convert this to a PSCustomObject later on). I use an ordered hashtable to keep the order
$PropHash = [System.Collections.Specialized.OrderedDictionary]#{}
# if loop to add the properties and values
for ($i = 0; $i -lt $PropertyCount; $i++)
{
$PropHash.Add($Properties[$i],$LineData[$i])
}
# Now convert the data to a PSCustomObject and add it to the list
$ResultList.Add($([PSCustomObject]$PropHash))
}
# Now you can sort this list using Linq:
Add-Type -AssemblyName System.Linq
# Sort using propertyname (my sample data had a prop called "Name")
$Sorted = [Linq.Enumerable]::OrderBy($ResultList, [Func[object,string]] { $args[0].Name })
Instead of using import-csv I've written a quick parser which uses a streamreader and parses the CSV data on the fly and puts it in a PSCustomObject.
This is then added to a list.
edit: fixed the linq sample
Putting the performance aside and at least come to a solution that works (meaning one that doesn't hang due to memory shortage) I would rely on the PowerShell pipeline. The issue is thou that for sorting an object you will need to stall te pipeline as the last object might potentially become the first object.
To resolve this part, I would do a coarse division on the first character(s) of the concern property first. Once that is done, fine sort each coarse division and append the results:
Function Sort-BigObject {
[CmdletBinding()] param(
[Parameter(ValueFromPipeLine = $True)]$InputObject,
[Parameter(Position = 0)][String]$Property,
[ValidateRange(1,9)]$Coarse = 1,
[System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default
)
Begin {
$TemporaryFiles = [System.Collections.SortedList]::new()
}
Process {
if ($InputObject.$Property) {
$Grain = $InputObject.$Property.SubString(0, $Coarse)
if (!$TemporaryFiles.Contains($Grain)) { $TemporaryFiles[$Grain] = New-TemporaryFile }
$InputObject | Export-Csv $TemporaryFiles[$Grain] -Encoding $Encoding -Append
} else { $InputObject.$Property }
}
End {
Foreach ($TemporaryFile in $TemporaryFiles.Values) {
Import-Csv $TemporaryFile -Encoding $Encoding | Sort-Object $Property
Remove-Item -LiteralPath $TemporaryFile
}
}
}
Usage
(Don't assign the stream to a variable and don't use parenthesis.)
Import-Csv .\1crores.csv | Sort-BigObject <PropertyName> | Export-Csv .\output.csv
If the temporary files still get too big to handle, you might need to increase the -Coarse parameter
Caveats (improvement considerations)
Objects with an empty sort property will be immediately outputted
The sort column is presumed to be a (single) string column
I presume the performance is poor (I didn't do a full test on 30 million records, but 10.000 records take about 8 second which means about 8 hours). Consider replacing native PowerShell cmdlets with .Net streaming methods. buffer/cache file input and outputs, parallel processing?
You could try SQLite:
$OutputFile = "D:\Performance_test_data\output1.csv"
$sw = [Diagnostics.Stopwatch]::StartNew()
sqlite3 output1.db '.mode csv' '.import 1crores.csv 1crores' '.headers on' ".output $OutputFile" 'Select * from 1crores order by 最終アクセス日時'
echo "data written successfully!!!"
$sw.Stop()
$sw.Elapsed
I have added a new answer as this is a complete different approach to tackle this issue.
Instead of creating temporary files (which presumable causes a lot of file opens and closures), you might consider to create a ordered list of indices and than go over the input file (-FilePath) multiple times and each time, process a selective number of lines (-BufferSize = 1Gb, you might have to tweak this "memory usage vs. performance" parameter):
Function Sort-Csv {
[CmdletBinding()] param(
[string]$InputFile,
[String]$Property,
[string]$OutputFile,
[Char]$Delimiter = ',',
[System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default,
[Int]$BufferSize = 1Gb
)
Begin {
if ($InputFile.StartsWith('.\')) { $InputFile = Join-Path (Get-Location) $InputFile }
$Index = 0
$Dictionary = [System.Collections.Generic.SortedDictionary[string, [Collections.Generic.List[Int]]]]::new()
Import-Csv $InputFile -Delimiter $Delimiter -Encoding $Encoding | Foreach-Object {
if (!$Dictionary.ContainsKey($_.$Property)) { $Dictionary[$_.$Property] = [Collections.Generic.List[Int]]::new() }
$Dictionary[$_.$Property].Add($Index++)
}
$Indices = [int[]]($Dictionary.Values | ForEach-Object { $_ })
$Dictionary = $Null # we only need the sorted index list
}
Process {
$Start = 0
$ChunkSize = [int]($BufferSize / (Get-Item $InputFile).Length * $Indices.Count / 2.2)
While ($Start -lt $Indices.Count) {
[System.GC]::Collect()
$End = $Start + $ChunkSize - 1
if ($End -ge $Indices.Count) { $End = $Indices.Count - 1 }
$Chunk = #{}
For ($i = $Start; $i -le $End; $i++) { $Chunk[$Indices[$i]] = $i }
$Reader = [System.IO.StreamReader]::new($InputFile, $Encoding)
$Header = $Reader.ReadLine()
$i = $Start
$Count = 0
For ($i = 0; ($Line = $Reader.ReadLine()) -and $Count -lt $ChunkSize; $i++) {
if ($Chunk.Contains($i)) { $Chunk[$i] = $Line }
}
$Reader.Dispose()
if ($OutputFile) {
if ($OutputFile.StartsWith('.\')) { $OutputFile = Join-Path (Get-Location) $OutputFile }
$Writer = [System.IO.StreamWriter]::new($OutputFile, ($Start -ne 0), $Encoding)
if ($Start -eq 0) { $Writer.WriteLine($Header) }
For ($i = $Start; $i -le $End; $i++) { $Writer.WriteLine($Chunk[$Indices[$i]]) }
$Writer.Dispose()
} else {
$Start..$End | ForEach-Object { $Header } { $Chunk[$Indices[$_]] } | ConvertFrom-Csv -Delimiter $Delimiter
}
$Chunk = $Null
$Start = $End + 1
}
}
}
Basic usage
Sort-Csv .\Input.csv <PropertyName> -Output .\Output.csv
Sort-Csv .\Input.csv <PropertyName> | ... | Export-Csv .\Output.csv
Note that for 1Crones.csv it will probably just export the full file in once unless you set the -BufferSize to a lower amount e.g. 500Kb.
I downloaded gnu sort.exe from here: http://gnuwin32.sourceforge.net/packages/coreutils.htm It also requires libiconv2.dll and libintl3.dll from the dependency zip. I basically did this within cmd.exe, and it used a little less than a gig of ram and took about 5 minutes. It's a 500 meg file of about 30 million random numbers. This command can also merge sorted files with --merge. You can also specify begin and end key position for sorting --key. It automatically uses temp files.
.\sort.exe < file1.csv > file2.csv
Actually it works in a similar way with the windows sort from the cmd prompt. The windows sort also has a /+n option to specify what character column to start the sort by.
sort.exe < file1.csv > file2.csv

Powershell: exclusive lock for a file during multiple set-content and get-content operations

I am trying to write a script which runs on multiple client machines and writes to a single text file on a network share.
I want to ensure that only one machine can maniputale the file at any one time, whilst the other machines run a loop to check if the file is available.
the script runs this first:
Set-Content -Path $PathToHost -Value (get-content -Path $PathToHost | Select-String -Pattern "$HostName " -NotMatch) -ErrorAction Stop
Which removes some lines if they are matching the criteria. Then I want to append a new line with this:
Add-Content $PathToHost "$heartbeat$_" -ErrorAction Stop
The problem is that between the execution of those two commands another client has access to the file and begins to write to the file as well.
I have explored the solution here: Locking the file while writing in PowerShell
$PathToHost = "C:\file.txt"
$mode = "Open"
$access = "ReadWrite"
$share = "None"
$file = [System.IO.File]::Open($path, $mode, $access, $share)
$file.close()
Which can definitely lock the file, but I am not sure how to proceed to then read and write to the file.
Any help is much appreciated.
EDIT: Solution as below thanks to twinlakes' answer
$path = "C:\Users\daniel_mladenov\hostsTEST.txt"
$mode = "Open"
$access = "ReadWrite"
$share = "none"
$file = [System.IO.File]::Open($path, $mode, $access, $share)
$fileread = [System.IO.StreamReader]::new($file, [Text.Encoding]::UTF8)
# Counts number of lines in file
$imax=0
while ($fileread.ReadLine() -ne $null){
$imax++
}
echo $imax
#resets read position to beginning
$fileread.basestream.position = 0
#reads content of whole file and discards mathching lines
$content=#()
for ($i=0; $i -lt $imax; $i++){
$ContentLine = $fileread.ReadLine()
If($ContentLine -notmatch "$HostIP\s" -and $ContentLine -notmatch "$HostName\s"){
$content += $ContentLine
}
}
echo $content
#Writes remaining lines back to file
$filewrite = [System.IO.StreamWriter]::new($file)
$filewrite.basestream.position = 0
for ($i=0; $i -lt $content.length; $i++){
$filewrite.WriteLine($content[$i])
}
$filewrite.WriteLine($heartbeat)
$filewrite.Flush()
$file.SetLength($file.Position) #trims file to the content which has been written, discarding any content past that point
$file.close()
$file is a System.IO.FileStream object. You will need to call the write method on that object, which requires a byte array.
$string = # the string to write to the file
$bytes = [Text.Encoding]::UTF8.GetBytes($string)
$file.Write($bytes, 0, $bytes.Length)

Search large .log for specific string quickly without streamreader

Problem: I need to search a large log file that is currently being used by another process. I cannot stop this other process or put a lock on the .log file. I need to quickly search this file, and I can't read it all into memory. I get that StreamReader() is the fastest, but I can't figure out how to avoid it attempting to grab a lock on the file.
$p = "Seachterm:Search"
$files = "\\remoteserver\c\temp\tryingtofigurethisout.log"
$SearchResult= Get-Content -Path $files | Where-Object { $_ -eq $p }
The below doesn't work because I can't get a lock of the file.
$reader = New-Object System.IO.StreamReader($files)
$lines = #()
if ($reader -ne $null) {
while (!$reader.EndOfStream) {
$line = $reader.ReadLine()
if ($line.Contains($p)) {
$lines += $line
}
}
}
$lines | Select-Object -Last 1
This takes too long:
get-content $files -ReadCount 500 |
foreach { $_ -match $p }
I would greatly appreciate any pointers in how I can go about quickly and efficiently (memory wise) searching a large log file.
Perhaps this will work for you. It tries to read the lines of the file as fast as possible, but with a difference to your second approach (which is approx. the same as what [System.IO.File]::ReadAllLines() would do).
To collect the lines, I use a List object which will perform faster than appending to an array using +=
$p = "Seachterm:Search"
$path = "\\remoteserver\c$\temp\tryingtofigurethisout.log"
if (!(Test-Path -Path $path -PathType Leaf)) {
Write-Warning "File '$path' does not exist"
}
else {
try {
$fileStream = [System.IO.FileStream]::new($path, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
$streamReader = [System.IO.StreamReader]::new($fileStream)
# or use this syntax:
# $fileMode = [System.IO.FileMode]::Open
# $fileAccess = [System.IO.FileAccess]::Read
# $fileShare = [System.IO.FileShare]::ReadWrite
# $fileStream = New-Object -TypeName System.IO.FileStream $path, $fileMode, $fileAccess, $fileShare
# $streamReader = New-Object -TypeName System.IO.StreamReader -ArgumentList $fileStream
# use a List object of type String or an ArrayList to collect the strings quickly
$lines = New-Object System.Collections.Generic.List[string]
# read the lines as fast as you can and add them to the list
while (!$streamReader.EndOfStream) {
$lines.Add($streamReader.ReadLine())
}
# close and dispose the obects used
$streamReader.Close()
$fileStream.Dispose()
# do the 'Contains($p)' after reading the file to not slow that part down
$lines.ToArray() | Where-Object { $_.Contains($p) } | Select-Object -Last 1
}
catch [System.IO.IOException] {}
}
Basically, it does what your second code does, but with the difference that using just the StreamReader, the file is opened with [System.IO.FileShare]::Read, whereas this code opens the file with [System.IO.FileShare]::ReadWrite
Note that there may be exceptions thrown using this because another application has write permissions to the file, hence the try{...} catch{...}
Hope that helps

Using Powershell to Print a Folder of Text files to PDF (Retaining the Original Base name)

First time posting - but I think this is a good one as I've spent 2 days researching, talked with local experts, and still haven't found this done.
Individual print jobs must be regularly initiated on a large set of files (.txt files), and this must be converted through the print job to a local file (i.e. through a PDF printer) which retains the original base name for each file. Further, the script must be highly portable.
The objective will not be met if the file is simply converted (and not printed), the original base file name is not retained, or the print process requires manual interaction at each print.
After my research, this is what stands so far in PowerShell:
PROBLEM: This script does everything but actually print the contents of the file.
It iterates through the files, and "prints" a .pdf while retaining the original file name base; but the .pdf is empty.
I know I'm missing something critical (i.e. maybe a stream use?); but after searching and searching have not been able to find it. Any help is greatly appreciated.
As mentioned in the code, the heart of the print function is gathered from this post:
# The heart of this script (ConvertTo-PDF) is largley taken and slightly modified from https://social.technet.microsoft.com/Forums/ie/en-US/04ddfe8c-a07f-4d9b-afd6-04b147f59e28/automating-printing-to-pdf?forum=winserverpowershell
# The $OutputFolder variable can be disregarded at the moment. It is an added bonus, and a work in progress, but not cirital to the objective.
function ConvertTo-PDF {
param(
$TextDocumentPath, $OutputFolder
)
Write-Host "TextDocumentPath = $TextDocumentPath"
Write-Host "OutputFolder = $OutputFolder"
Add-Type -AssemblyName System.Drawing
$doc = New-Object System.Drawing.Printing.PrintDocument
$doc.DocumentName = $TextDocumentPath
$doc.PrinterSettings = new-Object System.Drawing.Printing.PrinterSettings
$doc.PrinterSettings.PrinterName = 'Microsoft Print to PDF'
$doc.PrinterSettings.PrintToFile = $true
$file=[io.fileinfo]$TextDocumentPath
Write-Host "file = $file"
$pdf= [io.path]::Combine($file.DirectoryName, $file.BaseName) + '.pdf'
Write-Host "pdf = $pdf"
$doc.PrinterSettings.PrintFileName = $pdf
$doc.Print()
Write-Host "Attempted Print: $pdf"
$doc.Dispose()
}
# get the relative path of the TestFiles and OutpufFolder folders.
$scriptPath = split-path -parent $MyInvocation.MyCommand.Definition
Write-Host "scriptPath = $scriptPath"
$TestFileFolder = "$scriptPath\TestFiles\"
Write-Host "TestFileFolder = $TestFileFolder"
$OutputFolder = "$scriptPath\OutputFolder\"
Write-Host "OutputFolder = $OutputFolder"
# initialize the files variable with content of the TestFiles folder (relative to the script location).
$files = Get-ChildItem -Path $TestFileFolder
# Send each test file to the print job
foreach ($testFile in $files)
{
$testFile = "$TestFileFolder$testFile"
Write-Host "Attempting Print from: $testFile"
Write-Host "Attemtping Print to : $OutputFolder"
ConvertTo-PDF $testFile $OutputFolder
}
You are missing a handler that reads the text file and passes the text to the printer. It is defined as a scriptblock like this:
$PrintPageHandler =
{
param([object]$sender, [System.Drawing.Printing.PrintPageEventArgs]$ev)
# More code here - see below for details
}
and is added to the PrintDocument object like this:
$doc.add_PrintPage($PrintPageHandler)
The full code that you need is below:
$PrintPageHandler =
{
param([object]$sender, [System.Drawing.Printing.PrintPageEventArgs]$ev)
$linesPerPage = 0
$yPos = 0
$count = 0
$leftMargin = $ev.MarginBounds.Left
$topMargin = $ev.MarginBounds.Top
$line = $null
$printFont = New-Object System.Drawing.Font "Arial", 10
# Calculate the number of lines per page.
$linesPerPage = $ev.MarginBounds.Height / $printFont.GetHeight($ev.Graphics)
# Print each line of the file.
while ($count -lt $linesPerPage -and (($line = $streamToPrint.ReadLine()) -ne $null))
{
$yPos = $topMargin + ($count * $printFont.GetHeight($ev.Graphics))
$ev.Graphics.DrawString($line, $printFont, [System.Drawing.Brushes]::Black, $leftMargin, $yPos, (New-Object System.Drawing.StringFormat))
$count++
}
# If more lines exist, print another page.
if ($line -ne $null)
{
$ev.HasMorePages = $true
}
else
{
$ev.HasMorePages = $false
}
}
function Out-Pdf
{
param($InputDocument, $OutputFolder)
Add-Type -AssemblyName System.Drawing
$doc = New-Object System.Drawing.Printing.PrintDocument
$doc.DocumentName = $InputDocument.FullName
$doc.PrinterSettings = New-Object System.Drawing.Printing.PrinterSettings
$doc.PrinterSettings.PrinterName = 'Microsoft Print to PDF'
$doc.PrinterSettings.PrintToFile = $true
$streamToPrint = New-Object System.IO.StreamReader $InputDocument.FullName
$doc.add_PrintPage($PrintPageHandler)
$doc.PrinterSettings.PrintFileName = "$($InputDocument.DirectoryName)\$($InputDocument.BaseName).pdf"
$doc.Print()
$streamToPrint.Close()
}
Get-Childitem -Path "$PSScriptRoot\TextFiles" -File -Filter "*.txt" |
ForEach-Object { Out-Pdf $_ $_.Directory }
Incidentally, this is based on the official Microsoft C# example here:
PrintDocumentClass

Changing the Delimiter in a large CSV file using Powershell

I am in need of a way to change the delimiter in a CSV file from a comma to a pipe. Because of the size of the CSV files (~750 Mb to several Gb), using Import-CSV and/or Get-Content is not an option. What I'm using (and what works, albeit slowly) is the following code:
$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$reader.SetDelimiters(",")
While(!$reader.EndOfData)
{
$line = $reader.ReadFields()
$details = [ordered]#{
"Plugin ID" = $line[0]
CVE = $line[1]
CVSS = $line[2]
Risk = $line[3]
}
$export = New-Object PSObject -Property $details
$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"
}
This little loop took nearly 2 minutes to process a 20 Mb file. Scaling up at this speed would mean over an hour for the smallest CSV file I'm currently working with.
I've tried this as well:
While(!$reader.EndOfData)
{
$line = $reader.ReadFields()
$details = [ordered]#{
# Same data as before
}
$export.Add($details) | Out-Null
}
$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"
This is MUCH FASTER but doesn't provide the right information in the new CSV. Instead I get rows and rows of this:
"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
So, two questions:
1) Can the first block of code be made faster?
2) How can I unwrap the arraylist in the second example to get to the actual data?
EDIT: Sample data found here - http://pastebin.com/6L98jGNg
This is simple text-processing, so the bottleneck should be disk read speed:
1 second per 100 MB or 10 seconds per 1GB for the OP's sample (repeated to the mentioned size) as measured here on i7. The results would be worse for files with many/all small quoted fields.
The algo is simple:
Read the file in big string chunks e.g. 1MB.
It's much faster than reading millions of lines separated by CR/LF because:
less checks are performed as we mostly/primarily look only for doublequotes;
less iterations of our code executed by the interpreter which is slow.
Find the next doublequote.
Depending on the current $inQuotedField flag decide whether the found doublequote starts a quoted field (should be preceded by , + some spaces optionally) or ends the current quoted field (should be followed by any even number of doublequotes, optionally spaces, then ,).
Replace delimiters in the preceding span or to the end of 1MB chunk if no quotes were found.
The code makes some reasonable assumptions but it may fail to detect an escaped field if its doublequote is followed or preceded by more than 3 spaces before/after field delimiter. The checks won't be too hard to add, and I might've missed some other edge case, but I'm not that interested.
$sourcePath = 'c:\path\file.csv'
$targetPath = 'd:\path\file2.csv'
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM
$delim = [char]','
$newDelim = [char]'|'
$buf = [char[]]::new(1MB)
$sourceBase = [IO.FileStream]::new(
$sourcePath,
[IO.FileMode]::open,
[IO.FileAccess]::read,
[IO.FileShare]::read,
$buf.length, # let OS prefetch the next chunk in background
[IO.FileOptions]::SequentialScan)
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length)
$bufStart = 0
$bufPadding = 4
$inQuotedField = $false
$fieldBreak = [char[]]#($delim, "`r", "`n")
$out = [Text.StringBuilder]::new($buf.length)
while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) {
$s = [string]::new($buf, 0, $nRead+$bufStart)
$len = $s.length
$pos = 0
$out.Clear() >$null
do {
$iQuote = $s.IndexOf([char]'"', $pos)
if ($inQuotedField) {
$iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) }
if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) {
# no closing quote in buffer safezone
$out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null
break
}
if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") {
# even number of quotes are just quoted quotes
$inQuotedField = $matches[1].length % 2 -eq 0
}
$out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null
$pos = $iDelim + 1
continue
}
if ($iQuote -ge 0) {
$iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote)
if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) {
$inQuotedField = $true
}
$replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim)
} elseif ($pos -gt 0) {
$replaced = $s.Substring($pos).Replace($delim, $newDelim)
} else {
$replaced = $s.Replace($delim, $newDelim)
}
$out.Append($replaced) >$null
$pos = $iQuote + 1
} while ($iQuote -ge 0)
$target.Write($out)
$bufStart = 0
for ($i = $out.length; $i -lt $s.length; $i++) {
$buf[$bufStart++] = $buf[$i]
}
}
if ($bufStart) { $target.Write($buf, 0, $bufStart) }
$source.Close()
$target.Close()
Still not what I would call fast, but this is considerably faster than what you have listed by using the -Join operator:
$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source
$reader.SetDelimiters(",")
While(!$reader.EndOfData){
$line = $reader.ReadFields()
$line -join '|' | Add-Content C:\Temp\TestOutput.csv
}
That took a hair under 32 seconds to process a 20MB file. At that rate your 750MB file would be done in under 20 minutes, and bigger files should go at about 26 minutes per gig.