How to sort 30Million csv records in Powershell - powershell

I am using oledbconnection to sort the first column of csv file. Oledb connection is executed up to 9 million records within 6 min duration successfully. But when am executing 10 million records, getting following alert message.
Exception calling "ExecuteReader" with "0" argument(s): "The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or
there is not enough temporary storage space on the disk to store the query result."
is there any other solution to sort 30 million using Powershell?
here is my script
$OutputFile = "D:\Performance_test_data\output1.csv"
$stream = [System.IO.StreamWriter]::new( $OutputFile )
$sb = [System.Text.StringBuilder]::new()
$sw = [Diagnostics.Stopwatch]::StartNew()
$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='D:\Performance_test_data\';Extended Properties='Text;HDR=Yes;CharacterSet=65001;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from 1crores.csv order by col6"
$conn.open()
$data = $cmd.ExecuteReader()
echo "Query has been completed!"
$stream.WriteLine( "col1,col2,col3,col4,col5,col6")
while ($data.read())
{
$stream.WriteLine( $data.GetValue(0) +',' + $data.GetValue(1)+',' + $data.GetValue(2)+',' + $data.GetValue(3)+',' + $data.GetValue(4)+',' + $data.GetValue(5))
}
echo "data written successfully!!!"
$stream.close()
$sw.Stop()
$sw.Elapsed
$cmd.Dispose()
$conn.Dispose()

You can try using this:
$CSVPath = 'C:\test\CSVTest.csv'
$Delimiter = ';'
# list we use to hold the results
$ResultList = [System.Collections.Generic.List[Object]]::new()
# Create a stream (I use OpenText because it returns a streamreader)
$File = [System.IO.File]::OpenText($CSVPath)
# Read and parse the header
$HeaderString = $File.ReadLine()
# Get the properties from the string, replace quotes
$Properties = $HeaderString.Split($Delimiter).Replace('"',$null)
$PropertyCount = $Properties.Count
# now read the rest of the data, parse it, build an object and add it to a list
while ($File.EndOfStream -ne $true)
{
# Read the line
$Line = $File.ReadLine()
# split the fields and replace the quotes
$LineData = $Line.Split($Delimiter).Replace('"',$null)
# Create a hashtable with the properties (we convert this to a PSCustomObject later on). I use an ordered hashtable to keep the order
$PropHash = [System.Collections.Specialized.OrderedDictionary]#{}
# if loop to add the properties and values
for ($i = 0; $i -lt $PropertyCount; $i++)
{
$PropHash.Add($Properties[$i],$LineData[$i])
}
# Now convert the data to a PSCustomObject and add it to the list
$ResultList.Add($([PSCustomObject]$PropHash))
}
# Now you can sort this list using Linq:
Add-Type -AssemblyName System.Linq
# Sort using propertyname (my sample data had a prop called "Name")
$Sorted = [Linq.Enumerable]::OrderBy($ResultList, [Func[object,string]] { $args[0].Name })
Instead of using import-csv I've written a quick parser which uses a streamreader and parses the CSV data on the fly and puts it in a PSCustomObject.
This is then added to a list.
edit: fixed the linq sample

Putting the performance aside and at least come to a solution that works (meaning one that doesn't hang due to memory shortage) I would rely on the PowerShell pipeline. The issue is thou that for sorting an object you will need to stall te pipeline as the last object might potentially become the first object.
To resolve this part, I would do a coarse division on the first character(s) of the concern property first. Once that is done, fine sort each coarse division and append the results:
Function Sort-BigObject {
[CmdletBinding()] param(
[Parameter(ValueFromPipeLine = $True)]$InputObject,
[Parameter(Position = 0)][String]$Property,
[ValidateRange(1,9)]$Coarse = 1,
[System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default
)
Begin {
$TemporaryFiles = [System.Collections.SortedList]::new()
}
Process {
if ($InputObject.$Property) {
$Grain = $InputObject.$Property.SubString(0, $Coarse)
if (!$TemporaryFiles.Contains($Grain)) { $TemporaryFiles[$Grain] = New-TemporaryFile }
$InputObject | Export-Csv $TemporaryFiles[$Grain] -Encoding $Encoding -Append
} else { $InputObject.$Property }
}
End {
Foreach ($TemporaryFile in $TemporaryFiles.Values) {
Import-Csv $TemporaryFile -Encoding $Encoding | Sort-Object $Property
Remove-Item -LiteralPath $TemporaryFile
}
}
}
Usage
(Don't assign the stream to a variable and don't use parenthesis.)
Import-Csv .\1crores.csv | Sort-BigObject <PropertyName> | Export-Csv .\output.csv
If the temporary files still get too big to handle, you might need to increase the -Coarse parameter
Caveats (improvement considerations)
Objects with an empty sort property will be immediately outputted
The sort column is presumed to be a (single) string column
I presume the performance is poor (I didn't do a full test on 30 million records, but 10.000 records take about 8 second which means about 8 hours). Consider replacing native PowerShell cmdlets with .Net streaming methods. buffer/cache file input and outputs, parallel processing?

You could try SQLite:
$OutputFile = "D:\Performance_test_data\output1.csv"
$sw = [Diagnostics.Stopwatch]::StartNew()
sqlite3 output1.db '.mode csv' '.import 1crores.csv 1crores' '.headers on' ".output $OutputFile" 'Select * from 1crores order by 最終アクセス日時'
echo "data written successfully!!!"
$sw.Stop()
$sw.Elapsed

I have added a new answer as this is a complete different approach to tackle this issue.
Instead of creating temporary files (which presumable causes a lot of file opens and closures), you might consider to create a ordered list of indices and than go over the input file (-FilePath) multiple times and each time, process a selective number of lines (-BufferSize = 1Gb, you might have to tweak this "memory usage vs. performance" parameter):
Function Sort-Csv {
[CmdletBinding()] param(
[string]$InputFile,
[String]$Property,
[string]$OutputFile,
[Char]$Delimiter = ',',
[System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default,
[Int]$BufferSize = 1Gb
)
Begin {
if ($InputFile.StartsWith('.\')) { $InputFile = Join-Path (Get-Location) $InputFile }
$Index = 0
$Dictionary = [System.Collections.Generic.SortedDictionary[string, [Collections.Generic.List[Int]]]]::new()
Import-Csv $InputFile -Delimiter $Delimiter -Encoding $Encoding | Foreach-Object {
if (!$Dictionary.ContainsKey($_.$Property)) { $Dictionary[$_.$Property] = [Collections.Generic.List[Int]]::new() }
$Dictionary[$_.$Property].Add($Index++)
}
$Indices = [int[]]($Dictionary.Values | ForEach-Object { $_ })
$Dictionary = $Null # we only need the sorted index list
}
Process {
$Start = 0
$ChunkSize = [int]($BufferSize / (Get-Item $InputFile).Length * $Indices.Count / 2.2)
While ($Start -lt $Indices.Count) {
[System.GC]::Collect()
$End = $Start + $ChunkSize - 1
if ($End -ge $Indices.Count) { $End = $Indices.Count - 1 }
$Chunk = #{}
For ($i = $Start; $i -le $End; $i++) { $Chunk[$Indices[$i]] = $i }
$Reader = [System.IO.StreamReader]::new($InputFile, $Encoding)
$Header = $Reader.ReadLine()
$i = $Start
$Count = 0
For ($i = 0; ($Line = $Reader.ReadLine()) -and $Count -lt $ChunkSize; $i++) {
if ($Chunk.Contains($i)) { $Chunk[$i] = $Line }
}
$Reader.Dispose()
if ($OutputFile) {
if ($OutputFile.StartsWith('.\')) { $OutputFile = Join-Path (Get-Location) $OutputFile }
$Writer = [System.IO.StreamWriter]::new($OutputFile, ($Start -ne 0), $Encoding)
if ($Start -eq 0) { $Writer.WriteLine($Header) }
For ($i = $Start; $i -le $End; $i++) { $Writer.WriteLine($Chunk[$Indices[$i]]) }
$Writer.Dispose()
} else {
$Start..$End | ForEach-Object { $Header } { $Chunk[$Indices[$_]] } | ConvertFrom-Csv -Delimiter $Delimiter
}
$Chunk = $Null
$Start = $End + 1
}
}
}
Basic usage
Sort-Csv .\Input.csv <PropertyName> -Output .\Output.csv
Sort-Csv .\Input.csv <PropertyName> | ... | Export-Csv .\Output.csv
Note that for 1Crones.csv it will probably just export the full file in once unless you set the -BufferSize to a lower amount e.g. 500Kb.

I downloaded gnu sort.exe from here: http://gnuwin32.sourceforge.net/packages/coreutils.htm It also requires libiconv2.dll and libintl3.dll from the dependency zip. I basically did this within cmd.exe, and it used a little less than a gig of ram and took about 5 minutes. It's a 500 meg file of about 30 million random numbers. This command can also merge sorted files with --merge. You can also specify begin and end key position for sorting --key. It automatically uses temp files.
.\sort.exe < file1.csv > file2.csv
Actually it works in a similar way with the windows sort from the cmd prompt. The windows sort also has a /+n option to specify what character column to start the sort by.
sort.exe < file1.csv > file2.csv

Related

Changing the Delimiter in a large CSV file using Powershell

I am in need of a way to change the delimiter in a CSV file from a comma to a pipe. Because of the size of the CSV files (~750 Mb to several Gb), using Import-CSV and/or Get-Content is not an option. What I'm using (and what works, albeit slowly) is the following code:
$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$reader.SetDelimiters(",")
While(!$reader.EndOfData)
{
$line = $reader.ReadFields()
$details = [ordered]#{
"Plugin ID" = $line[0]
CVE = $line[1]
CVSS = $line[2]
Risk = $line[3]
}
$export = New-Object PSObject -Property $details
$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"
}
This little loop took nearly 2 minutes to process a 20 Mb file. Scaling up at this speed would mean over an hour for the smallest CSV file I'm currently working with.
I've tried this as well:
While(!$reader.EndOfData)
{
$line = $reader.ReadFields()
$details = [ordered]#{
# Same data as before
}
$export.Add($details) | Out-Null
}
$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"
This is MUCH FASTER but doesn't provide the right information in the new CSV. Instead I get rows and rows of this:
"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
So, two questions:
1) Can the first block of code be made faster?
2) How can I unwrap the arraylist in the second example to get to the actual data?
EDIT: Sample data found here - http://pastebin.com/6L98jGNg
This is simple text-processing, so the bottleneck should be disk read speed:
1 second per 100 MB or 10 seconds per 1GB for the OP's sample (repeated to the mentioned size) as measured here on i7. The results would be worse for files with many/all small quoted fields.
The algo is simple:
Read the file in big string chunks e.g. 1MB.
It's much faster than reading millions of lines separated by CR/LF because:
less checks are performed as we mostly/primarily look only for doublequotes;
less iterations of our code executed by the interpreter which is slow.
Find the next doublequote.
Depending on the current $inQuotedField flag decide whether the found doublequote starts a quoted field (should be preceded by , + some spaces optionally) or ends the current quoted field (should be followed by any even number of doublequotes, optionally spaces, then ,).
Replace delimiters in the preceding span or to the end of 1MB chunk if no quotes were found.
The code makes some reasonable assumptions but it may fail to detect an escaped field if its doublequote is followed or preceded by more than 3 spaces before/after field delimiter. The checks won't be too hard to add, and I might've missed some other edge case, but I'm not that interested.
$sourcePath = 'c:\path\file.csv'
$targetPath = 'd:\path\file2.csv'
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM
$delim = [char]','
$newDelim = [char]'|'
$buf = [char[]]::new(1MB)
$sourceBase = [IO.FileStream]::new(
$sourcePath,
[IO.FileMode]::open,
[IO.FileAccess]::read,
[IO.FileShare]::read,
$buf.length, # let OS prefetch the next chunk in background
[IO.FileOptions]::SequentialScan)
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length)
$bufStart = 0
$bufPadding = 4
$inQuotedField = $false
$fieldBreak = [char[]]#($delim, "`r", "`n")
$out = [Text.StringBuilder]::new($buf.length)
while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) {
$s = [string]::new($buf, 0, $nRead+$bufStart)
$len = $s.length
$pos = 0
$out.Clear() >$null
do {
$iQuote = $s.IndexOf([char]'"', $pos)
if ($inQuotedField) {
$iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) }
if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) {
# no closing quote in buffer safezone
$out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null
break
}
if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") {
# even number of quotes are just quoted quotes
$inQuotedField = $matches[1].length % 2 -eq 0
}
$out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null
$pos = $iDelim + 1
continue
}
if ($iQuote -ge 0) {
$iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote)
if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) {
$inQuotedField = $true
}
$replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim)
} elseif ($pos -gt 0) {
$replaced = $s.Substring($pos).Replace($delim, $newDelim)
} else {
$replaced = $s.Replace($delim, $newDelim)
}
$out.Append($replaced) >$null
$pos = $iQuote + 1
} while ($iQuote -ge 0)
$target.Write($out)
$bufStart = 0
for ($i = $out.length; $i -lt $s.length; $i++) {
$buf[$bufStart++] = $buf[$i]
}
}
if ($bufStart) { $target.Write($buf, 0, $bufStart) }
$source.Close()
$target.Close()
Still not what I would call fast, but this is considerably faster than what you have listed by using the -Join operator:
$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source
$reader.SetDelimiters(",")
While(!$reader.EndOfData){
$line = $reader.ReadFields()
$line -join '|' | Add-Content C:\Temp\TestOutput.csv
}
That took a hair under 32 seconds to process a 20MB file. At that rate your 750MB file would be done in under 20 minutes, and bigger files should go at about 26 minutes per gig.

Powershell Script for 500 errors

I want power shell script to fetch all 500 entries from IIS logs from multiple servers. I have written a script that fetches 500 from single servers for previous hours. Could someone check and help me how I can go for fetching multiple servers. Script that I have:
#Set Time Variable -60
$time = (Get-Date -Format "HH:mm:ss"(Get-Date).addminutes(-60))
# Location of IIS LogFile
#$servers = get-content C:\Users\servers.txt
$File = "\\server\D$\Logs\W3SVC89\"+"u_ex"+(get-date).ToString("yyMMddHH")+".log"
# Get-Content gets the file, pipe to Where-Object and skip the first 3 lines.
$Log = Get-Content $File | where {$_ -notLike "#[D,S-V]*" }
# Replace unwanted text in the line containing the columns.
$Columns = (($Log[0].TrimEnd()) -replace "#Fields: ", "" -replace "-","" -replace "\(","" -replace "\)","").Split(" ")
# Count available Columns, used later
$Count = $Columns.Length
# Strip out the other rows that contain the header (happens on iisreset)
$Rows = $Log | where {$_ -like "*500 0 0*"}
# Create an instance of a System.Data.DataTable
#Set-Variable -Name IISLog -Scope Global
$IISLog = New-Object System.Data.DataTable "IISLog"
# Loop through each Column, create a new column through Data.DataColumn and add it to the DataTable
foreach ($Column in $Columns) {
$NewColumn = New-Object System.Data.DataColumn $Column, ([string])
$IISLog.Columns.Add($NewColumn)
}
# Loop Through each Row and add the Rows.
foreach ($Row in $Rows) {
$Row = $Row.Split(" ")
$AddRow = $IISLog.newrow()
for($i=0;$i -lt $Count; $i++) {
$ColumnName = $Columns[$i]
$AddRow.$ColumnName = $Row[$i]
}
$IISLog.Rows.Add($AddRow)
}
$IISLog | select #{n="DateTime"; e={Get-Date ("$($_.date) $($_.time)")}},sip,csuristem,scstatus | ? { $_.DateTime -ge $time } |Out-File C:\Users\Servers\results.csv
Assuming your logfile is always on the same path, and that servers.txt contains you server list,
you can read the server list then execute your code against each one using a foreach loop :
something like that ( a result file is create for each server) :
#Set Time Variable -60
$time = (Get-Date -Format "HH:mm:ss"(Get-Date).addminutes(-60))
# Location of IIS LogFile
$servers = get-content C:\Users\servers.txt
$servers| foreach{
#inside the foreach loop $_ will represent the current server
$File = "\\$_\D$\Logs\W3SVC89\"+"u_ex"+(get-date).ToString("yyMMddHH")+".log"
# Get-Content gets the file, pipe to Where-Object and skip the first 3 lines.
$Log = Get-Content $File | where {$_ -notLike "#[D,S-V]*" }
# Replace unwanted text in the line containing the columns.
$Columns = (($Log[0].TrimEnd()) -replace "#Fields: ", "" -replace "-","" -replace "\(","" -replace "\)","").Split(" ")
# Count available Columns, used later
$Count = $Columns.Length
# Strip out the other rows that contain the header (happens on iisreset)
$Rows = $Log | where {$_ -like "*500 0 0*"}
# Create an instance of a System.Data.DataTable
#Set-Variable -Name IISLog -Scope Global
$IISLog = New-Object System.Data.DataTable "IISLog"
# Loop through each Column, create a new column through Data.DataColumn and add it to the DataTable
foreach ($Column in $Columns) {
$NewColumn = New-Object System.Data.DataColumn $Column, ([string])
$IISLog.Columns.Add($NewColumn)
}
# Loop Through each Row and add the Rows.
foreach ($Row in $Rows) {
$Row = $Row.Split(" ")
$AddRow = $IISLog.newrow()
for($i=0;$i -lt $Count; $i++) {
$ColumnName = $Columns[$i]
$AddRow.$ColumnName = $Row[$i]
}
$IISLog.Rows.Add($AddRow)
}
$IISLog | select #{n="DateTime"; e={Get-Date ("$($_.date) $($_.time)")}},sip,csuristem,scstatus | ? { $_.DateTime -ge $time } |Out-File C:\Users\Servers\$_results.csv
}
Note that this will run your code sequentially on each of your server wich can be time consumming. If you are facing duration issue, you can try to use invoke-command and the -asjob parameter in order to launch you code asynchronoulsy

Import Excel data into PowerShell variables

I have an Excel File which has an unknown number of records in it, and these 3 columns:
Variable Name, Store Number, Email Address
I use this in QlikView to import data for certain stores and then create a separate report for each store in the list. I then need to email each report to each individual store (store number will be in the report file name).
So in PowerShell I would like to read the Excel File and set variables for each store:
$Store1 = The Store Number in Row 2 of the Excel File
$Store1Email = The Store Email in Row 2 of the Excel File
$Store2 = The Store Number in Row 3 of the Excel File
$Store2Email = The Store Email in Row 3 of the Excel File
etc. for each Storein the file (can be any number of stores).
Please note the "Variable Name" in the excel file must be ignored (that is for QLikView) and the PowerShell variables must be named as per my above examples, each time incrementing the number.
Check out my PowerShell Excel Module on Github. You can also grab it from the PowerShell Gallery.
$stores = Import-Excel C:\Temp\store.xlsx
$stores[2].Name
$stores[2].StoreNumber
$stores[2].EmailAddress
''
'All stores'
'----------'
$stores
Ok, first off if you are going to be working with actual .XLS or .XLSX or .XLSM files I would highly suggest using the Import-XLS function from the TechNet gallery (found here).
After that, just reference the object it imports to send the emails instead of making objects for each store. Such as:
$StoreList = Import-XLS <path to Excel file>
GC <report folder> | %{
$Current = $_
$Store = $StoreList|?{$_.StoreNumber -match $Current.BaseName}|Select -ExpandProperty StoreNumber
$Email = $StoreList|?{$_.StoreNumber -match $Current.BaseName}|Select -ExpandProperty StoreEmail
<code to send $Current to $Email>
}
My preference is to Save-As the Excel file to a '.csv' type. The comma separated value can easily be imported into PowerShell.
$csvFile = Import-Csv -Path c:\scripts\temp\excelFile.csv
#now the entire Excel '.csv' file is saved into csvFile variable
$csvFile |Get-Member
#look at the properties
Remember to study the greats so your PowerShell script looks great. Jeffery Snover, Jason Hicks, Don Jones, Ashley McGlone, and anyone on their friends list ha ha
The above answers usually work, but I just had a project with excel datasheets that caused some problems.
edit: Here's a much more advanced version that will pull it into an object, can handle blank and duplicate column names, and can skip human information at the beginning of the worksheet by looking for something in the header row. I've also included some example usages
Your example:
$file = New-Module -AsCustomObject -ScriptBlock $file_template
$file.from_excel("c:\folder\file.xls")
$Store1 = $file.data[0]."Store Number" #first row, column named "Store Number"
$Store1Email = $file.data[0]."Store Email" #first row, column named "Store Email"
foreach ($row in $file.data)
{
write-host "Store: $($row."Store Number")"
write-host "Store Email: $($row."Store Email")"
}
Example 1:
# Simplest example
$file = New-Module -AsCustomObject -ScriptBlock $file_template
$file.from_excel("c:\folder\file.xls")
$file.data[0]
Example 2:
#advanced usage
$file = New-Module -AsCustomObject -ScriptBlock $file_template
$file.header_contains="First Name" # if included it will drop everything before the first line that contains this, useful if there are instructions for humans in the worksheet
$file.indexer_column = 5 # Default: 1 (first column); This column's contents will set the minimum number of rows, use if there are blank rows in your file but more data after them
$file.worksheet_index = "January" # Default: 1; can be a sheet index or sheet name
$file.filename = "c:\folder\file.xls" #can set this independently, useful for validation and troubleshooting
$file.from_excel() #This is where we actually pull from excel
$collected = $file.data|ogv -pass thru #this is a neat way to select some rows you want
$file.headers.count # It stores an array of the headers here, useful for troubleshooting and advanced logic
Excel Reader pseudoclass
$file_template = {
# -- universal --
$filename = ""
$delimiter = ","
$headers = #()
$data = #()
# -- used by some functions --
# we put these here to allow assigning them before calling functions, which improves readability and auditability
$header_contains=""
$indexer_column=1
$worksheet_index=1
function from_excel(
$filename=$this.filename,
$worksheet_index=$this.worksheet_index
)`
{
$this.filename = $filename
$this.worksheet_index = $worksheet_index
$data_by_row = $this.from_excel_as_csv() # $data_by_row = $file.from_excel_as_csv($test_file)
$data_by_row = $data_by_row -split"`n"
#if ($this.headers.count -lt 1) {$this.headers = $data_by_row[0] -split $this.delimiter} #this would let us set headers elsewhere which is more flexible but less adaptive, Because columns change unpredicably we need something more adaptive
$temp_headers = $data_by_row[0] -split $this.delimiter
$temp_headers = $this.fix_blank_headers($temp_headers)
$this.headers = $this.dedupe_headers($temp_headers)
$this.data = $data_by_row|select -Skip 1|ConvertFrom-Csv -Header $this.headers -Delimiter $this.delimiter
}
function from_csv($filename=$this.filename)`
{
$this.filename = $filename
$this.headers = (Get-Content $this.filename -ReadCount 1|select -first 1) -split $this.delimiter
$this.data = Get-Content $this.filename|ConvertFrom-Csv -Delimiter $this.delimiter
}
function from_excel_as_csv(
$filename=$this.filename,
$worksheet_index=$this.worksheet_index
)`
{
$this.filename = $filename
$this.worksheet_index = $worksheet_index
#set up excel
Write-Host "Importing from excel, this may take a little while..."
$excel = New-Object -ComObject Excel.Application
$excel.DisplayAlerts = $false
$excel.Visible = $false
$workbook = $excel.workbooks.open($this.filename)
$worksheet = $workbook.Worksheets.Item($this.worksheet_index)
#import from excel
try{
$data_by_row = ""
$indexed_column = $worksheet.columns.item($this.indexer_column).value2 #we use this to work around some files having headers with blank space
$minimum_rows = (($indexed_column -join "◘").TrimEnd("◘") -split "◘").count # This Strips the million or so extra blank rows excel appends to get a realistic column length.
[bool]$header_found = 0
$i=1
do `
{
$row = $worksheet.rows.item($i).value2
$row_as_text = $row -join "◘" # ◘ (alt+8) is just a placeholder that's unlikely to show up in the text
$row_as_text = $row_as_text -replace $this.delimiter,"."
$row_as_text = $row_as_text.TrimEnd("◘")
$row_as_text = $row_as_text -replace "◘",$this.delimiter
if ($row_as_text -like "*$($this.header_contains)*"){[bool]$header_found=1}
if ($header_found) {$data_by_row+="$row_as_text`n"}
$i++
}
while ( ($row_as_text.Length -gt 1) -or ($i -lt $minimum_rows) )
}
catch {Write-Warning "ERROR Importing from excel"}
#close excel
$workbook.Close()
$excel.Quit()
write-host "Done importing from excel"
return $data_by_row
}
function dedupe_headers($headers){
$dupes = ($headers|group)|?{$_.count -gt 1}
if ($dupes.count -ge 1)
{
foreach ($dupe in $dupes)
{ #$dupe = $dupes[0]
$i=1
$new_headers = #()
foreach ($header in $headers)
{ #$header = $headers[0]
if ($header -eq $dupe.name)
{
$header = "$($header)_$($i)" # "header_#"
$i++
}
$new_headers += $header
}
}
}
else {$new_headers = $headers} # no duplicates found
return $new_headers
}
function fix_blank_headers($headers)
{
$replace_blanks_with = "_"
$new_headers = #()
foreach ($header in $headers)
{
if ($header -eq "") {$new_headers += $replace_blanks_with}
else {$new_headers += $header}
}
if ($new_headers.count -ne $header)
{
$error_json = #($headers),#($new_headers)|ConvertTo-Json -Compress
Write-Error "Error when fixing blank headers, original and new counts are different $($error_json)"
}
return $new_headers
}
<# function some_function($some_parameter){return $some_parameter} #>
Export-ModuleMember -Function * -Variable *
}
Forgive the ugliness here. I am not a programmer, so there are undoubtedly more optimized ways to do this, as well as better formatting. It will work, however, if I understand your requirements correctly.
$excelfile = import-csv "c:\myfile.csv"
$i = 1
$excelfile | ForEach-Object {
New-Variable "Store$i" $_."Store Number"
$iemail = $i.ToString() + "Email"
New-Variable "Store$iemail" $_."Email Address"
$i ++
}
edit: as per the reply to your original post, this works with a csv file. Just save it to csv first if necessary.
$excelfile = import-csv "C:\Temp\store.csv"
$i = 1 $excelfile | ForEach-Object {
$NA= $_."Name"
$SN= $_."StoreNumber"
Write-Output "row $i"
$NA
$SN
$i++ }

Split CSV with powershell

I have large CSV files (50-500 MB each). Running complicated power shell commands on these takes forever and/or hits memory issues.
Processing the data requires grouping by common fields, say in ColumnA. So assuming that the data is already sorted by that column, if I split these files randomly (i.e. each x-thousand lines) then matching entries could still end up in different parts. There are thousands of different groups in A, so splitting every one into a single file would create to many files.
How can I split it into files of 10,000-ish lines and not lose the groups? E.g. rows 1-13 would be A1 in Column A, rows 14-17 would be A2 etc. and row 9997-10012 would be A784. In this case i would want the first file to contain rows 1-10012 and the next one to start with row 10013.
Obviously I would want to keep the entire rows (rather than just Column A), so if I pasted all the resulting files together this would be the same as the original file.
Not tested. This assumes ColumnA is the first column and it's common comma-delimited data. You'll need to adjust the line that creates the regex to suit your data.
$count = 0
$header = get-content file.csv -TotalCount 1
get-content file.csv -ReadCount 1000 |
foreach {
#add tail entries from last batch to beginning of this batch
$newbatch = $tail + $_
#create regex to match last entry in this batch
$regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0]))
#Extract everything that doesn't match the last entry to new file
#Add header if this is not the first file
if ($count)
{
$header |
set-content "c:\somedir\filepart_$count"
}
$newbatch -notmatch $regex |
add-content "c:\somedir\filepart_$count"
#Extact tail entries to add to next batch
$tail = #($newbatch -match $regex)
#Increment file counter
$count++
}
This is my attempt, it got messy :-P It will load the whole file into memory while splitting it, but this is pure text. It should take less memory then imported objects, but still about the size of the file.
$filepath = "C:\Users\graimer\Desktop\file.csv"
$file = Get-Item $filepath
$content = Get-Content $file
$csvheader = $content[0]
$lines = $content.Count
$minlines = 10000
$filepart = 1
$start = 1
while ($start -lt $lines - 1) {
#Set minimum $end value (last line)
if ($start + $minlines -le $lines - 1) { $end = $start + $minlines - 1 } else { $end = $lines - 1 }
#Value to compare. ColA is first column in my file = [0] . ColB is second column = [1]
$avalue = $content[$end].split(",")[0]
#If not last line in script
if ($end -ne $lines -1) {
#Increase $end by 1 while ColA is the same
while ($content[$end].split(",")[0] -eq $avalue) { $end++ }
#Return to last line with equal ColA value
$end--
}
#Create new csv-part
$filename = $file.FullName.Replace($file.BaseName, ($file.BaseName + ".part$filepart"))
#($csvheader, $content[$start..$end]) | Set-Content $filename
#Fix counters
$filepart++
$start = $end + 1
}
file.csv:
ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
A5,1,10
A6,1,10
A7,1,10
Results (I used $minlines = 5):
file.part1.csv:
ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
file.part2.csv:
ColA,ColB,ColC
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
file.part3.csv:
ColA,ColB,ColC
A5,1,10
A6,1,10
A7,1,10
This requires PowerShell v3 (due to -append on Export-CSV).
Also, I'm assuming that you have column headers and the first column is named col1. Adjust as necessary.
import-csv MYFILE.csv|foreach-object{$_|export-csv -notypeinfo -noclobber -append ($_.col1 + ".csv")}
This will create one file for each distinct value in the first column, with that value as the file name.
To compliment the helpful answer from mjolinor with a reusable function with a few additional parameters and using the steppable pipeline which is about a factor 8 faster:
function Split-Content {
[CmdletBinding()]
param (
[Parameter(Mandatory=$true)][String]$Path,
[ULong]$HeadSize,
[ValidateRange(1, [ULong]::MaxValue)][ULong]$DataSize = [ULong]::MaxValue,
[Parameter(Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]$Value
)
begin {
$Header = [Collections.Generic.List[String]]::new()
$DataCount = 0
$PartNr = 1
}
Process {
$ReadCount = 0
while ($ReadCount -lt #($_).Count -and $Header.Count -lt $HeadSize) {
if (#($_)[$ReadCount]) { $Header.Add(#($_)[$ReadCount]) }
$ReadCount++
}
if ($ReadCount -lt #($_).Count -and $Header.Count -ge $HeadSize) {
do {
if ($DataCount -le 0) { # Should never be less
$FileInfo = [System.IO.FileInfo]$ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($Path)
$FileName = $FileInfo.BaseName + $PartNr++ + $FileInfo.Extension
$LiteralPath = [System.IO.Path]::Combine($FileInfo.DirectoryName, $FileName)
$steppablePipeline = { Set-Content -LiteralPath $LiteralPath }.GetSteppablePipeline()
$steppablePipeline.Begin($PSCmdlet)
$steppablePipeline.Process($Header)
}
$Next = [math]::min(($DataSize - $DataCount), #($_).Count)
if ($Next -gt $ReadCount) { $steppablePipeline.Process(#($_)[$ReadCount..($Next - 1)]) }
$DataCount = ($DataCount + $Next - $ReadCount) % $DataSize
if ($DataCount -le 0) { $steppablePipeline.End() }
$ReadCount = $Next % #($_).Count
} while ($ReadCount)
}
}
End {
if ($steppablePipeline) { $steppablePipeline.End() }
}
}
Parameters
Value
Specifies the listed content lines to be broken into parts. Multiple lines sent through the pipeline at a time (aka sub arrays like Object[]) will also be passed to the output file at a time (assuming that is fits the -DataSize).
Path
Specifies a path to one or more locations. Each filename in the location is suffixed with a part number (starting with 1).
HeadSize
The specifies the number of lines of the header that will be taken from the input and preceded in each file part. The default is 0, meaning no header line are copied.
DataSize
The specifies the number of lines that will be successively taken (after the header) from the input as data and pasted into each file part. The default is [ULong]::MaxValue, basically meaning that all data is copied to a single file.
Example 1:
Get-Content -ReadCount 1000 .\Test.Csv |Split-Content -Path .\Part.Csv -HeadSize 1 -DataSize 10000
This will split the .\Test.Csv file in chuncks of csv files with 10000 rows
Note that the performance of this Split-Content function highly depends on the -ReadCount of the prior Get-Content cmdlet.
Example 2:
Get-Process |Out-String -Stream |Split-Content -Path .\Process.Txt -HeadSize 2 -DataSize 20
This will write chunks of 20 processes to the .\Process<PartNr>.Txt files preceded with the standard (2 line) header format:
NPM(K) PM(M) WS(M) CPU(s) Id SI ProcessName
------ ----- ----- ------ -- -- -----------
... # 20 rows following

How can I split a text file using PowerShell?

I need to split a large (500 MB) text file (a log4net exception file) into manageable chunks like 100 5 MB files would be fine.
I would think this should be a walk in the park for PowerShell. How can I do it?
A word of warning about some of the existing answers - they will run very slow for very big files. For a 1.6 GB log file I gave up after a couple of hours, realising it would not finish before I returned to work the next day.
Two issues: the call to Add-Content opens, seeks and then closes the current destination file for every line in the source file. Reading a little of the source file each time and looking for the new lines will also slows things down, but my guess is that Add-Content is the main culprit.
The following variant produces slightly less pleasant output: it will split files in the middle of lines, but it splits my 1.6 GB log in less than a minute:
$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB
$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
do {
"Reading $upperBound"
$count = $fromFile.Read($buff, 0, $buff.Length)
if ($count -gt 0) {
$to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
$toFile = [io.file]::OpenWrite($to)
try {
"Writing $count to $to"
$tofile.Write($buff, 0, $count)
} finally {
$tofile.Close()
}
}
$idx ++
} while ($count -gt 0)
}
finally {
$fromFile.Close()
}
Simple one-liner to split based on number of lines (100 in this case):
$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt}
This is a somewhat easy task for PowerShell, complicated by the fact that the standard Get-Content cmdlet doesn't handle very large files too well. What I would suggest to do is use the .NET StreamReader class to read the file line by line in your PowerShell script and use the Add-Content cmdlet to write each line to a file with an ever-increasing index in the filename. Something like this:
$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"
$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content -path $fileName -value $line
if((Get-ChildItem -path $fileName).Length -ge $upperBound)
{
++$count
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
}
}
$reader.Close()
Same as all the answers here, but using StreamReader/StreamWriter to split on new lines (line by line, instead of trying to read the whole file into memory at once). This approach can split big files in the fastest way I know of.
Note: I do very little error checking, so I can't guarantee it'll work smoothly for your case. It did for mine (1.7 GB TXT file of 4 million lines split in 100,000 lines per file in 95 seconds).
#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "C:\Users\Vincent\Desktop\test.txt"
$rootName = "C:\Users\Vincent\Desktop\result"
$ext = ".txt"
$linesperFile = 100000#100k
$filecount = 1
$reader = $null
try{
$reader = [io.file]::OpenText($filename)
try{
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
while($reader.EndOfStream -ne $true) {
"Reading $linesperFile"
while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
$writer.WriteLine($reader.ReadLine());
$linecount++
}
if($reader.EndOfStream -ne $true) {
"Closing file"
$writer.Dispose();
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
}
}
} finally {
$writer.Dispose();
}
} finally {
$reader.Dispose();
}
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
Output splitting a 1.7 GB file:
...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in 95.6308289 seconds
I often need to do the same thing. The trick is getting the header repeated into each of the split chunks. I wrote the following cmdlet (PowerShell v2 CTP 3) and it does the trick.
##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK
# Out-TempFile
##############################################################################
function Split-File {
[CmdletBinding(DefaultParameterSetName='Path')]
param(
[Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$Path,
[Alias("PSPath")]
[Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$LiteralPath,
[Alias('c')]
[Parameter(Position=2,Mandatory=$true)]
[Int32]$Count,
[Alias('d')]
[Parameter(Position=3)]
[String]$Destination='.',
[Alias('rc')]
[Parameter()]
[Int32]$RepeatCount
)
process {
# yeah! the cmdlet supports wildcards
if ($LiteralPath) { $ResolveArgs = #{LiteralPath=$LiteralPath} }
elseif ($Path) { $ResolveArgs = #{Path=$Path} }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
# get the input file in manageable chunks
$Part = 1
Get-Content $_ -ReadCount:$Count | %{
# make an output filename with a suffix
$OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))
# In the first iteration the header will be
# copied to the output file as usual
# on subsequent iterations we have to do it
if ($RepeatCount -and $Part -gt 1) {
Set-Content $OutputFile $Header
}
# write this chunk to the output file
Write-Host "Writing $OutputFile"
Add-Content $OutputFile $_
$Part += 1
}
}
}
}
I found this question while trying to split multiple contacts in a single vCard VCF file to separate files. Here's what I did based on Lee's code. I had to look up how to create a new StreamReader object and changed null to $null.
$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content -path $fileName -value $line
if($line -eq "END:VCARD")
{
++$count
$filename = "C:\Contacts\{0}.vcf" -f ($count)
}
}
$reader.Close()
Many of these answers were too slow for my source files. My source files were SQL files between 10 MB and 800 MB that needed to split into files of roughly equal line counts.
I found some of the previous answers which use Add-Content to be quite slow. Waiting many hours for a split to finish wasn't uncommon.
I didn't try Typhlosaurus's answer, but it looks to only do splits by file size, not line count.
The following has suited my purposes.
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length
Write-Host "Total Lines :" $totalLines
$skip = 0
$count = 100000; # Number of lines per file
# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")
while ($skip -le $totalLines) {
$upper = $skip + $count - 1
if ($upper -gt ($lines.Length - 1)) {
$upper = $lines.Length - 1
}
# Write the lines
[System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])
# Increment counters
$skip += $count
$fileNumber++
$fileNumberString = $filenumber.ToString("000")
}
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
For a 54 MB file, I get the output...
Reading source file...
Total Lines : 910030
Split complete in 1.7056578 seconds
I hope others looking for a simple, line-based splitting script that matches my requirements will find this useful.
There's also this quick (and somewhat dirty) one-liner:
$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } }
You can tweak the number of first lines per batch by changing the hard-coded 3000 value.
Do this:
FILE 1
There's also this quick (and somewhat dirty) one-liner:
$linecount=0; $i=0;
Get-Content .\BIG_LOG_FILE.txt | %
{
Add-Content OUT$i.log "$_";
$linecount++;
if ($linecount -eq 3000) {$I++; $linecount=0 }
}
You can tweak the number of first lines per batch by changing the hard-coded 3000 value.
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII
FILE 2
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII
FILE 3
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII
etc…
I've made a little modification to split files based on size of each part.
##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Size
# (Or -s) The maximum size of each file. Size must be expressed in MB.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv -s 20 -rc 1
#
#.LINK
# Out-TempFile
##############################################################################
function Split-File {
[CmdletBinding(DefaultParameterSetName='Path')]
param(
[Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$Path,
[Alias("PSPath")]
[Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$LiteralPath,
[Alias('s')]
[Parameter(Position=2,Mandatory=$true)]
[Int32]$Size,
[Alias('d')]
[Parameter(Position=3)]
[String]$Destination='.',
[Alias('rc')]
[Parameter()]
[Int32]$RepeatCount
)
process {
# yeah! the cmdlet supports wildcards
if ($LiteralPath) { $ResolveArgs = #{LiteralPath=$LiteralPath} }
elseif ($Path) { $ResolveArgs = #{Path=$Path} }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
# get the input file in manageable chunks
$Part = 1
$buffer = ""
Get-Content $_ -ReadCount:1 | %{
# make an output filename with a suffix
$OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))
# In the first iteration the header will be
# copied to the output file as usual
# on subsequent iterations we have to do it
if ($RepeatCount -and $Part -gt 1) {
Set-Content $OutputFile $Header
}
# test buffer size and dump data only if buffer is greater than size
if ($buffer.length -gt ($Size * 1MB)) {
# write this chunk to the output file
Write-Host "Writing $OutputFile"
Add-Content $OutputFile $buffer
$Part += 1
$buffer = ""
} else {
$buffer += $_ + "`r"
}
}
}
}
}
}
Sounds like a job for the UNIX command split:
split MyBigFile.csv
Just split my 55 GB csv file in 21k chunks in less than 10 minutes.
It's not native to PowerShell though, but comes with, for instance, the git for windows package https://git-scm.com/download/win
As the lines can be variable in logs I thought it best to take a number of lines per file approach. The following code snippet processed a 4 million line log file in under 19 seconds (18.83.. seconds)splitting it into 500,000 line chunks:
$sourceFile = "c:\myfolder\mylargeTextyFile.csv"
$partNumber = 1
$batchSize = 500000
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one
$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None")
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename
$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
$streamout.writeline($line)
$counter +=1
if ($counter -eq $batchsize)
{
$partNumber+=1
$counter =0
$streamOut.close()
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
$streamout = new-object System.IO.StreamWriter $pathAndFilename
}
$line = $streamIn.readline()
}
$streamin.close()
$streamout.close()
This can easily be turned into a function or script file with parameters to make it more versatile. It uses a StreamReader and StreamWriter to achieve its speed and tiny memory footprint
My requirement was a bit different. I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data. And they're really big, so I need to split them into manageable parts (whilst preserving the header row).
So, I reverted back to my classic VBScript method and bashed together a small .vbs script that can be run on any Windows computer (it gets automatically executed by the WScript.exe script host engine on Window).
The benefit of this method is that it uses Text Streams, so the underlying data isn't loaded into memory (or, at least, not all at once). The result is that it's exceptionally fast and it doesn't really need much memory to run. The test file I just split using this script on my i7 was about 1 GB in file size, had about 12 million lines of text and was split into 25 part files (each with about 500k lines each) – the processing took about 2 minutes and it didn’t go over 3 MB memory used at any point.
The caveat here is that it relies on the text file having "lines" (meaning each record is delimited with a CRLF) as the Text Stream object uses the "ReadLine" function to process a single line at a time. But hey, if you're working with TSV or CSV files, it's perfect.
Option Explicit
Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"
Private Const REPEAT_HEADER_ROW = True
Private Const LINES_PER_PART = 500000
Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart
sStart = Now()
sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1
Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
If REPEAT_HEADER_ROW Then
iLineCounter = 1
sHeaderLine = oInputFile.ReadLine()
Call oOutputFile.WriteLine(sHeaderLine)
End If
Do While Not oInputFile.AtEndOfStream
sLine = oInputFile.ReadLine()
Call oOutputFile.WriteLine(sLine)
iLineCounter = iLineCounter + 1
If iLineCounter Mod LINES_PER_PART = 0 Then
iOutputFile = iOutputFile + 1
Call oOutputFile.Close()
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
If REPEAT_HEADER_ROW Then
Call oOutputFile.WriteLine(sHeaderLine)
End If
End If
Loop
Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing
Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())
If this may help, it works perfectly for me.
Script check a folder, parse all CSV files and check nb of lines per file.
If file contains more than 55000 lines in file, script split the file into sub-files of 50000 lines and name them " _1, _2, ...."
At end of the script, original file is renamed to avoid a load.
foreach ($MyFile in $MyFolder)
{
# Read parent CSV
$InputFilename = $MyFile
$InputFile = Get-Content $MyFile
$OutputFilenamePattern = "$MyFile"+"_"
Write-Host ".........."
Write-Host ". File to process"
Write-Host ".........."
WRITE-HOST "$MyVar_file_Path"
Write-Host "$InputFilename"
Write-Host "$OutputFilenamePattern"
Write-Host ".........."
$LineLimit = 50000
# Initialize
$line = 0
$i = 0
$file = 0
$start = 0
$nb_lines = (Get-Content $MyFile).Length
Write-Host ".........."
Write-Host "$nb_lines lines in the file"
Write-Host ".........."
if ($nb_lines -gt 55000)
{
# Loop all text lines
while ($line -le $InputFile.Length)
{
# Generate child CSVs
if ($i -eq $LineLimit -Or $line -eq $InputFile.Length)
{
$file++
$Filename = "$OutputFilenamePattern$file.csv"
# $InputFile[0] | Out-File $Filename -Force # Writes Header at the beginning of the line.
If ($file -ne 1) {$InputFile[0] | Out-File $Filename -Force}
$InputFile[$start..($line - 1)] | Out-File $Filename -Force -Append # Original line 19 with the addition of -Append so it doesn't overwrite the headers you just wrote.
# $InputFile[$start..($line-1)] | Out-File $Filename -Force
$start = $line;
$i = 0
Write-Host "$Filename"
}
# Increment counters
$i++;
$line++
}
$Source_name = $MyVar_file_Path2 + "\" + $InputFilename
$Destination_name = $MyVar_file_Path2 + "\" + "Splitted_" + $InputFilename
Write-Host ".........."
Write-Host ". File to rename"
Write-Host ".........."
Write-Host "$Source_name"
Write-Host "$Destination_name"
Write-Host ".........."
Rename-Item $Source_name -NewName $Destination_name
}
Write-Host "."
Write-Host "."
}
Here is my solution to split a file called patch6.txt (about 32,000 lines) into separate files of 1000 lines each. Its not quick, but it does the job.
$infile = "D:\Malcolm\Test\patch6.txt"
$path = "D:\Malcolm\Test\"
$lineCount = 1
$fileCount = 1
foreach ($computername in get-content $infile)
{
write $computername | out-file -Append $path_$fileCount".txt"
$lineCount++
if ($lineCount -eq 1000)
{
$fileCount++
$lineCount = 1
}
}