looping through blocks of records from csv - powershell

i have a csv where i have around 8 rows.
Now there are some calculation, and based on that i have to group data on 3 rows each, add and get avg of them.
e.g.
timestamp;user_MIN;user_MAX;user_AVERAGE;nice_MIN;nice_MAX;nice_AVERAGE;system_MIN;system_MAX;system_AVERAGE;idle_MIN;idle_MAX;idle_AVERAGE;iowait_MIN;iowait_MAX;iowait_AVERAGE;irq_MIN;irq_MAX;irq_AVERAGE;softirq_MIN;softirq_MAX;softirq_AVERAGE
1. 1600013580;0.40213333333;0.40213333333;0.40213333333;0;0;0;0.63016666667;0.63016666667;0.63016666667;98.6436;98.6436;98.6436;0.32213333333;0.32213333333;0.32213333333;0;0;0;0.0019666666667;0.0019666666667;0.0019666666667
2. 1600013640;0.3618;0.3618;0.3618;0;0;0;0.59983333333;0.59983333333;0.59983333333;98.748533333;98.748533333;98.748533333;0.29786666667;0.29786666667;0.29786666667;0;0;0;0;0;0
3. 1600013700;0.3618;0.3618;0.3618;0;0;0;0.59983333333;0.59983333333;0.59983333333;98.748533333;98.748533333;98.748533333;0.29786666667;0.29786666667;0.29786666667;0;0;0;0;0;0
4. 1600013760;0.3618;0.3618;0.3618;0;0;0;0.59983333333;0.59983333333;0.59983333333;98.748533333;98.748533333;98.748533333;0.29786666667;0.29786666667;0.29786666667;0;0;0;0;0;0
5. 1600013820;0.3618;0.3618;0.3618;0;0;0;0.59983333333;0.59983333333;0.59983333333;98.748533333;98.748533333;98.748533333;0.29786666667;0.29786666667;0.29786666667;0;0;0;0;0;0
6. 1600013880;0.3618;0.3618;0.3618;0;0;0;0.59983333333;0.59983333333;0.59983333333;98.748533333;98.748533333;98.748533333;0.29786666667;0.29786666667;0.29786666667;0;0;0;0;0;0
7. 1600013940;0.30983333333;0.30983333333;0.30983333333;0;0;0;0.46146666667;0.46146666667;0.46146666667;98.932633333;98.932633333;98.932633333;0.29803333333;0.29803333333;0.29803333333;0;0;0;0;0;0
8. 1600014000;0.30983333333;0.30983333333;0.30983333333;0;0;0;0.46146666667;0.46146666667;0.46146666667;98.932633333;98.932633333;98.932633333;0.29803333333;0.29803333333;0.29803333333;0;0;0;0;0;0
Note: added row numbers for sample
so here first block will be row number 1-3, then 4-5, and then 7-8
so for all the groups above, need to calculate sum of all rows and divide by the number of rows in the block( i.e. 3 and for the last one 2)
i have the code for calculating the sum and the avg
***( given some variable names)***
$Hash = [ordered]
foreach ($Property in $Properties) {
$Hash[$Property] = (($New_Extracted_Data.$Property | Measure -Sum).Sum)/$NumberOfRowsToPick
}
Need to know how to divide the data in blocks.
Please need some idea on how to do this.
Below is the code is dividing my data into equal blocks, but the problem is for the last two row (i.e. 7-8) it is not considering.
Please let me know what i am doing wrong here
function break_csv($BlockCount)
{
$SourceDir = "C:\Script"
$InFileName = 'Server_raw_data.csv'
$InFullFileName = Join-Path -Path $SourceDir -ChildPath $InFileName
$BatchCount = $BlockCount
$DestDir = "C:\Script"
$OutFileName = 'LF_Batch_.csv'
$OutFullFileName = Join-Path -Path $DestDir -ChildPath $OutFileName
$CsvAsText = Get-Content -LiteralPath $InFullFileName
[array]$HeaderLine = $CsvAsText[0]
$rowcount = ($CsvAsText.Count) - 2
$BatchSize = [int]($rowcount / $BatchCount)
$StartLine = 1
foreach ($B_Index in 1..$BatchCount)
{
if ($B_Index -ne 1)
{
$StartLine = $StartLine + $BatchSize
$CurrentOutFullFileName = $OutFullFileName.Replace('_.', ('_{0}.' -f $B_Index))
$HeaderLine + $CsvAsText[$StartLine..(($StartLine + $BatchSize) - 1)] | Set-Content -LiteralPath $CurrentOutFullFileName
}
else
{
$CurrentOutFullFileName = $OutFullFileName.Replace('_.', ('_{0}.' -f $B_Index))
$HeaderLine + $CsvAsText[$StartLine..$BatchSize] | Set-Content -LiteralPath $CurrentOutFullFileName
}
}
}

Separate selection of blocks from calculating the averages by implementing the average calculation routine in its own function:
function Get-RowAverage
{
param(
[Parameter(Mandatory = $true, Position = 0)]
[psobject[]]$InputObject,
[Parameter(Mandatory = $true)]
[ValidateNotNullOrEmpty()]
[string[]]$Property
)
# Calculate average for the selected $Property values here and return result
}
Now you just move your calculations into that function and divide by $InputObject.Count, and then pass each "block" of rows individually:
$allRows = Import-Csv .\path\to\input.csv
$firstBlockAvg = Get-RowAverage $allRows[0..2] -Property $Property

Related

PowerShell Username Generator - Add to File/Check Against

I am creating usernames as such: first 3 letters of the first name then 4 randomly generated numbers. Ryan Smith = RYA4859. I am getting the random number from this PowerShell command:
Get-Random -Minimum 1000 -Maximum 10000
I need to know how to create a script that will add the username to a .txt file after it has been generated. I also want the script to first check the .txt file to see if the randomly generated number already already exists and if it does, generate a new 4 digit number that does not exist and then add that to the .txt file.
The flow should be:
generate random 4 digit number
check txt file if number exists
if yes - generate new number
if no - append file and add generated number to file
You want to run a do...until loop that runs until the randomly generated number doesn't exist in your text file
$file = "C:\users.txt"
$userId = "RYA"
# get the contents of your text file
$existingUserList = Get-Content $file
do
{
$userNumber = Get-Random -Minimum 1000 -Maximum 10000
# remove all alpha characters in the file, so only an array of numbers remains
$userListReplaced = $existingUserList -replace "[^0-9]" , ''
# the loop runs until the randomly generated number is not in the array of numbers
} until (-not ($userNumber -in $userListReplaced))
# concatenates your user name with the random number
$user = $userId + $userNumber
# appends the concatenated username into the text file
$user | Out-File -FilePath $file -Append
Without the 3 character prefix
$file = "C:\users.txt"
# get the contents of your text file
$existingUserList = Get-Content $file
do
{
$userNumber = Get-Random -Minimum 1000 -Maximum 10000
# remove all alpha characters in the file, so only an array of numbers remains
$userListReplaced = $existingUserList -replace "[^0-9]" , ''
# the loop runs until the randomly generated number is not in the array of numbers
} until (-not ($userNumber -in $userListReplaced))
# appends the concatenated username into the text file
$userNumber| Out-File -FilePath $file -Append
Note: Hashtables in general will find keys in less time than finding a matching element in an unsorted array. This difference in performance increases as the number of elements increase. While a binary search on a sorted arrays may come closer in performance, the sorting process itself can be be a major performance hit and add complexity to the code.
The main difference between the described version of the code in the comment on the question, and the following code, is that I'm appending the new user name to the file instead of over writing the file, and added a loop near the end to repeatedly ask if the code should continue.
function RandomDigits {
[CmdletBinding()]
param (
[Parameter()]
[int]$DigitCount = 2
)
$RandString = [string](Get-Random -Minimum 100000 -Maximum 10000000)
$RandString.Substring($RandString.Length-$DigitCount)
}
function GenUserName {
[CmdletBinding()]
param(
[Parameter(Mandatory = $true, Position = 0)]
[string]$Prefix
)
"$Prefix$(RandomDigits 4)"
}
function ReadAndMatchRegex {
[CmdletBinding()]
param(
[Parameter(Mandatory = $true, Position = 0)]
[string]$Regex,
[Parameter(Mandatory = $true, Position = 1)]
[string]$Prompt,
[Parameter(Mandatory = $false, Position = 2)]
[string]$ErrMsg = "Incorrect, please enter needed info (Type 'exit' to exit)."
)
$FirstPass = $true
do {
if (-not $FirstPass) {
Write-Host $ErrMsg -ForegroundColor Red
Write-Host
}
$ReadText = Read-Host -Prompt $Prompt
$ReadText = $ReadText.ToUpper()
if($ReadText -eq 'exit') {exit}
$FirstPass = $false
} until ($ReadText -match $Regex)
$ReadText
}
$Usernames = #{}
$UsernameFile = "$PSScriptRoot\Usernames.txt"
if(Test-Path -Path $UsernameFile -PathType Leaf) {
foreach($line in Get-Content $UsernameFile) { $Usernames[$Line]=$true }
}
do {
Write-Host
$UserPrefix = ReadAndMatchRegex '^[A-Z]{3}$' "Please enter 3 letters for user's ID"
do {
$NewUserName = GenUserName $UserPrefix
} while ($Usernames.ContainsKey($NewUserName))
$NewUserName | Out-File $UsernameFile -Append
$UserNames[$NewUserName]=$true
$UserNames.Keys
$Continue = ReadAndMatchRegex '^(Y|y|YES|yes|Yes|N|n|NO|no|No)$' 'Continue?[Y/N]'
} while ($Continue -match '^(Y|y|YES|yes|Yes)$')

Split a large csv file into multiple csv files according to the size in powershell

I have a large CSV file and I want to split it with respect to size and the header should be in every file.
For example, I have this 1.6MB file and I want the child files shouldn't be more than 512KB. So practically the parent file should have 4 child file.
Tried with the below simple program but the file is splitting with blank child files.
function csvSplitter {
$csvFile = "D:\Test\PTest\Dummy.csv";
$split = 10;
$content = Import-Csv $csvFile;
$start = 1;
$end = 0;
$records_per_file = [int][Math]::Ceiling($content.Count / $split);
for($i = 1; $i -le $split; $i++) {
$end += $records_per_file;
$content | Where-Object {[int]$_.Id -ge $start -and [int]$_.Id -le $end} | Export-Csv -Path "D:\Test\PTest\Destination\file$i.csv" -NoTypeInformation;
$start = $end + 1;
}
}csvSplitter
The logic for the size of the file is yet to write.
Tried to add both the files but I guess there is no option to add files.
this takes a slightly different path to a solution. [grin]
it ...
loads the CSV as a plain text file
saves the 1st line as a header line
calcs the batch size from the total line count & the batch count
uses array index ranges to grab the lines for each batch
combines the header line with the current batch of lines
writes that out to a text file
the reason for such a roundabout method is to save RAM. one drawback to loading the file as a CSV is the sheer amount of RAM needed. just loading the lines of text requires noticeably less RAM.
$SourceDir = $env:TEMP
$InFileName = 'LargeFile.csv'
$InFullFileName = Join-Path -Path $SourceDir -ChildPath $InFileName
$BatchCount = 4
$DestDir = $env:TEMP
$OutFileName = 'LF_Batch_.csv'
$OutFullFileName = Join-Path -Path $DestDir -ChildPath $OutFileName
#region >>> build file to work with
# remove this region when you are ready to do this with your test data OR to do this with real data
if (-not (Test-Path -LiteralPath $InFullFileName))
{
Get-ChildItem -LiteralPath $env:APPDATA -Recurse -File |
Sort-Object -Property Name |
Select-Object Name, Length, LastWriteTime, Directory |
Export-Csv -LiteralPath $InFullFileName -NoTypeInformation
}
#endregion >>> build file to work with
$CsvAsText = Get-Content -LiteralPath $InFullFileName
[array]$HeaderLine = $CsvAsText[0]
$BatchSize = [int]($CsvAsText.Count / $BatchCount) + 1
$StartLine = 1
foreach ($B_Index in 1..$BatchCount)
{
if ($B_Index -ne 1)
{
$StartLine = $StartLine + $BatchSize + 1
}
$CurrentOutFullFileName = $OutFullFileName.Replace('_.', ('_{0}.' -f $B_Index))
$HeaderLine + $CsvAsText[$StartLine..($StartLine + $BatchSize)] |
Set-Content -LiteralPath $CurrentOutFullFileName
}
there is no output on screen, but i got 4 files named LF_Batch_1.csv thru LF_Batch_4.csv that contained the 4our parts of the source file as expected. the last file has a slightly smaller number of rows, but that is what happens when the row count is not evenly divisible by the batch count. [grin]
Try this:
Add-Type -AssemblyName System.Collections
function Split-Csv {
param (
[string]$filePath,
[int]$partsNum
)
# Use generic lists for import/export
[System.Collections.Generic.List[object]]$contentImport = #()
[System.Collections.Generic.List[object]]$contentExport = #()
# import csv-file
$contentImport = Import-Csv $filePath
# how many lines per export file
$linesPerFile = [Math]::Max( [int]($contentImport.Count / $partsNum), 1 )
# start pointer for source list
$startPointer = 0
# counter for file name
$counter = 1
# main loop
while( $startPointer -lt $contentImport.Count ) {
# clear export list
[void]$contentExport.Clear()
# determine from-to from source list to export
$endPointer = [Math]::Min( $startPointer + $linesPerFile, $contentImport.Count )
# move lines to export to export list
[void]$contentExport.AddRange( $contentImport.GetRange( $startPointer, $endPointer - $startPointer ) )
# export
$contentExport | Export-Csv -Path ($filePath.Replace('.', $counter.ToString() + '.' ) ) -NoTypeInformation -Force
# move pointer
$startPointer = $endPointer
# increase counter for filename
$counter++
}
}
Split-Csv -filePath 'test.csv' -partsNum 7
try running this script:
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$FilePath = $HOME +'\Documents\Projects\ADOPT\Data8277.csv'
$SplitDir = $HOME +'\Documents\Projects\ADOPT\Split\'
CSV-FileSplitter -Path $FilePath -PartSizeBytes 35MB -SplitDir $SplitDir #-Verbose
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
I created this for files larger than 50GB files

Changing the Delimiter in a large CSV file using Powershell

I am in need of a way to change the delimiter in a CSV file from a comma to a pipe. Because of the size of the CSV files (~750 Mb to several Gb), using Import-CSV and/or Get-Content is not an option. What I'm using (and what works, albeit slowly) is the following code:
$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$reader.SetDelimiters(",")
While(!$reader.EndOfData)
{
$line = $reader.ReadFields()
$details = [ordered]#{
"Plugin ID" = $line[0]
CVE = $line[1]
CVSS = $line[2]
Risk = $line[3]
}
$export = New-Object PSObject -Property $details
$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"
}
This little loop took nearly 2 minutes to process a 20 Mb file. Scaling up at this speed would mean over an hour for the smallest CSV file I'm currently working with.
I've tried this as well:
While(!$reader.EndOfData)
{
$line = $reader.ReadFields()
$details = [ordered]#{
# Same data as before
}
$export.Add($details) | Out-Null
}
$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"
This is MUCH FASTER but doesn't provide the right information in the new CSV. Instead I get rows and rows of this:
"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
So, two questions:
1) Can the first block of code be made faster?
2) How can I unwrap the arraylist in the second example to get to the actual data?
EDIT: Sample data found here - http://pastebin.com/6L98jGNg
This is simple text-processing, so the bottleneck should be disk read speed:
1 second per 100 MB or 10 seconds per 1GB for the OP's sample (repeated to the mentioned size) as measured here on i7. The results would be worse for files with many/all small quoted fields.
The algo is simple:
Read the file in big string chunks e.g. 1MB.
It's much faster than reading millions of lines separated by CR/LF because:
less checks are performed as we mostly/primarily look only for doublequotes;
less iterations of our code executed by the interpreter which is slow.
Find the next doublequote.
Depending on the current $inQuotedField flag decide whether the found doublequote starts a quoted field (should be preceded by , + some spaces optionally) or ends the current quoted field (should be followed by any even number of doublequotes, optionally spaces, then ,).
Replace delimiters in the preceding span or to the end of 1MB chunk if no quotes were found.
The code makes some reasonable assumptions but it may fail to detect an escaped field if its doublequote is followed or preceded by more than 3 spaces before/after field delimiter. The checks won't be too hard to add, and I might've missed some other edge case, but I'm not that interested.
$sourcePath = 'c:\path\file.csv'
$targetPath = 'd:\path\file2.csv'
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM
$delim = [char]','
$newDelim = [char]'|'
$buf = [char[]]::new(1MB)
$sourceBase = [IO.FileStream]::new(
$sourcePath,
[IO.FileMode]::open,
[IO.FileAccess]::read,
[IO.FileShare]::read,
$buf.length, # let OS prefetch the next chunk in background
[IO.FileOptions]::SequentialScan)
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length)
$bufStart = 0
$bufPadding = 4
$inQuotedField = $false
$fieldBreak = [char[]]#($delim, "`r", "`n")
$out = [Text.StringBuilder]::new($buf.length)
while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) {
$s = [string]::new($buf, 0, $nRead+$bufStart)
$len = $s.length
$pos = 0
$out.Clear() >$null
do {
$iQuote = $s.IndexOf([char]'"', $pos)
if ($inQuotedField) {
$iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) }
if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) {
# no closing quote in buffer safezone
$out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null
break
}
if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") {
# even number of quotes are just quoted quotes
$inQuotedField = $matches[1].length % 2 -eq 0
}
$out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null
$pos = $iDelim + 1
continue
}
if ($iQuote -ge 0) {
$iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote)
if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) {
$inQuotedField = $true
}
$replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim)
} elseif ($pos -gt 0) {
$replaced = $s.Substring($pos).Replace($delim, $newDelim)
} else {
$replaced = $s.Replace($delim, $newDelim)
}
$out.Append($replaced) >$null
$pos = $iQuote + 1
} while ($iQuote -ge 0)
$target.Write($out)
$bufStart = 0
for ($i = $out.length; $i -lt $s.length; $i++) {
$buf[$bufStart++] = $buf[$i]
}
}
if ($bufStart) { $target.Write($buf, 0, $bufStart) }
$source.Close()
$target.Close()
Still not what I would call fast, but this is considerably faster than what you have listed by using the -Join operator:
$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source
$reader.SetDelimiters(",")
While(!$reader.EndOfData){
$line = $reader.ReadFields()
$line -join '|' | Add-Content C:\Temp\TestOutput.csv
}
That took a hair under 32 seconds to process a 20MB file. At that rate your 750MB file would be done in under 20 minutes, and bigger files should go at about 26 minutes per gig.

Count records in csv file - Powershell

I have following PS script to get a count. Is there a way to count (minus the header) without importing the entire csv? Sometimes the csv file is very large and sometime it has no records.
Get-ChildItem 'C:\Temp\*.csv' | ForEach {
$check = Import-Csv $_
If ($check) { Write-Host "$($_.FullName) contains data" }
Else { Write-Host "$($_.FullName) does not contain data" }
}
To count the rows without worrying about the header use this:
$c = (Import-Csv $_.FullName).count
However this has to read the entire file into memory. A faster way to count the file would be to use the Get-Content with the readcount flag like so:
$c = 0
Get-Content $_.FullName -ReadCount 1000 | % {$c += $_.Length}
$c -= 1
To remove the header row from the count you just subtract 1. If your files with no rows don't have an header you can avoid them counting as minus 1 like so:
$c = 0
Get-Content $_.FullName -ReadCount 1000 | % {$c += $_.Length}
$c -= #{$true = 0; $false = - 1}[$c -eq 0]
Here is the function that will check is CSV file empty (returns True if empty, False otherwise) with following features:
Can skip headers
Works in PS 2.0 (PS 2.0 hasn't -ReadCount switch for Get-Content cmdlet)
Doesn't load entire file in memory
Aware of CSV file structure (wouldn't count empty/nonvalid lines).
It accepts following arguments:
FileName: Path to CSV file.
MaxLine: Maximum lines to read from file.
NoHeader: If this switch is not specified, function will skip first line of the file
Usage example:
Test-IsCsvEmpty -FileName 'c:\foo.csv' -MaxLines 2 -NoHeader
function Test-IsCsvEmpty
{
Param
(
[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true)]
[string]$FileName,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateRange(1, [int]::MaxValue)]
[int]$MaxLines = 2,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[switch]$NoHeader
)
Begin
{
# Setup regex for CSV parsing
$DQuotes = '"'
$Separator = ','
# http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
}
Process
{
# Open file in StreamReader
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList $FileName -ErrorAction Stop
# Set inital values for Raw\Data lines count
$CsvRawLinesCount = 0
$CsvDataLinesCount = 0
# Loop over lines in file
while(($line = $InFile.ReadLine()) -ne $null)
{
# Increase Raw line counter
$CsvRawLinesCount++
# Skip header, if requested
if(!$NoHeader -and ($CsvRawLinesCount -eq 1))
{
continue
}
# Stop processing if MaxLines limit is reached
if($CsvRawLinesCount -gt $MaxLines)
{
break
}
# Try to parse line as CSV
if($line -match $SplitRegex)
{
# If success, increase CSV Data line counter
$CsvDataLinesCount++
}
}
}
End
{
# Close file, dispose StreamReader
$InFile.Close()
$InFile.Dispose()
# Write result to the pipeline
if($CsvDataLinesCount -gt 0)
{
$false
}
else
{
$true
}
}
}

Split CSV with powershell

I have large CSV files (50-500 MB each). Running complicated power shell commands on these takes forever and/or hits memory issues.
Processing the data requires grouping by common fields, say in ColumnA. So assuming that the data is already sorted by that column, if I split these files randomly (i.e. each x-thousand lines) then matching entries could still end up in different parts. There are thousands of different groups in A, so splitting every one into a single file would create to many files.
How can I split it into files of 10,000-ish lines and not lose the groups? E.g. rows 1-13 would be A1 in Column A, rows 14-17 would be A2 etc. and row 9997-10012 would be A784. In this case i would want the first file to contain rows 1-10012 and the next one to start with row 10013.
Obviously I would want to keep the entire rows (rather than just Column A), so if I pasted all the resulting files together this would be the same as the original file.
Not tested. This assumes ColumnA is the first column and it's common comma-delimited data. You'll need to adjust the line that creates the regex to suit your data.
$count = 0
$header = get-content file.csv -TotalCount 1
get-content file.csv -ReadCount 1000 |
foreach {
#add tail entries from last batch to beginning of this batch
$newbatch = $tail + $_
#create regex to match last entry in this batch
$regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0]))
#Extract everything that doesn't match the last entry to new file
#Add header if this is not the first file
if ($count)
{
$header |
set-content "c:\somedir\filepart_$count"
}
$newbatch -notmatch $regex |
add-content "c:\somedir\filepart_$count"
#Extact tail entries to add to next batch
$tail = #($newbatch -match $regex)
#Increment file counter
$count++
}
This is my attempt, it got messy :-P It will load the whole file into memory while splitting it, but this is pure text. It should take less memory then imported objects, but still about the size of the file.
$filepath = "C:\Users\graimer\Desktop\file.csv"
$file = Get-Item $filepath
$content = Get-Content $file
$csvheader = $content[0]
$lines = $content.Count
$minlines = 10000
$filepart = 1
$start = 1
while ($start -lt $lines - 1) {
#Set minimum $end value (last line)
if ($start + $minlines -le $lines - 1) { $end = $start + $minlines - 1 } else { $end = $lines - 1 }
#Value to compare. ColA is first column in my file = [0] . ColB is second column = [1]
$avalue = $content[$end].split(",")[0]
#If not last line in script
if ($end -ne $lines -1) {
#Increase $end by 1 while ColA is the same
while ($content[$end].split(",")[0] -eq $avalue) { $end++ }
#Return to last line with equal ColA value
$end--
}
#Create new csv-part
$filename = $file.FullName.Replace($file.BaseName, ($file.BaseName + ".part$filepart"))
#($csvheader, $content[$start..$end]) | Set-Content $filename
#Fix counters
$filepart++
$start = $end + 1
}
file.csv:
ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
A5,1,10
A6,1,10
A7,1,10
Results (I used $minlines = 5):
file.part1.csv:
ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
file.part2.csv:
ColA,ColB,ColC
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
file.part3.csv:
ColA,ColB,ColC
A5,1,10
A6,1,10
A7,1,10
This requires PowerShell v3 (due to -append on Export-CSV).
Also, I'm assuming that you have column headers and the first column is named col1. Adjust as necessary.
import-csv MYFILE.csv|foreach-object{$_|export-csv -notypeinfo -noclobber -append ($_.col1 + ".csv")}
This will create one file for each distinct value in the first column, with that value as the file name.
To compliment the helpful answer from mjolinor with a reusable function with a few additional parameters and using the steppable pipeline which is about a factor 8 faster:
function Split-Content {
[CmdletBinding()]
param (
[Parameter(Mandatory=$true)][String]$Path,
[ULong]$HeadSize,
[ValidateRange(1, [ULong]::MaxValue)][ULong]$DataSize = [ULong]::MaxValue,
[Parameter(Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]$Value
)
begin {
$Header = [Collections.Generic.List[String]]::new()
$DataCount = 0
$PartNr = 1
}
Process {
$ReadCount = 0
while ($ReadCount -lt #($_).Count -and $Header.Count -lt $HeadSize) {
if (#($_)[$ReadCount]) { $Header.Add(#($_)[$ReadCount]) }
$ReadCount++
}
if ($ReadCount -lt #($_).Count -and $Header.Count -ge $HeadSize) {
do {
if ($DataCount -le 0) { # Should never be less
$FileInfo = [System.IO.FileInfo]$ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($Path)
$FileName = $FileInfo.BaseName + $PartNr++ + $FileInfo.Extension
$LiteralPath = [System.IO.Path]::Combine($FileInfo.DirectoryName, $FileName)
$steppablePipeline = { Set-Content -LiteralPath $LiteralPath }.GetSteppablePipeline()
$steppablePipeline.Begin($PSCmdlet)
$steppablePipeline.Process($Header)
}
$Next = [math]::min(($DataSize - $DataCount), #($_).Count)
if ($Next -gt $ReadCount) { $steppablePipeline.Process(#($_)[$ReadCount..($Next - 1)]) }
$DataCount = ($DataCount + $Next - $ReadCount) % $DataSize
if ($DataCount -le 0) { $steppablePipeline.End() }
$ReadCount = $Next % #($_).Count
} while ($ReadCount)
}
}
End {
if ($steppablePipeline) { $steppablePipeline.End() }
}
}
Parameters
Value
Specifies the listed content lines to be broken into parts. Multiple lines sent through the pipeline at a time (aka sub arrays like Object[]) will also be passed to the output file at a time (assuming that is fits the -DataSize).
Path
Specifies a path to one or more locations. Each filename in the location is suffixed with a part number (starting with 1).
HeadSize
The specifies the number of lines of the header that will be taken from the input and preceded in each file part. The default is 0, meaning no header line are copied.
DataSize
The specifies the number of lines that will be successively taken (after the header) from the input as data and pasted into each file part. The default is [ULong]::MaxValue, basically meaning that all data is copied to a single file.
Example 1:
Get-Content -ReadCount 1000 .\Test.Csv |Split-Content -Path .\Part.Csv -HeadSize 1 -DataSize 10000
This will split the .\Test.Csv file in chuncks of csv files with 10000 rows
Note that the performance of this Split-Content function highly depends on the -ReadCount of the prior Get-Content cmdlet.
Example 2:
Get-Process |Out-String -Stream |Split-Content -Path .\Process.Txt -HeadSize 2 -DataSize 20
This will write chunks of 20 processes to the .\Process<PartNr>.Txt files preceded with the standard (2 line) header format:
NPM(K) PM(M) WS(M) CPU(s) Id SI ProcessName
------ ----- ----- ------ -- -- -----------
... # 20 rows following