Powershell to Break up CSV by Number of Row - powershell

So I am now tasked with getting constant reports that are more than 1 Million lines long.
My last question did not explain all things so I'm tryin got do a better question.
I'm getting a dozen + daily reports that are coming in as CSV files. I don't know what the headers are or anything like that as I get them.
They are huge. I cant open in excel.
I wanted to basically break them up into the same report, just each report maybe 100,000 lines long.
The code I wrote below does not work as I keep getting a
Exception of type 'System.OutOfMemoryException' was thrown.
I am guessing I need a better way to do this.
I just need this file broken down to a more manageable size.
It does not matter how long it takes as I can run it over night.
I found this on the internet, and I tried to manipulate it, but I cant get it to work.
$PSScriptRoot
write-host $PSScriptRoot
$loc = $PSScriptRoot
$location = $loc
# how many rows per CSV?
$rowsMax = 10000;
# Get all CSV under current folder
$allCSVs = Get-ChildItem "$location\Split.csv"
# Read and split all of them
$allCSVs | ForEach-Object {
Write-Host $_.Name;
$content = Import-Csv "$location\Split.csv"
$insertLocation = ($_.Name.Length - 4);
for($i=1; $i -le $content.length ;$i+=$rowsMax){
$newName = $_.Name.Insert($insertLocation, "splitted_"+$i)
$content|select -first $i|select -last $rowsMax | convertto-csv -NoTypeInformation | % { $_ -replace '"', ""} | out-file $location\$newName -fo -en ascii
}
}

The key is not to read large files into memory in full, which is what you're doing by capturing the output from Import-Csv in a variable ($content = Import-Csv "$location\Split.csv").
That said, while using a single pipeline would solve your memory problem, performance will likely be poor, because you're converting from and back to CSV, which incurs a lot of overhead.
Even reading and writing the files as text with Get-Content and Set-Content is slow, however.
Therefore, I suggest a .NET-based approach for processing the files as text, which should substantially speed up processing.
The following code demonstrates this technique:
Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {
$csvFile = $_.FullName
# Construct a file-path template for the sequentially numbered chunk
# files; e.g., "...\file_split_001.csv"
$csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'
# Set how many lines make up a chunk.
$chunkLineCount = 10000
# Read the file lazily and save every chunk of $chunkLineCount
# lines to a new file.
$i = 0; $chunkNdx = 0
foreach ($line in [IO.File]::ReadLines($csvFile)) {
if ($i -eq 0) { ++$i; $header = $line; continue } # Save header line.
if ($i++ % $chunkLineCount -eq 1) { # Create new chunk file.
# Close previous file, if any.
if (++$chunkNdx -gt 1) { $fileWriter.Dispose() }
# Construct the file path for the next chunk, by
# instantiating the template with the next sequence number.
$csvFileChunk = $csvFileChunkTemplate -f $chunkNdx
Write-Verbose "Creating chunk: $csvFileChunk"
# Create the next chunk file and write the header.
$fileWriter = [IO.File]::CreateText($csvFileChunk)
$fileWriter.WriteLine($header)
}
# Write a data row to the current chunk file.
$fileWriter.WriteLine($line)
}
$fileWriter.Dispose() # Close the last file.
}
Note that the above code creates BOM-less UTF-8 files; if your input contains ASCII-range characters only, these files will effectively be ASCII files.
Here's the equivalent single-pipeline solution, which is likely to be substantially slower.
Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {
$csvFile = $_.FullName
# Construct a file-path template for the sequentially numbered chunk
# files; e.g., ".../file_split_001.csv"
$csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'
# Set how many lines make up a chunk.
$chunkLineCount = 10000
$i = 0; $chunkNdx = 0
Get-Content -LiteralPath $csvFile | ForEach-Object {
if ($i -eq 0) { ++$i; $header = $_; return } # Save header line.
if ($i++ % $chunkLineCount -eq 1) { #
# Construct the file path for the next chunk.
$csvFileChunk = $csvFileChunkTemplate -f ++$chunkNdx
Write-Verbose "Creating chunk: $csvFileChunk"
# Create the next chunk file and write the header.
Set-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $header
}
# Write data row to the current chunk file.
Add-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $_
}
}

Another option from linux world - split command. To get it on windows just install git bash, then you'll be able to use many linux tools in your CMD/powershell.
Below is the syntax to achieve your goal:
split -l 100000 --numeric-suffixes --suffix-length 3 --additional-suffix=.csv sourceFile.csv outputfile
It's very fast. If you want you can wrap split.exe as a cmdlet

Related

Scanning log file using ForEach-Object and replacing text is taking a very long time

I have a Powershell script that scans log files and replaces text when a match is found. The list is currently 500 lines, and I plan to double/triple this. the log files can range from 400KB to 800MB in size. 
Currently, when using the below, a 42MB file takes 29mins, and I'm looking for help if anyone can see any way to make this faster?
I tried changing ForEach-Object with ForEach-ObjectFast but it's causing the script to take sufficiently longer. also tried changing the first ForEach-Object to a forloop but still took ~29 mins. 
$lookupTable= #{
'aaa:bbb:123'='WORDA:WORDB:NUMBER1'
'bbb:ccc:456'='WORDB:WORDBC:NUMBER456'
}
Get-Content -Path $inputfile | ForEach-Object {
$line=$_
$lookupTable.GetEnumerator() | ForEach-Object {
if ($line-match$_.Key)
{
$line=$line-replace$_.Key,$_.Value
}
}
$line
}|Set-Content -Path $outputfile
Since you say your input file could be 800MB in size, reading and updating the entire content in memory could potentially not fit.
The way to go then is to use a fast line-by-line method and the fastest I know of is switch
# hardcoded here for demo purposes.
# In real life you get/construct these from the Get-ChildItem
# cmdlet you use to iterate the log files in the root folder..
$inputfile = 'D:\Test\test.txt'
$outputfile = 'D:\Test\test_new.txt' # absolute full file path because we use .Net here
# because we are going to Append to the output file, make sure it doesn't exist yet
if (Test-Path -Path $outputfile -PathType Leaf) { Remove-Item -Path $outputfile -Force }
$lookupTable= #{
'aaa:bbb:123'='WORDA:WORDB:NUMBER1'
}
# create a regex string from the Keys of your lookup table,
# merging the strings with a pipe symbol (the regex 'OR').
# your Keys could contain characters that have special meaning in regex, so we need to escape those
$regexLookup = '({0})' -f (($lookupTable.Keys | ForEach-Object { [regex]::Escape($_) }) -join '|')
# create a StreamWriter object to write the lines to the new output file
# Note: use an ABSOLUTE full file path for this
$streamWriter = [System.IO.StreamWriter]::new($outputfile, $true) # $true for Append
switch -Regex -File $inputfile {
$regexLookup {
# do the replacement using the value in the lookup table.
# because in one line there may be multiple matches to replace
# get a System.Text.RegularExpressions.Match object to loop through all matches
$line = $_
$match = [regex]::Match($line, $regexLookup)
while ($match.Success) {
# because we escaped the keys, to find the correct entry we now need to unescape
$line = $line -replace $match.Value, $lookupTable[[regex]::Unescape($match.Value)]
$match = $match.NextMatch()
}
$streamWriter.WriteLine($line)
}
default { $streamWriter.WriteLine($_) } # write unchanged
}
# dispose of the StreamWriter object
$streamWriter.Dispose()

How create combined file from two text files in powershell?

How can I create file with combined lines from two different text files and create new file like this one:
First line from text file A
First line from text file B
Second line from text file A
Second line from text file B
...
For a solution that:
keeps memory use constant (doesn't load the whole files into memory up front)
performs acceptably with larger files.
direct use of .NET APIs is needed:
# Input files, assumed to be in the current directory.
# Important: always use FULL paths when calling .nET methods.
$dir = $PWD.ProviderPath
$fileA = [System.IO.File]::ReadLines("dir/fileA.txt")
$fileB = [System.IO.File]::ReadLines("$dir/fileB.txt")
# Create the output file.
$fileOut = [System.IO.File]::CreateText("$dir/merged.txt")
# Iterate over the files' lines in tandem, and write each pair
# to the output file.
while ($fileA.MoveNext(), $fileB.MoveNext() -contains $true) {
if ($null -ne $fileA.Current) { $fileOut.WriteLine($fileA.Current) }
if ($null -ne $fileB.Current) { $fileOut.WriteLine($fileB.Current) }
}
# Dipose of (close) the files.
$fileA.Dispose(); $fileB.Dispose(); $fileOut.Dispose()
Note: .NET APIs use UTF-8 by default, but you can pass the desired encoding, if needed.
See also: The relevant .NET API help topics:
System.IO.File.ReadLines
System.IO.File.CreateText
A solution that uses only PowerShell features:
Note: Using PowerShell-only features you can only lazily enumerate one file's lines at a time, so reading the other into memory in full is required.
(However, you could again use a lazy enumerable via the .NET API, i.e. System.IO.File]::ReadLines() as shown above, or read both files into memory in full up front.)
The key to acceptable performance is to only have one Set-Content call (plus possibly one Add-Content call) which processes all output lines.
However, given that Get-Content (without -Raw) is quite slow itself, due to decorating each line read with additional properties, the solution based on .NET APIs will perform noticeably better.
# Read the 2nd file into an array of lines up front.
# Note: -ReadCount 0 greatly speeds up reading, by returning
# the lines directly as a single array.
$fileBLines = Get-Content fileB.txt -ReadCount 0
$i = 0 # Initialize the index into array $fileBLines.
# Lazily enumerate the lines of file A.
Get-Content fileA.txt | ForEach-Object {
$_ # Output the line from file A.
# If file B hasn't run out of lines yet, output the corresponding file B line.
if ($i -lt $fileBLines.Count) { $fileBLines[$i++] }
} | Set-Content Merged.txt
# If file B still has lines left, append them now:
if ($i -lt $fileBLines.Count) {
Add-Content Merged.txt -Value $fileBLines[$i..($fileBLines.Count-1)]
}
Note: Windows PowerShell's Set-Content cmdlet defaults to "ANSI" encoding, whereas PowerShell (Core) (v6+) uses BOM-less UTF-8; use the -Encoding parameter as needed.
$file1content = Get-Content -Path "IN_1.txt"
$file2content = Get-Content -Path "IN_2.txt"
$filesLenght =#($file1content.Length, $file2content.Length)
for ($i = 1; $i -le ($filesLenght | Measure-Object -Maximum).Maximum; $i++)
{ Add-Content -Path "OUT.txt" $file1content[$i]
Add-Content -Path "OUT.txt" $file2content[$i]
}

How can I convert CSV files with a meta data header row into flat tables using Powershell?

I have a few thousand CSV files with a format similar to this (i.e. a table with a meta data row at the top):
dinosaur.csv,water,Benjamin.Field.12.Location53.Readings,
DATE,VALUE,QUALITY,STATE
2018-06-01,73.83,Good,0
2018-06-02,45.53,Good,0
2018-06-03,89.123,Good,0
Is it possible to use PowerShell to convert these CSV files into a simple table format such as this?
DATE,VALUE,QUALITY,STATE,FILENAME,PRODUCT,TAG
2018-06-01,73.83,Good,0,dinosaur.csv,water,Benjamin.Field.12.Location53.Readings
2018-06-02,45.53,Good,0,dinosaur.csv,water,Benjamin.Field.12.Location53.Readings
2018-06-03,89.123,Good,0,dinosaur.csv,water,Benjamin.Field.12.Location53.Readings
Or is there a better alternative to preparing these CSV's into a straight forward format to be ingested?
I have used PS to process simple CSV's before, but not with a meta data row that was important.
Thanks
Note: This is a faster alternative to thepip3r's helpful answer, and also covers the aspect of saving the modified content back to CSV files:
By using the switch statement to efficiently loop over the lines of the files as text, the costly calls to ConvertFrom-Csv, Select-Object and Export-Csv can be avoided.
Note that the switch statement is enclosed in $(), the subexpression operator, so as to enable writing back to the same file in a single pipeline; however, doing so requires keeping the entire (modified) file in memory; if that's not an option, enclose the switch statement in & { ... } and pipe it to Set-Content to a temporary file, which you can later use to replace the original file.
# Create a sample CSV file in the current dir.
#'
dinosaur.csv,water,Benjamin.Field.12.Location53.Readings,
DATE,VALUE,QUALITY,STATE
2018-06-01,73.83,Good,0
2018-06-02,45.53,Good,0
2018-06-03,89.123,Good,0
'# > sample.csv
# Loop over all *.csv files in the current dir.
foreach ($csvFile in Get-Item *.csv) {
$ndx = 0
$(
switch -File $csvFile.FullName {
default {
if ($ndx -eq 0) { # 1st line
$suffix = $_ -replace ',$' # save the suffix to append to data rows later
} elseif ($ndx -eq 1) { # header row
$_ + ',FILENAME,PRODUCT,TAG' # add additional column headers
} else { # data rows
$_ + ',' + $suffix # append suffix
}
++$ndx
}
}
) # | Set-Content $csvFile.FullName # <- activate this to write back to the same file.
# Use -Encoding as needed.
}
The above yields the following:
DATE,VALUE,QUALITY,STATE,FILENAME,PRODUCT,TAG
2018-06-01,73.83,Good,0,dinosaur.csv,water,Benjamin.Field.12.Location53.Readings
2018-06-02,45.53,Good,0,dinosaur.csv,water,Benjamin.Field.12.Location53.Readings
2018-06-03,89.123,Good,0,dinosaur.csv,water,Benjamin.Field.12.Location53.Readings
## If your inital block is an accurate representation
$s = get-content .\test.txt
## Get the 'metadata' line
$metaline = $s[0]
## Remove the metadata line from the original and turn it into a custom powershell object
$n = $s | where-object { $_ -ne $metaline } | ConvertFrom-Csv
## Split the metadata line by a comma to get the different parts for appending to the other content
$m = $metaline.Split(',')
## Loop through each item and append the metadata information to each entry
for ($i=0; $i -lt $n.Count; $i++) {
$n[$i] = $n[$i] | Select-Object -Property *,FILENAME,PRODUCT,TAG ## This is a cheap way to create new properties on an object
$n[$i].Filename = $m[0]
$n[$i].Product = $m[1]
$n[$i].Tag = $m[2]
}
## Display that the new objects reports as the desired output
$n | format-table

Is there a "split" equivalent in Powershell?

I am looking for a PowerShell equivalent to "split" *NIX command, such as seen here : http://www.computerhope.com/unix/usplit.htm
split outputs fixed-size pieces of input INPUT to files named
PREFIXaa, PREFIXab, ...
This is NOT referring to .split() like for strings. This is to take a LARGE array from pipe and then be stored into X number of files of each with the same number of lines.
In my use case, the content getting piped is list of over 1Million files...
Get-ChildItem $rootPath -Recurse | select -ExpandProperty FullName | foreach{ $_.Trim()} | {...means of splitting file here...}
I don't think it exists a CmdLet doing exactly what you want. but you can quickly build a function doing that.
It's a kind of duplicate of How can I split a text file using PowerShell? and you will find more scripts solutions if you google "powershell split a text file into smaller files"
Here is a peace of code to begin, my advice is to use the .NET class System.IO.StreamReader to handle more efficiently big files.
$sourcefilename = "D:\temp\theFiletosplit.txt"
$desFolderPathSplitFile = "D:\temp\TFTS"
$maxsize = 2 # The number of lines per file
$filenumber = 0
$linecount = 0
$reader = new-object System.IO.StreamReader($sourcefilename)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content $desFolderPathSplitFile$filenumber.txt $line
$linecount ++
If ($linecount -eq $maxsize)
{
$filenumber++
$linecount = 0
}
}
$reader.Close()
$reader.Dispose()

Slow Powershell script for CSV modification

I'm using a powershell script to append data to the end of a bunch of files.
Each file is a CSV around 50Mb (Say 2 millionish lines), there are about 50 files.
The script I'm using looks like this:
$MyInvocation.MyCommand.path
$files = ls *.csv
foreach($f in $files)
{
$baseName = [System.IO.Path]::GetFileNameWithoutExtension($f)
$year = $basename.substring(0,4)
Write-Host "Starting" $Basename
$r = [IO.File]::OpenText($f)
while ($r.Peek() -ge 0) {
$line = $r.ReadLine()
$line + "," + $year | Add-Content $(".\DR_" + $basename + ".CSV")
}
$r.Dispose()
}
Problem is, it's pretty slow. It's taken about 12 hours to get through them.
It's not super complex, so I wouldn't expect it to take that long to run.
What could I do to speed it up?
Reading and writing a file row by row can be a bit slow. Maybe your antivirus is contributing to slowness as well. Use Measure-Command to see which parts of the script are the slow ones.
As a general advise, rather write a few large blocks instead of lots of small ones. You can achieve this by storing some content in a StringBuilder and appending its contents into the output file every, say, 1000 processed rows. Like so,
$sb = new-object Text.StringBuilder # New String Builder for stuff
$i = 1 # Row counter
while ($r.Peek() -ge 0) {
# Add formatted stuff into the buffer
[void]$sb.Append($("{0},{1}{2}" -f $r.ReadLine(), $year, [Environment]::NewLine ) )
if(++$i % 1000 -eq 0){ # When 1000 rows are added, dump contents into file
Add-Content $(".\DR_" + $basename + ".CSV") $sb.ToString()
$sb = new-object Text.StringBuilder # Reset the StringBuilder
}
}
# Don't miss the tail of the contents
Add-Content $(".\DR_" + $basename + ".CSV") $sb.ToString()
Don't go into .NET Framework static methods and building up strings when there are cmdlets that can do the work on objects. Collect your data, add the year column, then export to your new file. You're also doing a ton of file I/O and that'll also slow you down.
This will probably require a little bit more memory. But it reads the whole file at once, and writes the whole file at once. It also assumes that your CSV files have column headings. But it's much easier for someone else to look at and understand exactly what's going on (write your scripts so they can be read!).
# Always use full cmdlet names in scripts, not aliases
$files = get-childitem *.csv;
foreach($f in $files)
{
#basename is a property of the file object in PowerShell, there's no need to call a static method
$basename = $f.basename;
$year = $f.basename.substring(0,4)
# Every time you use Write-Host, a puppy dies
"Starting $Basename";
# If you've got CSV data, treat it as CSV data. PowerShell can import it into a collection natively.
$data = Import-Csv $f;
$exportData = #();
foreach ($row in $data) {
# Add a year "property" to each row object
$row |Add-Member -membertype NoteProperty -Name "Year" -Value $year;
# Export the modified row to the output file
$row |Export-Csv -NoTypeInformation -Path $("r:\DR_" + $basename + ".CSV") -Append -NoClobber
}
}