Split a large csv file into multiple csv files according to the size in powershell - powershell

I have a large CSV file and I want to split it with respect to size and the header should be in every file.
For example, I have this 1.6MB file and I want the child files shouldn't be more than 512KB. So practically the parent file should have 4 child file.
Tried with the below simple program but the file is splitting with blank child files.
function csvSplitter {
$csvFile = "D:\Test\PTest\Dummy.csv";
$split = 10;
$content = Import-Csv $csvFile;
$start = 1;
$end = 0;
$records_per_file = [int][Math]::Ceiling($content.Count / $split);
for($i = 1; $i -le $split; $i++) {
$end += $records_per_file;
$content | Where-Object {[int]$_.Id -ge $start -and [int]$_.Id -le $end} | Export-Csv -Path "D:\Test\PTest\Destination\file$i.csv" -NoTypeInformation;
$start = $end + 1;
}
}csvSplitter
The logic for the size of the file is yet to write.
Tried to add both the files but I guess there is no option to add files.

this takes a slightly different path to a solution. [grin]
it ...
loads the CSV as a plain text file
saves the 1st line as a header line
calcs the batch size from the total line count & the batch count
uses array index ranges to grab the lines for each batch
combines the header line with the current batch of lines
writes that out to a text file
the reason for such a roundabout method is to save RAM. one drawback to loading the file as a CSV is the sheer amount of RAM needed. just loading the lines of text requires noticeably less RAM.
$SourceDir = $env:TEMP
$InFileName = 'LargeFile.csv'
$InFullFileName = Join-Path -Path $SourceDir -ChildPath $InFileName
$BatchCount = 4
$DestDir = $env:TEMP
$OutFileName = 'LF_Batch_.csv'
$OutFullFileName = Join-Path -Path $DestDir -ChildPath $OutFileName
#region >>> build file to work with
# remove this region when you are ready to do this with your test data OR to do this with real data
if (-not (Test-Path -LiteralPath $InFullFileName))
{
Get-ChildItem -LiteralPath $env:APPDATA -Recurse -File |
Sort-Object -Property Name |
Select-Object Name, Length, LastWriteTime, Directory |
Export-Csv -LiteralPath $InFullFileName -NoTypeInformation
}
#endregion >>> build file to work with
$CsvAsText = Get-Content -LiteralPath $InFullFileName
[array]$HeaderLine = $CsvAsText[0]
$BatchSize = [int]($CsvAsText.Count / $BatchCount) + 1
$StartLine = 1
foreach ($B_Index in 1..$BatchCount)
{
if ($B_Index -ne 1)
{
$StartLine = $StartLine + $BatchSize + 1
}
$CurrentOutFullFileName = $OutFullFileName.Replace('_.', ('_{0}.' -f $B_Index))
$HeaderLine + $CsvAsText[$StartLine..($StartLine + $BatchSize)] |
Set-Content -LiteralPath $CurrentOutFullFileName
}
there is no output on screen, but i got 4 files named LF_Batch_1.csv thru LF_Batch_4.csv that contained the 4our parts of the source file as expected. the last file has a slightly smaller number of rows, but that is what happens when the row count is not evenly divisible by the batch count. [grin]

Try this:
Add-Type -AssemblyName System.Collections
function Split-Csv {
param (
[string]$filePath,
[int]$partsNum
)
# Use generic lists for import/export
[System.Collections.Generic.List[object]]$contentImport = #()
[System.Collections.Generic.List[object]]$contentExport = #()
# import csv-file
$contentImport = Import-Csv $filePath
# how many lines per export file
$linesPerFile = [Math]::Max( [int]($contentImport.Count / $partsNum), 1 )
# start pointer for source list
$startPointer = 0
# counter for file name
$counter = 1
# main loop
while( $startPointer -lt $contentImport.Count ) {
# clear export list
[void]$contentExport.Clear()
# determine from-to from source list to export
$endPointer = [Math]::Min( $startPointer + $linesPerFile, $contentImport.Count )
# move lines to export to export list
[void]$contentExport.AddRange( $contentImport.GetRange( $startPointer, $endPointer - $startPointer ) )
# export
$contentExport | Export-Csv -Path ($filePath.Replace('.', $counter.ToString() + '.' ) ) -NoTypeInformation -Force
# move pointer
$startPointer = $endPointer
# increase counter for filename
$counter++
}
}
Split-Csv -filePath 'test.csv' -partsNum 7

try running this script:
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$FilePath = $HOME +'\Documents\Projects\ADOPT\Data8277.csv'
$SplitDir = $HOME +'\Documents\Projects\ADOPT\Split\'
CSV-FileSplitter -Path $FilePath -PartSizeBytes 35MB -SplitDir $SplitDir #-Verbose
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
I created this for files larger than 50GB files

Related

How to make a script that merges all .txt files into one .csv file into multiple columns in Powershell

I don't know how to merge multiple .txt files with datas into one .csv file each of the .txt file seperated into columns.
This is my code so far,
$location = (Get-Location).Path
$files = Get-ChildItem $location -Filter "*.asd.txt"
$data = #()
foreach ($file in $files) {
$fileData = Get-Content $file.FullName
foreach ($line in $fileData) {
$lineData = $line -split "\t"
$data = $lineData[1]
Add-Content -Path "$location\output.csv" -Value $data
}
}
Each of the file looks like this
I want to keep the first column "WaveLength" and put the second columns next to each other from all the files in the folder. The header will start with the exac name
"stovikmladyDoupno2 2020080500001.asd" or "stovikmladyDoupno2 2020080500002.asd" and so on ....
so it should look like this
I have tried to look for information over two days and still don't know. I have tried to put "," on the end of the file, I though excel will handle with that, but nothing helped.
Here I provide few files as test data
https://mega.nz/folder/zNhTzR4Z#rpc-BQdRfm3wxl87r9XUkw
few lines of data
Wavelength stovikmladyDoupno2 2020080500000.asd
350 6.38961399706465E-02
351 6.14107911262903E-02
352 6.04866108251357E-02
353 5.83485359067184E-02
354 0.054978792413247
355 5.27014859356317E-02
356 5.34849237528764E-02
357 5.32841277775603E-02
358 5.23466655229364E-02
359 5.47595002186027E-02
360 5.22061034631109E-02
361 4.90149806042666E-02
362 4.81633530421385E-02
363 4.83974076557941E-02
364 4.65219929658367E-02
365 0.044800930294557
366 4.47830287392802E-02
367 4.46947539436297E-02
368 0.043756926558447
369 4.31725380363072E-02
370 4.36867609723618E-02
371 4.33227601805265E-02
372 4.29978664449687E-02
373 4.23860463187361E-02
374 4.12183604375401E-02
375 4.14306521081773E-02
376 4.11760903772502E-02
377 4.06421127128478E-02
378 4.09771489689262E-02
379 4.10083126746385E-02
380 4.05161601354181E-02
381 3.97904564387456E-02
I assumed a location since I'm not fond of declaring file paths without a literal path. Please adjust path as needed.
$Files = Get-ChildItem J:\Test\*.txt -Recurse
$Filecount = 0
$ObjectCollectionArray = #()
#Fist parse and collect each row in an array.. While keeping the datetime information from filename.
foreach($File in $Files){
$Filecount++
Write-Host $Filecount
$DateTime = $File.fullname.split(" ").split(".")[1]
$Content = Get-Content $File.FullName
foreach($Row in $Content){
$Split = $Row.Split("`t")
if($Split[0] -ne 'Wavelength'){
$Object = [PSCustomObject]#{
'Datetime' = $DateTime
'Number' = $Split[0]
'Wavelength' = $Split[1]
}
$ObjectCollectionArray += $Object
}
}
}
#Match by number and create a new object with relation to the number and different datetime.
$GroupedCollection = #()
$Grouped = $ObjectCollectionArray | Group-Object number
foreach($GroupedNumber in $Grouped){
$NumberObject = [PSCustomObject]#{
'Number' = $GroupedNumber.Name
}
foreach($Occurance in $GroupedNumber.Group){
$NumberObject | Add-Member -NotePropertyName $Occurance.Datetime -NotePropertyValue $Occurance.wavelength
}
$GroupedCollection += $NumberObject
}
$GroupedCollection | Export-Csv -Path J:\Test\result.csv -NoClobber -NoTypeInformation
What you're looking to do is quite a hard task, there are a few ways to do it. This method requires that all files are in memory to process them. You can definitely treat these files as TSVs, so Import-Csv -Delimiter "`t" is an option so you can deal with objects instead of plain text.
# using this temp dictionary to create objects for each line of each tsv
$tmp = [ordered]#{}
# get all files and enumerate
$csvs = Get-ChildItem $location -Filter *.asd.txt | ForEach-Object {
# get their content as objects
$content = $_ | Import-Csv -Delimiter "`t"
# get their property Name that is not `Wavelength`
$property = $content[0].PSObject.Properties.Where{ $_.Name -ne 'Wavelength' }.Name
# output an object holding the total lines of this csv,
# its content and the property name of interest
[pscustomobject]#{
Lines = $content.Count
Content = $content
Property = $property
}
}
# use a scriptblock to allow streaming so `Export-Csv` starts exporting as
# output is going through the pipeline
& {
# for loop used for each line of the Tsv having the highest number of lines
for($i = 0; $i -lt [System.Linq.Enumerable]::Max([int[]] $csvs.Lines); $i++) {
# this boolean is used to preserve the "Wavelength" value of the first Tsv
$isFirstCsv = $true
foreach($csv in $csvs) {
# if this is the first object
if($isFirstCsv) {
# add the value of "Wavelength"
$tmp['Wavelength'] = $csv.Content[$i].Wavelength
# and set the bool to false, since we are only using this once
$isFirstCsv = $false
}
# then add the value of each property of each Tsv to the temp dictionary
$tmp[$csv.Property] = $csv.Content[$i].($csv.Property)
}
# then output this object
[pscustomobject] $tmp
# clear the temp dictionary
$tmp.Clear()
}
} | Export-Csv path\to\result.csv -NoTypeInformation
Here is a much more efficient approach that treats the files as plain text, this method is much faster and memory efficient however not as reliable. It uses StreamReader to read the file contents line-by-line and a StringBuilder to construct each line.
& {
# get all files and enumerate
$readers = Get-ChildItem $location -Filter *.asd.txt | ForEach-Object {
# create a stream reader for each file
[System.IO.StreamReader] $_.FullName
}
# this StringBuilder is used to construct each line
$sb = [System.Text.StringBuilder]::new()
# while any of the readers has more content
while($readers.EndOfStream -contains $false) {
# signals this is our first Tsv
$isFirstReader = $true
# enumerate each reader
foreach($reader in $readers) {
# if this is the first Tsv
if($isFirstReader) {
# append the line as-is, only trimming exces white space
$sb = $sb.Append($reader.ReadLine().Trim())
$isFirstReader = $false
# go to next reader
continue
}
# if this is not the first Tsv,
# split on Tab and exclude the first token (Wavelength)
$null, $line = $reader.ReadLine().Trim() -split '\t'
# append a Tab + this line
$sb = $sb.Append("`t$line")
}
# append a new line and output the constructed string
$sb.AppendLine().ToString()
# and clear it for next lines
$sb = $sb.Clear()
}
# dispose all readers when done
$readers | ForEach-Object Dispose
} | Set-Content path\to\result.tsv -NoNewline

How to delete rows in file under certain condition?

I've got the file 'Test.txt' which is being updated automatically. Every hour new value being added to this file. Like:
Some text 1:57
Some text 2:57
Some text 3:57
Some text 4:57
And I need to check when this file is more than 100 mb size, then delete FIRST half of the file. I mean 'Some text 1:57' and 'Some text 2:57' should be deleted in this case if the file has 4 values.
For now I have the next code where I can see the current size in KB.
$TestFileSize = (get-item C:\Test.txt).Length
if($TestFileSize -gt 100000){
# --- here the code that should delete the first 50 rows if file has 100 and so on.
}
Any advises? Thanks !
There are several ways of doing this. Here's two
$file = 'C:\Test.txt'
$TestFileSize = (Get-Item -Path $file).Length
if($TestFileSize -gt 100000){
# --- here the code that should delete the first 50 rows if file has 100 and so on.
$nlines = #([System.IO.File]::ReadAllLines($file)).Count
(Get-Content -Path $file -Tail ([math]::Ceiling($nlines / 2))) |
Set-Content -Path $file
}
or
$file = 'C:\Test.txt'
$TestFileSize = (Get-Item -Path $file).Length
if($TestFileSize -gt 100000){
$content = Get-Content -Path $file
$nlines = #($content).Count
$content[([math]::Ceiling($nlines / 2))..($nlines -1)] | Set-Content -Path $file
}

New columns into CSV file incredibly slow

I have a bunch of .csv files and I'm trying to add in some new column headers and their values (which are all blank anyway) and then output this to a new .csv file. My script currently runs and works fine but it takes about 5 minutes to complete the operation on a 60MB file with about 70,000 rows - I have about 100 files to do this on so it will take a while using this script.
My code is below, it's quite simple but clearly inefficient!
Import-Csv $strFilePath |
Select-Object *, #{Name='NewHeader';Expression={''}},
#{Name='NewHeader2';Expression={''}},
#{Name='NewHeader3';Expression={''}},
#{Name='NewHeader4';Expression={''}} |
Export-Csv $($strFilePath + ".new") -NoTypeInformation
As pointed out in the comments, I think it would be better to treat it as a simple text without the useless conversion.
$path = 'C:\test'
$newHeaders = 'NewHeader1','NewHeader2','NewHeader3','NewHeader4'
$files = Get-ChildItem -LiteralPath $path -Filter *.csv
$newHeadersString = #(''; $newHeaders | foreach { '"{0}"' -f $_ }) -join ','
$newColmunsString = ',""' * $newHeaders.Count
foreach ($file in $files) {
$sr = $file.OpenText()
$outfile = New-Item ($file.FullName + '.new') -Force
$sw = [IO.StreamWriter]::new($outfile.FullName)
$sw.WriteLine($sr.ReadLine() + $newHeadersString)
while(!$sr.EndOfStream) { $sw.WriteLine($sr.ReadLine() + $newColmunsString) }
$sr.Close()
$sw.Close()
}

Using Powershell to recursively search directory for files that only contain zeros

I have a directory that contains millions of files in binary format. Some of these files were written to the disk wrong (no idea how). The files are not empty, but they only contain zeros. Heres an example http://pastebin.com/5b7jHjgr
I need to search this directory, find the files that are all zeros and write their path out to a file.
I've been experimenting with format-hex and get-content, but my limited powershell experience is tripping me up. Format-Hex reads the entire file, when I only need the first few bytes, and Get-Content expects text files.
Use IO.BinaryReader:
Get-ChildItem r:\1\ -Recurse -File | Where {
$bin = [IO.BinaryReader][IO.File]::OpenRead($_.FullName)
foreach ($byte in $bin.ReadBytes(16)) {
if ($byte) { $bin.Close(); return $false }
}
$bin.Close()
$true
}
In the old PowerShell 2.0 instead of -File parameter you'll need to filter it manually:
Get-ChildItem r:\1\ -Recurse | Where { $_ -is [IO.FileInfo] } | Where { ..... }
You can use a System.IO.FileStream object to read the first n bytes of each file.
The following code reads the first ten bytes of each file:
Get-ChildItem -Path C:\Temp -File -Recurse | ForEach-Object -Process {
# Open file for reading
$file = [System.IO.FileStream]([System.IO.File]::OpenRead($_.FullName))
# Go through the first ten bytes of the file
$containsTenZeros = $true
for( $i = 0; $i -lt $file.Length -and $i -lt 10; $i++ )
{
if( $file.ReadByte() -ne 0 )
{
$containsTenZeros = $false
}
}
# If the file contains ten zeros then add its full path to List.txt
if( $containsTenZeros )
{
Add-Content -Path List.txt -Value $_.FullName
}
}

Powershell - Pass list of directory paths to FOR Loop - Output results to CSV

The code below works. Rather than specify the path manually I would like to pass a list of values from a csv file E:\Data\paths.csv and then output individual csv files for each path processed displaying the $Depth for that directory......
$StartLevel = 0 # 0 = include base folder, 1 = sub-folders only, 2 = start at 2nd level
$Depth = 10 # How many levels deep to scan
$Path = "E:\Data\MyPath" # starting path
For ($i=$StartLevel; $i -le $Depth; $i++) {
$Levels = "\*" * $i
(Resolve-Path $Path$Levels).ProviderPath | Get-Item | Where PsIsContainer |
Select FullName
}
Thanks,
Phil
Get-Help Import-Csv will help you in this regards.
regards,
kvprasoon
I assume you want something like the following:
# Create sample input CSV
#"
Path,StartLevel,Depth
"E:\Data\MyPath",0,10
"# > PathSpecs.csv
# Loop over each input CSV row (object with properties
# .Path, .StartLevel, and .Depth)
foreach ($pathSpec in Import-Csv PathSpecs.csv) {
& { For ([int] $i=$pathSpec.StartLevel; $i -le $pathSpec.Depth; $i++) {
$Levels = "\*" * $i
Resolve-Path "$($pathSpec.Path)$Levels" | Get-Item | Where PsIsContainer |
Select FullName
} } | # Export paths to a CSV file named "Path-<input-path-with-punct-stripped>.csv"
Export-Csv -NoTypeInformation "Path-$($pathSpec.Path -replace '[^\p{L}0-9]+', '_').csv"
}
Note that your approach to breadth-first enumeration of subdirectories in the subtree works, but will be quite slow with large subtrees.