I am merging a lot of large CSV files, e.g. while skipping the leading junk and appending the filename to each line:
Get-ChildItem . | Where Name -match "Q[0-4]20[0-1][0-9].csv" |
Foreach-Object {
$file = $_.BaseName
Get-Content $_.FullName | select-object -skip 3 | % {
"$_,${file}" | Out-File -Append temp.csv -Encoding ASCII
}
}
In PowerShell this is incredibly slow even on an i7/16GB machine (~5 megabyte/minute). Can I make it more efficient or should I just switch to e.g. Python?
Get-Content / Set-Content are terrible with larger files. Streams are a good alternative when performance is key. So with that in mind lets use one to read in each file and another to write out the results.
$rootPath = "C:\temp"
$outputPath = "C:\test\somewherenotintemp.csv"
$streamWriter = [System.IO.StreamWriter]$outputPath
Get-ChildItem $rootPath -Filter "*.csv" -File | ForEach-Object{
$file = $_.BaseName
[System.IO.File]::ReadAllLines($_.FullName) |
Select-Object -Skip 3 | ForEach-Object{
$streamWriter.WriteLine(('{0},"{1}"' -f $_,$file))
}
}
$streamWriter.Close(); $streamWriter.Dispose()
Create a writing stream $streamWriter to output the edited lines in each file. We could read in the file and write the file in larger batches, which would be faster, but we need to ignore a few lines and make changes to each one so processing line by line is simpler. Avoid writing anything to console during this time as it will just slow everything down.
What '{0},"{1}"' -f $_,$file does is quote that last "column" that is added in case the basename contains spaces.
Measure-Command -Expression {
Get-ChildItem C:\temp | Where Name -like "*.csv" | ForEach-Object {
$file = $_.BaseName
Get-Content $_.FullName | select-object -Skip 3 | ForEach-Object {
"$_,$($file)" | Out-File -Append C:\temp\t\tempe1.csv -Encoding ASCII -Force
}
}
} # TotalSeconds : 12,0526802 for 11415 lines
If you first put everything into an array in memory, things go a lot faster:
Measure-Command -Expression {
$arr = #()
Get-ChildItem C:\temp | Where Name -like "*.csv" | ForEach-Object {
$file = $_.BaseName
$arr += Get-Content $_.FullName | select-object -Skip 3 | ForEach-Object {
"$_,$($file)"
}
}
$arr | Out-File -Append C:\temp\t\tempe2.csv -Encoding ASCII -Force
} # TotalSeconds : 0,8197193 for 11415 lines
EDIT: Fixed it so that your filename was added to each row.
To avoid -Append to ruin the performance of your script you could use a buffer array variable:
# Initialize buffer
$csvBuffer = #()
Get-ChildItem *.csv | Foreach-Object {
$file = $_.BaseName
$content = Get-Content $_.FullName | Select-Object -Skip 3 | %{
"$_,${file}"
}
# Populate buffer
$csvBuffer += $content
# Write buffer to disk if it contains 5000 lines or more
$csvBufferCount = $csvBuffer | Measure-Object | Select-Object -ExpandProperty Count
if( $csvBufferCount -ge 5000 )
{
$csvBuffer | Out-File -Path temp.csv -Encoding ASCII -Append
$csvBuffer = #()
}
}
# Important : empty the buffer remainder
if( $csvBufferCount -gt 0 )
{
$csvBuffer | Out-File -Path temp.csv -Encoding ASCII -Append
$csvBuffer = #()
}
Related
I'm trying (badly) to work through combining CSV files into one file and prepending a column that contains the file name. I'm new to PowerShell, so hopefully someone can help here.
I tried initially to do the well documented approach of using Import-Csv / Export-Csv, but I don't see any options to add columns.
Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv CombinedFile.txt -UseQuotes Never -NoTypeInformation -Append
Next I'm trying to loop through the files and append the name, which kind of works, but for some reason this stops after the first row is generated. Since it's not a CSV process, I have to use the switch to skip the first title row of each file.
$getFirstLine = $true
Get-ChildItem -Filter *.csv | Where-Object {$_.Name -NotMatch "Combined.csv"} | foreach {
$filePath = $_
$collection = Get-Content $filePath
foreach($lines in $collection) {
$lines = ($_.Basename + ";" + $lines)
}
$linesToWrite = switch($getFirstLine) {
$true {$lines}
$false {$lines | Select -Skip 1}
}
$getFirstLine = $false
Add-Content "Combined.csv" $linesToWrite
}
This is where the -PipelineVariable parameter comes in real handy. You can set a variable to represent the current iteration in the pipeline, so you can do things like this:
Get-ChildItem -Filter *.csv -PipelineVariable File | Where-Object {$_.Name -NotMatch "Combined.csv"} | ForEach-Object { Import-Csv $File.FullName } | Select *,#{l='OriginalFile';e={$File.Name}} | Export-Csv Combined.csv -Notypeinfo
Merging your CSVs into one and adding a column for the file's name can be done as follows, using a calculated property on Select-Object:
Get-ChildItem -Filter *.csv | ForEach-Object {
$fileName = $_.Name
Import-Csv $_.FullName | Select-Object #{
Name = 'FileName'
Expression = { $fileName }
}, *
} | Export-Csv path/to/merged.csv -NoTypeInformation
I'm trying to merge CSV files in Powershell. I've read numerous answers here but I'm stuck on this problem.
I have a list of csv files, 2 difficulties :
[A] each file has a metadataline, the headers are in the second line.
[B] each file has the same structure, but sometimes quotes surround the column to escape the content.
Thanks to this question : Merging multiple CSV files into one using PowerShell,
I'm able to solve these two problems individually.
However, I'm stuck at combining the solutions.
Partial solution A
Skips every metadata line as well as header for subsequent files
Adapting the answer from kemiller2002:
$sourcefilefolderPath = "C:\CSV_folder"
$destinationfilePath = "C:\appended_files.csv"
$getHeader = $true
Get-ChildItem -Path $sourcefilefolderPath -Filter *.csv -Recurse| foreach {
$filePath = $_.FullName
$lines = $lines = Get-Content $filePath
$linesToWrite = switch($getHeader) {
$true {$lines | Select -Skip 1} # skips only the metadata line
$false {$lines | Select -Skip 2} # skips both the metadata line as well as headers
}
$getHeader = False
Add-Content $destination_file $linesToWrite
}
The problem : Import-Csv $destination_file give inconsistent results, as the quoting can be different for each source file.
Partial solution B
handles successfully random quoted columns
Solution provided by stinkyfriend.
Import-Csv seems to import the data gracefully when the column quoting, however different from one column to the other, is consistent for each line of the source file.
I could not combine this solution with the one above.
Get-ChildItem -Path $sourcefilefolderPath -File -Filter *.csv -Recurse |
Select-Object -ExpandProperty FullName |
Import-Csv |
Export-Csv $destination_file -NoTypeInformation -Append
Thanks a lot for your help !
Solution C
produces blank file on my PC
using suggestion from Mathias R. Jessen
Get-ChildItem -Path $sourcefilefolderPath -File -Filter *.csv -Recurse | foreach {
Write-Host $_.FullName |
Get-Content $_.FullName | Select-Object -Skip 1 | ConvertFrom-Csv |
Export-Csv $destinationfilePath -NoTypeInformation -Append
--- EDIT ---
RESULT
I could solve the problem by creating appended_files.csv using the first matching source file and then append to it.
$pattern_sourceFile = "*.csv*"
$list_files = Get-ChildItem -Path $sourcefilefolderPath -File -Recurse | Where {
$_FullName -match $pattern_sourcefile }
Get-Content $list_files[0].FullName |
Select-Object -Skip 1 | # skips metadataline
ConvertFrom-Csv | Export-Csv $destinationfilePath -NoTypeInformation
$list_files |
Select-Object -Skip 1 | # skips $array_files[0]
foreach { Get-Content $_.FullName |
Select-Object -Skip 1 | # skips metadata line
ConvertFrom-Csv |
Export-Csv $destinationfilePath -NoTypeInformation -Append }
Use ConvertFrom-Csv instead of Import-Csv, this way you can still control how many lines to skip:
Get-Content $file |Select -Skip 1 |ConvertFrom-Csv
So you'll end up with something like:
$sourcefilefolderPath = "C:\CSV_folder"
$destinationfilePath = "C:\appended_files.csv"
Get-ChildItem -Path $sourcefilefolderPath -Filter *.csv -Recurse | foreach {
Get-Content $_.FullName |Select-Object -Skip 1 |ConvertFrom-Csv |Export-Csv -Path $destinationfilePath -NoTypeInformation -Append
}
getting memory exception while running this code. Is there a way to filter one file at a time and write output and append after processing each file. Seems the below code loads everything to memory.
$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"
Get-ChildItem $inputFolder -File -Filter '*.csv' |
ForEach-Object { Import-Csv $_.FullName } |
Where-Object { $_.machine_type -eq 'workstations' } |
Export-Csv $outputFile -NoType
May be can you export and filter your files one by one and append result into your output file like this :
$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"
Remove-Item $outputFile -Force -ErrorAction SilentlyContinue
Get-ChildItem $inputFolder -Filter "*.csv" -file | %{import-csv $_.FullName | where machine_type -eq 'workstations' | export-csv $outputFile -Append -notype }
Note: The reason for not using Get-ChildItem ... | Import-Csv ... - i.e., for not directly piping Get-ChildItem to Import-Csv and instead having to call Import-Csv from the script block ({ ... } of an auxiliary ForEach-Object call, is a bug in Windows PowerShell that has since been fixed in PowerShell Core - see the bottom section for a more concise workaround.
However, even output from ForEach-Object script blocks should stream to the remaining pipeline commands, so you shouldn't run out of memory - after all, a salient feature of the PowerShell pipeline is object-by-object processing, which keeps memory use constant, irrespective of the size of the (streaming) input collection.
You've since confirmed that avoiding the aux. ForEach-Object call does not solve the problem, so we still don't know what causes your out-of-memory exception.
Update:
This GitHub issue contains clues as to the reason for excessive memory use, especially with many properties that contain small amounts of data.
This GitHub feature request proposes using strongly typed output objects to help the issue.
The following workaround, which uses the switch statement to process the files as text files, may help:
$header = ''
Get-ChildItem $inputFolder -Filter *.csv | ForEach-Object {
$i = 0
switch -Wildcard -File $_.FullName {
'*workstations*' {
# NOTE: If no other columns contain the word `workstations`, you can
# simplify and speed up the command by omitting the `ConvertFrom-Csv` call
# (you can make the wildcard matching more robust with something
# like '*,workstations,*')
if ((ConvertFrom-Csv "$header`n$_").machine_type -ne 'workstations') { continue }
$_ # row whose 'machine_type' column value equals 'workstations'
}
default {
if ($i++ -eq 0) {
if ($header) { continue } # header already written
else { $header = $_; $_ } # header row of 1st file
}
}
}
} | Set-Content $outputFile
Here's a workaround for the bug of not being able to pipe Get-ChildItem output directly to Import-Csv, by passing it as an argument instead:
Import-Csv -LiteralPath (Get-ChildItem $inputFolder -File -Filter *.csv) |
Where-Object { $_.machine_type -eq 'workstations' } |
Export-Csv $outputFile -NoType
Note that in PowerShell Core you could more naturally write:
Get-ChildItem $inputFolder -File -Filter *.csv | Import-Csv |
Where-Object { $_.machine_type -eq 'workstations' } |
Export-Csv $outputFile -NoType
Solution 2 :
$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"
$encoding = [System.Text.Encoding]::UTF8 # modify encoding if necessary
$Delimiter=','
#find header for your files => i take first row of first file with data
$Header = Get-ChildItem -Path $inputFolder -Filter *.csv | Where length -gt 0 | select -First 1 | Get-Content -TotalCount 1
#if not header founded then not file with sise >0 => we quit
if(! $Header) {return}
#create array for header
$HeaderArray=$Header -split $Delimiter -replace '"', ''
#open output file
$w = New-Object System.IO.StreamWriter($outputfile, $true, $encoding)
#write header founded
$w.WriteLine($Header)
#loop on file csv
Get-ChildItem $inputFolder -File -Filter "*.csv" | %{
#open file for read
$r = New-Object System.IO.StreamReader($_.fullname, $encoding)
$skiprow = $true
while ($line = $r.ReadLine())
{
#exclude header
if ($skiprow)
{
$skiprow = $false
continue
}
#Get objet for current row with header founded
$Object=$line | ConvertFrom-Csv -Header $HeaderArray -Delimiter $Delimiter
#write in output file for your condition asked
if ($Object.machine_type -eq 'workstations') { $w.WriteLine($line) }
}
$r.Close()
$r.Dispose()
}
$w.close()
$w.Dispose()
You have to read and write to the .csv files one row at a time, using StreamReader and StreamWriter:
$filepath = "C:\Change\2019\October"
$outputfile = "C:\Change\2019\output.csv"
$encoding = [System.Text.Encoding]::UTF8
$files = Get-ChildItem -Path $filePath -Filter *.csv |
Where-Object { $_.machine_type -eq 'workstations' }
$w = New-Object System.IO.StreamWriter($outputfile, $true, $encoding)
$skiprow = $false
foreach ($file in $files)
{
$r = New-Object System.IO.StreamReader($file.fullname, $encoding)
while (($line = $r.ReadLine()) -ne $null)
{
if (!$skiprow)
{
$w.WriteLine($line)
}
$skiprow = $false
}
$r.Close()
$r.Dispose()
$skiprow = $true
}
$w.close()
$w.Dispose()
get-content *.csv | add-content combined.csv
Make sure combined.csv doesn't exist when you run this, or it's going to go full Ouroboros.
I need to take a slew of csv files from a directory and get them into an array in Powershell (to eventually manipulate and write back to a CSV).
The problem is there are 5 file types. I need around 8 columns from each. The columns are essentially the same, but have different headings.
Is there an easy way to do this? I started creating a custom object with my 8 fields, looping through the files importing each one, looking at the filename (which tells me the column names I need) and then a bunch of ifs to add it to my custom object array.
I was wondering if there is a simpler way...like with a template saying which columns from each file.
wound up doing this. It may have not been the most efficient, but works. I wound up writing out each file separately and combining at the end as PS really got bogged down (over a million rows combined).
$Newcsv = #()
$path = "c:\scrap\BWFILES\"
$files = gci -path $path -recurse -filter *.csv | Where-Object { ! ($_.psiscontainer) }
$counter=1
foreach($file in $files)
{
$csv = Import-Csv $file.FullName
if ($file.Name -like '*SAV*')
{
$Newcsv = $csv | Select-Object #{Name="PRODUCT";Expression={"SV"}},DMBRCH,DMACCT,DMSHRT
}
if ($file.Name -like '*TIME*')
{
$Newcsv = $csv | Select-Object #{Name="PRODUCT";Expression={"TM"}},TMBRCH,TMACCT,TMSHRT
}
if ($file.Name -like '*TRAN*')
{
$Newcsv = $csv | Select-Object #{Name="PRODUCT";Expression={"TR"}},DMBRCH,DMACCT,DMSHRT
}
if ($file.Name -like '*LN*')
{
$Newcsv = $csv | Select-Object #{Name="PRODUCT";Expression={"LN"}},LNBRCH,LNNOTE,LNSHRT
}
$Newcsv | Export-Csv "C:\scrap\$file.name$counter.csv" -force -notypeinformation
$counter++
}
get-childItem "c:\scrap\*.csv" | foreach {
$filePath = $_
$lines = $lines = Get-Content $filePath
$linesToWrite = switch($getFirstLine) {
$true {$lines}
$false {$lines | Select -Skip 1}
}
$getFirstLine = $false
Add-Content "c:\scrap\combined.csv" $linesToWrite
}
With a hashtable for reference, a little RegEx matching, and using the automatic variable $Matches in a ForEach-Object loop (alias % used) that could all be shortened to:
$path = "c:\scrap\BWFILES\"
$Reference = #{
'SAV' = 'SV'
'TIME' = 'TM'
'TRAN' = 'TR'
'LN'='LN'
}
Set-Content -Value "PRODUCT,BRCH,ACCT,SHRT" -Path 'c:\scrap\combined.csv'
gci -path $path -recurse -filter *.csv | Where-Object { !($_.psiscontainer) -and $_.Name -match ".*(SAV|TIME|TRAN|LN).*"}|%{
$Product = $Reference[($Matches[1])]
Import-CSV $_.FullName | Select-Object #{Name="PRODUCT";Expression={$Product}},*BRCH,#{l='Acct';e={$_.LNNOTE, $_.DMACCT, $_.TMACCT|?{$_}}},*SHRT | ConvertTo-Csv -NoTypeInformation | Select -Skip 1 | Add-Content 'c:\scrap\combined.csv'
}
That should produce the exact same file. Only kind of tricky part was the LNNOTE/TMACCT/DMACCT field since obviously you can't just do the same as like *SHRT.
I have a CSV file (file1) that looks like: (User dirs and the size)
Initials,Size
User1,10
User2,100
User3,131
User4,140
I have another CSV file (file2) that looks like: (VIP users)
User2
User4
Now what I'm trying to do, is to update file1, so it looks like:
User1,10
User3,131
User2 and User4 is removed because they are in file2
I can get them removed, but at the same time I remove the size for all users, so my output is only the Users:
User1
User3
My code:
$SourcePath = "\\server1\info\SYSINFO\UsrSize"
$DestinationFile = "\\server1\info\SYSINFO\UsrSize\OverLimit\UsersOverLimit1.log"
$VIP_Exclusion_List = "\\server1\info\SYSINFO\UsrSize\OverLimit\_VIP_EXCLUSION_LIST.txt"
$Database = "\\server1\info\SYSINFO\UsrSize\OverLimit\_UsersOverLimitDATABASE.log"
$INT_SizeToLookFor = 100
dir $SourcePath -Filter usr*.txt | import-csv -delimiter "`t" |
Where-Object {[INT] $_."Size excl. Backup/Pst" -ge $INT_SizeToLookFor} |
Select-Object Initials,"Size excl. Backup/Pst" | convertto-csv -NoTypeInformation | % { $_ -replace '"', ""} | out-file $DestinationFile ;
$Userlist = import-csv $DestinationFile | Select-Object Initials |
convertto-csv -NoTypeInformation | % { $_ -replace '"', ""};
compare-object ($Userlist) (get-content $VIP_Exclusion_List) |
select-object inputObject | convertto-csv -NoTypeInformation |
% { $_ -replace '"', ""} | out-file "\\server1\info\SYSINFO\UsrSize\OverLimit\UsersOverLimitThisTime.log";
If the files are small-ish and you don't care too much about performance, then the following would be a trivial way:
$data = Import-Csv file1
$vips = Import-Csv file2
$data = $data | ?{ $vips -notcontains $_.Initials }
$data | Export-Csv file1_new -NoTypeInformation
A faster way would be to add the names to remove to a set, but given the things you're talking about here I doubt you'll get into the range of a few thousand or million users.
I solved it using this code:
$ArrayVIP = get-content $VIP_Exclusion_List
select-string $DestinationFile -pattern $ArrayVIP -notmatch |
select -expand line |
out-file $DestinationFile
Taken from here: Removing lines from a CSV