Why Import-Csv's Sort-Object is slow for 1 million records

Why Import-Csv's Sort-Object is slow for 1 million records - powershell

I need to sort first column (column may differ) of csv files.
As my csv files have more than a million records, for executing below command , it is taking 10 minutes.
is there any other way to optimize the code to speed up the execution?
$CsvFile = "D:\Performance\10_lakh_records.csv"
$OutputFile ="D:\Performance\output.csv"
Import-Csv $CsvFile | Sort-Object { $_.psobject.Properties.Value[1] } | Export-Csv -Encoding default -Path $OutputFile -NoTypeInformation

You could try using the [array]::Sort() static method which might prove faster than Sort-Object, although it does take an extra step to first get a one-dimensional array of all values to sort upon..
Try
$CsvFile = "D:\Performance\10_lakh_records.csv"
$OutputFile = "D:\Performance\output.csv"
# import the data
$data = Import-Csv -Path $CsvFile
# determine the column name to sort on. In this demo the first column
# of course, if you know the column name you don't need that and can simply use the name as-is
$column = $data[0].PSObject.Properties.Name[0]
# use the Sort(Array, Array) overload method to sort the data by the
# values of the column you have chosen.
# see https://learn.microsoft.com/en-us/dotnet/api/system.array.sort?view=net-5.0#System_Array_Sort_System_Array_System_Array_
[array]::Sort($data.$column, $data)
$data | Export-Csv -Encoding default -Path $OutputFile -NoTypeInformation

Related

powershell: Write specific rows from files to formatted csv

The following code gives me the correct output to console. But I would need it in a csv file:
$array = #{}
$files = Get-ChildItem "C:\Temp\Logs\*"
foreach($file in $files){
foreach($row in (Get-Content $file | select -Last 2)){
if($row -like "Total peak job memory used:*"){
$sp_memory = $row.Split(" ")[5]
$array.Add(($file.BaseName),([double]$sp_memory))
break
}
}
}
$array.GetEnumerator() | sort Value -Descending |Format-Table -AutoSize
current output (console):
required output (csv):
In order to increase performance I would like to avoid the array and write output directly to csv (no append).
Thanks in advance!

Change your last line to this -
$array.GetEnumerator() | sort Value -Descending | select #{l='FileName'; e={$_.Name}}, #{l='Memory (MB)'; e={$_.Value }} | Export-Csv -path $env:USERPROFILE\Desktop\Output.csv -NoTypeInformation
This will give you a csv file named Output.csv on your desktop.
I am using Calculated properties to change the column headers to FileName and Memory (MB) and piping the output of $array to Export-Csv cmdlet.
Just to let you know, your variable $array is of type Hashtable which won't store duplicate keys. If you need to store duplicate key/value pairs, you can use arrays. Just suggesting! :)

I use -NoTypeInformation so why do I get header back when using Out-File?

I filtered by date this file data1.csv
2017.11.1,09:55,1.1,1.2,1.3,1.4,1
2017.11.2,09:55,1.5,1.6,1.7,1.8,2
I don't get a header with -NoTypeInformation:
$CutOff = (Get-Date).AddDays(-2)
$filePath = "data1.csv"
$Data = Import-Csv $filePath -Header Date,Time,A,B,C,D,E
$Data2 = $Data | Where-Object {$_.Date -as [datetime] -gt $Cutoff} | convertto-csv -NoTypeInformation -Delimiter "," | % {$_ -replace '"',''}
But when rewriting with Out-File
$Data2 | Out-File "data2.csv" -Encoding utf8 -Force
I get header back as data2.csv contains:
Date,Time,A,B,C,D,E
2017.11.2,09:55,1.5,1.6,1.7,1.8,2
Why do I have Date,Time,A,B,C,D,E ?

-NoTypeInformation is not about the header but the data type of the rows in the file. Remove it to see what shows up. From Microsoft
Omits the type information header from the output. By default, the string in the output contains #TYPE followed by the fully-qualified name of the object type.
Emphasis mine.
CSVs need headers. That is why it is making one. If you don't want to see the header in the output use Select-Object -Skip 1 to remove it.
$Data |
Where-Object {$_.Date -as [datetime] -gt $Cutoff} |
ConvertTo-CSV -NoTypeInformation -Delimiter "," |
Select-Object -Skip 1 |
% {$_ -replace '"'}
I would not pipe Out-File to itself. You could pipe to Set-Content here just as well.
I am guessing this whole process is to keep the source file in the same state just with some lines filtered out based on date. You could skip most of this just by parsing the date out in each line.
$threshold = (Get-Date).AddDays(-2)
$filePath = "c:\temp\bagel.txt"
(Get-Content $filePath) | Where-Object{
$date,$null=$_.Split(",",2)
[datetime]$date -gt $threshold
} | Set-Content $filePath
Now you don't have to worry about PowerShell CSV object structure or output since we act on the raw data of the file itself.
That will take each line of the input file and filter it out if the parsed date does not match the threshold. Change encoding on the input output cmdlets as you see necessary. What $date,$null=$_.Split(",",2) is doing is splitting the line
on the comma into 2 parts. First of which becomes $date and since this is just a filtering condition we dump the rest of the line into $null.

Properly-formed CSV files must have column headers. Your use of -NoTypeInformation in generating the CSV does not affect column headers; instead, it affects whether the PowerShell object type information is included. If you Export-CSV without -NoTypeInformation, the first line of your CSV file will have a line that looks like #TYPE System.PSCustomObject, which you don't want if you're going to open the CSV in a spreadsheet program.
If you subsequently Import-CSV, the headers (Date, Time, A, B, C) are used to create the fields of a PSObject, so that you can refer to them using the standard dot notation (e.g., $CSV[$line].Date).
The ability to specify -Header on Import-CSV is essentially a "hack" to allow the cmdlet to handle files that are comma-separated, but which did not include column headers.

Powershell removing columns and rows from CSV

I'm having trouble making some changes to a series of CSV files, all with the same data structure. I'm trying to combine all of the files into one CSV file or one tab delimited text file (don't really mind), however each file needs to have 2 empty rows removed and two of the columns removed, below is an example:
col1,col2,col3,col4,col5,col6 <-remove
col1,col2,col3,col4,col5,col6 <-remove
col1,col2,col3,col4,col5,col6
col1,col2,col3,col4,col5,col6
^ ^
remove remove
End Result:
col1,col2,col4,col6
col1,col2,col4,col6
This is my attempt at doing this (I'm very new to Powershell)
$ListofFiles = "example.csv" #this is an list of all the CSV files
ForEach ($file in $ListofFiles)
{
$content = Get-Content ($file)
$content = $content[2..($content.Count)]
$contentArray = #()
[string[]]$contentArray = $content -split ","
$content = $content[0..2 + 4 + 6]
Add-Content '...\output.txt' $content
}
Where am I going wrong here...

your example file should be read, before foreach to fetch the file list
$ListofFiles = get-content "example.csv"
Inside the foreach you are getting content of mainfile
$content = Get-Content ($ListofFiles)
instead of
$content = Get-Content $file
and for removing rows i will recommend this:
$obj = get-content C:\t.csv | select -Index 0,1,3
for removing columns (column numbers 0,1,3,5):
$obj | %{(($_.split(","))[0,1,3,5]) -join "," } | out-file test.csv -Append

According to the fact the initial files looks like
col1,col2,col3,col4,col5,col6
col1,col2,col3,col4,col5,col6
,,,,,
,,,,,
You can also try this one liner
Import-Csv D:\temp\*.csv -Header 'C1','C2','C3','C4','C5','C6' | where {$_.c1 -ne ''} | select -Property 'C1','C2','C5' | Export-Csv 'd:\temp\final.csv' -NoTypeInformation
According to the fact that you CSVs have all the same structure, you can directly open them providing the header, then remove objects with the missing datas then export all the object in a csv file.

It is sufficient to specify fictitious column names, with a column number that can exceed the number of columns in the file, change where you want and exclude columns that you do not want to take.
gci "c:\yourdirwithcsv" -file -filter *.csv |
%{ Import-Csv $_.FullName -Header C1,C2,C3,C4,C5,C6 |
where C1 -ne '' |
select -ExcludeProperty C3, C4 |
export-csv "c:\temp\merged.csv" -NoTypeInformation
}

Using powershell to transform CSV file

I have CSV files which have a lot of columns. I need to transform several columns, for example, some date columns have text string of "Missing" and I want to replace "Missing" to an empty string, etc.
The following code may work but it will be a long file since there are a lot of columns. Is it a better way to write it?
Import-Csv $file |
select #(
#{l="xxx"; e={ ....}},
# repeat many times for each column....
) | export-Csv

You could use an imperative style rather than a pipelined style:
$records = Import-Csv $file
foreach ($record in $records)
{
if ($record.Date -eq 'Missing')
{
$record.Date = ''
}
}
$records | Export-Csv $file
Edit: To use a pipelined style, you could do it like this:
import-csv $file |
select -ExcludeProperty Name1,Name2 -Property *,#{n='Name1'; e={"..."}},#{n='Name2'; e={'...'}}
The * is a wildcard that matches all properties. I couldn't find a way to format this code in a nicer way, so it is kind of ugly looking.

If all you want to do is a find-replace, you don't really need to read it as a CSV.
You could do this instead:
Get-Content $file | %{$_.ToString().Replace("Missing", "")} | Out-File $file

Add Column to CSV Windows PowerShell

I have a fairly standard csv file with headers I want to add a new column & set all the rows to the same data.
Original:
column1, column2
1,b
2,c
3,5
After
column1, column2, column3
1,b, setvalue
2,c, setvalue
3,5, setvalue
I can't find anything on this if anybody could point me in the right direction that would be great. Sorry very new to Power Shell.

Here's one way to do that using Calculated Properties:
Import-Csv file.csv |
Select-Object *,#{Name='column3';Expression={'setvalue'}} |
Export-Csv file.csv -NoTypeInformation
You can find more on calculated properties here: http://technet.microsoft.com/en-us/library/ff730948.aspx.
In a nutshell, you import the file, pipe the content to the Select-Object cmdlet, select all exiting properties (e.g '*') then add a new one.

The ShayLevy's answer also works for me!
If you don't want to provide a value for each object yet the code is even easier...
Import-Csv file.csv |
Select-Object *,"column3" |
Export-Csv file.csv -NoTypeInformation

None of the scripts I've seen are dynamic in nature, so they're fairly limited in their scope & what you can do with them.. that's probably because most PS Users & even Power Users aren't programmers. You very rarely see the use of arrays in Powershell. I took Shay Levy's answer & improved upon it.
Note here: The Import needs to be consistent (two columns for instance), but it would be fairly easy to modify this to dynamically count the columns & generate headers that way too. For this particular question, that wasn't asked. Or simply don't generate a header unless it's needed.
Needless to say the below will pull in as many CSV files that exist in the folder, add a header, and then later strip it. The reason I add the header is for consistency in the data, it makes manipulating the columns later down the line fairly straight forward too (if you choose to do so). You can modify this to your hearts content, feel free to use it for other purposes too. This is generally the format I stick with for just about any of my Powershell needs. The use of a counter basically allows you to manipulate individual files, so there's a lot of possibilities here.
$chargeFiles = 'C:\YOURFOLDER\BLAHBLAH\'
$existingReturns = Get-ChildItem $chargeFiles
for ($i = 0; $i -lt $existingReturns.count; $i++)
{
$CSV = Import-Csv -Path $existingReturns[$i].FullName -Header Header1,Header2
$csv | select *, #{Name='Header3';Expression={'Header3 Static'}}
| select *, #{Name='Header4';Expression={'Header4 Static Tet'}}
| select *, #{Name='Header5';Expression={'Header5 Static Text'}}|
CONVERTTO-CSV -DELIMITER "," -NoTypeInformation |
SELECT-OBJECT -SKIP 1 | % {$_ -replace '"', ""} |
OUT-FILE -FilePath $existingReturns[$i].FullName -FORCE -ENCODING ASCII
}

You could also use Add-Member:
$csv = Import-Csv 'input.csv'
foreach ($row in $csv)
{
$row | Add-Member -NotePropertyName 'MyNewColumn' -NotePropertyValue 'MyNewValue'
}
$csv | Export-Csv 'output.csv' -NoTypeInformation

For some applications, I found that producing a hashtable and using the .values as the column to be good (it would allow for cross reference validation against another object that was being enumerated).
In this case, #powershell on freenode brought my attention to an ordered hashtable (since the column header must be used).
Here is an example without any validation the .values
$newcolumnobj = [ordered]#{}
#input data into a hash table so that we can more easily reference the `.values` as an object to be inserted in the CSV
$newcolumnobj.add("volume name", $currenttime)
#enumerate $deltas [this will be the object that contains the volume information `$volumedeltas`)
# add just the new deltas to the newcolumn object
foreach ($item in $deltas){
$newcolumnobj.add($item.volume,$item.delta)
}
$originalcsv = #(import-csv $targetdeltacsv)
#thanks to pscookiemonster in #powershell on freenode
for($i=0; $i -lt $originalcsv.count; $i++){
$originalcsv[$i] | Select-Object *, #{l="$currenttime"; e={$newcolumnobj.item($i)}}
}
Example is related to How can I perform arithmetic to find differences of values in two CSVs?

create a csv file with nothin in it
$csv >> "$PSScriptRoot/dpg.csv"
define the csv file's path. here $psscriptroot is the root of the script
$csv = "$PSScriptRoot/dpg.csv"
now add columns to it
$csv | select vds, protgroup, vlan, ports | Export-Csv $csv

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why Import-Csv's Sort-Object is slow for 1 million records - powershell

Related

powershell: Write specific rows from files to formatted csv

I use -NoTypeInformation so why do I get header back when using Out-File?

Powershell removing columns and rows from CSV

Using powershell to transform CSV file

Add Column to CSV Windows PowerShell

Categories

Resources