Powershell csv row column transpose and manipulation - powershell

I'm newbie in Powershell. I tried to process / transpose row-column against a medium size csv based record (around 10000 rows). The original CSV consist of around 10000 rows with 3 columns ("Time","Id","IOT") as below:
"Time","Id","IOT"
"00:03:56","23","26"
"00:03:56","24","0"
"00:03:56","25","0"
"00:03:56","26","1"
"00:03:56","27","0"
"00:03:56","28","0"
"00:03:56","29","0"
"00:03:56","30","1953"
"00:03:56","31","22"
"00:03:56","32","39"
"00:03:56","33","8"
"00:03:56","34","5"
"00:03:56","35","269"
"00:03:56","36","5"
"00:03:56","37","0"
"00:03:56","38","0"
"00:03:56","39","0"
"00:03:56","40","1251"
"00:03:56","41","103"
"00:03:56","42","0"
"00:03:56","43","0"
"00:03:56","44","0"
"00:03:56","45","0"
"00:03:56","46","38"
"00:03:56","47","14"
"00:03:56","48","0"
"00:03:56","49","0"
"00:03:56","2013","0"
"00:03:56","2378","0"
"00:03:56","2380","32"
"00:03:56","2758","0"
"00:03:56","3127","0"
"00:03:56","3128","0"
"00:09:16","23","22"
"00:09:16","24","0"
"00:09:16","25","0"
"00:09:16","26","2"
"00:09:16","27","0"
"00:09:16","28","0"
"00:09:16","29","21"
"00:09:16","30","48"
"00:09:16","31","0"
"00:09:16","32","4"
"00:09:16","33","4"
"00:09:16","34","7"
"00:09:16","35","382"
"00:09:16","36","12"
"00:09:16","37","0"
"00:09:16","38","0"
"00:09:16","39","0"
"00:09:16","40","1882"
"00:09:16","41","42"
"00:09:16","42","0"
"00:09:16","43","3"
"00:09:16","44","0"
"00:09:16","45","0"
"00:09:16","46","24"
"00:09:16","47","22"
"00:09:16","48","0"
"00:09:16","49","0"
"00:09:16","2013","0"
"00:09:16","2378","0"
"00:09:16","2380","19"
"00:09:16","2758","0"
"00:09:16","3127","0"
"00:09:16","3128","0"
...
...
...
I tried to do the transpose using code based from powershell script downloaded from https://gallery.technet.microsoft.com/scriptcenter/Powershell-Script-to-7c8368be
Basically my powershell code is as below:
$b = #()
foreach ($Time in $a.Time | Select -Unique) {
$Props = [ordered]#{ Time = $time }
foreach ($Id in $a.Id | Select -Unique){
$IOT = ($a.where({ $_.Id -eq $Id -and $_.time -eq $time })).IOT
$Props += #{ $Id = $IOT }
}
$b += New-Object -TypeName PSObject -Property $Props
}
$b | FT -AutoSize
$b | Out-GridView
Above code could give me the result as I expected which are all "Id" values will become column headers while all "Time" values will become unique row and "IOT" values as the intersection from "Id" x "Time" as below:
"Time","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","2013","2378","2380","2758","3127","3128"
"00:03:56","26","0","0","1","0","0","0","1953","22","39","8","5","269","5","0","0","0","1251","103","0","0","0","0","38","14","0","0","0","0","32","0","0","0"
"00:09:16","22","0","0","2","0","0","21","48","0","4","4","7","382","12","0","0","0","1882","42","0","3","0","0","24","22","0","0","0","0","19","0","0","0"
While it only involves a few hundreds rows, the result comes out quickly as expected, but the problem now when processing the whole csv file with 10000 rows, the script above 'keep executing' and doesn't seem able to finish for long time (hours) and couldn't spit out any results.
So probably if some powershell experts from stackoverflow could help to asses the code above and probably could help to modify to speed up the results?
Many thanks for the advise

10000 records is a lot but I don't think it is enough to advise streamreader* and manually parsing the CSV. The biggest thing going against you though is the following line:
$b += New-Object -TypeName PSObject -Property $Props
What PowerShell is doing here is making a new array and appending that element to it. This is a very memory intensive operation that you are repeating 1000's of times. Better thing to do in this case is use the pipeline to your advantage.
$data = Import-Csv -Path "D:\temp\data.csv"
$headers = $data.ID | Sort-Object {[int]$_} -Unique
$data | Group-Object Time | ForEach-Object{
$props = [ordered]#{Time = $_.Name}
foreach($header in $headers){
$props."$header" = ($_.Group | Where-Object{$_.ID -eq $header}).IOT
}
[pscustomobject]$props
} | export-csv d:\temp\testing.csv -NoTypeInformation
$data will be your entire file in memory as an object. Need to get all the $headers that will be the column headers.
Group the data by each Time. Then inside each time object we get the value for every ID. If the ID does not exist during that time then the entry will show as null.
This is not the best way but should be faster than yours. I ran 10000 records in under a minute (51 second average over 3 passes). Will benchmark to show you if I can.
I just ran your code once with my own data and it took 13 minutes. I think it is safe to say that mine performs faster.
Dummy data was made with this logic FYI
1..100 | %{
$time = get-date -Format "hh:mm:ss"
sleep -Seconds 1
1..100 | % {
[pscustomobject][ordered]#{
time = $time
id = $_
iot = Get-Random -Minimum 0 -Maximum 7
}
}
} | Export-Csv d:\temp\data.csv -notypeinformation
* Not a stellar example for your case of streamreader. Just pointing it out to show that it is the better way to read large files. Just need to parse string line by line.

Related

PowerShell output random numbers to csv. CSV full of empty lines

Actually 2 part question here. The code below outputs nothing but 1000 blank lines to the csv. I'm just trying to output a random range of numbers to a csv and I actually need to follow up with 4 more columns of randomly generated numbers like this first attempt so the second part of this is after getting this first issue resolved how would I direct the next ranges to the other columns?
Get-Random -Count 998 -InputObject (8000..8999) | Export-Csv -Path SingleColumn.csv -NoTypeInformation
Export-Csv same as ConvertTo-Csv is not designed to deal with array of values:
0..10 | ConvertTo-Csv # Outputs `AutomationNull.Value`
Both cmdlets require you to feed them objects:
0..10 | ForEach-Object { [pscustomobject]#{ foo = $_ } } | ConvertTo-Csv
You can create new objects easily with PSCustomObject.
As for the second question, you can dynamically create a dataset by tweaking this code:
$columnsCount = 5
$numberOfrows = 998
$min = 8000; $max = 9000
1..$numberOfrows | ForEach-Object {
$out = [ordered]#{}
foreach($column in 1..$columnsCount) {
$out["Column $column"] = Get-Random -Minimum $min -Maximum $max
}
[pscustomobject] $out
} | Export-Csv path/to/csv.csv -NoTypeInformation
Few lines of Csv output would look something like this:
"Column 1","Column 2","Column 3","Column 4","Column 5"
"8314","8937","8789","8946","8267"
"8902","8500","8107","8006","8287"
"8655","8204","8552","8681","8863"
"8643","8375","8891","8476","8475"
"8338","8243","8175","8568","8917"
"8747","8629","8054","8505","8351"
"8102","8859","8564","8018","8817"
"8810","8154","8845","8074","8436"
"8626","8731","8070","8156","8459"
....

PowerShell Slowness Two Large CSV Files

I have this script working, but with 100k+ rows in File1 and 200k+ in file 2, it will take days to complete. I got the where.({ part down to less than a second, with both csv files as data tables, but with that route I can't get the data out the way I want. This script outputs the data the way I want, but it takes 4 seconds per lookup. What can I do to speed this up?
I thought containskey somewhere might help, but on PRACT_ID there is a one to many relationship, so not sure how to handle those? Thx.
Invoke-Expression "C:\SHC\MSO\DataTable\functionlibrary.ps1"
[System.Data.DataTable]$Script:MappingTable = New-Object System.Data.DataTable
$File1 = Import-csv "C:\File1.csv" -Delimiter '|' | Sort-Object PRACT_ID
$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object
$Script:MappingTable = $File1 | Out-DataTable
$Logs = "C:\Testing1.7.csv"
[System.Object]$UserOutput = #()
foreach ($name in $File1) {
[string]$userMatch = $File2.Where( { $_.Split("|")[0] -eq $name.PRACT_ID })
if ($userMatch) {
# Process the data
$UserOutput += New-Object PsObject -property #{
ID_NUMBER = $name.ID_NUMBER
PRACT_ID = $name.PRACT_ID
LAST_NAME = $name.LAST_NAME
FIRST_NAME = $name.FIRST_NAME
MIDDLE_INITIAL = $name.MIDDLE_INITIAL
DEGREE = $name.DEGREE
EMAILADDRESS = $name.EMAILADDRESS
PRIMARY_CLINIC_PHONE = $name.PRIMARY_CLINIC_PHONE
SPECIALTY_NAME = $name.SPECIALTY_NAME
State_License = $name.State_License
NPI_Number = $name.NPI_Number
'University Affiliation' = $name.'University Affiliation'
Teaching_Title = $name.Teaching_Title
FACILITY = $userMatch
}
}
}
$UserOutput | Select-Object ID_NUMBER, PRACT_ID, LAST_NAME, FIRST_NAME, MIDDLE_INITIAL, DEGREE, EMAILADDRESS, PRIMARY_CLINIC_PHONE, SPECIALTY_NAME, State_License, NPI_Number, 'University Affiliation', Teaching_Title, FACILITY |
Export-Csv $logs -NoTypeInformation
Load $File2 into a hashtable with the $_.Split('|')[0] value as the key - you can then also skip the object creation completely and offload everything to Select-Object:
$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object
# load $file2 into hashtable
$userTable = #{}
foreach($userEntry in $File2){
$userTable[$userEntry.Split('|')[0]] = $userEntry
}
# prepare the existing property names we want to preserve
$propertiesToSelect = 'ID_NUMBER', 'PRACT_ID', 'LAST_NAME', 'FIRST_NAME', 'MIDDLE_INITIAL', 'DEGREE', 'EMAILADDRESS', 'PRIMARY_CLINIC_PHONE', 'SPECIALTY_NAME', 'State_License', 'NPI_Number', 'University Affiliation', 'Teaching_Title'
# read file, filter on existence in $userTable, add the FACILITY calculated property before export
Import-csv "C:\File1.csv" -Delimiter '|' |Where-Object {$userTable.ContainsKey($_.PRACT_ID)} |Select-Object $propertiesToSelect,#{Name='FACILITY';Expression={$userTable[$_.PRACT_ID]}} |Export-Csv $logs -NoTypeInformation
There are a multitude of ways to increase the speed of the operations you're doing and can be broken down to in-script and out-of-script-possibilities:
Out of Script Possibilities:
Since the files are large, how much memory does the machine you're running this on have? And are you maxing it out during this operation?
If you are paging to disk, this will be the single biggest impact to the overall process!
If you are, the two ways to address this are:
Throw more hardware at the problem (easiest to deal with)
Write your code to iterate over each file small chunks at a time so you don't load it all into RAM at the same time. (very difficult if you're not familiar with it)
In-Script:
Dont use #() with += (it's really slow (especially over large datasets))
Use instead an ArrayList. Here is a quick sample of the perf difference (ArrayList ~40x faster on 10,000 and 500x faster on 100,000 entries, consistently -- this difference gets larger as the dataset gets larger or in other words, #() += gets slower as the dataset gets bigger)):
(Measure-Command {
$arr = [System.Collections.ArrayList]::new()
1..100000 | % {
[void]$arr.Add($_)
}
}).TotalSeconds
(Measure-Command {
$arr = #()
1..100000 | % {
$arr += $_
}
}).TotalSeconds
0.8258113
451.5413987
If you need to do multiple key-based lookups on the data, iterating over the data millions of times will be slow. Import the data as a CSV and then structure a couple of hashtables with the associated information with key -> data and/or key -> data[] and then you can do index lookups instead of iterating through the arrays millions of times... will be MUCH faster; assuming you have available RAM for the extra objects..
EDIT for #RoadRunner:
My experience with GC may be old... it used to be horrendously slow on large files but appears in newer PowerShell versions, may have been fixed:
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\10MB.txt", ('8' * 10MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\50MB.txt", ('8' * 50MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\100MB.txt", ('8' * 100MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\500MB.txt", ('8' * 500MB))
$10MB = gi .\10MB.txt
$50MB = gi .\50MB.txt
$100MB = gi .\100MB.txt
$500MB = gi .\500MB.txt
0..10 | % {
$n = [pscustomobject] #{
'GC_10MB' = (Measure-Command { Get-Content $10MB }).TotalSeconds
'RAL_10MB' = (Measure-Command { [System.IO.File]::ReadAllLines($10MB) }).TotalSeconds
'GC_50MB' = (Measure-Command { Get-Content $50MB }).TotalSeconds
'RAL_50MB' = (Measure-Command { [System.IO.File]::ReadAllLines($50MB) }).TotalSeconds
'GC_100MB' = (Measure-Command { Get-Content $100MB }).TotalSeconds
'RAL_100MB' = (Measure-Command { [System.IO.File]::ReadAllLines($100MB) }).TotalSeconds
'GC_500MB' = (Measure-Command { Get-Content $500MB }).TotalSeconds
'RAL_500MB' = (Measure-Command { [System.IO.File]::ReadAllLines($500MB) }).TotalSeconds
'Delta_10MB' = $null
'Delta_50MB' = $null
'Delta_100MB' = $null
'Delta_500MB' = $null
}
$n.Delta_10MB = "{0:P}" -f ($n.GC_10MB / $n.RAL_10MB)
$n.Delta_50MB = "{0:P}" -f ($n.GC_50MB / $n.RAL_50MB)
$n.Delta_100MB = "{0:P}" -f ($n.GC_100MB / $n.RAL_100MB)
$n.Delta_500MB = "{0:P}" -f ($n.GC_500MB / $n.RAL_500MB)
$n
}

Read large CSV in PowerShell parse multiple columns for unique values save results based on oldest value in column

I have a large 10 million row file (currently CSV). I need to read through the file, and remove duplicate items based on multiple columns.
Example line of data would look something like:
ComputerName, IPAddress, MacAddress, CurrentDate, FirstSeenDate
I would want to check MacAddress and ComputerName for duplicates and if a duplicate is discovered keep the unique entry with the oldest FirstSeenDate.
I have read a CSV into a variable using import-csv and then processed the variable using sort-object...etc but it's horribly slow.
$data | Group-Object -Property ComputerName,MaAddress | ForEach-Object{$_.Group | Sort-Object -Property FirstSeenDate | Select-Object -First 1}
I am thinking I could use stream.reader and read the CSV line by line building a unique array based on array contains logic.
Thoughts?
I would probably use Python if performance were a major concern. Or LogParser.
However, if I had to use PowerShell, I would probably try something like this:
$CultureInfo = [CultureInfo]::InvariantCulture
$DateFormat = 'M/d/yyyy' # Use whatever date format is appropriate
# We need to convert the strings that represent dates. You can skip the ParseExact() calls if the dates are already in a string sortable format (e.g., yyyy-MM-dd).
$Data = Import-Csv $InputFile | Select-Object -Property ComputerName, IPAddress, MacAddress, #{n = 'CurrentDate'; e = {[DateTime]::ParseExact($_.CurrentDate, $DateFormat, $CultureInfo)}}, #{n = 'FirstSeenDate'; e = {[DateTime]::ParseExact($_.FirstSeenDate, $DateFormat, $CultureInfo)}}
$Results = #{}
foreach ($Record in $Data) {
$Key = $Record.ComputerName + ';' + $Record.MacAddress
if (!$Results.ContainsKey($Key)) {
$Results[$Key] = $Record
}
elseif ($Record.FirstSeenDate -lt $Results[$Key].FirstSeenDate) {
$Results[$Key] = $Record
}
}
$Results.Values | Sort-Object -Property ComputerName, MacAddress | Export-Csv $OutputFile -NoTypeInformation
That may very well be faster because Group-Object is often a bottleneck even though it is quite powerful.
If you really want to try using a stream reader, try using the Microsoft.VisualBasic.FileIO.TextFieldParser class, which is a part of the .Net framework in spite of it's slightly misleading name. You can access it by running Add-Type -AssemblyName Microsoft.VisualBasic.
You could do an import in a database (i.e. SQLite example )
and then query:
SELECT
MIN(FirstSeenDate) AS FirstSeenDate,
ComputerName,
IPAddress,
MacAddress
FROM importedData
GROUP BY ComputerName, IPAddress, MacAddress

How to process large CSV file in powershell

I am trying to find the number of rows in a csv file that have above a certain value. The code I have goes something like
$T6=Import-Csv $file | Where-Object {$_."Value" -ge 0.6 } | Measure-Object
This works well for smaller files but for large csv files(1 GB or more) it will run forever. Is there any better way to parse csv files like this in powershell?
Import-Csv is the official cmdlet for this. One comment though, everything imported is a string, so you better cast the Value property to the correct type. For instance:
$T6 = Import-Csv $file | Where-Object { [float]$_.Value -ge 0.6 } | Measure-Object
You can try to get rid of Import-Csv:
$values = ([System.IO.File]::ReadAllText('c:\pst\New Microsoft Office Excel Worksheet.csv')).Split(";") | where {$_ -ne ""}
$items = New-Object "System.Collections.Generic.List[decimal]"
foreach($value in $values)
{
[decimal]$out = New-Object decimal
if ([System.Decimal]::TryParse($value, [ref] $out))
{
if ($out -ge 10){$items.Add($out)}
}
}
$items | Measure-Object
For speed when processing large files consider using a streamreader, Roman's answer here demonstrates usage.

Powershell Select-Object from array not working

I am trying to seperate values in an array so i can pass them to another function.
Am using the select-Object function within a for loop to go through each line and separate the timestamp and value fields.
However, it doesn't matter what i do the below code only displays the first select-object variable for each line. The second select-object command doesn't seem to work as my output is a blank line for each of the 6 rows.
Any ideas on how to get both values
$ReportData = $SystemStats.get_performance_graph_csv_statistics( (,$Query) )
### Allocate a new encoder and turn the byte array into a string
$ASCII = New-Object -TypeName System.Text.ASCIIEncoding
$csvdata = $ASCII.GetString($ReportData[0].statistic_data)
$csv2 = convertFrom-CSV $csvdata
$newarray = $csv2 | Where-Object {$_.utilization -ne "0.0000000000e+00" -and $_.utilization -ne "nan" }
for ( $n = 0; $n -lt $newarray.Length; $n++)
{
$nTime = $newarray[$n]
$nUtil = $newarray[$n]
$util = $nUtil | select-object Utilization
$util
$tstamp = $nTime | select-object timestamp
$tstamp
}
Let me slightly modify the processing code, if it will help.
$csv2 |
Where-Object {$_.utilization -ne "0.0000000000e+00" -and $_.utilization -ne "nan" } |
Select-Object Utilization,TimeStamp
It will produce somewhat different output, but that should be better for working with.
The result are objects with properties Utilization and TimeStamp. You can pass them to the another function as you mention.
Generally it is better to use pipes instead of for loops. You don't need to care about indexes and it works with arrays as well as with scalar values.
If my updated code won't work: is the TimeStamp property really filled with any value?