PowerShell Slowness Two Large CSV Files - powershell

I have this script working, but with 100k+ rows in File1 and 200k+ in file 2, it will take days to complete. I got the where.({ part down to less than a second, with both csv files as data tables, but with that route I can't get the data out the way I want. This script outputs the data the way I want, but it takes 4 seconds per lookup. What can I do to speed this up?
I thought containskey somewhere might help, but on PRACT_ID there is a one to many relationship, so not sure how to handle those? Thx.
Invoke-Expression "C:\SHC\MSO\DataTable\functionlibrary.ps1"
[System.Data.DataTable]$Script:MappingTable = New-Object System.Data.DataTable
$File1 = Import-csv "C:\File1.csv" -Delimiter '|' | Sort-Object PRACT_ID
$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object
$Script:MappingTable = $File1 | Out-DataTable
$Logs = "C:\Testing1.7.csv"
[System.Object]$UserOutput = #()
foreach ($name in $File1) {
[string]$userMatch = $File2.Where( { $_.Split("|")[0] -eq $name.PRACT_ID })
if ($userMatch) {
# Process the data
$UserOutput += New-Object PsObject -property #{
ID_NUMBER = $name.ID_NUMBER
PRACT_ID = $name.PRACT_ID
LAST_NAME = $name.LAST_NAME
FIRST_NAME = $name.FIRST_NAME
MIDDLE_INITIAL = $name.MIDDLE_INITIAL
DEGREE = $name.DEGREE
EMAILADDRESS = $name.EMAILADDRESS
PRIMARY_CLINIC_PHONE = $name.PRIMARY_CLINIC_PHONE
SPECIALTY_NAME = $name.SPECIALTY_NAME
State_License = $name.State_License
NPI_Number = $name.NPI_Number
'University Affiliation' = $name.'University Affiliation'
Teaching_Title = $name.Teaching_Title
FACILITY = $userMatch
}
}
}
$UserOutput | Select-Object ID_NUMBER, PRACT_ID, LAST_NAME, FIRST_NAME, MIDDLE_INITIAL, DEGREE, EMAILADDRESS, PRIMARY_CLINIC_PHONE, SPECIALTY_NAME, State_License, NPI_Number, 'University Affiliation', Teaching_Title, FACILITY |
Export-Csv $logs -NoTypeInformation

Load $File2 into a hashtable with the $_.Split('|')[0] value as the key - you can then also skip the object creation completely and offload everything to Select-Object:
$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object
# load $file2 into hashtable
$userTable = #{}
foreach($userEntry in $File2){
$userTable[$userEntry.Split('|')[0]] = $userEntry
}
# prepare the existing property names we want to preserve
$propertiesToSelect = 'ID_NUMBER', 'PRACT_ID', 'LAST_NAME', 'FIRST_NAME', 'MIDDLE_INITIAL', 'DEGREE', 'EMAILADDRESS', 'PRIMARY_CLINIC_PHONE', 'SPECIALTY_NAME', 'State_License', 'NPI_Number', 'University Affiliation', 'Teaching_Title'
# read file, filter on existence in $userTable, add the FACILITY calculated property before export
Import-csv "C:\File1.csv" -Delimiter '|' |Where-Object {$userTable.ContainsKey($_.PRACT_ID)} |Select-Object $propertiesToSelect,#{Name='FACILITY';Expression={$userTable[$_.PRACT_ID]}} |Export-Csv $logs -NoTypeInformation

There are a multitude of ways to increase the speed of the operations you're doing and can be broken down to in-script and out-of-script-possibilities:
Out of Script Possibilities:
Since the files are large, how much memory does the machine you're running this on have? And are you maxing it out during this operation?
If you are paging to disk, this will be the single biggest impact to the overall process!
If you are, the two ways to address this are:
Throw more hardware at the problem (easiest to deal with)
Write your code to iterate over each file small chunks at a time so you don't load it all into RAM at the same time. (very difficult if you're not familiar with it)
In-Script:
Dont use #() with += (it's really slow (especially over large datasets))
Use instead an ArrayList. Here is a quick sample of the perf difference (ArrayList ~40x faster on 10,000 and 500x faster on 100,000 entries, consistently -- this difference gets larger as the dataset gets larger or in other words, #() += gets slower as the dataset gets bigger)):
(Measure-Command {
$arr = [System.Collections.ArrayList]::new()
1..100000 | % {
[void]$arr.Add($_)
}
}).TotalSeconds
(Measure-Command {
$arr = #()
1..100000 | % {
$arr += $_
}
}).TotalSeconds
0.8258113
451.5413987
If you need to do multiple key-based lookups on the data, iterating over the data millions of times will be slow. Import the data as a CSV and then structure a couple of hashtables with the associated information with key -> data and/or key -> data[] and then you can do index lookups instead of iterating through the arrays millions of times... will be MUCH faster; assuming you have available RAM for the extra objects..
EDIT for #RoadRunner:
My experience with GC may be old... it used to be horrendously slow on large files but appears in newer PowerShell versions, may have been fixed:
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\10MB.txt", ('8' * 10MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\50MB.txt", ('8' * 50MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\100MB.txt", ('8' * 100MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\500MB.txt", ('8' * 500MB))
$10MB = gi .\10MB.txt
$50MB = gi .\50MB.txt
$100MB = gi .\100MB.txt
$500MB = gi .\500MB.txt
0..10 | % {
$n = [pscustomobject] #{
'GC_10MB' = (Measure-Command { Get-Content $10MB }).TotalSeconds
'RAL_10MB' = (Measure-Command { [System.IO.File]::ReadAllLines($10MB) }).TotalSeconds
'GC_50MB' = (Measure-Command { Get-Content $50MB }).TotalSeconds
'RAL_50MB' = (Measure-Command { [System.IO.File]::ReadAllLines($50MB) }).TotalSeconds
'GC_100MB' = (Measure-Command { Get-Content $100MB }).TotalSeconds
'RAL_100MB' = (Measure-Command { [System.IO.File]::ReadAllLines($100MB) }).TotalSeconds
'GC_500MB' = (Measure-Command { Get-Content $500MB }).TotalSeconds
'RAL_500MB' = (Measure-Command { [System.IO.File]::ReadAllLines($500MB) }).TotalSeconds
'Delta_10MB' = $null
'Delta_50MB' = $null
'Delta_100MB' = $null
'Delta_500MB' = $null
}
$n.Delta_10MB = "{0:P}" -f ($n.GC_10MB / $n.RAL_10MB)
$n.Delta_50MB = "{0:P}" -f ($n.GC_50MB / $n.RAL_50MB)
$n.Delta_100MB = "{0:P}" -f ($n.GC_100MB / $n.RAL_100MB)
$n.Delta_500MB = "{0:P}" -f ($n.GC_500MB / $n.RAL_500MB)
$n
}

Related

PowerShell: Find unique values from multiple CSV files

let's say that I have several CSV files and I need to check a specific column and find values that exist in one file, but not in any of the others. I'm having a bit of trouble coming up with the best way to go about it as I wanted to use Compare-Object and possibly keep all columns and not just the one that contains the values I'm checking.
So I do indeed have several CSV files and they all have a Service Code column, and I'm trying to create a list for each Service Code that only appears in one file. So I would have "Service Codes only in CSV1", "Service Codes only in CSV2", etc.
Based on some testing and a semi-related question, I've come up with a workable solution, but with all of the nesting and For loops, I'm wondering if there is a more elegant method out there.
Here's what I do have:
$files = Get-ChildItem -LiteralPath "C:\temp\ItemCompare" -Include "*.csv"
$HashList = [System.Collections.Generic.List[System.Collections.Generic.HashSet[String]]]::New()
For ($i = 0; $i -lt $files.Count; $i++){
$TempHashSet = [System.Collections.Generic.HashSet[String]]::New([String[]](Import-Csv $files[$i])."Service Code")
$HashList.Add($TempHashSet)
}
$FinalHashList = [System.Collections.Generic.List[System.Collections.Generic.HashSet[String]]]::New()
For ($i = 0; $i -lt $HashList.Count; $i++){
$UniqueHS = [System.Collections.Generic.HashSet[String]]::New($HashList[$i])
For ($j = 0; $j -lt $HashList.Count; $j++){
#Skip the check when the HashSet would be compared to itself
If ($j -eq $i){Continue}
$UniqueHS.ExceptWith($HashList[$j])
}
$FinalHashList.Add($UniqueHS)
}
It seems a bit messy to me using so many different .NET references, and I know I could make it cleaner with a tag to say using namespace System.Collections.Generic, but I'm wondering if there is a way to make it work using Compare-Object which was my first attempt, or even just a simpler/more efficient method to filter each file.
I believe I found an "elegant" solution based on Group-Object, using only a single pipeline:
# Import all CSV files.
Get-ChildItem $PSScriptRoot\csv\*.csv -File -PipelineVariable file | Import-Csv |
# Add new column "FileName" to distinguish the files.
Select-Object *, #{ label = 'FileName'; expression = { $file.Name } } |
# Group by ServiceCode to get a list of files per distinct value.
Group-Object ServiceCode |
# Filter by ServiceCode values that exist only in a single file.
# Sort-Object -Unique takes care of possible duplicates within a single file.
Where-Object { ( $_.Group.FileName | Sort-Object -Unique ).Count -eq 1 } |
# Expand the groups so we get the original object structure back.
ForEach-Object Group |
# Format-Table requires sorting by FileName, for -GroupBy.
Sort-Object FileName |
# Finally pretty-print the result.
Format-Table -Property ServiceCode, Foo -GroupBy FileName
Test Input
a.csv:
ServiceCode,Foo
1,fop
2,fip
3,fap
b.csv:
ServiceCode,Foo
6,bar
6,baz
3,bam
2,bir
4,biz
c.csv:
ServiceCode,Foo
2,bla
5,blu
1,bli
Output
FileName: b.csv
ServiceCode Foo
----------- ---
4 biz
6 bar
6 baz
FileName: c.csv
ServiceCode Foo
----------- ---
5 blu
Looks correct to me. The values 1, 2 and 3 are duplicated between multiple files, so they are excluded. 4, 5 and 6 exist only in single files, while 6 is a duplicate value only within a single file.
Understanding the code
Maybe it is easier to understand how this code works, by looking at the intermediate output of the pipeline produced by the Group-Object line:
Count Name Group
----- ---- -----
2 1 {#{ServiceCode=1; Foo=fop; FileName=a.csv}, #{ServiceCode=1; Foo=bli; FileName=c.csv}}
3 2 {#{ServiceCode=2; Foo=fip; FileName=a.csv}, #{ServiceCode=2; Foo=bir; FileName=b.csv}, #{ServiceCode=2; Foo=bla; FileName=c.csv}}
2 3 {#{ServiceCode=3; Foo=fap; FileName=a.csv}, #{ServiceCode=3; Foo=bam; FileName=b.csv}}
1 4 {#{ServiceCode=4; Foo=biz; FileName=b.csv}}
1 5 {#{ServiceCode=5; Foo=blu; FileName=c.csv}}
2 6 {#{ServiceCode=6; Foo=bar; FileName=b.csv}, #{ServiceCode=6; Foo=baz; FileName=b.csv}}
Here the Name contains the unique ServiceCode values, while Group "links" the data to the files.
From here it should already be clear how to find values that exist only in single files. If duplicate ServiceCode values within a single file wouldn't be allowed, we could even simplify the filter to Where-Object Count -eq 1. Since it was stated that dupes within single files may exist, we need the Sort-Object -Unique to count multiple equal file names within a group as only one.
It is not completely clear what you expect as an output.
If this is just the ServiceCodes that intersect then this is actually a duplicate with:
Comparing two arrays & get the values which are not common
Union and Intersection in PowerShell?
But taking that you actually want the related object and files, you might use this approach:
$HashTable = #{}
ForEach ($File in Get-ChildItem .\*.csv) {
ForEach ($Object in (Import-Csv $File)) {
$HashTable[$Object.ServiceCode] = $Object |Select-Object *,
#{ n='File'; e={ $File.Name } },
#{ n='Count'; e={ $HashTable[$Object.ServiceCode].Count + 1 } }
}
}
$HashTable.Values |Where-Object Count -eq 1
Here is my take on this fun exercise, I'm using a similar approach as yours with the HashSet but adding [System.StringComparer]::OrdinalIgnoreCase to leverage the .Contains(..) method:
using namespace System.Collections.Generic
# Generate Random CSVs:
$charset = 'abABcdCD0123xXyYzZ'
$ran = [random]::new()
$csvs = #{}
foreach($i in 1..50) # Create 50 CSVs for testing
{
$csvs["csv$i"] = foreach($z in 1..50) # With 50 Rows
{
$index = (0..2).ForEach({ $ran.Next($charset.Length) })
[pscustomobject]#{
ServiceCode = [string]::new($charset[$index])
Data = $ran.Next()
}
}
}
# Get Unique 'ServiceCode' per CSV:
$result = #{}
foreach($key in $csvs.Keys)
{
# Get all unique `ServiceCode` from the other CSVs
$tempHash = [HashSet[string]]::new(
[string[]]($csvs[$csvs.Keys -ne $key].ServiceCode),
[System.StringComparer]::OrdinalIgnoreCase
)
# Filter the unique `ServiceCode`
$result[$key] = foreach($line in $csvs[$key])
{
if(-not $tempHash.Contains($line.ServiceCode))
{
$line
}
}
}
# Test if the code worked,
# If something is returned from here means it didn't work
foreach($key in $result.Keys)
{
$tmp = $result[$result.Keys -ne $key].ServiceCode
foreach($val in $result[$key])
{
if($val.ServiceCode -in $tmp)
{
$val
}
}
}
i was able to get unique items as follow
# Get all items of CSVs in a single variable with adding the file name at the last column
$CSVs = Get-ChildItem "C:\temp\ItemCompare\*.csv" | ForEach-Object {
$CSV = Import-CSV -Path $_.FullName
$FileName = $_.Name
$CSV | Select-Object *,#{N='Filename';E={$FileName}}
}
Foreach($line in $CSVs){
$ServiceCode = $line.ServiceCode
$file = $line.Filename
if (!($CSVs | where {$_.ServiceCode -eq $ServiceCode -and $_.filename -ne $file})){
$line
}
}

Which operator provides quicker output -match -contains or Where-Object for large CSV files

I am trying to build a logic where I have to query 4 large CSV files against 1 CSV file. Particularly finding an AD object against 4 domains and store them in variable for attribute comparison.
I have tried importing all files in different variables and used below 3 different codes to get the desired output. But it takes longer time for completion than expected.
CSV import:
$AllMainFile = Import-csv c:\AllData.csv
#Input file contains below
EmployeeNumber,Name,Domain
Z001,ABC,Test.com
Z002,DEF,Test.com
Z003,GHI,Test1.com
Z001,ABC,Test2.com
$AAA = Import-csv c:\AAA.csv
#Input file contains below
EmployeeNumber,Name,Domain
Z001,ABC,Test.com
Z002,DEF,Test.com
Z003,GHI,Test1.com
Z001,ABC,Test2.com
Z004,JKL,Test.com
$BBB = Import-Csv C:\BBB.csv
$CCC = Import-Csv C:\CCC.csv
$DDD = Import-Csv c:\DDD.csv
Sample code 1:
foreach ($x in $AllMainFile) {
$AAAoutput += $AAA | ? {$_.employeeNumber -eq $x.employeeNumber}
$BBBoutput += $BBB | ? {$_.employeeNumber -eq $x.employeeNumber}
$CCCoutput += $CCC | ? {$_.employeeNumber -eq $x.employeeNumber}
$DDDoutput += $DDD | ? {$_.employeeNumber -eq $x.employeeNumber}
if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
#### My Other script execution code here
} else {
#### My Other script execution code here
}
}
Sample code 2 (just replacing with -match instead of Where-Object):
foreach ($x in $AllMainFile) {
$AAAoutput += $AAA -match $x.EmployeeNumber
$BBBoutput += $BBB -match $x.EmployeeNumber
$CCCoutput += $CCC -match $x.EmployeeNumber
$DDDoutput += $AllMainFile -match $x.EmployeeNumber
if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
#### My Other script execution code here
} else {
#### My Other script execution code here
}
}
Sample code 3 (just replacing with -contains operator):
foreach ($x in $AllMainFile) {
foreach ($c in $AAA){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$AAAoutput += $c}}
foreach ($c in $BBB){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$BBBoutput += $c}}
foreach ($c in $CCC){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$CCCoutput += $c}}
foreach ($c in $DDD){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$DDDoutput += $c}}
if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
#### My Other script execution code here
} else {
#### My Other script execution code here
}
}
I am expecting to execute the script as quick and fast as possible by comparing and lookup all 4 CSV files against 1 input file. Each files contains more than 1000k objects/rows with 5 columns.
Performance
Before answering the question, I would like to clear some air about measuring the performance of PowerShell cmdlets. Native PowerShell is very good in streaming objects and therefore could save a lot of memory if streamed correctly (do not assign a stream to a variable or use brackets). PowerShell is also capable of invoking almost every existing .Net methods (like Add()) and technologies like LINQ.
The usual way of measuring the performance of a command is:
(Measure-Command {<myCommand>}).TotalMilliseconds
If you use this on native powershell streaming cmdlets, they appear not to perform very well in comparison with statements and dotnet commands. Often it is concluded that e.g. LINQ outperforms native PowerShell commands well over a factor hundred. The reason for this is that LINQ is reactive and using a deferred (lazy) execution: It tells it has done the job but it is actually doing it at the moment you need any result (besides it is caching a lot of results which is easiest to exclude from a benchmark by starting a new session) where of Native PowerShell is rather proactive: it passes any resolved item immediately back into the pipeline and any next cmdlet (e.g. Export-Csv) might than finalize the item and release it from memory.
In other words, if you have a slow input (see: Advocating native PowerShell) or have a large amount data to process (e.g. larger than the physical memory available), it might be better and easier to use the Native PowerShell approach.
Anyways, if you are comparing any results, you should test is in practice and test it end-to-end and not just on data that is already available in memory.
Building a list
I agree that using the Add() method on a list is much faster that using += which concatenates the new item with the current array and then reassigns it back to the array.
But again, both approaches stall the pipeline as they collect all the data in memory where you might be better off to intermediately release the result to the disk.
HashTables
You will probably find the most performance improvement in using a hash table as they are optimized for a binary search.
As it is required to compare two collections to each other, you can't stream both but as explained, it might be best and easiest you use 1 hash table for one side and compare this to each item in a stream at the other side and because you want to compare the AllData which each of the other tables, it is best to index that table into memory (in the form of a hash table).
This is how I would do this:
$Main = #{}
ForEach ($Item in $All) {
$Main[$Item.EmployeeNumber] = #{MainName = $Item.Name; MainDomain = $Item.Domain}
}
ForEach ($Name in 'AAA', 'BBB', 'CCC', 'DDD') {
Import-Csv "C:\$Name.csv" | Where-Object {$Main.ContainsKey($_.EmployeeNumber)} | ForEach-Object {
[PSCustomObject](#{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain} + $Main[$_.EmployeeNumber])
} | Export-Csv "C:\Output$Name.csv"
}
Addendum
Based on the comment (and the duplicates in the lists), it appears that actually a join on all keys is requested and not just on the EmployeeNumber. For this you need to concatenate the concerned keys (separated with a separator that is not used in the data) and use that as key for the hash table.
Not in the question but from the comment it appears also that full-join is expected. For the right-join part this can be done by returning the right object in case there is no match found in the main table ($Main.ContainsKey($Key)). For the left-join part this is more complex as you will need to track ($InnerMain) which items in main are already matched and return the leftover items in the end:
$Main = #{}
$Separator = "`t" # Chose a separator that isn't used in any value
ForEach ($Item in $All) {
$Key = $Item.EmployeeNumber, $Item.Name, $Item.Domain -Join $Separator
$Main[$Key] = #{MainEmployeeNumber = $Item.EmployeeNumber; MainName = $Item.Name; MainDomain = $Item.Domain} # What output is expected?
}
ForEach ($Name in 'AAA', 'BBB', 'CCC', 'DDD') {
$InnerMain = #($False) * $Main.Count
$Index = 0
Import-Csv "C:\$Name.csv" | ForEach-Object {
$Key = $_.EmployeeNumber, $_.Name, $_.Domain -Join $Separator
If ($Main.ContainsKey($Key)) {
$InnerMain[$Index] = $True
[PSCustomObject](#{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain} + $Main[$Key])
} Else {
[PSCustomObject](#{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain; MainEmployeeNumber = $Null; MainName = $Null; MainDomain = $Null})
}
$Index++
} | Export-Csv "C:\Output$Name.csv"
$Index = 0
ForEach ($Item in $All) {
If (!$InnerMain[$Index]) {
$Key = $Item.EmployeeNumber, $Item.Name, $Item.Domain -Join $Separator
[PSCustomObject](#{EmployeeNumber = $Null; Name = $Null; Domain = $Null} + $Main[$Key])
}
$Index++
} | Export-Csv "C:\Output$Name.csv"
}
Join-Object
Just FYI, I have made a few improvements to Join-Object cmdlet (use and installation are very simple, see: In Powershell, what's the best way to join two tables into one?) including an easier changing of multiple joins which might come in handy for a request as this one. Although I still do not have full understanding of what you exactly looking for (and have minor questions like: how could the domains differ in a domain column if it is an extract from one specific domain?).
I take the general description "Particularly finding an AD object against 4 domains and store them in variable for attribute comparison" as leading.
In here I presume that the $AllMainFile is actually just an intermediate table existing out of a concatenation of all concerned tables (and not really necessarily but just confusing as it might contain to types of duplicates the employeenumbers from the same domain and the employeenumbers from other domains). If this is correct, you can just omit this table using the Join-Object cmdlet:
$AAA = ConvertFrom-Csv #'
EmployeeNumber,Name,Domain
Z001,ABC,Domain1
Z002,DEF,Domain2
Z003,GHI,Domain3
'#
$BBB = ConvertFrom-Csv #'
EmployeeNumber,Name,Domain
Z001,ABC,Domain1
Z002,JKL,Domain2
Z004,MNO,Domain4
'#
$CCC = ConvertFrom-Csv #'
EmployeeNumber,Name,Domain
Z005,PQR,Domain2
Z001,ABC,Domain1
Z001,STU,Domain2
'#
$DDD = ConvertFrom-Csv #'
EmployeeNumber,Name,Domain
Z005,VWX,Domain4
Z006,XYZ,Domain1
Z001,ABC,Domain3
'#
$AAA | FullJoin $BBB -On EmployeeNumber -Discern AAA |
FullJoin $CCC -On EmployeeNumber -Discern BBB |
FullJoin $DDD -On EmployeeNumber -Discern CCC,DDD | Format-Table
Result:
EmployeeNumber AAAName AAADomain BBBName BBBDomain CCCName CCCDomain DDDName DDDDomain
-------------- ------- --------- ------- --------- ------- --------- ------- ---------
Z001 ABC Domain1 ABC Domain1 ABC Domain1 ABC Domain3
Z001 ABC Domain1 ABC Domain1 STU Domain2 ABC Domain3
Z002 DEF Domain2 JKL Domain2
Z003 GHI Domain3
Z004 MNO Domain4
Z005 PQR Domain2 VWX Domain4
Z006 XYZ Domain1

CSV file - count distinct, group by, sum

I have a file that looks like the following;
- Visitor ID,Revenue,Channel,Flight
- 1234,100,Email,BA123
- 2345,200,PPC,BA112
- 456,150,Email,BA456
I need to produce a file that contains;
The count of distinct Visitor IDs (3)
The total revenue (450)
The count of each Channel
Email 2
PPC 2
The count of each Flight
BA123 1
BA112 1
BA456 1
So far I have the following code, however when executing this on the 350MB file, it takes too long and in some cases breaks the memory limit. As I have to run this function on multiple columns, it is going through the file many times. I ideally need to do this in one file pass.
$file = 'log.txt'
function GroupBy($columnName)
{
$objects = Import-Csv -Delimiter "`t" $file | Group-Object $columnName |
Select-Object #{n=$columnName;e={$_.Group[0].$columnName}}, Count
for($i=0;$i -lt $objects.count;$I++) {
$line += $columnName +"|"+$objects[$I]."$columnName" +"|Count|"+ $objects[$I].'Count' + $OFS
}
return $line
}
$finalOutput += GroupBy "Channel"
$finalOutput += GroupBy "Flight"
Write-Host $finalOutput
Any help would be much appreciated.
Thanks,
Craig
The fact that your are importing the CSV again for each column is what is killing your script. Try to do the loading once, then re-use the data. For example:
$data = Import-Csv .\data.csv
$flights = $data | Group-Object Flight -NoElement | ForEach-Object {[PsCustomObject]#{Flight=$_.Name;Count=$_.Count}}
$visitors = ($data | Group-Object "Visitor ID" | Measure-Object).Count
$revenue = ($data | Measure-Object Revenue -Sum).Sum
$channel = $data | Group-Object Channel -NoElement | ForEach-Object {[PsCustomObject]#{Channel=$_.Name;Count=$_.Count}}
You can display the data like this:
"Revenue : $revenue"
"Visitors: $visitors"
$flights | Format-Table -AutoSize
$channel | Format-Table -AutoSize
This will probably work - using hashmaps.
Pros: It will be faster/use less memory.
Cons: It is less readable
by far than Group-Object, and requires more code.
Make it even less memory-hungry: Read the CSV-file line by line
$data = Import-CSV -Path "C:\temp\data.csv" -Delimiter ","
$DistinctVisitors = #{}
$TotalRevenue = 0
$ChannelCount = #{}
$FlightCount = #{}
$data | ForEach-Object {
$DistinctVisitors[$_.'Visitor ID'] = $true
$TotalRevenue += $_.Revenue
if (-not $ChannelCount.ContainsKey($_.Channel)) {
$ChannelCount[$_.Channel] = 0
}
$ChannelCount[$_.Channel] += 1
if (-not $FlightCount.ContainsKey($_.Flight)) {
$FlightCount[$_.Flight] = 0
}
$FlightCount[$_.Flight] += 1
}
$DistinctVisitorsCount = $DistinctVisitors.Keys | Measure-Object | Select-Object -ExpandProperty Count
Write-Output "The count of distinc Visitor IDs $DistinctVisitorsCount"
Write-Output "The total revenue $TotalRevenue"
Write-Output "The Count of each Channel"
$ChannelCount.Keys | ForEach-Object {
Write-Output "$_ $($ChannelCount[$_])"
}
Write-Output "The count of each Flight"
$FlightCount.Keys | ForEach-Object {
Write-Output "$_ $($FlightCount[$_])"
}

How do I join two lines in PowerShell

I'm trying to get Information about our VMs in Hyper-V via PowerShell.
This is what I got so far:
$Path = 'c:/VM.csv'
"Name,CPUs,Dynamischer Arbeitsspeicher,RAM Maximum (in MB),RAM Minimum (in MB), Size" > $Path
$line1 = Get-VM | %{($_.Name, $_.ProcessorCount, $_.DynamicMemoryEnabled, ($_.MemoryMaximum/1MB), ($_.MemoryMinimum/1MB)) -join ","}
$line2 = Get-VM –VMName * | Select-Object VMId | Get-VHD | %{$_.FileSize/1GB -join ","}
$out = $line1+","+ $line2
Write-Output $out | Out-File $Path -Append
Import-Csv -Path $Path | Out-GridView
The Problem is that the second object ($line2) should be in the same column as $line1. As you can see, currently the information about the size of the VMs ($line2) is written in rows under the output of $line1. Also the order is wrong.
Any idea what is wrong with my code?
Thanks in advance.
I think you messed it up a little,
Export-CSV will do the job without the need to manually define the csv structure.
Anyway, regarding your code I think you can improve it a little, (I don't have hyper-v to test this, but I think it should work)
What I've done is create a results array to hold the final data, then using foreach loop i'm iterating the Get-VM Results and creating a row for each VM, at the end of each iteration I'm adding the row to the final results array, so:
$Results = #()
foreach ($VM in (Get-VM))
{
$Row = "" | Select Name,CPUs,'Dynamischer Arbeitsspeicher','RAM Maximum (inMB)','RAM Minimum (in MB)',Size
$Row.Name = $VM.Name
$Row.CPUs = $VM.ProcessorCount
$Row.'Dynamischer Arbeitsspeicher' = $VM.DynamicMemoryEnabled
$Row.'RAM Maximum (inMB)' = $VM.MemoryMaximum/1MB
$Row.'RAM Minimum (in MB)' = $VM.MemoryMinimum/1MB
$Total=0; ($VM.VMId | Get-VHD | %{$Total += ($_.FileSize/1GB)})
$Row.Size = [math]::Round($Total)
$Results += $Row
}
$Results | Export-Csv c:\vm.csv -NoTypeInformation
This is definitely not the proper way to build an object list, instead you should do something like this:
Get-VM | Select Name,
#{Name = "CPUs"; Expression = {$_.ProcessorCount},
#{Name = "Dynamischer Arbeitsspeicher"; Expression = {$_.DynamicMemoryEnabled},
#{Name = "RAM Maximum (in MB)"; Expression = {$_.MemoryMaximum/1MB},
#{Name = "RAM Minimum (in MB)"; Expression = {$_.MemoryMinimum/1MB},
#{Name = "Size"; Expression = {$_.VMId | Get-VHD | %{$_.FileSize/1GB -join ","}} |
Export-Csv c:\vm.csv -NoTypeInformation

Powershell csv row column transpose and manipulation

I'm newbie in Powershell. I tried to process / transpose row-column against a medium size csv based record (around 10000 rows). The original CSV consist of around 10000 rows with 3 columns ("Time","Id","IOT") as below:
"Time","Id","IOT"
"00:03:56","23","26"
"00:03:56","24","0"
"00:03:56","25","0"
"00:03:56","26","1"
"00:03:56","27","0"
"00:03:56","28","0"
"00:03:56","29","0"
"00:03:56","30","1953"
"00:03:56","31","22"
"00:03:56","32","39"
"00:03:56","33","8"
"00:03:56","34","5"
"00:03:56","35","269"
"00:03:56","36","5"
"00:03:56","37","0"
"00:03:56","38","0"
"00:03:56","39","0"
"00:03:56","40","1251"
"00:03:56","41","103"
"00:03:56","42","0"
"00:03:56","43","0"
"00:03:56","44","0"
"00:03:56","45","0"
"00:03:56","46","38"
"00:03:56","47","14"
"00:03:56","48","0"
"00:03:56","49","0"
"00:03:56","2013","0"
"00:03:56","2378","0"
"00:03:56","2380","32"
"00:03:56","2758","0"
"00:03:56","3127","0"
"00:03:56","3128","0"
"00:09:16","23","22"
"00:09:16","24","0"
"00:09:16","25","0"
"00:09:16","26","2"
"00:09:16","27","0"
"00:09:16","28","0"
"00:09:16","29","21"
"00:09:16","30","48"
"00:09:16","31","0"
"00:09:16","32","4"
"00:09:16","33","4"
"00:09:16","34","7"
"00:09:16","35","382"
"00:09:16","36","12"
"00:09:16","37","0"
"00:09:16","38","0"
"00:09:16","39","0"
"00:09:16","40","1882"
"00:09:16","41","42"
"00:09:16","42","0"
"00:09:16","43","3"
"00:09:16","44","0"
"00:09:16","45","0"
"00:09:16","46","24"
"00:09:16","47","22"
"00:09:16","48","0"
"00:09:16","49","0"
"00:09:16","2013","0"
"00:09:16","2378","0"
"00:09:16","2380","19"
"00:09:16","2758","0"
"00:09:16","3127","0"
"00:09:16","3128","0"
...
...
...
I tried to do the transpose using code based from powershell script downloaded from https://gallery.technet.microsoft.com/scriptcenter/Powershell-Script-to-7c8368be
Basically my powershell code is as below:
$b = #()
foreach ($Time in $a.Time | Select -Unique) {
$Props = [ordered]#{ Time = $time }
foreach ($Id in $a.Id | Select -Unique){
$IOT = ($a.where({ $_.Id -eq $Id -and $_.time -eq $time })).IOT
$Props += #{ $Id = $IOT }
}
$b += New-Object -TypeName PSObject -Property $Props
}
$b | FT -AutoSize
$b | Out-GridView
Above code could give me the result as I expected which are all "Id" values will become column headers while all "Time" values will become unique row and "IOT" values as the intersection from "Id" x "Time" as below:
"Time","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","2013","2378","2380","2758","3127","3128"
"00:03:56","26","0","0","1","0","0","0","1953","22","39","8","5","269","5","0","0","0","1251","103","0","0","0","0","38","14","0","0","0","0","32","0","0","0"
"00:09:16","22","0","0","2","0","0","21","48","0","4","4","7","382","12","0","0","0","1882","42","0","3","0","0","24","22","0","0","0","0","19","0","0","0"
While it only involves a few hundreds rows, the result comes out quickly as expected, but the problem now when processing the whole csv file with 10000 rows, the script above 'keep executing' and doesn't seem able to finish for long time (hours) and couldn't spit out any results.
So probably if some powershell experts from stackoverflow could help to asses the code above and probably could help to modify to speed up the results?
Many thanks for the advise
10000 records is a lot but I don't think it is enough to advise streamreader* and manually parsing the CSV. The biggest thing going against you though is the following line:
$b += New-Object -TypeName PSObject -Property $Props
What PowerShell is doing here is making a new array and appending that element to it. This is a very memory intensive operation that you are repeating 1000's of times. Better thing to do in this case is use the pipeline to your advantage.
$data = Import-Csv -Path "D:\temp\data.csv"
$headers = $data.ID | Sort-Object {[int]$_} -Unique
$data | Group-Object Time | ForEach-Object{
$props = [ordered]#{Time = $_.Name}
foreach($header in $headers){
$props."$header" = ($_.Group | Where-Object{$_.ID -eq $header}).IOT
}
[pscustomobject]$props
} | export-csv d:\temp\testing.csv -NoTypeInformation
$data will be your entire file in memory as an object. Need to get all the $headers that will be the column headers.
Group the data by each Time. Then inside each time object we get the value for every ID. If the ID does not exist during that time then the entry will show as null.
This is not the best way but should be faster than yours. I ran 10000 records in under a minute (51 second average over 3 passes). Will benchmark to show you if I can.
I just ran your code once with my own data and it took 13 minutes. I think it is safe to say that mine performs faster.
Dummy data was made with this logic FYI
1..100 | %{
$time = get-date -Format "hh:mm:ss"
sleep -Seconds 1
1..100 | % {
[pscustomobject][ordered]#{
time = $time
id = $_
iot = Get-Random -Minimum 0 -Maximum 7
}
}
} | Export-Csv d:\temp\data.csv -notypeinformation
* Not a stellar example for your case of streamreader. Just pointing it out to show that it is the better way to read large files. Just need to parse string line by line.