let's say that I have several CSV files and I need to check a specific column and find values that exist in one file, but not in any of the others. I'm having a bit of trouble coming up with the best way to go about it as I wanted to use Compare-Object and possibly keep all columns and not just the one that contains the values I'm checking.
So I do indeed have several CSV files and they all have a Service Code column, and I'm trying to create a list for each Service Code that only appears in one file. So I would have "Service Codes only in CSV1", "Service Codes only in CSV2", etc.
Based on some testing and a semi-related question, I've come up with a workable solution, but with all of the nesting and For loops, I'm wondering if there is a more elegant method out there.
Here's what I do have:
$files = Get-ChildItem -LiteralPath "C:\temp\ItemCompare" -Include "*.csv"
$HashList = [System.Collections.Generic.List[System.Collections.Generic.HashSet[String]]]::New()
For ($i = 0; $i -lt $files.Count; $i++){
$TempHashSet = [System.Collections.Generic.HashSet[String]]::New([String[]](Import-Csv $files[$i])."Service Code")
$HashList.Add($TempHashSet)
}
$FinalHashList = [System.Collections.Generic.List[System.Collections.Generic.HashSet[String]]]::New()
For ($i = 0; $i -lt $HashList.Count; $i++){
$UniqueHS = [System.Collections.Generic.HashSet[String]]::New($HashList[$i])
For ($j = 0; $j -lt $HashList.Count; $j++){
#Skip the check when the HashSet would be compared to itself
If ($j -eq $i){Continue}
$UniqueHS.ExceptWith($HashList[$j])
}
$FinalHashList.Add($UniqueHS)
}
It seems a bit messy to me using so many different .NET references, and I know I could make it cleaner with a tag to say using namespace System.Collections.Generic, but I'm wondering if there is a way to make it work using Compare-Object which was my first attempt, or even just a simpler/more efficient method to filter each file.
I believe I found an "elegant" solution based on Group-Object, using only a single pipeline:
# Import all CSV files.
Get-ChildItem $PSScriptRoot\csv\*.csv -File -PipelineVariable file | Import-Csv |
# Add new column "FileName" to distinguish the files.
Select-Object *, #{ label = 'FileName'; expression = { $file.Name } } |
# Group by ServiceCode to get a list of files per distinct value.
Group-Object ServiceCode |
# Filter by ServiceCode values that exist only in a single file.
# Sort-Object -Unique takes care of possible duplicates within a single file.
Where-Object { ( $_.Group.FileName | Sort-Object -Unique ).Count -eq 1 } |
# Expand the groups so we get the original object structure back.
ForEach-Object Group |
# Format-Table requires sorting by FileName, for -GroupBy.
Sort-Object FileName |
# Finally pretty-print the result.
Format-Table -Property ServiceCode, Foo -GroupBy FileName
Test Input
a.csv:
ServiceCode,Foo
1,fop
2,fip
3,fap
b.csv:
ServiceCode,Foo
6,bar
6,baz
3,bam
2,bir
4,biz
c.csv:
ServiceCode,Foo
2,bla
5,blu
1,bli
Output
FileName: b.csv
ServiceCode Foo
----------- ---
4 biz
6 bar
6 baz
FileName: c.csv
ServiceCode Foo
----------- ---
5 blu
Looks correct to me. The values 1, 2 and 3 are duplicated between multiple files, so they are excluded. 4, 5 and 6 exist only in single files, while 6 is a duplicate value only within a single file.
Understanding the code
Maybe it is easier to understand how this code works, by looking at the intermediate output of the pipeline produced by the Group-Object line:
Count Name Group
----- ---- -----
2 1 {#{ServiceCode=1; Foo=fop; FileName=a.csv}, #{ServiceCode=1; Foo=bli; FileName=c.csv}}
3 2 {#{ServiceCode=2; Foo=fip; FileName=a.csv}, #{ServiceCode=2; Foo=bir; FileName=b.csv}, #{ServiceCode=2; Foo=bla; FileName=c.csv}}
2 3 {#{ServiceCode=3; Foo=fap; FileName=a.csv}, #{ServiceCode=3; Foo=bam; FileName=b.csv}}
1 4 {#{ServiceCode=4; Foo=biz; FileName=b.csv}}
1 5 {#{ServiceCode=5; Foo=blu; FileName=c.csv}}
2 6 {#{ServiceCode=6; Foo=bar; FileName=b.csv}, #{ServiceCode=6; Foo=baz; FileName=b.csv}}
Here the Name contains the unique ServiceCode values, while Group "links" the data to the files.
From here it should already be clear how to find values that exist only in single files. If duplicate ServiceCode values within a single file wouldn't be allowed, we could even simplify the filter to Where-Object Count -eq 1. Since it was stated that dupes within single files may exist, we need the Sort-Object -Unique to count multiple equal file names within a group as only one.
It is not completely clear what you expect as an output.
If this is just the ServiceCodes that intersect then this is actually a duplicate with:
Comparing two arrays & get the values which are not common
Union and Intersection in PowerShell?
But taking that you actually want the related object and files, you might use this approach:
$HashTable = #{}
ForEach ($File in Get-ChildItem .\*.csv) {
ForEach ($Object in (Import-Csv $File)) {
$HashTable[$Object.ServiceCode] = $Object |Select-Object *,
#{ n='File'; e={ $File.Name } },
#{ n='Count'; e={ $HashTable[$Object.ServiceCode].Count + 1 } }
}
}
$HashTable.Values |Where-Object Count -eq 1
Here is my take on this fun exercise, I'm using a similar approach as yours with the HashSet but adding [System.StringComparer]::OrdinalIgnoreCase to leverage the .Contains(..) method:
using namespace System.Collections.Generic
# Generate Random CSVs:
$charset = 'abABcdCD0123xXyYzZ'
$ran = [random]::new()
$csvs = #{}
foreach($i in 1..50) # Create 50 CSVs for testing
{
$csvs["csv$i"] = foreach($z in 1..50) # With 50 Rows
{
$index = (0..2).ForEach({ $ran.Next($charset.Length) })
[pscustomobject]#{
ServiceCode = [string]::new($charset[$index])
Data = $ran.Next()
}
}
}
# Get Unique 'ServiceCode' per CSV:
$result = #{}
foreach($key in $csvs.Keys)
{
# Get all unique `ServiceCode` from the other CSVs
$tempHash = [HashSet[string]]::new(
[string[]]($csvs[$csvs.Keys -ne $key].ServiceCode),
[System.StringComparer]::OrdinalIgnoreCase
)
# Filter the unique `ServiceCode`
$result[$key] = foreach($line in $csvs[$key])
{
if(-not $tempHash.Contains($line.ServiceCode))
{
$line
}
}
}
# Test if the code worked,
# If something is returned from here means it didn't work
foreach($key in $result.Keys)
{
$tmp = $result[$result.Keys -ne $key].ServiceCode
foreach($val in $result[$key])
{
if($val.ServiceCode -in $tmp)
{
$val
}
}
}
i was able to get unique items as follow
# Get all items of CSVs in a single variable with adding the file name at the last column
$CSVs = Get-ChildItem "C:\temp\ItemCompare\*.csv" | ForEach-Object {
$CSV = Import-CSV -Path $_.FullName
$FileName = $_.Name
$CSV | Select-Object *,#{N='Filename';E={$FileName}}
}
Foreach($line in $CSVs){
$ServiceCode = $line.ServiceCode
$file = $line.Filename
if (!($CSVs | where {$_.ServiceCode -eq $ServiceCode -and $_.filename -ne $file})){
$line
}
}
I have a script, that runs with a for loop.
It outputs 3 variables: $A, $B and $C.
I would like at each iteration to outout those 3 variables onto the same line inside a file, instead of the standard output (each of them separaed by a comma).
I would like to add a header. I am cerating a CSV file.
I have seen several way to output variable inside a CSV file, but not inside a for loop.
Any way to concatenate those 3 variables and append them to a file?
Write-Host $A ',' $B ',' $C | Out-File -FilePath C:\temp\TEST.csv -Append
$A = 'foo'
$B = 'bar'
$C = 'baz'
"$A,$B,$C" # outputs Foo,Bar,Baz
However, if you're working with CSVs, the preferred method is to use Import-Csv and Export-Csv, which handles the formatting for you.
Write-Host output goes directly to the host console. It cannot be pipelined. Also, as has already been mentioned, you normally want to use the Import-Csv and Export-Csv cmdlets when dealing with CSVs (particularly since you want a CSV with headers).
To get a bunch of variables in a form that is exportable by Export-Csv construct a custom object like this:
New-Object -Type PSObject -Property #{
'X' = $A
'Y' = $B
'Z' = $C
} | Export-Csv 'C:\Temp\test.csv' -NoType
The keys of the property hashtable become the column titles of the CSV.
Since you say you want to export data from a for loop you'll need to add the parameter -Append to Export-Csv:
for (...) {
New-Object ... | Export-Csv 'C:\Temp\test.csv' -NoType -Append
}
That is because for loops don't write to the pipeline, meaning that something like this won't work:
for (...) {
New-Object ...
} | Export-Csv 'C:\Temp\test.csv' -NoType
However, depending on what your actual loop looks like you might be able to substitute it with a combination of the range operator (..) and a ForEach-Object loop:
1..5 | ForEach-Object {
New-Object ...
} | Export-Csv 'C:\Temp\test.csv' -NoType
Provided at each iteration to outout those 3 variables is not meant literal,
you could gather the for loops output in a variable via a PSCustomObject and then Export-Csv the result.
For demonstration a simple counting loop:
$Result = for ($i=1;$i -lt 5;$i++){[pscustomobject]#{X=$i;Y=$i+1;Z=$i+2}}
$Result
X Y Z
- - -
1 2 3
2 3 4
3 4 5
4 5 6
Then Export/ConvertTo-Csv:
$Result|ConvertTo-Csv -NoTypeInformation
"X","Y","Z"
"1","2","3"
"2","3","4"
"3","4","5"
"4","5","6"
I don't have much experience with CSV, so apologies if I'm really blind here.
I have a basic CSV and script setup to test this with. The CSV has two columns, Letter and Number. Letter goes from A-F and Number goes from 1-10. This means that Number has more rows than Letter, so when running the following script, the output can sometimes provide an empty Letter.
$L = ipcsv ln.csv | Get-Random | Select-Object -ExpandProperty Letter
$N = ipcsv ln.csv | Get-Random | Select-Object -ExpandProperty Number
Write-Output $L
Write-Output $N
Some outputs come out as
B
9
while others can come out as
5
I don't know whether the issue is my script not ignoring empty lines or my CSV being written incorrectly, which is posted below.
Letter,Number
A,1
B,2
C,3
D,4
E,5
F,6
,7
,8
,9
,10
What's my issue here and how do I go about fixing it?
Your asking for a random object from your CSV, not a random letter. Since some of the lines are missing a letter, you might end up picking one that has an empty Letter-value.
If you want to pick any line with a letter, you need to filter the rows first to only pick from the ones with a value. Also, you sould avoid reading the same file twice, use a varible
#$csv = Import-CSV -Path ln.csv
$csv = #"
Letter,Number
A,1
B,2
C,3
D,4
E,5
F,6
,7
,8
,9
,10
"# | ConvertFrom-Csv
$L = $csv | Where-Object { $_.Letter } | Get-Random | Select-Object -ExpandProperty Letter
$N = $csv | Where-Object { $_.Number } | Get-Random | Select-Object -ExpandProperty Number
Write-Output $L
Write-Output $N
CSV migtht not be the best solution for this scenario. Ex. you could store these as arrays in the script, like:
$chars = [char[]](65..70) #A-F uppercase letters
$numbers = 1..10
$L = $chars | Get-Random
$N = $numbers | Get-Random
Write-Output $L
Write-Output $N
Import-Csv turns each line into an object, with a property for each column.
Even though one or more property values may be empty, the object still exists, and Get-Random has no reason determine that an object with a certain property (such as Letter) having the value "" (ie. an empty string), should not be picked.
You can fix this by expanding the property values first, then filter for empty values and then finally pick the random value from those that weren't empty:
$L = ipcsv ln.csv |Select-Object -ExpandProperty Letter |Where-Object {$_} |Get-Random
$N = ipcsv ln.csv |Select-Object -ExpandProperty Number |Where-Object {$_} |Get-Random
I'm newbie in Powershell. I tried to process / transpose row-column against a medium size csv based record (around 10000 rows). The original CSV consist of around 10000 rows with 3 columns ("Time","Id","IOT") as below:
"Time","Id","IOT"
"00:03:56","23","26"
"00:03:56","24","0"
"00:03:56","25","0"
"00:03:56","26","1"
"00:03:56","27","0"
"00:03:56","28","0"
"00:03:56","29","0"
"00:03:56","30","1953"
"00:03:56","31","22"
"00:03:56","32","39"
"00:03:56","33","8"
"00:03:56","34","5"
"00:03:56","35","269"
"00:03:56","36","5"
"00:03:56","37","0"
"00:03:56","38","0"
"00:03:56","39","0"
"00:03:56","40","1251"
"00:03:56","41","103"
"00:03:56","42","0"
"00:03:56","43","0"
"00:03:56","44","0"
"00:03:56","45","0"
"00:03:56","46","38"
"00:03:56","47","14"
"00:03:56","48","0"
"00:03:56","49","0"
"00:03:56","2013","0"
"00:03:56","2378","0"
"00:03:56","2380","32"
"00:03:56","2758","0"
"00:03:56","3127","0"
"00:03:56","3128","0"
"00:09:16","23","22"
"00:09:16","24","0"
"00:09:16","25","0"
"00:09:16","26","2"
"00:09:16","27","0"
"00:09:16","28","0"
"00:09:16","29","21"
"00:09:16","30","48"
"00:09:16","31","0"
"00:09:16","32","4"
"00:09:16","33","4"
"00:09:16","34","7"
"00:09:16","35","382"
"00:09:16","36","12"
"00:09:16","37","0"
"00:09:16","38","0"
"00:09:16","39","0"
"00:09:16","40","1882"
"00:09:16","41","42"
"00:09:16","42","0"
"00:09:16","43","3"
"00:09:16","44","0"
"00:09:16","45","0"
"00:09:16","46","24"
"00:09:16","47","22"
"00:09:16","48","0"
"00:09:16","49","0"
"00:09:16","2013","0"
"00:09:16","2378","0"
"00:09:16","2380","19"
"00:09:16","2758","0"
"00:09:16","3127","0"
"00:09:16","3128","0"
...
...
...
I tried to do the transpose using code based from powershell script downloaded from https://gallery.technet.microsoft.com/scriptcenter/Powershell-Script-to-7c8368be
Basically my powershell code is as below:
$b = #()
foreach ($Time in $a.Time | Select -Unique) {
$Props = [ordered]#{ Time = $time }
foreach ($Id in $a.Id | Select -Unique){
$IOT = ($a.where({ $_.Id -eq $Id -and $_.time -eq $time })).IOT
$Props += #{ $Id = $IOT }
}
$b += New-Object -TypeName PSObject -Property $Props
}
$b | FT -AutoSize
$b | Out-GridView
Above code could give me the result as I expected which are all "Id" values will become column headers while all "Time" values will become unique row and "IOT" values as the intersection from "Id" x "Time" as below:
"Time","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","2013","2378","2380","2758","3127","3128"
"00:03:56","26","0","0","1","0","0","0","1953","22","39","8","5","269","5","0","0","0","1251","103","0","0","0","0","38","14","0","0","0","0","32","0","0","0"
"00:09:16","22","0","0","2","0","0","21","48","0","4","4","7","382","12","0","0","0","1882","42","0","3","0","0","24","22","0","0","0","0","19","0","0","0"
While it only involves a few hundreds rows, the result comes out quickly as expected, but the problem now when processing the whole csv file with 10000 rows, the script above 'keep executing' and doesn't seem able to finish for long time (hours) and couldn't spit out any results.
So probably if some powershell experts from stackoverflow could help to asses the code above and probably could help to modify to speed up the results?
Many thanks for the advise
10000 records is a lot but I don't think it is enough to advise streamreader* and manually parsing the CSV. The biggest thing going against you though is the following line:
$b += New-Object -TypeName PSObject -Property $Props
What PowerShell is doing here is making a new array and appending that element to it. This is a very memory intensive operation that you are repeating 1000's of times. Better thing to do in this case is use the pipeline to your advantage.
$data = Import-Csv -Path "D:\temp\data.csv"
$headers = $data.ID | Sort-Object {[int]$_} -Unique
$data | Group-Object Time | ForEach-Object{
$props = [ordered]#{Time = $_.Name}
foreach($header in $headers){
$props."$header" = ($_.Group | Where-Object{$_.ID -eq $header}).IOT
}
[pscustomobject]$props
} | export-csv d:\temp\testing.csv -NoTypeInformation
$data will be your entire file in memory as an object. Need to get all the $headers that will be the column headers.
Group the data by each Time. Then inside each time object we get the value for every ID. If the ID does not exist during that time then the entry will show as null.
This is not the best way but should be faster than yours. I ran 10000 records in under a minute (51 second average over 3 passes). Will benchmark to show you if I can.
I just ran your code once with my own data and it took 13 minutes. I think it is safe to say that mine performs faster.
Dummy data was made with this logic FYI
1..100 | %{
$time = get-date -Format "hh:mm:ss"
sleep -Seconds 1
1..100 | % {
[pscustomobject][ordered]#{
time = $time
id = $_
iot = Get-Random -Minimum 0 -Maximum 7
}
}
} | Export-Csv d:\temp\data.csv -notypeinformation
* Not a stellar example for your case of streamreader. Just pointing it out to show that it is the better way to read large files. Just need to parse string line by line.