Fastest way to match two large arrays of objects by key in Powershell - powershell

I have two powershell arrays of objects generated via Import-CSV, and I must match them by one of their properties. Specifically, it is a 1:n relationship so currently I'm following this pattern:
foreach ($line in $array1) {
$match=$array2 | where {$_.key -eq $line.key} # could be 1 or n results
...# process here the 1 to n lines
}
, which is not very efficient (both tables have many columns) and takes a time that is unacceptable for our needs. Is there a fastest way to perform this match?
Both data sources come from a csv file, so using something instead of Import-CSV would be also welcome.
Thanks

The standard method is to index the data using a hashtable (or dictionary/map in other languages).
function buildIndex($csv, [string]$keyName) {
$index = #{}
foreach ($row in $csv) {
$key = $row.($keyName)
$data = $index[$key]
if ($data -is [Collections.ArrayList]) {
$data.add($row) >$null
} elseif ($data) {
$index[$key] = [Collections.ArrayList]#($data, $row)
} else {
$index[$key] = $row
}
}
$index
}
$csv1 = Import-Csv 'r:\1.csv'
$csv2 = Import-Csv 'r:\2.csv'
$index2 = buildIndex $csv2, 'key'
foreach ($row in $csv1) {
$matchedInCsv2 = $index2[$row.key]
foreach ($row2 in $matchedInCsv2) {
# ........
}
}
Also, if you need speed and iterate a big collection, avoid | pipelining as it's many times slower than foreach/while/do statements. And don't use anything with a ScriptBlock like where {$_.key -eq $line.key} in your code because execution context creation adds a ridiculously big overhead compared to the simple code inside.

Related

Check if a condition is met by a line within a TXT but "in an advanced way"

I have a TXT file with 1300 megabytes (huge thing). I want to build code that does two things:
Every line contains a unique ID at the beginning. I want to check for all lines with the same unique ID if the conditions is met for that "group" of IDs. (This answers me: For how many lines with the unique ID X have all conditions been met)
If the script is finished I want to remove all lines from the TXT where the condition was met (see 2). So I can rerun the script with another condition set to "narrow down" the whole document.
After few cycles I finally have a set of conditions that applies to all lines in the document.
It seems that my current approach is very slow.( one cycle needs hours). My final result is a set of conditions that apply to all lines of code.
If you find an easier way to do that, feel free to recommend.
Help is welcome :)
Code so far (does not fullfill everything from 1&2)
foreach ($item in $liste)
{
# Check Conditions
if ( ($item -like "*XXX*") -and ($item -like "*YYY*") -and ($item -notlike "*ZZZ*")) {
# Add a line to a document to see which lines match condition
Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
# Retrieve the unique ID from the line and feed array.
$array += $item.Split("/")[1]
# Remove the line from final document
$liste = $liste -replace $item, ""
}
}
# Pipe the "new cleaned" list somewhere
$liste | Set-Content -Path "C:\NewListToWorkWith.txt"
# Show me the counts
$array | group | % { $h = #{} } { $h[$_.Name] = $_.Count } { $h } | Out-File "C:\Desktop\count.txt"
Demo Lines:
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
performance considerations:
Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
try to avoid wrapping cmdlet pipelines
See also: Mastering the (steppable) pipeline
$array += $item.Split("/")[1]
Try to avoid using the increase assignment operator (+=) to create a collection
See also: Why should I avoid using the increase assignment operator (+=) to create a collection
$liste = $liste -replace $item, ""
This is a very expensive operation considering that you are reassigning (copying) a long list ($liste) with each iteration.
Besides it is a bad practice to change an array that you are currently iterating.
$array | group | ...
Group-Object is a rather slow cmdlet, you better collect (or count) the items on-the-fly (where you do $array += $item.Split("/")[1]) using a hashtable, something like:
$Name = $item.Split("/")[1]
if (!$HashTable.Contains($Name)) { $HashTable[$Name] = [Collections.Generic.List[String]]::new() }
$HashTable[$Name].Add($Item)
To minimize memory usage it may be better to read one line at a time and check if it already exists. Below code I used StringReader and you can replace with StreamReader for reading from a file. I'm checking if the entire string exists, but you may want to split the line. Notice I have duplicaes in the input but not in the dictionary. See code below :
$rows= #"
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
"#
$dict = [System.Collections.Generic.Dictionary[int, System.Collections.Generic.List[string]]]::new();
$reader = [System.IO.StringReader]::new($rows)
while(($row = $reader.ReadLine()) -ne $null)
{
$hash = $row.GetHashCode()
if($dict.ContainsKey($hash))
{
#check if list contains the string
if($dict[$hash].Contains($row))
{
#string is a duplicate
}
else
{
#add string to dictionary value if it is not in list
$list = $dict[$hash].Value
$list.Add($row)
}
}
else
{
#add new hash value to dictionary
$list = [System.Collections.Generic.List[string]]::new();
$list.Add($row)
$dict.Add($hash, $list)
}
}
$dict

Fast compare two large csv(boths rows and columns) in powershell

I have two large CSVs to compare. Bosth csvs are basically data from the same system 1 day apart. No of rows are around 12k and columns 30.
The aim is to identify what column data has changed for primary key(#ID).
My idea was to loop through the CSVs to identify which rows have changed and dump these into a separate csvs. One done, I again loop through the changes rows, and indetify the exact change in column.
NewCSV = Import-Csv -Path ".\Data_A.csv"
OldCSV = Import-Csv -Path ".\Data_B.csv"
foreach ($LineNew in $NewCSV)
{
ForEach ($LineOld in $OldCSV)
{
If($LineNew -eq $LineOld)
{
Write-Host $LineNew, " Match"
}else{
Write-Host $LineNew, " Not Match"
}
}
}
But as soon as run the loop, it takes forever to run for 12k rows. I was hoping there must be a more efficient way to compare large files powershell. Something that is quicker.
Well you can give this a try, I'm not claiming it will be fast for what vonPryz has already pointed out but it should give you a good side-by-side perspective to compare what has changed from OldCsv to NewCsv.
Note: Those cells that have the same value on both CSVs will be ignored.
$NewCSV = Import-Csv -Path ".\Data_A.csv"
$OldCSV = Import-Csv -Path ".\Data_B.csv" | Group-Object ID -AsHashTable -AsString
$properties = $newCsv[0].PSObject.Properties.Name
$result = foreach($line in $NewCSV)
{
if($ref = $OldCSV[$line.ID])
{
foreach($prop in $properties)
{
if($line.$prop -ne $ref.$prop)
{
[pscustomobject]#{
ID = $line.ID
Property = $prop
OldValue = $ref.$prop
NewValue = $line.$prop
}
}
}
continue
}
Write-Warning "ID $($line.ID) could not be found on Old Csv!!"
}
As vonPryz hints in the comments, you've written an algorithm with quadratic time complexity (O(n²) in Big-O notation) - every time the input size doubles, the number of computations performed increase 4-fold.
To avoid this, I'd suggest using a hashtable or other dictionary type to hold each data set, and use the primary key from the input as the dictionary key. This way you get constant-time lookup of corresponding records, and the time complexity of your algorithm becomes near-linear (O(2n + k)):
$NewCSV = #{}
Import-Csv -Path ".\Data_A.csv" |ForEach-Object {
$NewCSV[$_.ID] = $_
}
$OldCSV = #{}
Import-Csv -Path ".\Data_B.csv" |ForEach-Object {
$OldCSV[$_.ID] = $_
}
Now that we can efficiently resolve each row by it's ID, we can inspect the whole of the data sets with an independent loop over each:
foreach($entry in $NewCSV.GetEnumerator()){
if(-not $OldCSV.ContainsKey($entry.Key)){
# $entry.Value is a new row, not seen in the old data set
}
$newRow = $entry.Value
$oldRow = $OldCSV[$entry.Key]
# do the individual comparison of the rows here
}
Do another loop like above, but with $NewCSV in place of $OldCSV to find/detect deletions.

PS Object unescape character

I have small error when running my code. I assign a string to custom object but it's parsing the string by itself and throwing an error.
Code:
foreach ($item in $hrdblistofobjects) {
[string]$content = Get-Content -Path $item
[string]$content = $content.Replace("[", "").Replace("]", "")
#here is line 43 which is shown as error as well
foreach ($object in $listofitemsdb) {
$result = $content -match $object
$OurObject = [PSCustomObject]#{
ObjectName = $null
TestObjectName = $null
Result = $null
}
$OurObject.ObjectName = $item
$OurObject.TestObjectName = $object #here is line 52 which is other part of error
$OurObject.Result = $result
$Resultsdb += $OurObject
}
}
This code loads an item and checks if an object exists within an item. Basically if string part exists within a string part and then saves result to a variable. I am using this code for other objects and items but they don't have that \p part which I am assuming is the issue. I can't put $object into single quotes for obvious reasons (this was suggested on internet but in my case it's not possible). So is there any other option how to unescape \p? I tried $object.Replace("\PMS","\\PMS") but that did not work either (this was suggested somewhere too).
EDIT:
$Resultsdb = #(foreach ($item in $hrdblistofobjects) {
[string]$content = Get-Content -Path $item
[string]$content = $content.Replace("[", "").Replace("]", "")
foreach ($object in $listofitemsdb) {
[PSCustomObject]#{
ObjectName = $item
TestObjectName = $object
Result = $content -match $object
}
}
}
)
$Resultsdb is not defined as an array, hence you get that error when you try to add one object to another object when that doesn't implement the addition operator.
You shouldn't be appending to an array in a loop anyway. That will perform poorly, because with each iteration it creates a new array with the size increased by one, copies all elements from the existing array, puts the new item in the new free slot, and then replaces the original array with the new one.
A better approach is to just output your objects in the loop and collect the loop output in a variable:
$Resultsdb = foreach ($item in $hrdblistofobjects) {
...
foreach ($object in $listofitemsdb) {
[PSCustomObject]#{
ObjectName = $item
TestObjectName = $object
Result = $content -match $object
}
}
}
Run the loop in an array subexpression if you need to ensure that the result is an array, otherwise it will be empty or a single object when the loop returns less than two results.
$Resultsdb = #(foreach ($item in $hrdblistofobjects) {
...
})
Note that you need to suppress other output on the default output stream in the loop, so that it doesn't pollute your result.
I changed the match part to this and it's working fine $result = $content -match $object.Replace("\PMS","\\PMS").
Sorry for errors in posting. I will amend that.

Is this the best way to replace text in all of an object's properties in powershell?

I have a large CSV file in which some fields have a new line embedded. Excel 2016 produces errors when importing a CSV with rows which have fields with a new line embedded.
Based on this post, I wrote code to replace any new line in any field with a space. Below is a code block that duplicates the functionality and issue. Option 1 works. Option 2, which is commented out, casts my object to a string. I was hoping Option 2 might run faster.
Question: Is there a better way to do this to optimize for performance processing very large files?
$array = #([PSCustomObject]#{"ID"="1"; "Name"="Joe`nSmith"},
[PSCustomObject]#{"ID"="2"; "Name"="Jasmine Baker"})
$array = $array | ForEach-Object {
#Option 1: produces an Object, but is code optimized?
foreach ($n in $_.PSObject.Properties.Name) {
$_.PSObject.Properties[$n].Value = `
$_.PSObject.Properties[$n].Value -replace "`n"," "
}
#Option 2: produces a string, not an object
#$_ = $_ -replace "`n"," "
$_
}
Keep in mind that in my real-world use case, each row has > 15 fields and any combination of them may have one or more new lines embedded.
Use the fast TextFieldParser to read, process, and build the CSV from the file (PowerShell 3+):
[Reflection.Assembly]::LoadWithPartialName('Microsoft.VisualBasic') >$null
$parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser 'r:\1.csv'
$parser.SetDelimiters(',')
$header = $parser.ReadFields()
$CSV = while (!$parser.EndOfData) {
$i = 0
$row = [ordered]#{}
foreach ($field in $parser.ReadFields()) {
$row[$header[$i++]] = $field.replace("`n", ' ')
}
[PSCustomObject]$row
}
Or modify each field in-place in an already existing CSV array:
foreach ($row in $CSV) {
foreach ($field in $row.PSObject.Properties) {
$field.value = $field.value.replace("`n", ' ')
}
}
Notes:
foreach statement is much faster than piping to ForEach-Object (also aliased as foreach)
$stringVariable.replace() is faster then -replace operator

Powershell: how to fetch a single column from a multi-dimensional array?

Is there a function, method, or language construction allowing to retrieve a single column from a multi-dimensional array in Powershell?
$my_array = #()
$my_array += ,#(1,2,3)
$my_array += ,#(4,5,6)
$my_array += ,#(7,8,9)
# I currently use that, and I want to find a better way:
foreach ($line in $my_array) {
[array]$single_column += $line[1] # fetch column 1
}
# now $single_column contains only 2 and 5 and 8
My final goal is to find non-duplicated values from one column.
Sorry, I don't think anything like that exist. I would go with:
#($my_array | foreach { $_[1] })
To quickly find unique values I tend to use hashtables keys hack:
$UniqueArray = #($my_array | foreach -Begin {
$unique = #{}
} -Process {
$unique.($_[1]) = $null
} -End {
$unique.Keys
})
Obviously it has it limitations...
To extract one column:
$single_column = $my_array | foreach { $_[1] }
To extract any columns:
$some_columns = $my_array | foreach { ,#($_[2],$_[1]) } # any order
To find non-duplicated values from one column:
$unique_array = $my_array | foreach {$_[1]} | sort-object -unique
# caveat: the resulting array is sorted,
# so BartekB have a better solution if sort is a problem
I tried #BartekB's solution and it worked for me. But for the unique part I did the following.
#($my_array | foreach { $_[1] } | select -Unique)
I am not very familiar with powershell but I am posting this hoping it helps others since it worked for me.