I have two CSV files. The first CSV is Card Data, which holds about 30,000 records and contains the card's name, UUID, and price (which is currently empty). The second CSV is Pricing Data, which holds around 50,000 records and contains UUID and some pricing information for that specific UUID.
These are two separate CSV files that are generated elsewhere.
For each record in Card Data CSV I am taking the UUID and finding the corresponding UUID in the Pricing Data CSV using the Where-Object function in PowerShell. This is so I can find the pricing information for the respective card and run that through a pricing algorithm to generate a price for each record in the Card Data CSV.
At the moment is seems to take around 1 second per record in the Card Data CSV file and with 30,000 records to process, it would take over 8 hours to run through. Is there a better more efficient way to perform this task.
Code:
Function Calculate-Price ([float]$A, [float]$B, [float]$C) {
#Pricing Algorithm
....
$Card.'Price' = $CCPrice
}
$PricingData = Import-Csv "$Path\Pricing.csv"
$CardData = Import-Csv "$Update\Cards.csv"
Foreach ($Card In $CardData) {
$PricingCard = $PricingData | Where-Object { $_.UUID -eq $Card.UUID }
. Calculate-Price -A $PricingCard.'A-price' -B $PricingCard.'B-price' -C $PricingCard.'C-price'
}
$CardData | Select "Title","Price","UUID" |
Export-Csv -Path "$Update\CardsUpdated.csv" -NoTypeInformation
The first CSV is Card Data, which holds about 30,000 records
The second CSV is Pricing Data, which holds around 50,000 records
No wonder it's slow, you're calculating the expression $_.UUID -eq $Card.UUID ~1500000000 (that's 1.5 BILLION, or 1500 MILLION) times - that already sounds pretty compute-heavy, and we've not even considered the overhead from the pipeline having to bind input arguments to Where-Object the same amount of times.
Instead of using the array of objects returned by Import-Csv directly, use a hashtable to "index" the records in the data set you need to search, by the property that you're joining on later!
$PricingData = Import-Csv "$Path\Pricing.csv"
$CardData = Import-Csv "$Update\Cards.csv"
$PricingByUUID = #{}
$PricingData |ForEach-Object {
# Let's index the price cards using their UUID value
$PricingByUUID[$_.UUID] = $_
}
Foreach ($Card In $CardData) {
# No need to search through the whole set anymore
$PricingCard = $PricingByUUID[$Card.UUID]
. Calculate-Price -A $PricingCard.'A-price' -B $PricingCard.'B-price' -C $PricingCard.'C-price'
}
Under the hood, hashtables (and most other dictionary types in .NET) are implemented in a way so that they have extremely fast constant-time lookup/retrieval performance - which is exactly the kind of thing you want in this situation!
Related
I feel like this shouldn't be that hard, but I'm having trouble getting a data structure in mind that will give me what I want. I have a large amount of data and I need to find the instances where there are multiple Secondary Identifiers as defined below.
Primary Identifier,Secondary Identifier,Fruit
11111,1,apple
11111,1,pear
22222,1,banana
22222,1,grapefruit
33333,1,apple
33333,1,pear
33333,2,apple
33333,2,orange
That might not be a great example to use - but basically only two of the columns matter. What I'd really like is to return the Primary Identifiers where the unique count of Secondary Identifiers is greater than 1. So I'm thinking maybe a HashTable would be my best bet, but I tried to doing something in a pipeline oriented way and failed so I'm wondering if there is an easier method or Cmdlet that I haven't tried.
The final array (or hashtable) would be something like this:
ID Count of Secondary ID
----- ---------------------
11111 1
22222 1
33333 2
At that point, getting the instances of multiple would be as easy as $array | Where-Object {$_."Count of Secondary ID" -gt 1}
If this example sucks or what I'm after doesn't make sense, let me know and I can rewrite it; but it's almost like I need an implementation of Select-Object -Unique that would allow you to use two or more input objects/columns. Basically the same as Excel's remove duplicates and then selecting which headers to include. Except there are too many rows to open in Excel
Use Group-Object twice - first to group the objects by common Primary Identifier, then use Group-Object again to count the number of distinct Secondary Identifier's within each group:
$data = #'
Primary Identifier,Secondary Identifier,Fruit
11111,1,apple
11111,1,pear
22222,1,banana
22222,1,grapefruit
33333,1,apple
33333,1,pear
33333,2,apple
33333,2,orange
'# |ConvertFrom-Csv
$data |Group-Object 'Primary Identifier' |ForEach-Object {
[pscustomobject]#{
# Primary Identifier value will be the name of the group, since that's what we grouped by
'Primary Identifier' = $_.Name
# Use `Group-Object -NoElement` to count unique values - you could also use `Sort-Object -Unique`
'Count of distinct Secondary Identifiers' = #($_.Group |Group-Object 'Secondary Identifier' -NoElement).Count
}
}
I am doing 2 separate SQL queries on separate databases / connections in a Powershell script. The goal is to export the results of both requests into a single CSV file.
What I am doing now is:
# Create a data table for Clients
$ClientsTable = new-object "System.Data.DataTable"
# Create text commands
$ClientsCommand1 = $connection1.CreateCommand()
$ClientsCommand1.CommandText = $ClientsQuery1
$ClientsCommand2 = $connection2.CreateCommand()
$ClientsCommand2.CommandText = $ClientsQuery2
# Get Clients results
$ClientsResults1 = $ClientsCommand1.ExecuteReader()
$ClientsResults2 = $ClientsCommand2.ExecuteReader()
# Load Clients in data table
$ClientsTable.Load($ClientsResults1)
$ClientsTable.Load($ClientsResults2)
# Export Clients data table to CSV
$ClientsTable | export-csv -Encoding UTF8 -NoTypeInformation -delimiter ";" "C:\test\clients.csv"
where $connection1 and $connection2 are opened System.Data.SqlClient.SqlConnection.
Both requests work fine and both output data with exactly the same columns names. If I export the 2 results sets to 2 separate CSV files, all is fine.
But loading the results in the data table as above fails with the following message:
Failed to enable constraints. One or more rows contain values violating non-null, unique, or foreign-key constraints.
If instead I switch the order in which I load data into the data tables, like
$ClientsTable.Load($ClientsResults2)
$ClientsTable.Load($ClientsResults1)
(load second results set before the first one), then the error goes away and my CSV is generated without any problem with the data from the 2 requests. I cannot think of why appending data in one way, or the other, would trigger this error, or work fine.
Any idea?
I'm skeptical reversing the order works. More likely, it's doing something like appending to the csv file that was already created from the first attempt.
It is possible, though, that different primary key definitions from the original data could produce the results you're seeing. Datatable.Load() can do unexpected things when pulling data from an additional sort. It will try to MERGE the data rather than simply append it, using different matching strategies depending on the overload and argument. If the primary key used for the one of the tables causes nothing match and no records to merge, but the primary key for the table matched everything, that might explain it.
If you want to just append the results, what you want to do instead is Load() the first result into the datatable, export to CSV, clear the table, load the second result into the table, and then export again in append mode.
I'm mostly just looking to be pointed in the right direction so I can piece it together myself. I have a decent amount of batch file scripting experience. I'm a PS noob but I think PS would be better for the project below.
We have software which requires the client ID to be part of the install string (along with switches, usr/pass, other switches, logging paths, etc).
I've created a batch file (hundreds actually) which I execute with PSEXEC on remote machines which does work but it's burly to maintain. The only change in each is the client ID.
What I'm attempting to do is have a CSV with 2 columns as input (so I just have to maintain the CSV): machine name (as presented by %hostname%) & client ID. I want to create a script which matches %hostname% to a corresponding row in column 1, read the data in column 2 of the same row, and then be able to call that as a variable in the install string.
E.G.
If my CSV has bobs-pc in column 1, row 6, then insert the data from column 2, row 6 (let's call it 0006) in the following install string:
install.exe /client_ID=0006
no looping
I don't want it to install on all machines simultaneously due to the multiple time zones we operate in.
Something like this would be really useful for many projects I have so I'm more interested in learning than having anyone write it for me.
I understand I should be using Import-Csv. I've created a sample csv and can get certain fields to print out in PS. What I need is for a script to be able to insert those fields as variables in the install string.
Sounds like you want something along the lines of this, (assumes your CSV has a header row of col1 and col2):
$hostname = 'server1'
$value = Import-CSV myfile.csv | where { $_.col1 -eq $hostname } | select -expandproperty col2
Install.exe /client_id=$value
I am trying to read in a large CSV with millions of rows for testing. I know that I can treat the CSV as a database using the provider Microsoft.ACE.OLEDB.12.0
Using a small data set I am able to read the row contents positionally using .GetValue(int). I am having a tough time finding a better was to read the data (assuming there even is one.). If I know the column names before hand this is easy. However if I didn't know them I would have to read in the first line of the file to get that data which seems silly.
#"
id,first_name,last_name,email,ip_address
1,Edward,Richards,erichards0#businessweek.com,201.133.112.30
2,Jimmy,Scott,jscott1#clickbank.net,103.231.149.144
3,Marilyn,Williams,mwilliams2#chicagotribune.com,52.180.157.43
4,Frank,Morales,fmorales3#google.ru,218.175.165.205
5,Chris,Watson,cwatson4#ed.gov,75.251.1.149
6,Albert,Ross,aross5#abc.net.au,89.56.133.54
7,Diane,Daniels,ddaniels6#washingtonpost.com,197.156.129.45
8,Nancy,Carter,ncarter7#surveymonkey.com,75.162.65.142
9,John,Kennedy,jkennedy8#tumblr.com,85.35.177.235
10,Bonnie,Bradley,bbradley9#dagondesign.com,255.67.106.193
"# | Set-Content .\test.csv
$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='C:\Users\Matt';Extended Properties='Text;HDR=Yes;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from test.csv where first_name like '%n%'"
$conn.open()
$data = $cmd.ExecuteReader()
$data | ForEach-Object{
[pscustomobject]#{
id=$_.GetValue(0)
first_name=$_.GetValue(1)
last_name=$_.GetValue(2)
ip_address=$_.GetValue(4)
}
}
$cmd.Dispose()
$conn.Dispose()
Is there a better way to deal with the output from $cmd.ExecuteReader()? Finding hard to get information for a CSV import. Most of the web deals with exporting to CSV using this provider from a SQL database. The logic here would be applied to a large CSV so that I don't need to read the whole thing in just to ignore most of the data.
I should have looked closer on TechNet for the OleDbDataReader Class. There are a few methods and properties that help understand the data returned from the SQL statement.
FieldCount: Gets the number of columns in the current row.
So if nothing else you know how many columns your rows have.
Item[Int32]: Gets the value of the specified column in its native format given the column ordinal.
Which I can use to pull back the data from each row. This appears to work the same as GetValue().
GetName(Int32): Gets the name of the specified column.
So if you don't know what the column is named this is what you can use to get it from a given index.
There are many other methods and some properties but those are enough to shed light if you are not sure what data is contained within a csv (assuming you don't want to manually verify before hand). So, knowing that, a more dynamic way to get the same information would be...
$data | ForEach-Object{
# Save the current row as its own object so that it can be used in other scopes
$dataRow = $_
# Blank hashtable that will be built into a "row" object
$properties = #{}
# For every field that exists we will add it name and value to the hashtable
0..($dataRow.FieldCount - 1) | ForEach-Object{
$properties.($dataRow.GetName($_)) = $dataRow.Item($_)
}
# Send the newly created object down the pipeline.
[pscustomobject]$properties
}
$cmd.Dispose()
$conn.Dispose()
Only downside of this is that the columns will likely be output in not the same order as the originating CSV. That can be address by saving the row names in a separate variable and using a Select at the end of the pipe. This answer was mostly trying to make sense of the column names and values returned.
I created a table in a power shell script with some data in it. I need to find a way to do something like a replace in a column. For every value (x) i find in the column, change it to (y). How is this possible i power shell?
Thnaks in advance. Also, I could not find anything like this from google, and it has to be done after the table is already built, not while building the table columns and rows. Thanks!
I'm not sure exactly what you mean by a table. However, assuming you are referring to a collection of objects, it's simple:
$collectionToUpdate | Where-Object { $_.PropertyToCheck -eq $valueToCheck } | ForEach-Object { $_.PropertyToCheck = $replacementValue }
Obviously, replace the names of the variables and property with the correct values from your code.