I feel like this shouldn't be that hard, but I'm having trouble getting a data structure in mind that will give me what I want. I have a large amount of data and I need to find the instances where there are multiple Secondary Identifiers as defined below.
Primary Identifier,Secondary Identifier,Fruit
11111,1,apple
11111,1,pear
22222,1,banana
22222,1,grapefruit
33333,1,apple
33333,1,pear
33333,2,apple
33333,2,orange
That might not be a great example to use - but basically only two of the columns matter. What I'd really like is to return the Primary Identifiers where the unique count of Secondary Identifiers is greater than 1. So I'm thinking maybe a HashTable would be my best bet, but I tried to doing something in a pipeline oriented way and failed so I'm wondering if there is an easier method or Cmdlet that I haven't tried.
The final array (or hashtable) would be something like this:
ID Count of Secondary ID
----- ---------------------
11111 1
22222 1
33333 2
At that point, getting the instances of multiple would be as easy as $array | Where-Object {$_."Count of Secondary ID" -gt 1}
If this example sucks or what I'm after doesn't make sense, let me know and I can rewrite it; but it's almost like I need an implementation of Select-Object -Unique that would allow you to use two or more input objects/columns. Basically the same as Excel's remove duplicates and then selecting which headers to include. Except there are too many rows to open in Excel
Use Group-Object twice - first to group the objects by common Primary Identifier, then use Group-Object again to count the number of distinct Secondary Identifier's within each group:
$data = #'
Primary Identifier,Secondary Identifier,Fruit
11111,1,apple
11111,1,pear
22222,1,banana
22222,1,grapefruit
33333,1,apple
33333,1,pear
33333,2,apple
33333,2,orange
'# |ConvertFrom-Csv
$data |Group-Object 'Primary Identifier' |ForEach-Object {
[pscustomobject]#{
# Primary Identifier value will be the name of the group, since that's what we grouped by
'Primary Identifier' = $_.Name
# Use `Group-Object -NoElement` to count unique values - you could also use `Sort-Object -Unique`
'Count of distinct Secondary Identifiers' = #($_.Group |Group-Object 'Secondary Identifier' -NoElement).Count
}
}
Related
I have two CSV files. The first CSV is Card Data, which holds about 30,000 records and contains the card's name, UUID, and price (which is currently empty). The second CSV is Pricing Data, which holds around 50,000 records and contains UUID and some pricing information for that specific UUID.
These are two separate CSV files that are generated elsewhere.
For each record in Card Data CSV I am taking the UUID and finding the corresponding UUID in the Pricing Data CSV using the Where-Object function in PowerShell. This is so I can find the pricing information for the respective card and run that through a pricing algorithm to generate a price for each record in the Card Data CSV.
At the moment is seems to take around 1 second per record in the Card Data CSV file and with 30,000 records to process, it would take over 8 hours to run through. Is there a better more efficient way to perform this task.
Code:
Function Calculate-Price ([float]$A, [float]$B, [float]$C) {
#Pricing Algorithm
....
$Card.'Price' = $CCPrice
}
$PricingData = Import-Csv "$Path\Pricing.csv"
$CardData = Import-Csv "$Update\Cards.csv"
Foreach ($Card In $CardData) {
$PricingCard = $PricingData | Where-Object { $_.UUID -eq $Card.UUID }
. Calculate-Price -A $PricingCard.'A-price' -B $PricingCard.'B-price' -C $PricingCard.'C-price'
}
$CardData | Select "Title","Price","UUID" |
Export-Csv -Path "$Update\CardsUpdated.csv" -NoTypeInformation
The first CSV is Card Data, which holds about 30,000 records
The second CSV is Pricing Data, which holds around 50,000 records
No wonder it's slow, you're calculating the expression $_.UUID -eq $Card.UUID ~1500000000 (that's 1.5 BILLION, or 1500 MILLION) times - that already sounds pretty compute-heavy, and we've not even considered the overhead from the pipeline having to bind input arguments to Where-Object the same amount of times.
Instead of using the array of objects returned by Import-Csv directly, use a hashtable to "index" the records in the data set you need to search, by the property that you're joining on later!
$PricingData = Import-Csv "$Path\Pricing.csv"
$CardData = Import-Csv "$Update\Cards.csv"
$PricingByUUID = #{}
$PricingData |ForEach-Object {
# Let's index the price cards using their UUID value
$PricingByUUID[$_.UUID] = $_
}
Foreach ($Card In $CardData) {
# No need to search through the whole set anymore
$PricingCard = $PricingByUUID[$Card.UUID]
. Calculate-Price -A $PricingCard.'A-price' -B $PricingCard.'B-price' -C $PricingCard.'C-price'
}
Under the hood, hashtables (and most other dictionary types in .NET) are implemented in a way so that they have extremely fast constant-time lookup/retrieval performance - which is exactly the kind of thing you want in this situation!
I am trying to read in a large CSV with millions of rows for testing. I know that I can treat the CSV as a database using the provider Microsoft.ACE.OLEDB.12.0
Using a small data set I am able to read the row contents positionally using .GetValue(int). I am having a tough time finding a better was to read the data (assuming there even is one.). If I know the column names before hand this is easy. However if I didn't know them I would have to read in the first line of the file to get that data which seems silly.
#"
id,first_name,last_name,email,ip_address
1,Edward,Richards,erichards0#businessweek.com,201.133.112.30
2,Jimmy,Scott,jscott1#clickbank.net,103.231.149.144
3,Marilyn,Williams,mwilliams2#chicagotribune.com,52.180.157.43
4,Frank,Morales,fmorales3#google.ru,218.175.165.205
5,Chris,Watson,cwatson4#ed.gov,75.251.1.149
6,Albert,Ross,aross5#abc.net.au,89.56.133.54
7,Diane,Daniels,ddaniels6#washingtonpost.com,197.156.129.45
8,Nancy,Carter,ncarter7#surveymonkey.com,75.162.65.142
9,John,Kennedy,jkennedy8#tumblr.com,85.35.177.235
10,Bonnie,Bradley,bbradley9#dagondesign.com,255.67.106.193
"# | Set-Content .\test.csv
$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='C:\Users\Matt';Extended Properties='Text;HDR=Yes;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from test.csv where first_name like '%n%'"
$conn.open()
$data = $cmd.ExecuteReader()
$data | ForEach-Object{
[pscustomobject]#{
id=$_.GetValue(0)
first_name=$_.GetValue(1)
last_name=$_.GetValue(2)
ip_address=$_.GetValue(4)
}
}
$cmd.Dispose()
$conn.Dispose()
Is there a better way to deal with the output from $cmd.ExecuteReader()? Finding hard to get information for a CSV import. Most of the web deals with exporting to CSV using this provider from a SQL database. The logic here would be applied to a large CSV so that I don't need to read the whole thing in just to ignore most of the data.
I should have looked closer on TechNet for the OleDbDataReader Class. There are a few methods and properties that help understand the data returned from the SQL statement.
FieldCount: Gets the number of columns in the current row.
So if nothing else you know how many columns your rows have.
Item[Int32]: Gets the value of the specified column in its native format given the column ordinal.
Which I can use to pull back the data from each row. This appears to work the same as GetValue().
GetName(Int32): Gets the name of the specified column.
So if you don't know what the column is named this is what you can use to get it from a given index.
There are many other methods and some properties but those are enough to shed light if you are not sure what data is contained within a csv (assuming you don't want to manually verify before hand). So, knowing that, a more dynamic way to get the same information would be...
$data | ForEach-Object{
# Save the current row as its own object so that it can be used in other scopes
$dataRow = $_
# Blank hashtable that will be built into a "row" object
$properties = #{}
# For every field that exists we will add it name and value to the hashtable
0..($dataRow.FieldCount - 1) | ForEach-Object{
$properties.($dataRow.GetName($_)) = $dataRow.Item($_)
}
# Send the newly created object down the pipeline.
[pscustomobject]$properties
}
$cmd.Dispose()
$conn.Dispose()
Only downside of this is that the columns will likely be output in not the same order as the originating CSV. That can be address by saving the row names in a separate variable and using a Select at the end of the pipe. This answer was mostly trying to make sense of the column names and values returned.
I created a table in a power shell script with some data in it. I need to find a way to do something like a replace in a column. For every value (x) i find in the column, change it to (y). How is this possible i power shell?
Thnaks in advance. Also, I could not find anything like this from google, and it has to be done after the table is already built, not while building the table columns and rows. Thanks!
I'm not sure exactly what you mean by a table. However, assuming you are referring to a collection of objects, it's simple:
$collectionToUpdate | Where-Object { $_.PropertyToCheck -eq $valueToCheck } | ForEach-Object { $_.PropertyToCheck = $replacementValue }
Obviously, replace the names of the variables and property with the correct values from your code.
I have a question: I want to store data in a variable with three columns and then process it. So I looked at an example with hash tables, which seemed great but then I need three columns, and I want to be able to run queries against this with it having say 100 rows.
What's the best way of doing this?
Example
You can create custom objects, each with three properties. That will give you the three column output. If you have V3 you can create custom objects using a hashtable like so:
$obj = [pscustomobject]#{Name='John';Age=42;Hobby='Music'}
PS> $obj | ft -auto
Name Age Hobby
---- --- -----
John 42 Music
If you are on V2 you can create these objects with New-Object:
$obj = new-object psobject -Property #{Name='John';Age=42;Hobby='Music'}
I would create an array or collection of custom PS Objects, each having 3 properties, then use the Powershell comparison operators on that array/collection to do my queries.
see:
Get-Help about_object_creation
Get-Help about_comparison_operators
Get-Help Where-Object
I search for best way to store lists associated with key in key value database (like berkleydb or leveldb)
For example:
I have users and orders from user to user
I want to store list of orders ids for each user to fast access with range selects (for pagination)
How to store this structure?
I don't want to store it in serializable format for each user:
user_1_orders = serialize(1,2,3..)
user_2_orders = serialize(1,2,3..)
beacuse list can be long
I think about separate db file for each user with store orders ids as keys in it, but this does not solve range selects problem.. What if I want to get user ids with range [5000:5050]?
I know about redis, but interest in key value implementation like berkleydb or leveldb.
Let start with a single list. You can work with a single hashmap:
store in row 0 the count of user's order
for each new order store a new row with the count incremented
So yoru hashmap looks like the following:
key | value
-------------
0 | 5
1 | tomato
2 | celery
3 | apple
4 | pie
5 | meat
Steady increment of the key makes sure that every key is unique. Given the fact that the db is key ordered and that the pack function translates integers into a set of byte arrays that are correctly ordered you can fetch slices of the list. To fetch orders between 5000 and 5050 you can use bsddb Cursor.set_range or leveldb's createReadStream (js api)
Now let's expand to multiple user orders. If you can open several hashmap you can use the above using several hashmap. Maybe you will hit some system issues (max nb of open fds or max num of files per directory). So you can use a single and share the same hashmap for several users.
What I explain in the following works for both leveldb and bsddb given the fact that you pack keys correctly using the lexicographic order (byteorder). So I will assume that you have a pack function. In bsddb you have to build a pack function yourself. Have a look at wiredtiger.packing or bytekey for inspiration.
The principle is to namespace the keys using the user's id. It's also called key composition.
Say you database looks like the following:
key | value
-------------------
1 | 0 | 2 <--- count column for user 1
1 | 1 | tomato
1 | 2 | orange
... ...
32 | 0 | 1 <--- count column for user 32
32 | 1 | banna
... | ...
You create this database with the following (pseudo) code:
db.put(pack(1, make_uid(1)), 'tomato')
db.put(pack(1, make_uid(1)), 'orange')
...
db.put(pack(32, make_uid(32)), 'bannana')
make_uid implementation looks like this:
def make_uid(user_uid):
# retrieve the current count
counter_key = pack(user_uid, 0)
value = db.get(counter_key)
value += 1 # increment
# save new count
db.put(counter_key, value)
return value
Then you have to do the correct range lookup, it's similar to the single composite-key. Using bsddb api cursor.set_range(key) we retrieve all items
between 5000 and 5050 for user 42:
def user_orders_slice(user_id, start, end):
key, value = cursor.set_range(pack(user_id, start))
while True:
user_id, order_id = unpack(key)
if order_id > end:
break
else:
# the value is probably packed somehow...
yield value
key, value = cursor.next()
Not error checks are done. Among other things slicing user_orders_slice(42, 5000, 5050) is not guaranteed to tore 51 items if you delete items from the list. A correct way to query say 50 items, is to implement a user_orders_query(user_id, start, limit)`.
I hope you get the idea.
You can use Redis to store list in zset(sorted set), like this:
// this line is called whenever a user place an order
$redis->zadd($user_1_orders, time(), $order_id);
// list orders of the user
$redis->zrange($user_1_orders, 0, -1);
Redis is fast enough. But one thing you should know about Redis is that it stores all data in memory, so if the data eventually exceed the physical memory, you have to shard the data by your own.
Also you can use SSDB(https://github.com/ideawu/ssdb), which is a wrapper of leveldb, has similar APIs to Redis, but stores most data in disk, memory is only used for caching. That means SSDB's capacity is 100 times of Redis' - up to TBs.
One way you could model this in a key-value store which supports scans , like leveldb, would be to add the order id to the key for each user. So the new keys would be userId_orderId for each order. Now to get orders for a particular user, you can do a simple prefix scan - scan(userId*). Now this makes the userId range query slow, in that case you can maintain another table just for userIds or use another key convention : Id_userId for getting userIds between [5000-5050]
Recently I have seen hyperdex adding data types support on top of leveldb : ex: http://hyperdex.org/doc/04.datatypes/#lists , so you could give that a try too.
In BerkeleyDB you can store multiple values per key, either in sorted or unsorted order. This would be the most natural solution. LevelDB has no such feature. You should look into LMDB(http://symas.com/mdb/) though, it also supports sorted multi-value keys, and is smaller, faster, and more reliable than either of the others.