I have a pipe delimited flat file from which I need to deduplicate the entries based on an object, to be specific a part of file is:
"001A"|"1"|"*"||"A"|"504367667"|"1"|"2005-06-10-16.57.23.000000"|
"001A"|"1"|"*"||"A"|"504367667"|"1"|"2005-10-24-16.52.29.000000"|
"001A"|"1"|"*"||"A"|"504367667"|"1"|"2007-12-13-15.48.47.000000"|
"001A"|"1"|"*"||"A"|"504367667"|"1"|"2008-12-09-17.10.39.000000"|
"001B"|"1"|"*"||"B"|"800026800"|"1"|"2005-08-08-10.48.16.000000"|
"001C"|"1"|"*"||"C"|"490349139"|"1"|"2006-01-19-12.03.08.000000"|
"001C"|"1"|"*"||"C"|"490349139"|"1"|"2009-03-12-15.08.11.000000"|
The first field is ID and last field is a timestamp, I want to deduplicate the entries such that only the latest timestamp entry is kept for each ID. So, The output that I need should be:
"001A"|"1"|"*"||"A"|"504367667"|"1"|"2008-12-09-17.10.39.000000"|
"001B"|"1"|"*"||"B"|"800026800"|"1"|"2005-08-08-10.48.16.000000"|
"001C"|"1"|"*"||"C"|"490349139"|"1"|"2009-03-12-15.08.11.000000"|
I read the file and stored the entries in an array with distinct object names, then I tried
$inputdeduped = $inputfilearray | Sort-Object Date
$inputdeduped = $inputdeduped | Select-Object ID -Unique
hoping that once the date is sorted, get-unique cmdlet used as -unique here would either pick either the first or last of the duplicated entry in the sorted array so depending on that I would sort the date in either desc or asc order, however it doesn't and randomly picks one entry.
Please help me out guys or help me understand how the get-unique cmdlet works.
you can try this:
$newInputdeduped = $inputfilearray | sort id, date -ascending | group -Property id |
select #{n="GroupedList"; e={ $_.group | select -first 1 }} |
select -expa list
This is what I do with your example data after saving it as a txt file:
> $a = Import-Csv -Header "id","n1","n2","v1","n3","n4","n5","date" -Path .\c.txt -delimiter '|'
> $a | ft -AutoSize
id n1 n2 v1 n3 n4 n5 date
-- -- -- -- -- -- -- ----
001A 1 * A 504367667 1 2005-06-10-16.57.23.000000
001A 1 * A 504367667 1 2005-10-24-16.52.29.000000
001A 1 * A 504367667 1 2007-12-13-15.48.47.000000
001A 1 * A 504367667 1 2008-12-09-17.10.39.000000
001B 1 * B 800026800 1 2005-08-08-10.48.16.000000
001C 1 * C 490349139 1 2006-01-19-12.03.08.000000
001C 1 * C 490349139 1 2009-03-12-15.08.11.000000
> $b = $a | sort id, date -ascending | group -Property id | select #{n="list";e={ $_.group | select -first 1 }} | select -expa list
> $b | ft -AutoSize
id n1 n2 v1 n3 n4 n5 date
-- -- -- -- -- -- -- ----
001C 1 * C 490349139 1 2009-03-12-15.08.11.000000
001B 1 * B 800026800 1 2005-08-08-10.48.16.000000
001A 1 * A 504367667 1 2008-12-09-17.10.39.000000
Related
I have a file which has transaction_date, transaction_amount and debit_credit_indicator. I want to write a program which shows for each date total count and total amount.
Total amount is calculated as follows -
if debit_credit_indicator is 'C' add else if 'D' subtract.
I got till grouping by indicators but don't know how to proceed after wards.
My ouput looks like this
TRANSACTION_DATE DEBIT_CREDIT_INDICA TotalAmount Count
TOR
---------------- ------------------- ----------- -----
2019-02-26 C 1478
2019-02-25 D 100
2019-02-26 D 200
param([string]$inputFileName=30)
(Get-Content $inputFileName) -replace '\|', ',' | Set-Content c:\learnpowershell\test.csv
$transactionData = Import-csv c:\learnpowershell\test.csv | Group-Object -Property TRANSACTION_DATE, DEBIT_CREDIT_INDICATOR
[Array] $newsbData += foreach($gitem in $transactionData)
{
$gitem.group | Select -Unique TRANSACTION_DATE, DEBIT_CREDIT_INDICATOR, `
#{Name = ‘TotalAmount’;Expression = {(($gitem.group) | measure -Property TRANSACTION_AMOUNT -sum).sum}},
#{Name = ‘Count’;Expression = {(($gitem.group) | Measure-Object -count).count}}
};
write-output $newsbData
I suppose you want replace '|' by ',' because you dont know -delimiter option otherwise keep you code for replace. now i propose my code for your problem:
#import en group by date
import-csv "c:\learnpowershell\test.csv" -Delimiter '|' | group TRANSACTION_DATE | %{
$TotalCredit=0
$TotalDebit=0
$CountRowCredit=0
$CountRowDebit=0
$HasProblem=$false
#calculation by date for every group
$_.Group | %{
if ($_.DEBIT_CREDIT_INDICATOR -EQ 'C')
{
$TotalCredit+=$_.transaction_amount
$CountRowCredit++
}
elseif ($_.DEBIT_CREDIT_INDICATOR -EQ 'D')
{
$TotalDebit+=$_.transaction_amount
$CountRowDebit++
}
else
{
$HasProblem=$true
}
}
#output result
[pscustomobject]#{
TRANSACTION_DATE=$_.Name
CountRow=$_.Count
Credit_Total=$TotalCredit
Credit_CountRow=$CountRowCredit
Debit_Total=-$TotalDebit
Debit_CountRow=$CountRowDebit
Total_DebitCredit=$TotalCredit - $TotalDebit
HasProblem=$HasProblem
}
}
You can add ' | Format-Table ' if you want print result formated in table
Quick question. I have the following:
$domain = "my.new.domain.com"
$domain.Split('.')[0,1]
...which returns the value:
my
new
That's great except I need the LAST TWO (domain.com) and am unsure how to do that. Unfortunately the number of splits is variable (e.g. test.my.new.domain.com). How does one say "go to the end and count X splits backwards"?
To take last N elements of an array, you can use either of the following options:
$array | select -Last n
$array[-n..-1] (← '..' is the Range Operator)
Example
$domain = "my.new.domain.com"
$domain.Split('.') | select -Last 2
Will result in:
domain
com
Note
Using the select cmdlet, you can do some jobs that you usually do using LINQ in .NET, for example:
Take first N elements: $array | select -First N
Take last N elements: $array | select -Last N
Skip first N elements: $array | select -Skip N
Skip last N elements: $array | select -SkipLast N
Skip and take from first: $array | select -Skip N -First M
Skip and take from last: $array | select -Skip N -Last M
Select distinct elements: $array | select -Distinct
select elements at index: $array | select -Index (0,2,4)
I have a CSV file with the following columns:
Error_ID
Date
hh (hour in two digit)
Error description
It look like this:
In SQL it was very easy:
SELECT X,Y,Count(1)
FROM #Table
GROUP BY X,Y
In PowerShell its a bit more different.
The Group-Object cmdlet allows grouping by multiple properties:
Import-Csv 'C:\path\to\your.csv' | Group-Object ErrorID, Date
which will give you a result like this:
Count Name Group
----- ---- -----
3 1, 15/07/2016 {#{ErrorID=1; Date=15/07/2016; Hour=16}, #{ErrorID=1; Da...
1 2, 16/07/2016 {#{ErrorID=2; Date=16/07/2016; Hour=9}}
However, to display grouped values in tabular form like an SQL query would do you need to extract them from the groups with calculated properties:
Import-Csv 'C:\path\to\your.csv' | Group-Object ErrorID, Date |
Select-Object #{n='ErrorID';e={$_.Group[0].ErrorID}},
#{n='Date';e={$_.Group[0].Date}}, Count
which will produce output like this:
ErrorID Date Count
------- ---- -----
1 15/07/2016 3
2 16/07/2016 1
You can use the following:
$csv = import-csv path/to/csv.csv
$csv | group-object errorid
Count Name Group
----- ---- -----
2 1 {#{errorID=1; time=15/7/2016; description=bad}, #{errorID=1; time=15/8/2016; description=wow}}
1 3 {#{errorID=3; time=15/7/2016; description=worse}}
1 5 {#{errorID=5; time=15/8/2016; description=the worst}}
$csv | where {$_.errorid -eq "2"}
errorID time description
------- ---- -----------
1 15/7/2016 bad
1 15/8/2016 wow
You can Pipe first and second example to get the desired result.
If I have a hastable $states = #{ 1 = 15; 2 = 5; 3 = 41 }, The result shows
Name Value
---- -----
3 41
2 5
1 15
I used $states.GetEnumerator() | sort value -Descending | select -Last 1 to find the minimum value that I need.
The result is:
Name Value
---- -----
2 5
However, I cannot use the value (5) as a new variable to do a calculation. This is due to the result cotains both name and value. Is there any method to get the minimum value only from the result?
Use the .Values property from the beginning:
$states.Values | Sort-Object -Descending | Select-Object -Last 1
Or expand the .Value property:
$states.GetEnumerator() | sort value -Descending | select -Last 1 -ExpandProperty Value
Imagine the following hash:
$h=#{}
$h.Add(1,'a')
$h.Add(2,'b')
$h.Add(3,'c')
$h.Add(4,'d')
$h.Add(5,'a')
$h.Add(6,'c')
What query would return the 2 duplicate values 'a' and 'c' ?
Basically I am looking for the powershell equivalent of the following SQL query (assuming the table h(c1,c2):
select c1
from h
group by c1
having count(*) > 1
You could try this:
$h.GetEnumerator() | Group-Object Value | ? { $_.Count -gt 1 }
Count Name Group
----- ---- -----
2 c {System.Collections.DictionaryEntry, System.Collections.DictionaryEntry}
2 a {System.Collections.DictionaryEntry, System.Collections.DictionaryEntry}
If you store the results, you could dig into the group to get the key-name for the duplicate entries. Ex.
$a = $h.GetEnumerator() | Group-Object Value | ? { $_.Count -gt 1 }
#Check the first group(the one with 'c' as value)
$a[0].Group
Name Value
---- -----
6 c
3 c
You can use another hash table:
$h=#{}
$h.Add(1,'a')
$h.Add(2,'b')
$h.Add(3,'c')
$h.Add(4,'d')
$h.Add(5,'a')
$h.Add(6,'c')
$h1=#{}
$h.GetEnumerator() | foreach { $h1[$_.Value] += #($_.name) }
$h1.GetEnumerator() | where { $_.value.count -gt 1}
Name Value
---- -----
c {6, 3}
a {5, 1}
Just a slightly different question:
How to list the duplicate items of a PowerShell Array
But a similar solution as from Frode F:
$Duplicates = $Array | Group | ? {$_.Count -gt 1} | Select -ExpandProperty Name