Powershell Help: How can I remove duplicates (using multiple columns simultaneously, not sequentially)?

Powershell Help: How can I remove duplicates (using multiple columns simultaneously, not sequentially)? - powershell

I have tried several different variations based on some other stack overflow articles, but I will share a sample of what I have and a sample output and then some cobbled-together code hoping for some direction from the community:
C:\Scripts\contacts.csv:
id,first_name,last_name,email
1,john,smith,jsmith#notreal.com
1,jane,smith,jsmith#notreal.com
2,jane,smith,jsmith#notreal.com
2,john,smith,jsmith#notreal.com
3,sam,jones,sjones#notreal.com
3,sandy,jones,sandy#notreal.com
Need to turn this into a file where column "email" is unique to column "id". In other words there can be duplicate addresses, but only if there is a different id.
desired output C:\Scripts\contacts-trimmed.csv:
id,first_name,last_name,email
1,john,smith,jsmith#notreal.com
2,john,smith,jsmith#notreal.com
3,sam,jones,sjones#notreal.com
3,sandy,jones,sandy#notreal.com
I have tried this with a few different variations:
Import-Csv C:\Scripts\contacts.csv | sort first_name | Sort-Object -Property id,email -Unique | Export-Csv C:\Scripts\contacts-trim.csv -NoTypeInformation
Any help or direction would be most appreciated

You'll want to use the Group-Object cmdlet, to, well, group together records with similar values:
$records = #'
id,first_name,last_name,email
1,john,smith,jsmith#notreal.com
1,jane,smith,jsmith#notreal.com
2,jane,smith,jsmith#notreal.com
2,john,smith,jsmith#notreal.com
3,sam,jones,sjones#notreal.com
3,sandy,jones,sandy#notreal.com
'# |ConvertFrom-Csv
# group records based on id and email column
$records |Group-Object id,email |ForEach-Object {
# grab only the first record from each group
$_.Group |Select-Object -First 1
} |Export-Csv .\no_duplicates.csv -NoTypeInformation

Related

Sort CSV powershell script delete duplicate, keep the one with a special value in 3rd column

How do I delete double entriys in a csv by one column and leave the one with one special value in one of the columns?
Example: I got a csv with
Name;Employeenumber;Accessrights
Max;123456;ReadOnly
Berta;133556;Write
Jhonny;161771;ReadOnly
Max;123456;Write
I want to end up with:
Name;Employeenumber;Accessrights
Max;123456;Write
Berta;133556;Write
Jhonny;161771;ReadOnly
I tried by Get-Content Select-Object -unique, but that does not solve the problem that it should only keep the ones with the value "write" at the property Accessrights.
So I have no clue at all

You can use a combination of sorting and grouping ....
#'
Name;Employeenumber;Accessrights
Max;123456;ReadOnly
Berta;133556;Write
Jhonny;161771;ReadOnly
Max;123456;Write
'# |
ConvertFrom-Csv -Delimiter ';' |
Sort-Object -Property Name, Accessrights -Descending |
Group-Object -Property Name |
ForEach-Object {
$_.Group[0]
}

Using Powershell, how can I export and delete csv rows, where a particular value is not found in a different csv?

I have two files. One is called allper.csv
institutiongroup,studentid,iscomplete
institutionId=22343,123,FALSE
institutionId=22343,456,FALSE
institutionId=22343,789,FALSE
The other one is called actswithpersons.csv
abc,123;456
def,456
ghi,123
jkl,123;456
Note: The actswithpersons.csv does not have headers - they are going to be added in later via an excel power query so don't want them in there now. The actswithpersons csv columns are delimited with commas - there are only two columns, and the second one contains multiple personids - again Excel will deal with this later.
I want to remove all rows from allper.csv where the personid doesn't appear in actswithpersons.csv, and export them to another csv. So in the desired outcome, allper.csv would look like this
institutiongroup,studentid,iscomplete
institutionId=22343,123,FALSE
institutionId=22343,456,FALSE
and the export.csv would look like this
institutiongroup,studentid,iscomplete
institutionId=22343,789,FALSE
I've got as far as the below, which will put into the shell whether the personid is found in the actswithpersons.csv file.
$donestuff = (Get-Content .\ActsWithpersons.csv | ConvertFrom-Csv); $ids=(Import-Csv .\allper.csv);foreach($id in $ids.personid) {echo $id;if($donestuff -like "*$id*" )
{
echo 'Contains String'
}
else
{
echo 'Does not contain String'
}}
However, I'm not sure how to go the last step, and export & remove the unwanted rows from allper.csv
I've tried (among many things)
$donestuff = (Get-Content .\ActsWithpersons.csv | ConvertFrom-Csv);
Import-Csv .\allper.csv |
Where-Object {$donestuff -notlike $_.personid} |
Export-Csv -Path export.csv -NoTypeInformation
This took a really long time and left me with an empty csv. So, if you can give any guidance, please help.

Since your actswithpersons.csv doesn't have headers, in order for you to import as csv, you can specify the -Header parameter in either Import-Csv or ConvertFrom-Csv; with the former cmdlet being the better solution.
With that said, you can use any header name for those 2 columns then filter by the given column name (ID in this case) after your import of allper.csv using Where-Object:
$awp = (Import-Csv -Path '.\actswithpersons.csv' -Header 'blah','ID').ID.Split(';')
Import-Csv -Path '.\allper.csv' | Where-Object -Property 'Studentid' -notin $awp
This should give you:
institutiongroup studentid iscomplete
---------------- --------- ----------
institutionId=22343 789 FALSE
If you're looking to do it with Get-Content you can split by the delimiters of , and ;. This should give you just a single row of values which you can then compare the entirety of variable ($awp) using the same filter as above which will give you the same results:
$awp = (Get-Content -Path '.\actswithpersons.csv') -split ",|;"
Import-Csv -Path '.\allper.csv' | Where-Object -Property 'Studentid' -notin $awp

Powershell - Import-CSV Group-Object SUM a number from grouped objects and then combine all grouped objects to single rows

I have a question similar to this one but with a twist:
Powershell Group Object in CSV and exporting it
My file has 42 existing headers. The delimiter is a standard comma, and there are no quotation marks in this file.
master_account_number,sub,txn,cur,last,first,address,address2,city,state,zip,ssn,credit,email,phone,cell,workphn,dob,chrgnum,cred,max,allow,neg,plan,downpayment,pmt2,min,clid,cliname,owner,merch,legal,is_active,apply,ag,offer,settle_perc,min_pay,plan2,lstpmt,orig,placedate
The file's data (the first 6 columns) looks like this:
master_account_number,sub,txn,cur,last,first
001,12,35,50.25,BIRD, BIG
001,34,47,100.10,BIRD, BIG
002,56,9,10.50,BUNNY, BUGS
002,78,3,20,BUNNY, BUGS
003,54,7,250,DUCK, DAFFY
004,44,88,25,MOUSE, JERRY
I am only working with the first column master_account_number and the 4th column cur.
I want to check for duplicates of the"master_account_number" column, if found then add the totals up from the 4th column "cur" for only those dupes found and then do a combine for any rows that we just did a sum on. The summed value from the dupes should replace the cur value in our combined row.
With that said, our out-put should look like so.
master_account_number,sub,txn,cur,last,first
001,12,35,150.35,BIRD, BIG
002,56,9,30.50,BUNNY, BUGS
003,54,7,250,DUCK, DAFFY
004,44,88,25,MOUSE, JERRY
Now that we have that out the way, here is how this question differs. I want to keep all 42 columns intact in the out-put file. In the other question I referenced above, the input was 5 columns and the out-put was 4 columns and this is not what I'm trying to achieve. I have so many more headers, I'd hate to have specify individually all 42 columns. That seems inefficient anyhow.
As for what I have so far for code... not much.
$revNB = "\\server\path\example.csv"
$global:revCSV = import-csv -Path $revNB | ? {$_.is_active -eq "Y"}
$dupesGrouped = $revCSV | Group-Object master_account_number | Select-Object #{Expression={ ($_.Group|Measure-Object cur -Sum).Sum }}
Ultimately I want the output to look identical to the input, only the output should merge duplicate account numbers rows, and add all the "cur" values, where the merged row contains the sum of the grouped cur values, in the cur field.
Last Update: Tried Rich's solution and got an error. Modified what he had to this $dupesGrouped = $revCSV | Group-Object master_account_number | Select-Object Name, #{Name='curSum'; Expression={ ($_.Group | Measure-Object cur -Sum).Sum}}
And this gets me exactly what my own code got me so I am still looking for a solution. I need to output this CSV with all 42 headers. Even for items with no duplicates.
Other things I've tried:
This doesn't give me the data I need in the columns, the columns are there but they are blank.
$dupesGrouped = $revCSV | Group-Object master_account_number | Select-Object #{ expression={$_.Name}; label='master_account_number' },
sub_account_number,
charge_txn,
#{Name='current_balance'; Expression={ ($_.Group | Measure-Object current_balance -Sum).Sum },
last,
}

You're pretty close, but you used current_balance where you probably meant cur.
Here's a start:
$dupesGrouped = $revCSV | Group-Object master_account_number |
Select-Object Name, #{N='curSum'; E={ ($_.Group | Measure-Object cur -Sum).Sum},
#{N='last'; E={ ($_.Group | Select-Object last -first 1).last} }
You can add the other fields by adding Name;Expression hashtables for each of the fields you want to summarize. I assumed you would want to select the first occurrence of repeated last name for the same master_account_number. The output will be incorrect if the last name differs for the same master_account_number.

In the case of changing only part of the data, there is also the following way.
$dupesGrouped = $revCSV | Group-Object master_account_number | ForEach-Object {
# copy the first data in order not to change original data
$new = $_.Group[0].psobject.Copy()
# update the value of cur property
$new.cur = ($_.Group | Measure-Object cur -Sum).Sum
# output
$new
}

Select specific column based on data supplied using Powershell

I have a csv file that may have unknown headers, one of the columns will contain email addresses for example.
Is there a way to select only the column that contains the email addresses and save it as a list to a variable?
One csv could have the header say email, another could say emailaddresses, another could say email addresses another file might not even have the word email in the header. As you can see, the headers are different. So I want to be able to detect the correct column first and use that data further in the script. Once the column is identified based on the data it contains, select that column only.
I've tried the where-object and select-string cmdlets. With both, the output is the entire array and not just the data in the column I am wanting.
$CSV = import-csv file.csv
$CSV | Where {$_ -like "*#domain.com"}
This outputs the entire array as all rows will contain this data.

Sample Data for visualization
id,first_name,bagel,last_name
1,Base,bcruikshank0#homestead.com,Cruikshank
2,Regan,rbriamo1#ebay.co.uk,Briamo
3,Ryley,rsacase2#mysql.com,Sacase
4,Siobhan,sdonnett3#is.gd,Donnett
5,Patty,pesmonde4#diigo.com,Esmonde
Bagel is obviously what we are trying to find. And we will play pretend in that we have no knowledge of the columns name or position ahead of time.
Find column dynamically
# Import the CSV
$data = Import-CSV $path
# Take the first row and get its columns
$columns = $data[0].psobject.properties.name
# Cycle the columns to find the one that has an email address for a row value
# Use a VERY crude regex to validate an email address.
$emailColumn = $columns | Where-Object{$data[0].$_ -match ".*#*.\..*"}
# Example of using the found column(s) to display data.
$data | Select-Object $emailColumn
Basically read in the CSV like normal and use the first columns data to try and figure out where the email address column is. There is a caveat that if there is more than one column that matches it will get returned.
To enforce only 1 result a simple pipe to Select-Object -First 1 will handle that. Then you just have to hope the first one is the "right" one.

If you're using Import-Csv, the result is a PSCustomObject.
$CsvObject = Import-Csv -Path 'C:\Temp\Example.csv'
$Header = ($CsvObject | Get-Member | Where-Object { $_.Name -like '*email*' }).Name
$CsvObject.$Header
This filters for the header containing email, then selects that column from the object.
Edit for requirement:
$Str = #((Get-Content -Path 'C:\Temp\Example.csv') -like '*#domain.com*')
$Headers = #((Get-Content -Path 'C:\Temp\Example.csv' -TotalCount 1) -split ',')
$Str | ConvertFrom-Csv -Delimiter ',' -Header $Headers

Other method:
$PathFile="c:\temp\test.csv"
$columnName=$null
$content=Get-Content $PathFile
foreach ($item in $content)
{
$SplitRow= $item -split ','
$Cpt=0..($SplitRow.Count - 1) | where {$SplitRow[$_] -match ".*#*.\..*"} | select -first 1
if ($Cpt)
{
$columnName=($content[0] -split ',')[$Cpt]
break
}
}
if ($columnName)
{
import-csv "c:\temp\test.csv" | select $columnName
}
else
{
"No Email column founded"
}

How to merge two csv files with PowerShell

I have two .CSV files that contain information about the employees from where I work. The first file (ActiveEmploye.csv) has approximately 70 different fields with 2500 entries. The other one (EmployeEmail.csv) has four fields (FirstName, LastName, Email, FullName) and 700 entries. Both files have the FullName field in common and it is the only thing I can use to compare each file. What I need to do is to add the Email address (from EmployeEmail.csv) to the corresponding employees in ActiveEmploye.csv. And, for those who don't have email address, the field can be left blank.
I tried to use a Join-Object function I found on internet few days ago but the first csv files contain way too much fields for the function can handle.
Any suggestions is highly appreciated!

There are probably several ways to skin this cat. I would try this one:
Import-Csv EmployeeEmail.csv | ForEach-Object -Begin {
$Employees = #{}
} -Process {
$Employees.Add($_.FullName,$_.email)
}
Import-Csv ActiveEmployees.csv | ForEach-Object {
$_ | Add-Member -MemberType NoteProperty -Name email -Value $Employees."$($_.FullName)" -PassThru
} | Export-Csv -NoTypeInformation Joined.csv

Here is an alternate method to #BartekB which I used to join multiple fields to the left and had better results for processing time.
Import-Csv EmployeeEmail.csv | ForEach-Object -Begin {
$Employees = #{}
} -Process {
$Employees.Add($_.FullName,$_)
}
Import-Csv ActiveEmployees.csv |
Select *,#{Name="email";Expression={$Employees."$($_.FullName)"."email"}} |
Export-Csv Joined.csv -NoTypeInformation
This allows one to query the array index of FullName on $Employees and then get the value of a named member on that element.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Powershell Help: How can I remove duplicates (using multiple columns simultaneously, not sequentially)? - powershell

Related

Sort CSV powershell script delete duplicate, keep the one with a special value in 3rd column

Using Powershell, how can I export and delete csv rows, where a particular value is not found in a different csv?

Powershell - Import-CSV Group-Object SUM a number from grouped objects and then combine all grouped objects to single rows

Select specific column based on data supplied using Powershell

How to merge two csv files with PowerShell

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Powershell Help: How can I remove duplicates (using multiple columns simultaneously, not sequentially)? - powershell

Related

Sort CSV powershell script delete duplicate, keep the one with a special value in 3rd column

Using Powershell, how can I export and delete csv rows, where a particular value is *not found* in a *different* csv?

Powershell - Import-CSV Group-Object SUM a number from grouped objects and then combine all grouped objects to single rows

Select specific column based on data supplied using Powershell

How to merge two csv files with PowerShell

Categories

Resources

Using Powershell, how can I export and delete csv rows, where a particular value is not found in a different csv?