Alternative way to remove duplicates from CSV other than Sort-Object -unique?

Alternative way to remove duplicates from CSV other than Sort-Object -unique? - powershell

I have a bug I cannot beat. When I run my script gets to this chunk of code it is incorrectly removing unique values:
import-csv "$LocalPath\A1-$abbrMonth$Year.csv" |
where {$_."CustomerName" -match $Customersregex} |
select "SubmitterID","SubmitterName","JobDate","JobTime",#{Name="Form";Expression={if ($_.FormName -match "Copy"){"C"};if ($_.FormName -match "Letter"){"L"} else {""} }},"TotalDocs",#{Name="AddnPages";Expression={$_.TotalAdditionalPages}},"InputFilename",#{Name="ActualDocs";Expression={[string]([int]$_.RegularDocs + [int]$_.UnqualifiedDocs)}}|
sort "InputFilename" -Unique |
export-csv "$LocalPath\A2-$abbrMonth$Year.csv" -NoTypeInformation
It's occurring during the "sort "InputFilename" -Unique" line, however it will work properly when I cut it up and execute it line by line, but not in the original script.
Is there any other way to remove duplicates based on the value of a column? I've tried using the "-unique" parameter on the Select-Object statement but I can't find a way to limit it to only one column.
EDIT: To clarify the issue I'm having, I have a LARGE list of accounting data. I'm trying to remove duplicate entries by using "Sort -unique". After the above code is running, there are entries missing that should not be because they are unique. I can isolate them in their own CSV, run the above code and all entries are present that should be, however when I run my master CSV file through the above code (and only that code, nothing else) and search for those entries they are missing.
EDIT 2: Looks like it was an issue with the data file. Good grief.

You can always group things, then expand the first item in the group. It's not fast, but it works for what you're doing.
import-csv "$LocalPath\A1-$abbrMonth$Year.csv" |
where {$_."CustomerName" -match $Customersregex} |
group InputFilename |
% { $_.Group[0] } |
select "SubmitterID","SubmitterName","JobDate","JobTime",#{Name="Form";Expression={if ($_.FormName -match "Copy"){"C"};if ($_.FormName -match "Letter"){"L"} else {""} }},"TotalDocs",#{Name="AddnPages";Expression={$_.TotalAdditionalPages}},"InputFilename",#{Name="ActualDocs";Expression={[string]([int]$_.RegularDocs + [int]$_.UnqualifiedDocs)}}|
sort "InputFilename" |
export-csv "$LocalPath\A2-$abbrMonth$Year.csv" -NoTypeInformation

Related

Create Csv with loop and output

This basically works
foreach ($cprev in $CopyPreventeds) {
Write-Host ("prevented copy $(($cprev)."Name")")
$cprev | Select-Object Path, Name, Length, LastWrite, DestinationNewer | Export-Csv '.\prevented.csv' -NoTypeInformation
}
But only the last output is written to the csv. How could I write all contents to a new csv with an output at the same time for the user in PowerShell.

Maybe I'm missing something?
While I appreciate a solution has already been proposed in the comments, I have to ask, given the narrow scope of the question why are we using an obscure, albeit clever technique? And/or, repeatedly invoking Export-Csv...
The question doesn't mention sparing a variable. Moreover, There doesn't appear to be a need for the ForEach loop.
$CopyPreventeds |
Select-Object Path, Name, Length, LastWrite, DestinationNewer |
Export-Csv '.\prevented.csv' -NoTypeInformation
In the above $CopyPreventeds already exists and remains so, unmolested after the export. You would need only to output it again for the benefit of an interactive user. All taking advantage of PowerShell's intuitive pipeline and features.
Moreover, since the iteration variable $cprev isn't needed you are still less one variable.
Note: You don't need -Append because you are streaming into a single Export-Csv command, as opposed to repeatedly invoking it.
There are at least 2 ways (probably many more) you could conveniently output to an interactive user.
1: Echo a header, something like "The following copies were prevented:" then echo the variable $CopyPreventeds, presumably to a table.
Note: That given multiple points at which you seem only interested in a subset of properties. You may think about trimming those objects beforehand:
$CopyPreventeds =
$CopyPreventeds |
Select-Object Path, Name, Length, LastWrite, DestinationNewer
$CopyPreventeds | Export-Csv '.\prevented.csv' -NoTypeInformation
Write-Host "The following copies were prevented:"
$CopyPreventeds | Format-Table -AutoSize | Out-Host
Note: More than 4 Properties in a [PSCustomObject] (resulting from Select-Object) where custom formatting hasn't been defined will by default output as a list, so use Format-Table to overcome that. Out-Host is then used to prevent pipeline pollution.
2: Return to using a ForEach-Object Loop for the output between the Select-Object and the Export-Csv command.
$CopyPreventeds |
Select-Object Path, Name, Length, LastWrite, DestinationNewer
ForEach-Object{
"Prevented Copy : {0}, {1}, {2}, {3}, {4}" -f $_.Path, $_.Name, $_.Length, $_.LastWrite, $_.DestinationNewer |
Write-Host
$_
} |
Export-Csv '.\prevented.csv' -NoTypeInformation
In this example, when you are done outputting to the screen (admittedly a little messy), you emit $_ from the loop, thus piping it to Export-Csv just the same.
Note: there are a number of ways to construct strings, I choose to use the -f operator here because it's a little cleaning than imbedding numerous $() sub expressions. And, of course this assume you want to prefix on every line Which I personally think is gratuitous, so I'd choose something more like #1..

Powershell "if more than one, then delete all but one"

Is there a way to do something like this in Powershell:
"If more than one file includes a certain set of text, delete all but one"
Example:
"...Cam1....jpg"
"...Cam2....jpg"
"...Cam2....jpg"
"...Cam3....jpg"
Then I would want one of the two "...Cam2....jpg" deleted, while the other one should stay.
I know that I can use something like
gci *Cam2* | del
but I don't know how I can make one of these files stay.
Also, for this to work, I need to look through all the files to see if there are any duplicates, which defeats the purpose of automating this process with a Powershell script.
I searched for a solution to this for a long time, but I just can't find something that is applicable to my scenario.

Get a list of files into a collection and use range operator to select a subset of its elements. To remove all but first element, start from index one. Like so,
$cams = gci "*cam2*"
if($cams.Count -gt 1) {
$cams[1..$cams.Count] | remove-item
}

Expanding on the idea of commenter boxdog:
# Find all duplicately named files.
$dupes = Get-ChildItem c:\test -file -recurse | Group-Object Name | Where-Object Count -gt 1
# Delete all duplicates except the 1st one per group.
$dupes | ForEach-Object { $_.Group | Select-Object -Skip 1 | Remove-Item -Force }
I've split this up into two sub tasks to make it easier to understand. Also it is a good idea to always separate directory iteration from file deletion, to avoid inconsistent results.
First statement uses Group-Object to group files by names. It outputs a Count property containing the number of files per group. Then Where-Object is used to get only groups that contain more than one file, which will be the dupes. The result is stored in variable $dupes, which is an array that looks like this:
Count Name Group
----- ---- -----
2 file1.txt {C:\test\subdir1\file1.txt, C:\test\subdir2\file1.txt}
2 file2.txt {C:\test\subdir1\file2.txt, C:\test\subdir2\file2.txt}
The second statement uses ForEach-Object to iterate over all groups of duplicates. From the Group-Object call of the 1st statement we got a Group property that contains an array of file informations. Using Select-Object -Skip 1 we select all but the 1st element of this array, which are passed to Remove-Item to delete the files.

From a CSV file get the file header and a portion of the file based on starting and ending line number parameters using PowerShell

So I have a very huge CSV file, the first line has the column headers. I want to keep the first line as a header and add a portion of the file from the file's mid-section or perhaps the end. I'm also trying to select only a few of the columns from the file. And finally, it would be great if the solution also changed the file delimiter from a comma to a tab.
I'm aiming for a solution that's a one-liner or perhaps 2?
Non-working Code version 30 ...
Get-Content -Tail 100 filename.csv | Export-Csv -Delimiter "`t" -NoTypeInformation -Path .\filename_out.csv
I'm trying to get a better grip on PowerShell. So far, so good but I'm not quite there yet. But trying to solve such challenges are helping me (and hopefully others) build a good collection of coding idioms. (FYI - the boss is trying PowerShell due to our efforts so.)

OK thanks to iRon tip. Import-CSV defaults to comma separated, the Select-Object -Property get the columns I want, the select -Last gets the last 200 rows, and the Export-CSV changes the delimiter to a tab:
Import-Csv iarf.csv |
Select-Object -Property Id,Name,RecordTypeId,CreatedDate |
select -Last 200 |
Export-Csv -Delimiter "`t" -NoTypeInformation -Path .\iarf100props6.csv

iRon provided the crucial pointer: Using Import-Csv rather than Get-Content allows you to retrieve arbitrary ranges from the original file as objects, if selected via Select-Object, and exporting these objects again via Export-Csv automatically includes a header line whose column names are the input objects' property names, as initially derived from the input file's header line.
In order to select an arbitrary range of rows, combine Select-Object's -Skip and -First parameters:
To only get rows from the beginning, use just -First $count:
To only get rows from the end, use just -Last $count
To get rows in a given range, use just -Skip $startRowMinus1 -First $rangeRowCount
For instance, the following command extracts rows 10 through 30:
Import-Csv iarf.csv |
Select-Object -Property Id,Name,RecordTypeId,CreatedDate -Skip 9 -First 20 |
Export-Csv -Delimiter "`t" -NoTypeInformation -Path .\iarf100props6.csv

Comparing first line in multiple CSV

I have a folder containing about 130 .csv files that all appear to contain similar fields (column names). However, based on some of the file names, I am under the impression that some of the .CSV files may have slightly diffrent schemas (e.g., xxxxxx_new_format.cvs, xxxxx_version_2.csv). My thought is to copy the first line of each .csv into a text doc for comparison.
So I created the following script:
Get-Content "C:\*.csv" | ForEach-Object {
Select-Object -First 1 | Out-File "C:\compare.txt"
}
which seemed to go into an infinite loop.
How should I attack this problem? If there is a better method for comparison (i.e., I should be using python) please let me know.

try this
Get-ChildItem "C:\temp2\*.csv" |
%{[pscustomobject]#{FileName=$_.FullName;Header=gc $_.FullName -TotalCount 1}} |
group Header

Reformat column names in a csv with PowerShell

Question
How do I reformat an unknown CSV column name according to a formula or subroutine (e.g. rename column " Arbitrary Column Name " to "Arbitrary Column Name" by running a trim or regex or something) while maintaining data?
Goal
I'm trying to more or less sanitize columns (the names) in a hand-produced (or at least hand-edited) csv file that needs to be processed by an existing PowerShell script. In this specific case, the columns have spaces that would be removed by a call to [String]::Trim(), or which could be ignored with an appropriate regex, but I can't figure a way to call or use those techniques when importing or processing a CSV.
Short Background
Most files and columns have historically been entered into the CSV properly, but recently a few columns were being dropped during processing; I determined it was because the files contained a space (e.g., Select-Object was being told to get "RFC", but Import-CSV retrieved "RFC ", so no matchy-matchy). Telling the customer to enter it correctly by hand (though preferred and much simpler) is not an option in this case.
Options considered
I could manually process the text of the file, but that is a messy and error prone way to re-invent the wheel. I wonder if there's a syntax with Select-Object that would allow a softer match for column names, but I can't find that info.
The closest I have come conceptually is using a calculated property in the call to Select-Object to rename the column, but I can only find ways to rename a known column to another known column. So, this would require enumerating the columns and matching them exactly (preferred) or a softer match (like comparing after trimming or matching via regex as a fallback) with expected column names, then creating a collection of name mappings to use in constructing calculated properties from that information to select into a new object.
That seems like it would work, but more it's work than I'd prefer, and I can't help but hope that there's a simpler way I haven't been able to find via Google. Maybe I should try Bing?

Sample File
Let's say you have a file.csv like this:
" RFC "
"1"
"2"
"3"
Code
Now try to run the following:
$CSV = Get-Content file.csv -First 2 | ConvertFrom-Csv
$FixedHeaders = $CSV.PSObject.Properties.Name.Trim(' ')
Import-Csv file.csv -Header $FixedHeaders |
Select-Object -Skip 1 -Property RFC
Output
You will get this output:
RFC
---
1
2
3
Explanation
First we use Get-Content with parameter -First 2 to get the first two lines. Piping to ConvertFrom-Csv will allow us to access the headers with PSObject.Properties.Name. Use Import-Csv with the -Header parameter to use the trimmed headers. Pipe to Select-Object and use -Skip 1 to skip the original headers.

I'm not sure about comparisons in terms of efficiency, but I think this is a little more hardened, and imports the CSV only once. You might be able to use #lahell's approach and Get-Content -raw, but this was done and it works, so I'm gonna leave it to the community to determine which is better...
#import the CSV
$rawCSV = Import-Csv $Path
#get actual header names and map to their reformatted versions
$CSVColumns = #{}
$rawCSV |
Get-Member |
Where-Object {$_.MemberType -eq "NoteProperty"} |
Select-Object -ExpandProperty Name |
Foreach-Object {
#add a mapping to the original from a trimmed and whitespace-reduced version of the original
$CSVColumns.Add(($_.Trim() -replace '(\s)\s+', '$1'), "$_")
}
#Create the array of names and calculated properties to pass to Select-Object
$SelectColumns = #()
$CSVColumns.GetEnumerator() |
Foreach-Object {
$SelectColumns += {
if ($CSVColumns.values -contains $_.key) {$_.key}
else { #{Name = $_.key; Expression = $CSVColumns[$_.key]} }
}
}
$FormattedCSV = $rawCSV |
Select-Object $SelectColumns
This was hand-copied to a computer where I don't have the rights to run it, so there might be an error - I tried to copy it correctly

You can use gocsv https://github.com/DataFoxCo/gocsv to see the headers of the csv, you can then rename the headers, behead the file, swap columns, join, merge, any number of transformations you want

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse