Comparing first line in multiple CSV - powershell

I have a folder containing about 130 .csv files that all appear to contain similar fields (column names). However, based on some of the file names, I am under the impression that some of the .CSV files may have slightly diffrent schemas (e.g., xxxxxx_new_format.cvs, xxxxx_version_2.csv). My thought is to copy the first line of each .csv into a text doc for comparison.
So I created the following script:
Get-Content "C:\*.csv" | ForEach-Object {
Select-Object -First 1 | Out-File "C:\compare.txt"
}
which seemed to go into an infinite loop.
How should I attack this problem? If there is a better method for comparison (i.e., I should be using python) please let me know.

try this
Get-ChildItem "C:\temp2\*.csv" |
%{[pscustomobject]#{FileName=$_.FullName;Header=gc $_.FullName -TotalCount 1}} |
group Header

Related

Powershell - Extract first line from CSV files and export the results

Thanks in advance for the help.
I have a folder with multiple CSV files. I’d like to be able to extract the first line of each of the files and store the results in a separate CSV file. The newly created CSV file will have the first column as the file name and the second column to be the first line of the file.
The output should look something like this (as an exported CSV File):
FileName,FirstLine
FileName1,Col1,Col2,Col3
FileName2,Col1,Col2,Col3
Notes:
There are other files that should be ignored. I’d like the code to loop through all CSV files which match the name pattern. I’m able to locate the files using the below code:
$targetDir ="C:\CSV_Testing\"
Get-ChildItem -Path $targetDir -Recurse -Filter "em*"
I’m also able to read the first line of one file with the below code:
Get-Content C: \CSV_Testing\testing.csv | Select -First 1
I guess I just need someone to help with looping through the files and exporting the results. Is anyone able to assist?
Thanks
You basically need a loop, to enumerate each file, for this you can use ForEach-Object, then to construct the output you need to instantiate new objects, for that [pscustomobject] is the easiest choice, then Export-Csv will convert those objects into CSV.
$targetDir = "C:\CSV_Testing"
Get-ChildItem -Path $targetDir -Recurse -Filter "em*.csv" | ForEach-Object {
[pscustomobject]#{
FileName = $_.Name
FirstLine = $_ | Get-Content -TotalCount 1
}
} | Export-Csv path\to\theResult.csv -NoTypeInformation
I have assumed the files actually have the .Csv extension hence changed your filter to -Filter "em*.csv", if that's not the case you could use the filter as you currently have it.

Powershell "if more than one, then delete all but one"

Is there a way to do something like this in Powershell:
"If more than one file includes a certain set of text, delete all but one"
Example:
"...Cam1....jpg"
"...Cam2....jpg"
"...Cam2....jpg"
"...Cam3....jpg"
Then I would want one of the two "...Cam2....jpg" deleted, while the other one should stay.
I know that I can use something like
gci *Cam2* | del
but I don't know how I can make one of these files stay.
Also, for this to work, I need to look through all the files to see if there are any duplicates, which defeats the purpose of automating this process with a Powershell script.
I searched for a solution to this for a long time, but I just can't find something that is applicable to my scenario.
Get a list of files into a collection and use range operator to select a subset of its elements. To remove all but first element, start from index one. Like so,
$cams = gci "*cam2*"
if($cams.Count -gt 1) {
$cams[1..$cams.Count] | remove-item
}
Expanding on the idea of commenter boxdog:
# Find all duplicately named files.
$dupes = Get-ChildItem c:\test -file -recurse | Group-Object Name | Where-Object Count -gt 1
# Delete all duplicates except the 1st one per group.
$dupes | ForEach-Object { $_.Group | Select-Object -Skip 1 | Remove-Item -Force }
I've split this up into two sub tasks to make it easier to understand. Also it is a good idea to always separate directory iteration from file deletion, to avoid inconsistent results.
First statement uses Group-Object to group files by names. It outputs a Count property containing the number of files per group. Then Where-Object is used to get only groups that contain more than one file, which will be the dupes. The result is stored in variable $dupes, which is an array that looks like this:
Count Name Group
----- ---- -----
2 file1.txt {C:\test\subdir1\file1.txt, C:\test\subdir2\file1.txt}
2 file2.txt {C:\test\subdir1\file2.txt, C:\test\subdir2\file2.txt}
The second statement uses ForEach-Object to iterate over all groups of duplicates. From the Group-Object call of the 1st statement we got a Group property that contains an array of file informations. Using Select-Object -Skip 1 we select all but the 1st element of this array, which are passed to Remove-Item to delete the files.

Get-Content Measure-Object Command : Additional rows are added to the actual row count

This is my first post here - my apologies in advance if I didn't follow a certain etiquette for posting. I'm a newbie to powershell, but I'm hoping someone can help me figure something out.
I'm using the following powershell script to tell me the total count of rows in a CSV file, minus the header. This generated into a text file.
$x = (Get-Content -Path "C:\mysql\out_data\18*.csv" | Measure-Object -Line).Lines
$logfile = "C:\temp\MyLog.txt"
$files = get-childitem "C:\mysql\out_data\18*.csv"
foreach($file in $files)
{
$x--
"File: $($file.name) Count: $x" | out-file $logfile -Append
}
I am doing this for 10 individual files. But there is just ONE file that keeps adding exactly 807 more rows to the actual count. For example, for the code above, the actual row count (minus the header) in the file is 25,083. But my script above generates 25,890 as the count. I've tried running this for different iterations of the same type of file (same data, different days), but it keeps adding exactly 807 to the row count.
Even when running only (Get-Content -Path "C:\mysql\out_data\18*.csv" | Measure-Object -Line).Lines, I still see the wrong record count in the powershell window.
I'm suspicious that there may be a problem specifically with the csv file itself? I'm coming to that conclusion since 9 out of 10 files generate the correct row count. Thank you advance for your time.
To measure the items in a csv you should use Import-Csv rather than Get-Content. This way you don't have to worry about headers or empty lines.
(Import-Csv -Path $csvfile | Measure-Object).Count
It's definitely possible there's a problem with that csv file. Also, note that if the csv has cells that include linebreaks that will confuse Get-Content so also try Import-CSV
I'd start with this
$PathToQuestionableFile = "c:\somefile.csv"
$TestContents = Get-Content -Path $PathToQuestionableFile
Write-Host "`n-------`nUsing Get-Content:"
$TestContents.count
$TestContents[0..10]
$TestCsv = Import-CSV -Path $PathToQuestionableFile
Write-Host "`n-------`nUsing Import-CSV:"
$TestCsv.count
$TestCsv[0..10] | Format-Table
That will let you see what Get-Content is pulling so you can narrow down where the problem is.
If it is in the file itself and using Import-CSV doesn't fix it I'd try using Notepad++ to check both the encoding and the line endings
encoding is a drop down menu, compare to the other csv files
line endings can be seen with (View > Show Symbol > Show All Characters). They should be consistent across the file, and should be one of these
CR (typically if it came from a mac)
LF (typically if it came from *nix or the internet)
CRLF (typically if it came from windows)

Alternative way to remove duplicates from CSV other than Sort-Object -unique?

I have a bug I cannot beat. When I run my script gets to this chunk of code it is incorrectly removing unique values:
import-csv "$LocalPath\A1-$abbrMonth$Year.csv" |
where {$_."CustomerName" -match $Customersregex} |
select "SubmitterID","SubmitterName","JobDate","JobTime",#{Name="Form";Expression={if ($_.FormName -match "Copy"){"C"};if ($_.FormName -match "Letter"){"L"} else {""} }},"TotalDocs",#{Name="AddnPages";Expression={$_.TotalAdditionalPages}},"InputFilename",#{Name="ActualDocs";Expression={[string]([int]$_.RegularDocs + [int]$_.UnqualifiedDocs)}}|
sort "InputFilename" -Unique |
export-csv "$LocalPath\A2-$abbrMonth$Year.csv" -NoTypeInformation
It's occurring during the "sort "InputFilename" -Unique" line, however it will work properly when I cut it up and execute it line by line, but not in the original script.
Is there any other way to remove duplicates based on the value of a column? I've tried using the "-unique" parameter on the Select-Object statement but I can't find a way to limit it to only one column.
EDIT: To clarify the issue I'm having, I have a LARGE list of accounting data. I'm trying to remove duplicate entries by using "Sort -unique". After the above code is running, there are entries missing that should not be because they are unique. I can isolate them in their own CSV, run the above code and all entries are present that should be, however when I run my master CSV file through the above code (and only that code, nothing else) and search for those entries they are missing.
EDIT 2: Looks like it was an issue with the data file. Good grief.
You can always group things, then expand the first item in the group. It's not fast, but it works for what you're doing.
import-csv "$LocalPath\A1-$abbrMonth$Year.csv" |
where {$_."CustomerName" -match $Customersregex} |
group InputFilename |
% { $_.Group[0] } |
select "SubmitterID","SubmitterName","JobDate","JobTime",#{Name="Form";Expression={if ($_.FormName -match "Copy"){"C"};if ($_.FormName -match "Letter"){"L"} else {""} }},"TotalDocs",#{Name="AddnPages";Expression={$_.TotalAdditionalPages}},"InputFilename",#{Name="ActualDocs";Expression={[string]([int]$_.RegularDocs + [int]$_.UnqualifiedDocs)}}|
sort "InputFilename" |
export-csv "$LocalPath\A2-$abbrMonth$Year.csv" -NoTypeInformation

Reformat column names in a csv with PowerShell

Question
How do I reformat an unknown CSV column name according to a formula or subroutine (e.g. rename column " Arbitrary Column Name " to "Arbitrary Column Name" by running a trim or regex or something) while maintaining data?
Goal
I'm trying to more or less sanitize columns (the names) in a hand-produced (or at least hand-edited) csv file that needs to be processed by an existing PowerShell script. In this specific case, the columns have spaces that would be removed by a call to [String]::Trim(), or which could be ignored with an appropriate regex, but I can't figure a way to call or use those techniques when importing or processing a CSV.
Short Background
Most files and columns have historically been entered into the CSV properly, but recently a few columns were being dropped during processing; I determined it was because the files contained a space (e.g., Select-Object was being told to get "RFC", but Import-CSV retrieved "RFC ", so no matchy-matchy). Telling the customer to enter it correctly by hand (though preferred and much simpler) is not an option in this case.
Options considered
I could manually process the text of the file, but that is a messy and error prone way to re-invent the wheel. I wonder if there's a syntax with Select-Object that would allow a softer match for column names, but I can't find that info.
The closest I have come conceptually is using a calculated property in the call to Select-Object to rename the column, but I can only find ways to rename a known column to another known column. So, this would require enumerating the columns and matching them exactly (preferred) or a softer match (like comparing after trimming or matching via regex as a fallback) with expected column names, then creating a collection of name mappings to use in constructing calculated properties from that information to select into a new object.
That seems like it would work, but more it's work than I'd prefer, and I can't help but hope that there's a simpler way I haven't been able to find via Google. Maybe I should try Bing?
Sample File
Let's say you have a file.csv like this:
" RFC "
"1"
"2"
"3"
Code
Now try to run the following:
$CSV = Get-Content file.csv -First 2 | ConvertFrom-Csv
$FixedHeaders = $CSV.PSObject.Properties.Name.Trim(' ')
Import-Csv file.csv -Header $FixedHeaders |
Select-Object -Skip 1 -Property RFC
Output
You will get this output:
RFC
---
1
2
3
Explanation
First we use Get-Content with parameter -First 2 to get the first two lines. Piping to ConvertFrom-Csv will allow us to access the headers with PSObject.Properties.Name. Use Import-Csv with the -Header parameter to use the trimmed headers. Pipe to Select-Object and use -Skip 1 to skip the original headers.
I'm not sure about comparisons in terms of efficiency, but I think this is a little more hardened, and imports the CSV only once. You might be able to use #lahell's approach and Get-Content -raw, but this was done and it works, so I'm gonna leave it to the community to determine which is better...
#import the CSV
$rawCSV = Import-Csv $Path
#get actual header names and map to their reformatted versions
$CSVColumns = #{}
$rawCSV |
Get-Member |
Where-Object {$_.MemberType -eq "NoteProperty"} |
Select-Object -ExpandProperty Name |
Foreach-Object {
#add a mapping to the original from a trimmed and whitespace-reduced version of the original
$CSVColumns.Add(($_.Trim() -replace '(\s)\s+', '$1'), "$_")
}
#Create the array of names and calculated properties to pass to Select-Object
$SelectColumns = #()
$CSVColumns.GetEnumerator() |
Foreach-Object {
$SelectColumns += {
if ($CSVColumns.values -contains $_.key) {$_.key}
else { #{Name = $_.key; Expression = $CSVColumns[$_.key]} }
}
}
$FormattedCSV = $rawCSV |
Select-Object $SelectColumns
This was hand-copied to a computer where I don't have the rights to run it, so there might be an error - I tried to copy it correctly
You can use gocsv https://github.com/DataFoxCo/gocsv to see the headers of the csv, you can then rename the headers, behead the file, swap columns, join, merge, any number of transformations you want