CSV file - count distinct, group by, sum

CSV file - count distinct, group by, sum - powershell

I have a file that looks like the following;
- Visitor ID,Revenue,Channel,Flight
- 1234,100,Email,BA123
- 2345,200,PPC,BA112
- 456,150,Email,BA456
I need to produce a file that contains;
The count of distinct Visitor IDs (3)
The total revenue (450)
The count of each Channel
Email 2
PPC 2
The count of each Flight
BA123 1
BA112 1
BA456 1
So far I have the following code, however when executing this on the 350MB file, it takes too long and in some cases breaks the memory limit. As I have to run this function on multiple columns, it is going through the file many times. I ideally need to do this in one file pass.
$file = 'log.txt'
function GroupBy($columnName)
{
$objects = Import-Csv -Delimiter "`t" $file | Group-Object $columnName |
Select-Object #{n=$columnName;e={$_.Group[0].$columnName}}, Count
for($i=0;$i -lt $objects.count;$I++) {
$line += $columnName +"|"+$objects[$I]."$columnName" +"|Count|"+ $objects[$I].'Count' + $OFS
}
return $line
}
$finalOutput += GroupBy "Channel"
$finalOutput += GroupBy "Flight"
Write-Host $finalOutput
Any help would be much appreciated.
Thanks,
Craig

The fact that your are importing the CSV again for each column is what is killing your script. Try to do the loading once, then re-use the data. For example:
$data = Import-Csv .\data.csv
$flights = $data | Group-Object Flight -NoElement | ForEach-Object {[PsCustomObject]#{Flight=$_.Name;Count=$_.Count}}
$visitors = ($data | Group-Object "Visitor ID" | Measure-Object).Count
$revenue = ($data | Measure-Object Revenue -Sum).Sum
$channel = $data | Group-Object Channel -NoElement | ForEach-Object {[PsCustomObject]#{Channel=$_.Name;Count=$_.Count}}
You can display the data like this:
"Revenue : $revenue"
"Visitors: $visitors"
$flights | Format-Table -AutoSize
$channel | Format-Table -AutoSize

This will probably work - using hashmaps.
Pros: It will be faster/use less memory.
Cons: It is less readable
by far than Group-Object, and requires more code.
Make it even less memory-hungry: Read the CSV-file line by line
$data = Import-CSV -Path "C:\temp\data.csv" -Delimiter ","
$DistinctVisitors = #{}
$TotalRevenue = 0
$ChannelCount = #{}
$FlightCount = #{}
$data | ForEach-Object {
$DistinctVisitors[$_.'Visitor ID'] = $true
$TotalRevenue += $_.Revenue
if (-not $ChannelCount.ContainsKey($_.Channel)) {
$ChannelCount[$_.Channel] = 0
}
$ChannelCount[$_.Channel] += 1
if (-not $FlightCount.ContainsKey($_.Flight)) {
$FlightCount[$_.Flight] = 0
}
$FlightCount[$_.Flight] += 1
}
$DistinctVisitorsCount = $DistinctVisitors.Keys | Measure-Object | Select-Object -ExpandProperty Count
Write-Output "The count of distinc Visitor IDs $DistinctVisitorsCount"
Write-Output "The total revenue $TotalRevenue"
Write-Output "The Count of each Channel"
$ChannelCount.Keys | ForEach-Object {
Write-Output "$_ $($ChannelCount[$_])"
}
Write-Output "The count of each Flight"
$FlightCount.Keys | ForEach-Object {
Write-Output "$_ $($FlightCount[$_])"
}

Related

Find out Text data in CSV File Numeric Columns in Powershell

I am very new in powershell.
I am trying to validate my CSV file by finding out if there is any text value in my numeric fields. I can define with columns are numeric.
This is my source data like this
ColA ColB ColC ColD
23 23 ff 100
2.30E+01 34 2.40E+01 23
df 33 ss df
34 35 36 37
I need output something like this (only text values if found in any column)
ColA ColC ColD
2.30E+01 ff df
df 2.40E+01
ss
I have tried some code but not getting any results, get only some output like as under
System.Object[]
---------------
xxx fff' ddd 3.54E+03
...
This is what I was trying
#
cls
function Is-Numeric ($Value) {
return $Value -match "^[\d\.]+$"
}
$arrResult = #()
$arraycol = #()
$FileCol = #("ColA","ColB","ColC","ColD")
$dif_file_path = "C:\Users\$env:username\desktop\f2.csv"
#Importing CSVs
$dif_file = Import-Csv -Path $dif_file_path -Delimiter ","
############## Test Datatype (Is-Numeric)##########
foreach($col in $FileCol)
{
foreach ($line in $dif_file) {
$val = $line.$col
$isnum = Is-Numeric($val)
if ($isnum -eq $false) {
$arrResult += $line.$col
$arraycol += $col
}
}
}
[pscustomobject]#{$arraycol = "$arrResult"}| out-file "C:\Users\$env:username\Desktop\Errors1.csv"
####################
can someone guide me right direction?
Thanks

You can try something like this,
function Is-Numeric ($Value) {
return $Value -match "^[\d\.]+$"
}
$dif_file_path = "C:\Users\$env:username\desktop\f2.csv"
#Importing CSVs
$dif_file = Import-Csv -Path $dif_file_path -Delimiter ","
#$columns = $dif_file | Get-member -MemberType 'NoteProperty' | Select-Object -ExpandProperty 'Name'
# Use this to specify certain columns
$columns = "ColB", "ColC", "ColD"
foreach($row in $dif_file) {
foreach ($col in $columns) {
if ($col -in $columns) {
if (!(Is-Numeric $row.$col)) {
$row.$col = ""
}
}
}
}
$dif_file | Export-Csv C:\temp\formatted.txt
Look up name of columns as you go
Look up values of each col in each row and if it is not numeric, change to ""
Exported updated file.

I think not displaying columns that have no data creates the challenge here. You can do the following:
$csv = Import-Csv "C:\Users\$env:username\desktop\f2.csv"
$finalprops = [collections.generic.list[string]]#()
$out = foreach ($line in $csv) {
$props = $line.psobject.properties | Where {$_.Value -notmatch '^[\d\.]+$'} |
Select-Object -Expand Name
$props | Where {$_ -notin $finalprops} | Foreach-Object { $finalprops.add($_) }
if ($props) {
$line | Select $props
}
$out | Select-Object ($finalprops | Sort)
Given the nature of Format-Table or tabular output, you only see the properties of the first object in the collection. So if object1 has ColA only, but object2 has ColA and ColB, you only see ColA.

The output order you want is quite different than the input CSV; you're tracking bad text data not by first occurrence, but by column order, which requires some extra steps.
test.csv file contents:
ColA,ColB,ColC,ColD
23,23,ff,100
2.30E+01,34,2.40E+01,23
df,33,ss,df
34,35,36,37
Sample code tested to meet your description:
$csvIn = Import-Csv "$PSScriptRoot\test.csv";
# create working data set with headers in same order as input file
$data = [ordered]#{};
$csvIn[0].PSObject.Properties | foreach {
$data.Add($_.Name, (New-Object System.Collections.ArrayList));
};
# add fields with text data
$csvIn | foreach {
$_.PSObject.Properties | foreach {
if ($_.Value -notmatch '^-?[\d\.]+$') {
$null = $data[$_.Name].Add($_.Value);
}
}
}
$removes = #(); # remove `good` columns with numeric data
$rowCount = 0; # column with most bad values
$data.GetEnumerator() | foreach {
$badCount = $_.Value.Count;
if ($badCount -eq 0) { $removes += $_.Key; }
if ($badCount -gt $rowCount) { $rowCount = $badCount; }
}
$removes | foreach { $data.Remove($_); }
0..($rowCount - 1) | foreach {
$h = [ordered]#{};
foreach ($key in $data.Keys) {
$h.Add($key, $data[$key][$_]);
}
[PSCustomObject]$h;
} |
Export-Csv -NoTypeInformation -Path "$PSScriptRoot\text-data.csv";
output file contents:
"ColA","ColC","ColD"
"2.30E+01","ff","df"
"df","2.40E+01",
,"ss",

#Jawad, Finally I have tried
function Is-Numeric ($Value) {
return $Value -match "^[\d\.]+$"
}
$arrResult = #()
$columns = "ColA","ColB","ColC","ColD"
$dif_file_path = "C:\Users\$env:username\desktop\f1.csv"
$dif_file = Import-Csv -Path $dif_file_path -Delimiter "," |select $columns
$columns = $dif_file | Get-member -MemberType 'NoteProperty' | Select-Object -ExpandProperty 'Name'
foreach($row in $dif_file) {
foreach ($col in $columns) {
$val = $row.$col
$isnum = Is-Numeric($val)
if ($isnum -eq $false) {
$arrResult += $col+ " " +$row.$col
}}}
$arrResult | out-file "C:\Users\$env:username\desktop\Errordata.csv"
I get correct result in my out file, order is very ambiguous like
ColA ss
ColB 5.74E+03
ColA ss
ColC rrr
ColB 3.54E+03
ColD ss
ColB 8.31E+03
ColD cc
any idea to get proper format? thanks
Note: with your suggested code, I get complete source file with all data , not the specific error data.

How to export two variables into same CSV as joined via PowerShell?

I have a PowerShell script employing poshwsus module like below:
$FileOutput = "C:\WSUSReport\WSUSReport.csv"
$ProcessLog = "C:\WSUSReport\QueryLog2.txt"
$WSUSServers = "C:\WSUSReport\Computers.txt"
$WSUSPort = "8530"
import-module poshwsus
ForEach ($Server in Get-Content $WSUSServers)
{
& connect-poshwsusserver $Server -port $WSUSPort | out-file $ProcessLog -append
$r1 = & Get-PoshWSUSClient | select #{name="Computer";expression={$_.FullDomainName}},#{name="LastUpdated";expression={if ([datetime]$_.LastReportedStatusTime -gt [datetime]"1/1/0001 12:00:00 AM") {$_.LastReportedStatusTime} else {$_.LastSyncTime}}}
$r2 = & Get-PoshWSUSUpdateSummaryPerClient -UpdateScope (new-poshwsusupdatescope) -ComputerScope (new-poshwsuscomputerscope) | Select Computer,NeededCount,DownloadedCount,NotApplicableCount,NotInstalledCount,InstalledCount,FailedCount
}
What I need to do is to export CSV outpout including the results with the columns (like "inner join"):
Computer, NeededCount, DownloadedCount, NotApplicableCount, NotINstalledCount, InstalledCount, FailedCount, LastUpdated
I have tried to use the line below in foreach, but it didn't work as I expected.
$r1 + $r2 | export-csv -NoTypeInformation -append $FileOutput
I appreciate if you may help or advise.
EDIT --> The output I've got:
ComputerName LastUpdate
X A
Y B
X
Y
So no error, first two rows from $r2, last two rows from $r1, it is not joining the tables as I expected.
Thanks!

I've found my guidance in this post: Inner Join in PowerShell (without SQL)
Modified my query accordingly like below, works like a charm.
$FileOutput = "C:\WSUSReport\WSUSReport.csv"
$ProcessLog = "C:\WSUSReport\QueryLog.txt"
$WSUSServers = "C:\WSUSReport\Computers.txt"
$WSUSPort = "8530"
import-module poshwsus
function Join-Records($tab1, $tab2){
$prop1 = $tab1 | select -First 1 | % {$_.PSObject.Properties.Name} #properties from t1
$prop2 = $tab2 | select -First 1 | % {$_.PSObject.Properties.Name} #properties from t2
$join = $prop1 | ? {$prop2 -Contains $_}
$unique1 = $prop1 | ?{ $join -notcontains $_}
$unique2 = $prop2 | ?{ $join -notcontains $_}
if ($join) {
$tab1 | % {
$t1 = $_
$tab2 | % {
$t2 = $_
foreach ($prop in $join) {
if (!$t1.$prop.Equals($t2.$prop)) { return; }
}
$result = #{}
$join | % { $result.Add($_,$t1.$_) }
$unique1 | % { $result.Add($_,$t1.$_) }
$unique2 | % { $result.Add($_,$t2.$_) }
[PSCustomObject]$result
}
}
}
}
ForEach ($Server in Get-Content $WSUSServers)
{
& connect-poshwsusserver $Server -port $WSUSPort | out-file $ProcessLog -append
$r1 = & Get-PoshWSUSClient | select #{name="Computer";expression={$_.FullDomainName}},#{name="LastUpdated";expression={if ([datetime]$_.LastReportedStatusTime -gt [datetime]"1/1/0001 12:00:00 AM") {$_.LastReportedStatusTime} else {$_.LastSyncTime}}}
$r2 = & Get-PoshWSUSUpdateSummaryPerClient -UpdateScope (new-poshwsusupdatescope) -ComputerScope (new-poshwsuscomputerscope) | Select Computer,NeededCount,DownloadedCount,NotApplicableCount,NotInstalledCount,InstalledCount,FailedCount
Join-Records $r1 $r2 | Select Computer,NeededCount,DownloadedCount,NotApplicableCount,NotInstalledCount,InstalledCount,FailedCount, LastUpdated | export-csv -NoTypeInformation -append $FileOutput
}

I think this could be made simpler. Since Select-Object's -Property parameter accepts an array of values, you can create an array of the properties you want to display. The array can be constructed by comparing your two objects' properties and outputting a unique list of those properties.
$selectProperties = $r1.psobject.properties.name | Compare-Object $r2.psobject.properties.name -IncludeEqual -PassThru
$r1,$r2 | Select-Object -Property $selectProperties
Compare-Object by default will output only differences between a reference object and a difference object. Adding the -IncludeEqual switch displays different and equal comparisons. Adding the -PassThru parameter outputs the actual objects that are compared rather than the default PSCustomObject output.

Multiple Criteria Matching in PowerShell

Hello PowerShell Scriptwriters,
I got an objective to count rows, based on the multiple criteria matching. My PowerShell script can able to fetch me the end result, but it consumes too much time[when the rows are more, the time it consumes becomes even more]. Is there a way to optimism my existing code? I've shared my code for your reference.
$csvfile = Import-csv "D:\file\filename.csv"
$name_unique = $csvfile | ForEach-Object {$_.Name} | Select-Object -Unique
$region_unique = $csvfile | ForEach-Object {$_."Region Location"} | Select-Object -Unique
$cost_unique = $csvfile | ForEach-Object {$_."Product Cost"} | Select-Object -Unique
Write-host "Save Time on Report" $csvfile.Length
foreach($nu in $name_unique)
{
$inc = 1
foreach($au in $region_unique)
{
foreach($tu in $cost_unique)
{
foreach ($mainfile in $csvfile)
{
if (($mainfile."Region Location" -eq $au) -and ($mainfile.'Product Cost' -eq $tu) -and ($mainfile.Name -eq $nu))
{
$inc++ #Matching Counter
}
}
}
}
$inc #expected to display Row values with the total count.And export the result as csv
}

You can do this quite simply using the Group option on a Powershell object.
$csvfile = Import-csv "D:\file\filename.csv"
$csvfile | Group Name,"Region Location","Product Cost" | Select Name, Count
This gives output something like the below
Name Count
---- ------
f1, syd, 10 2
f2, syd, 10 1
f3, syd, 20 1
f4, melb, 10 2
f2, syd, 40 1
P.S. the code you provided above is not matching all of the fields, it is simply checking the Name parameter (looping through the other parameters needlessly).

Combining like objects in an array

I am attempting to analyze a group of text files (MSFTP logs) and do counts of IP addresses that have submitted bad credentials. I think I have it worked out except I don't think that the array is passing to/from the function correctly. As a result, I get duplicate entries if the same IP appears in multiple log files. What am I doing wrong?
Function LogBadAttempt($FTPLog,$BadPassesArray)
{
$BadPassEx="PASS - 530"
Foreach($Line in $FTPLog)
{
if ($Line -match $BadPassEx)
{
$IP=($Line.Split(' '))[1]
if($BadPassesArray.IP -contains $IP)
{
$CurrentIP=$BadPassesArray | Where-Object {$_.IP -like $IP}
[int]$CurrentCount=$CurrentIP.Count
$CurrentCount++
$CurrentIP.Count=$CurrentCount
}else{
$info=#{"IP"=$IP;"Count"='1'}
$BadPass=New-Object -TypeName PSObject -Property $info
$BadPassesArray += $BadPass
}
}
}
return $BadPassesArray
}
$BadPassesArray=#()
$FTPLogs = Get-Childitem \\ftpserver\MSFTPSVC1\test
$Result = ForEach ($LogFile in $FTPLogs)
{
$FTPLog=Get-Content ($LogFile.fullname)
LogBadAttempt $FTPLog
}
$Result | Export-csv C:\Temp\test.csv -NoTypeInformation
The result looks like...
Count IP
7 209.59.17.20
20 209.240.83.135
18441 209.59.17.20
13059 200.29.3.98
and would like it to combine the entries for 209.59.17.20

You're making this way too complicated. Process the files in a pipeline and use a hashtable to count the occurrences of each IP address:
$BadPasswords = #{}
Get-ChildItem '\\ftpserver\MSFTPSVC1\test' | Get-Content | ? {
$_ -like '*PASS - 530*'
} | % {
$ip = ($_ -split ' ')[1]
$BadPasswords[$ip]++
}
$BadPasswords.GetEnumerator() |
select #{n='IP';e={$_.Name}}, #{n='Count';e={$_.Value}} |
Export-Csv 'C:\Temp\test.csv' -NoType

Powershell : merge two CSV files with partially duplicate lines

I have scraped two files from a website in order to list the companies in my city.
The first lists : name, city, phone number, email
The second lists : name, city, phone number
And I will have duplicate lines if I merge them, as an example, i will have the following :
> "Firm1";"Los Angeles";"000000";"info#firm1.lol"
> "Firm1";"Los Angeles";"000000";""
> "Firm2";"Los Angeles";"111111";""
> "Firm3";"Los Angeles";"000000";"contact#firm3.lol"
> "Firm3";"Los Angeles";"000000";""
> ...
Is there a way to merge the two files and keep the max info like this :
> "Firm1";"Los Angeles";"000000";"info#firm1.lol"
> "Firm2";"Los Angeles";"111111";""
> "Firm3";"Los Angeles";"000000";"contact#firm3.lol"
> ...

According to the fact you've got a file like this called 'firm.csv'
"Firm1";"Los Angeles";"000000";"info#firm1.lol"
"Firm1";"Los Angeles";"000000";""
"Firm2";"Los Angeles";"111111";""
"Firm3";"Los Angeles";"000000";"contact#firm3.lol"
"Firm3";"Los Angeles";"000000";""
You can load it using :
$firms = import-csv C:\temp\firm.csv -Header 'Firm','Town','Tel','Mail' -Delimiter ';'
Then
$firms | Sort-Object -Unique -Property 'Firm'
According to Joey's comment I improved the solution :
$firms | Group-Object -Property 'firm' | % {$_.group | Sort-Object -Property mail -Descending | Select-Object -first 1}

EDIT: just realized the two files don't contain the same headers. Here is an update.
$main = Import-Csv firm1.csv -Header 'Firm','Town','Tel','Mail' -Delimiter ";"
$alt = Import-Csv firm2.csv -Header 'Firm','Town','Tel' -Delimiter ";"
foreach ($f in $alt)
{
$found = $false
foreach($g in $main)
{
if ($g.Firm -eq $f.Firm -and $g.city -eq $f.city)
{
$found = $true
if ($g.Tel -eq "")
{
$g.Tel = $f.Tel
}
}
}
if ($found -eq $false)
{
$main += $f
}
}
# Everything is merged into the $main array
$main

There must be better approach but this is one costy way to do this.
$firms = import-csv C:\firm.csv -Header 'Firm','Town','Tel','Mail' -Delimiter ';'
$Result = #()
ForEach($i in $firms){
$found = 0;
ForEach($m in $Result){
if($m.Firm -eq $i.Firm){
$found = 1
if( $i.Mail.length -ne 0 )
{
$m.Mail = $i.Mail
}
break;
}
}
if($found -eq 0){
$Result += [pscustomobject] #{Firm=$i.Firm; Town=$i.Town; Tel=$i.Tel; Mail=$i.Mail}
}
}
$Result | export-csv C:\out.csv

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

CSV file - count distinct, group by, sum - powershell

Related

Find out Text data in CSV File Numeric Columns in Powershell

How to export two variables into same CSV as joined via PowerShell?

Multiple Criteria Matching in PowerShell

Combining like objects in an array

Powershell : merge two CSV files with partially duplicate lines

Categories

Resources