Powershell Performance tuning for aggregation operation on big delimited files

Powershell Performance tuning for aggregation operation on big delimited files - powershell

I have a delimited file with 350 columns. The delimiter is \034(Field separator).
I have to extract a particular column value and find out the count of each distinct value of that column in the file. If the count of distinct value is greater or equal to 2, I need to output it to a file.
The source file is 1GB. I have written the following command. It is very slow.
Get-Content E:\Test\test.txt | Foreach {($_ -split '\034')[117]} | Group-Object -Property { $_ } | %{ if($_.Count -ge 2) { Select-Object -InputObject $_ -Property Name,Count} } | Export-csv -Path "E:\Test\test2.csv" -NoTypeInformation
Please help!

I suggest using a switch statement to process the input file quickly (by PowerShell standards):
# Get an array of all the column values of interest.
$allColValues = switch -File E:\Test\test.txt {
default { # each input line
# For better performance with *literal* separators,
# use the .Split() *method*.
# Generally, however, use of the *regex*-based -split *operator* is preferable.
$_.Split([char] 0x1c)[117] # hex 0x1c is octal 034
}
}
# Group the column values, and only output those that occur at least
# twice.
$allColValues | Group-Object -NoElement | Where-Object Count -ge 2 |
Select-Object Name, Count | Export-Csv E:\Test\test2.csv -NoTypeInformation
Tip of the hat to Mathias R. Jessen for suggesting the -NoElement switch, which streamlines the Group-Object call by only maintaining abstract group information; that is, only the grouping criteria (as reflected in .Name, not also the individual objects that make up the group (as normally reflected in .Group) are returned via the output objects.
As for what you tried:
Get-Content with line-by-line streaming in the pipeline is slow, both generally (the object-by-object passing introduces overhead) and, specifically, because Get-Content decorates each line it outputs with ETS (Extended Type System) metadata.
GitHub issue #7537 proposes adding a way to opt-out of this decoration.
At the expense of memory consumption and potentially additional work for line-splitting, the -Raw switch reads the entire file as a single, multi-line string, which is much faster.
Passing -Property { $_ } to Group-Object isn't necessary - just omit it. Without a -Property argument, the input objects are grouped as a whole.
Chaining Where-Object and Select-Object - rather than filtering via an if statement in a ForEach-Object call combined with multiple Select-Object calls - is not only conceptually clearer, but performs better.

Related

how to add information to a row and csv with powershell?

i have csv file with 3 columns SID, SamAccount name, ENABLED.
i also have folder containing files that called in a combination of "UVHD-"+SID.
i try to update the csv file with Length, LastWriteTime
so it will be like this for example:
SID SAMAccountName Enabled Length LastWriteTime
S-... FelixR False 205520896 02/02/2021 9:13:40
i tried many things and all failed
this is the best i could get:
Import-Csv $path\SID-ListTEST2.csv | select -ExpandProperty SID | ForEach-Object { Get-Childitem –Path $path\"UVHD-"$_.vhdx | Export-Csv $path\SID-ListTEST2.csv -Append | where $_ }

Use calculated properties:
(
Import-Csv $path\SID-ListTEST2.csv |
Select-Object *,
#{
Name='LastWriteTime';
Expression={ (Get-Item "$path\UVHD-$($_.SID).vhdx").LastWriteTime }
}
) | # Export-Csv -NoTypeInformation -Encoding utf8 $path\SID-ListTEST2.csv
Outputs to the display; remove the # from the last line to export to a CSV file instead.
Note the (...) around the pipeline, which ensures that all output is collected up front, which is the prerequisite for saving the results back to the original input file. Note that the original character encoding isn't necessarily preserved - use -Encoding to specify the desired one.
This adds one additional property, LastWriteTime; construct the other ones analogously.
For improved performance, you could cache the result of the Get-Item call, so that it doesn't have to be repeated in every calculated property: In the simplest case, use ($script:file = Get-Item ...) in the first calculated property, which you can then reuse as $script:file (or just $file) in the subsequent ones. Note that the $script: scope modifier is necessary, because the script blocks of calculated properties run in child scopes.[1]
Note that if no matching file exists, the Get-Item call fails silently and defaults to $null.
[1] Therefore, the more robust - but more cumbersome - approach would be to use Set-Variable -Scope 1 file (Get-Item ...) instead of $script:file = Get-Item ..., to ensure that the variable is created in the immediate parent scope, whatever it happens to be.

PowerShell, can't get LastWriteTime

I have this working, but need LastWriteTime and can't get it.
Get-ChildItem -Recurse | Select-String -Pattern "CYCLE" | Select-Object Path, Line, LastWriteTime
I get an empty column and zero Date-Time data

Select-String's output objects, which are of type Microsoft.PowerShell.Commands.MatchInfo, only contain the input file path (string), no other metadata such as LastWriteTime.
To obtain it, use a calculated property, combined with the common -PipelineVariable parameter,
which allows you to reference the input file at hand in the calculated property's expression script block as a System.IO.FileInfo instance as output by Get-ChildItem, whose .LastWriteTime property value you can return:
Get-ChildItem -File -Recurse -PipelineVariable file |
Select-String -Pattern "CYCLE" |
Select-Object Path,
Line,
#{
Name='LastWriteTime';
Expression={ $file.LastWriteTime }
}
Note how the pipeline variable, $file, must be passed without the leading $ (i.e. as file) as the -PipelineVariable argument . -PipelineVariable can be abbreviated to -pv.

LastWriteTime is a property of System.IO.FileSystemInfo, which is the base type of the items Get-ChildItem returns for the Filesystem provider (which is System.IO.FileInfo for files). Path and Line are properties of Microsoft.PowerShell.Commands.MatchInfo, which contains information about the match, not the file you passed in. Select-Object operates on the information piped into it, which comes from the previous expression in the pipeline, your Select-String in this case.
You can't do this as a (well-written) one-liner if you want the file name, line match, and the last write time of the actual file to be returned. I recommend using an intermediary PSCustomObject for this and we can loop over the found files and matches individually:
# Use -File to only get file objects
$foundMatchesInFiles = Get-ChildItem -Recurse -File | ForEach-Object {
# Assign $PSItem/$_ to $file since we will need it in the second loop
$file = $_
# Run Select-String on each found file
$file | Select-String -Pattern CYCLE | ForEach-Object {
[PSCustomObject]#{
Path = $_.Path
Line = $_.Line
FileLastWriteTime = $file.LastWriteTime
}
}
}
Note: I used a slightly altered name of FileLastWriteTime to exemplify that this comes from the returned file and not the match provided by Select-String, but you could use LastWriteTime if you wish to retain the original property name.
Now $foundMatchesInFiles will be a collection of files which have CYCLE occurring within them, the path of the file itself (as returned by Select-String), and the last write time of the file itself as was returned by the initial Get-ChildItem.
Additional considerations
You could also use Select-Object and computed properties but IMO the above is a more concise approach when merging properties from unrelated objects together. While not a poor approach, Select-Object outputs data with a type containing the original object type name (e.g. Selected.Microsoft.PowerShell.Commands.MatchInfo). The code may work fine but can cause some confusion when others who may consume this object in the future inspect the output members. LastWriteTime, for example, belongs to FileSystemInfo, not MatchInfo. Another developer may not understand where the property came from at first if it has the MatchInfo type referenced. It is generally a better design to create a new object with the merged properties.
That said this is a minor issue which largely comes down to stylistic preference and whether this object might be consumed by others aside from you. I write modules and scripts that many other teams in my organization consume so this is a concern for me. It may not be for you. #mklement0's answer is an excellent example of how to use computed properties with Select-Object to achieve the same functional result as this answer.

Create Csv with loop and output

This basically works
foreach ($cprev in $CopyPreventeds) {
Write-Host ("prevented copy $(($cprev)."Name")")
$cprev | Select-Object Path, Name, Length, LastWrite, DestinationNewer | Export-Csv '.\prevented.csv' -NoTypeInformation
}
But only the last output is written to the csv. How could I write all contents to a new csv with an output at the same time for the user in PowerShell.

Maybe I'm missing something?
While I appreciate a solution has already been proposed in the comments, I have to ask, given the narrow scope of the question why are we using an obscure, albeit clever technique? And/or, repeatedly invoking Export-Csv...
The question doesn't mention sparing a variable. Moreover, There doesn't appear to be a need for the ForEach loop.
$CopyPreventeds |
Select-Object Path, Name, Length, LastWrite, DestinationNewer |
Export-Csv '.\prevented.csv' -NoTypeInformation
In the above $CopyPreventeds already exists and remains so, unmolested after the export. You would need only to output it again for the benefit of an interactive user. All taking advantage of PowerShell's intuitive pipeline and features.
Moreover, since the iteration variable $cprev isn't needed you are still less one variable.
Note: You don't need -Append because you are streaming into a single Export-Csv command, as opposed to repeatedly invoking it.
There are at least 2 ways (probably many more) you could conveniently output to an interactive user.
1: Echo a header, something like "The following copies were prevented:" then echo the variable $CopyPreventeds, presumably to a table.
Note: That given multiple points at which you seem only interested in a subset of properties. You may think about trimming those objects beforehand:
$CopyPreventeds =
$CopyPreventeds |
Select-Object Path, Name, Length, LastWrite, DestinationNewer
$CopyPreventeds | Export-Csv '.\prevented.csv' -NoTypeInformation
Write-Host "The following copies were prevented:"
$CopyPreventeds | Format-Table -AutoSize | Out-Host
Note: More than 4 Properties in a [PSCustomObject] (resulting from Select-Object) where custom formatting hasn't been defined will by default output as a list, so use Format-Table to overcome that. Out-Host is then used to prevent pipeline pollution.
2: Return to using a ForEach-Object Loop for the output between the Select-Object and the Export-Csv command.
$CopyPreventeds |
Select-Object Path, Name, Length, LastWrite, DestinationNewer
ForEach-Object{
"Prevented Copy : {0}, {1}, {2}, {3}, {4}" -f $_.Path, $_.Name, $_.Length, $_.LastWrite, $_.DestinationNewer |
Write-Host
$_
} |
Export-Csv '.\prevented.csv' -NoTypeInformation
In this example, when you are done outputting to the screen (admittedly a little messy), you emit $_ from the loop, thus piping it to Export-Csv just the same.
Note: there are a number of ways to construct strings, I choose to use the -f operator here because it's a little cleaning than imbedding numerous $() sub expressions. And, of course this assume you want to prefix on every line Which I personally think is gratuitous, so I'd choose something more like #1..

How does powershell lazily evaluate this statement?

I was searching for a way to to read only the first few lines of a csv file and came across this answer. The accepted answer suggests using
Get-Content "C:\start.csv" | select -First 10 | Out-File "C:\stop.csv"
Another answers suggests using
Get-Content C:\Temp\Test.csv -TotalCount 3
Because my csv is fairly large I went with the second option. It worked fine. Out of curiosity I decided to try the first option assuming I could ctrl+c if it took forever. I was surprised to see that it returned just as quickly.
Is it safe to use the first approach when working with large files? How does powershell achieve this?

Yes, Select-Object -First n is "safe" for large files (provided you want to read only a small number of lines, so pipeline overhead will be insignificant, else Get-Content -TotalCount n will be more efficient).
It works like break in a loop, by exiting the pipeline early, when the given number of items have been processed. Internally it throws a special exception that the PowerShell pipeline machinery recognizes.
Here is a demonstration that "abuses" Select-Object to break from a ForEach-Object "loop", which is not possible using normal break statement.
1..10 | ForEach-Object {
Write-Host $_ # goes directly to console, so is ignored by Select-Object
if( $_ -ge 3 ) { $true } # "break" by outputting one item
} | Select-Object -First 1 | Out-Null
Output:
1
2
3
As you can see, Select-Object -First n actually breaks the pipeline instead of first reading all input and then selecting only the specified number of items.
Another, more common use case is when you want to find only a single item in the output of a pipeline. Then it makes sense to exit from the pipeline as soon as you have found that item:
Get-ChildItem -Recurse | Where-Object { SomeCondition } | Select-Object -First 1

According to Microsoft the Get-Content cmdlet has a parameter called -ReadCount. Their documentation states
Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.
This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in large items.
Since -ReadCount defaults to 1 Get-Content effectively acts as a generator for reading a file line-by-line.

Reformat column names in a csv with PowerShell

Question
How do I reformat an unknown CSV column name according to a formula or subroutine (e.g. rename column " Arbitrary Column Name " to "Arbitrary Column Name" by running a trim or regex or something) while maintaining data?
Goal
I'm trying to more or less sanitize columns (the names) in a hand-produced (or at least hand-edited) csv file that needs to be processed by an existing PowerShell script. In this specific case, the columns have spaces that would be removed by a call to [String]::Trim(), or which could be ignored with an appropriate regex, but I can't figure a way to call or use those techniques when importing or processing a CSV.
Short Background
Most files and columns have historically been entered into the CSV properly, but recently a few columns were being dropped during processing; I determined it was because the files contained a space (e.g., Select-Object was being told to get "RFC", but Import-CSV retrieved "RFC ", so no matchy-matchy). Telling the customer to enter it correctly by hand (though preferred and much simpler) is not an option in this case.
Options considered
I could manually process the text of the file, but that is a messy and error prone way to re-invent the wheel. I wonder if there's a syntax with Select-Object that would allow a softer match for column names, but I can't find that info.
The closest I have come conceptually is using a calculated property in the call to Select-Object to rename the column, but I can only find ways to rename a known column to another known column. So, this would require enumerating the columns and matching them exactly (preferred) or a softer match (like comparing after trimming or matching via regex as a fallback) with expected column names, then creating a collection of name mappings to use in constructing calculated properties from that information to select into a new object.
That seems like it would work, but more it's work than I'd prefer, and I can't help but hope that there's a simpler way I haven't been able to find via Google. Maybe I should try Bing?

Sample File
Let's say you have a file.csv like this:
" RFC "
"1"
"2"
"3"
Code
Now try to run the following:
$CSV = Get-Content file.csv -First 2 | ConvertFrom-Csv
$FixedHeaders = $CSV.PSObject.Properties.Name.Trim(' ')
Import-Csv file.csv -Header $FixedHeaders |
Select-Object -Skip 1 -Property RFC
Output
You will get this output:
RFC
---
1
2
3
Explanation
First we use Get-Content with parameter -First 2 to get the first two lines. Piping to ConvertFrom-Csv will allow us to access the headers with PSObject.Properties.Name. Use Import-Csv with the -Header parameter to use the trimmed headers. Pipe to Select-Object and use -Skip 1 to skip the original headers.

I'm not sure about comparisons in terms of efficiency, but I think this is a little more hardened, and imports the CSV only once. You might be able to use #lahell's approach and Get-Content -raw, but this was done and it works, so I'm gonna leave it to the community to determine which is better...
#import the CSV
$rawCSV = Import-Csv $Path
#get actual header names and map to their reformatted versions
$CSVColumns = #{}
$rawCSV |
Get-Member |
Where-Object {$_.MemberType -eq "NoteProperty"} |
Select-Object -ExpandProperty Name |
Foreach-Object {
#add a mapping to the original from a trimmed and whitespace-reduced version of the original
$CSVColumns.Add(($_.Trim() -replace '(\s)\s+', '$1'), "$_")
}
#Create the array of names and calculated properties to pass to Select-Object
$SelectColumns = #()
$CSVColumns.GetEnumerator() |
Foreach-Object {
$SelectColumns += {
if ($CSVColumns.values -contains $_.key) {$_.key}
else { #{Name = $_.key; Expression = $CSVColumns[$_.key]} }
}
}
$FormattedCSV = $rawCSV |
Select-Object $SelectColumns
This was hand-copied to a computer where I don't have the rights to run it, so there might be an error - I tried to copy it correctly

You can use gocsv https://github.com/DataFoxCo/gocsv to see the headers of the csv, you can then rename the headers, behead the file, swap columns, join, merge, any number of transformations you want

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse