How does powershell lazily evaluate this statement? - powershell

I was searching for a way to to read only the first few lines of a csv file and came across this answer. The accepted answer suggests using
Get-Content "C:\start.csv" | select -First 10 | Out-File "C:\stop.csv"
Another answers suggests using
Get-Content C:\Temp\Test.csv -TotalCount 3
Because my csv is fairly large I went with the second option. It worked fine. Out of curiosity I decided to try the first option assuming I could ctrl+c if it took forever. I was surprised to see that it returned just as quickly.
Is it safe to use the first approach when working with large files? How does powershell achieve this?

Yes, Select-Object -First n is "safe" for large files (provided you want to read only a small number of lines, so pipeline overhead will be insignificant, else Get-Content -TotalCount n will be more efficient).
It works like break in a loop, by exiting the pipeline early, when the given number of items have been processed. Internally it throws a special exception that the PowerShell pipeline machinery recognizes.
Here is a demonstration that "abuses" Select-Object to break from a ForEach-Object "loop", which is not possible using normal break statement.
1..10 | ForEach-Object {
Write-Host $_ # goes directly to console, so is ignored by Select-Object
if( $_ -ge 3 ) { $true } # "break" by outputting one item
} | Select-Object -First 1 | Out-Null
Output:
1
2
3
As you can see, Select-Object -First n actually breaks the pipeline instead of first reading all input and then selecting only the specified number of items.
Another, more common use case is when you want to find only a single item in the output of a pipeline. Then it makes sense to exit from the pipeline as soon as you have found that item:
Get-ChildItem -Recurse | Where-Object { SomeCondition } | Select-Object -First 1

According to Microsoft the Get-Content cmdlet has a parameter called -ReadCount. Their documentation states
Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.
This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in large items.
Since -ReadCount defaults to 1 Get-Content effectively acts as a generator for reading a file line-by-line.

Related

Powershell - Return Line or Row number from input file

I found an answer to a previous question incredibly helpful, but I can't quite figure out how Get-Content is able able to store the 'line number' from the input.
Basically I'm wondering if PSObjects store information such as line number or row number. In the example below, it is basically like using Get-Content is able to store the line number as a variable you can use later. In the pipeline, the variable would be $_.psobject.Properties.value[5]
A bit of that seems redundant to me since $_ is an object (I think), but still it is very cool that .value[5] seems to be the line number or row number. The same is not true of Import-CSV and while I'm looking for a similar option with Import-CSV; I'd like to better understand why this works the way it does.
https://stackoverflow.com/a/23119235/15243610
Get-Content $colCnt | ?{$_} | Select -Skip 1 | %{if(!($_.split("|").Count -eq 210)){"Process stopped at line number $($_.psobject.Properties.value[5]), incorrect column count of: $($_.split("|").Count).";break}}
The answer in the other question works because Get-Content does indeed include the line number when it reads in the strings. When you run Get-Content each line will have a $_.ReadCount property as the 6th property on the object, which in my old answer I referenced in the PSObject for it as $_.psobject.Properties.value[5] (it was 7 years ago and I didn't know better yet, sorry). Mind you, if you use the -ReadCount parameter it will send that many lines through at a time, so Get-Content $file -readcount 5 | Select -first 1 | ForEach-Object{ $_.ReadCount } will come out as 5. Also -Raw sends everything through at once so it won't work with that.
Honestly, this isn't that hard to adapt to Import-Csv, we just increment a variable defined in the ForEach-Object loop.
Import-Csv C:\Path\To\SomeFile.csv | ForEach-Object -Begin {$x=1} -Process {
If($_.Something -eq $SomethingElse){
Write-Warning "Somethin' bad happened on line $x!"
break
}else{$_}
$x++
}

Powershell Performance tuning for aggregation operation on big delimited files

I have a delimited file with 350 columns. The delimiter is \034(Field separator).
I have to extract a particular column value and find out the count of each distinct value of that column in the file. If the count of distinct value is greater or equal to 2, I need to output it to a file.
The source file is 1GB. I have written the following command. It is very slow.
Get-Content E:\Test\test.txt | Foreach {($_ -split '\034')[117]} | Group-Object -Property { $_ } | %{ if($_.Count -ge 2) { Select-Object -InputObject $_ -Property Name,Count} } | Export-csv -Path "E:\Test\test2.csv" -NoTypeInformation
Please help!
I suggest using a switch statement to process the input file quickly (by PowerShell standards):
# Get an array of all the column values of interest.
$allColValues = switch -File E:\Test\test.txt {
default { # each input line
# For better performance with *literal* separators,
# use the .Split() *method*.
# Generally, however, use of the *regex*-based -split *operator* is preferable.
$_.Split([char] 0x1c)[117] # hex 0x1c is octal 034
}
}
# Group the column values, and only output those that occur at least
# twice.
$allColValues | Group-Object -NoElement | Where-Object Count -ge 2 |
Select-Object Name, Count | Export-Csv E:\Test\test2.csv -NoTypeInformation
Tip of the hat to Mathias R. Jessen for suggesting the -NoElement switch, which streamlines the Group-Object call by only maintaining abstract group information; that is, only the grouping criteria (as reflected in .Name, not also the individual objects that make up the group (as normally reflected in .Group) are returned via the output objects.
As for what you tried:
Get-Content with line-by-line streaming in the pipeline is slow, both generally (the object-by-object passing introduces overhead) and, specifically, because Get-Content decorates each line it outputs with ETS (Extended Type System) metadata.
GitHub issue #7537 proposes adding a way to opt-out of this decoration.
At the expense of memory consumption and potentially additional work for line-splitting, the -Raw switch reads the entire file as a single, multi-line string, which is much faster.
Passing -Property { $_ } to Group-Object isn't necessary - just omit it. Without a -Property argument, the input objects are grouped as a whole.
Chaining Where-Object and Select-Object - rather than filtering via an if statement in a ForEach-Object call combined with multiple Select-Object calls - is not only conceptually clearer, but performs better.

Powershell "if more than one, then delete all but one"

Is there a way to do something like this in Powershell:
"If more than one file includes a certain set of text, delete all but one"
Example:
"...Cam1....jpg"
"...Cam2....jpg"
"...Cam2....jpg"
"...Cam3....jpg"
Then I would want one of the two "...Cam2....jpg" deleted, while the other one should stay.
I know that I can use something like
gci *Cam2* | del
but I don't know how I can make one of these files stay.
Also, for this to work, I need to look through all the files to see if there are any duplicates, which defeats the purpose of automating this process with a Powershell script.
I searched for a solution to this for a long time, but I just can't find something that is applicable to my scenario.
Get a list of files into a collection and use range operator to select a subset of its elements. To remove all but first element, start from index one. Like so,
$cams = gci "*cam2*"
if($cams.Count -gt 1) {
$cams[1..$cams.Count] | remove-item
}
Expanding on the idea of commenter boxdog:
# Find all duplicately named files.
$dupes = Get-ChildItem c:\test -file -recurse | Group-Object Name | Where-Object Count -gt 1
# Delete all duplicates except the 1st one per group.
$dupes | ForEach-Object { $_.Group | Select-Object -Skip 1 | Remove-Item -Force }
I've split this up into two sub tasks to make it easier to understand. Also it is a good idea to always separate directory iteration from file deletion, to avoid inconsistent results.
First statement uses Group-Object to group files by names. It outputs a Count property containing the number of files per group. Then Where-Object is used to get only groups that contain more than one file, which will be the dupes. The result is stored in variable $dupes, which is an array that looks like this:
Count Name Group
----- ---- -----
2 file1.txt {C:\test\subdir1\file1.txt, C:\test\subdir2\file1.txt}
2 file2.txt {C:\test\subdir1\file2.txt, C:\test\subdir2\file2.txt}
The second statement uses ForEach-Object to iterate over all groups of duplicates. From the Group-Object call of the 1st statement we got a Group property that contains an array of file informations. Using Select-Object -Skip 1 we select all but the 1st element of this array, which are passed to Remove-Item to delete the files.

PowerShell: How to get count of piped collection?

Suppose I have a process that generates a collection of objects. For a very simple example, consider $(1 | get-member). I can get the number of objects generated:
PS C:\WINDOWS\system32> $(1 | get-member).count
21
or I can do something with those objects.
PS C:\WINDOWS\system32> $(1 | get-member) | ForEach-object {write-host $_.name}
CompareTo
Equals
...
With only 21 objects, doing the above is no problem. But what if the process generates hundreds of thousands of objects? Then I don't want to run the process once just to count the objects and then run it again to execute what I want to do with them. So how can I get a count of objects in a collection sent down the pipeline?
A similar question was asked before, and the accepted answer was to use a counter variable inside the script block that works on the collection. The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct. So I don't want to just count inside the script block. I want a separate, independent measure of the size of the collection that I sent down the pipeline. How can I do that?
If processing and counting is needed:
Doing your own counting inside a ForEach-Object script block is your best bet to avoid processing in two passes.
The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct.
ForEach-Object is reliably invoked for each and every input object, including $null values, so there should be no need to double-check.
If you want a cleaner separation of processing and counting, you can pass multiple -Process script blocks to ForEach-Object (in this example, { $_ + 1 } is the input-processing script block and { ++$count } is the input-counting one):
PS> 1..5 | ForEach-Object -Begin { $count = 0 } `
-Process { $_ + 1 }, { ++$count } `
-End { "--- count: $count" }
2
3
4
5
6
--- count: 5
Note that, due to a quirk in ForEach-Object's parameter binding, passing -Begin and -End script blocks is actually required in order to pass multiple -Process (per-input-object) blocks; pass $null if you don't actually need -Begin and/or -End - see GitHub issue #4513.
Also note that the $count variable lives in the caller's scope, and is not scoped to the ForEach-Object call; that is, $count = 0 potentially updates a preexisting $count variable, and, if it didn't previously exist, lives on after the ForEach-Object call.
If only counting is needed:
Measure-Object is the cmdlet to use with large, streaming input collections in the pipeline[1]:
The following example generates 100,000 integers one by one and has Measure-Object count them one by one, without collecting the entire input in memory.
PS> (& { $i=0; while ($i -lt 1e5) { (++$i) } } | Measure-Object).Count
100000
Caveat: Measure-Object ignores $null values in the input collection - see GitHub issue #10905.
Note that while counting input objects is Measure-Object's default behavior, it supports a variety of other operations as well, such as summing -Sum and averaging (-Average), optionally combined in a single invocation.
[1] Measure-Object, as a cmdlet, is capable of processing input in a streaming fashion, meaning it counts objects it receives one by one, as they're being received, which means that even very large streaming input sets (those also created one by one, such as enumerating the rows of a large CSV file with Import-Csv) can be processed without the risk of running out of memory - there is no need to load the input collection as a whole into memory. However, if (a) the input collection already is in memory, or (b) it can fit into memory and performance is important, then use (...).Count.

Delete a file, if it is empty except for a header row

I am trying to write a PowerShell script to delete a file if its empty, apart from the header.
postanote's answer provides some useful background information on the use of the Measure-Object cmdlet.
In the case at hand, however, it's simpler and faster to use the following:
$file = 'C:\path\to\FileOfInterest'
if ((Get-Content -First 2 $file).Count -le 1) {
Remove-Item $file
}
Get-Content -First 2 $file returns up to 2 lines from the start of file $file, as an array.
Note:-First is a more descriptive alias for the -TotalCount parameter; in PowerShell v2, use the latter.
(...).Count counts the elements of that array, i.e., the number of lines actually read.[1]
-le 1 (-le meaning less-than-or-equal) returns $true if, despite asking for 2 lines, only 0 or 1 are returned.
The Remove-Item call then removes file $file.
[1] Up to PowerShell version 2, .Count would return $null if only 1 line had been read, because PowerShell returns a single output object as-is instead of wrapping it in a single-element array. However, since $null is coerced to 0 in a numerical comparison such as with -le, ths solution works in v2 as well. PowerShell versions 3 and higher implicitly implement a .Count property even on scalars (single objects), which - sensibly - returns 1.
Agreed Olaf...
Khader - What did you search for. There are samples of how to count lines in a file all over the web.
Just search for 'powershell count lines in file'
Example hits.
Use a PowerShell Cmdlet to Count Files, Words, and Lines
How to count number of lines and words in a file using Powershell?
If I want to know how many lines are contained in the file, I use the
Measure-Object cmdlet with the line switch. This command is shown
here:
Get-Content C:\fso\a.txt | Measure-Object –Line
If I need to know the number of characters, I use the character
switch:
Get-Content C:\fso\a.txt | Measure-Object -Character
There is also a words switched parameter that will return the number
of words in the text file. It is used similarly to the character or
line switched parameter. The command is shown here:
Get-Content C:\fso\a.txt | Measure-Object –Word
In the following figure, I use the Measure-Object cmdlet to count
lines; then lines and characters; and finally lines, characters, and
words. These commands illustrate combining the switches to return
specific information.
Update for OP.
You should have updated your original question for context vs putting your code in the comment
As for …
Is there any way I can return just the count and use it with an if
statement to check if it is equal to 1, and then del the file
Just use the if statement when checking for the 'lines' count greater than 1
If (Get-Content $_.FullName | Measure-Object –Line | Where-Object -Property Lines -gt 1)
{
'Count is greater than one'
Remove-Item ...
}
Again, this is very basic PowerShell overview stuff, so it's prudent you take Olaf's suggestion to limit future confusion, frustrations, misconceptions and errors you are going to encounter.