If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?
Example:
this
is
a
sample
file
with
several
lines
of
varying
length
Output:
Length Occurences
1 1
2 2
4 3
5 1
6 2
7 2
Have you got any ideas?
For high-volume work, use Get-Content with -ReadCount
$ht = #{}
Get-Content <file> -ReadCount 1000 |
foreach {
foreach ($line in $_)
{$ht[$line.length]++}
}
$ht.GetEnumerator() | sort Name
How about
get-content <file> | Group-Object -Property Length | sort -Property Name
Depending on how long your file is, you may want to do something more efficient
Related
thank you for taking the time to read and maybe help me!
I was doing an assignment with counting chars for a document, but now i wanted to see if i could count char without counting the first 3 pages of the document.
Did some research, and couldn't find much about it since i am fairly new to powershell.
clear-host
$b = Read-Host 'Indtast destination mappe' #Beder burgeren om at indtaste destinations mappen
Get-Content -path $b | Measure -Line -Word -Character | Out-File C:\Users\TimHen\Desktop\output.txt #Tæller linjer, ord og tegn i dokumentet.
#udskriver vokaler og konsonante
Judging from your code, I'd say you're reading a text file. A text file doesn't have pages, but what you could do is skip the first x amount of lines.
Get-Content $b | Select-Object -Skip 160 | Measure -Line -Word -Character | Out-File C:\Users\TimHen\Desktop\output.txt
Another possibility (but not really applicable in your scenario) is to use the Tail parameter of Get-Content. That will give you the x last lines of the file.
Get-Content $b -Tail 3000 | Measure -Line -Word -Character | Out-File C:\Users\TimHen\Desktop\output.txt
I am trying to get a list of files and a count of the number of rows in each file displayed in a table consisting of two columns, Name and Lines.
I have tried using format table but I don't think the problem is with the format of the table and more to do with my results being separate results. See below
#Get a list of files in the filepath location
$files = Get-ChildItem $filepath
$files | ForEach-Object { $_ ; $_ | Get-Content | Measure-Object -Line} | Format-Table Name,Lines
Expected results
Name Lines
File A
9
File B
89
Actual Results
Name Lines
File A
9
File B
89
Another approach how to make a custom object like this: Using PowerShell's Calculated Properties:
$files | Select-Object -Property #{ N = 'Name' ; E = { $_.Name} },
#{ N = 'Lines'; E = { ($_ | Get-Content | Measure-Object -Line).Lines } }
Name Lines
---- -----
dotNetEnumClass.ps1 232
DotNetVersions.ps1 9
dotNETversionTable.ps1 64
Typically you would make a custom object like this, instead of outputting two different kinds of objects.
$files | ForEach-Object {
$lines = $_ | Get-Content | Measure-Object -Line
[pscustomobject]#{name = $_.name
lines = $lines.lines}
}
name lines
---- -----
rof.ps1 11
rof.ps1~ 7
wai.ps1 2
wai.ps1~ 1
Donald Knuth once got the task to write a literate program computing the word frequency of a file.
Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
Doug McIlroy famously rewrote the 10 pages of Pascal in a few lines of sh:
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
As a little exercise, I converted this to Powershell:
(-split ((Get-Content -Raw test.txt).ToLower() -replace '[^a-zA-Z]',' ')) |
Group-Object |
Sort-Object -Property count -Descending |
Select-Object -First $Args[0] |
Format-Table count, name
I like that Powershell combines sort | uniq -c into a single Group-Object.
The first line looks ugly, so I wonder if it can be written more elegantly? Maybe there is a way to load the file with a regex delimiter somehow?
One obvious way to shorten the code would be to uses the aliases, but that does not help readability.
I would do it this way.
PS C:\users\me> Get-Content words.txt
One one
two
two
three,three.
two;two
PS C:\users\me> (Get-Content words.txt) -Split '\W' | Group-Object
Count Name Group
----- ---- -----
2 One {One, one}
4 two {two, two, two, two}
2 three {three, three}
1 {}
EDIT: Some code from Bruce Payette's Windows Powershell in Action
# top 10 most frequent words, hash table
$s = gc songlist.txt
$s = [string]::join(" ", $s)
$words = $s.Split(" `t", [stringsplitoptions]::RemoveEmptyEntries)
$uniq = $words | sort -Unique
$words | % {$h=#{}} {$h[$_] += 1}
$frequency = $h.keys | sort {$h[$_]}
-1..-10 | %{ $frequency[$_]+" "+$h[$frequency[$_]]}
# or
$grouped = $words | group | sort count
$grouped[-1..-10]
Thanks js2010 and LotPings for important hints. To document what is probably the best solution:
$Input -split '\W+' |
Group-Object -NoElement |
Sort-Object count -Descending |
Select-Object -First $Args[0]
Things I learned:
$Input contains stdin. This is closer to McIlroys code than Get-Content some file.
split can actually take regex delimiters
the -NoElement parameter let me get rid of the Format-Table line.
Windows 10 64-bit. PowerShell 5
How to find what whole word (the not -the- or weather) regardless of case is most frequently used in a text file and how many times it is used using Powershell:
Replace 1.txt with your file.
$z = gc 1.txt -raw
-split $z | group -n | sort c* | select -l 1
Results:
Count Name
----- ----
30 THE
I am using PowerShell to collect lists of names from multiple text files. May of the names in these files are similar / repeating. I am trying to ensure that PowerShell returns a single text file with all of the unique items. In looking at the data it looks like the script is gathering 271/296 of the unique items. I'm guessing that some of the data is being flagged as duplicates when it shouldn't, any suggestions?
#Take content of each file (all names) and add unique values to text file
#for each unique value, create a row & check to see which txt files contain
function List {
$nofiles = Read-Host "How many files are we pulling from?"
$data = #()
for ($i = 0;$i -lt $nofiles; $i++)
{
$data += Read-Host "Give me the file name for file # $($i+1)"
}
return $data
}
function Aggregate ($array) {
Get-Content $array | Sort-Object -unique | Out-File newaggregate.txt
}
#SCRIPT BODY
$data = List
aggregate ($data)
I was expecting this code to catch everything, but it's missing some items that look very similar. List of missing names and their similar match:
CORPINZUTL16 MISSING FROM OUTFILE
CORPINZTRACE MISSING FROM OUTFILE
CORPINZADMIN Found In File
I have about 20 examples like this one. Apparently the Get-Content -Unique is not checking every character in a line. Can anyone recommend a better way of checking each line or possibly forcing the get-character to check full names?
Just for demonstration this line creates 3 txt files with numbers
for($i=1;$i -lt 4;$i++){set-content -path "$i.txt" -value ($i..$($i+7))}
1.txt | 2.txt | 3.txt | newaggregate.txt
1 | | | 1
2 | 2 | | 2
3 | 3 | 3 | 3
4 | 4 | 4 | 4
5 | 5 | 5 | 5
6 | 6 | 6 | 6
7 | 7 | 7 | 7
8 | 8 | 8 | 8
| 9 | 9 | 9
| | 10 | 10
Here using Get-Content with a range [1-3] of files
Get-Content [1-3].txt | Sort-Object {[int]$_} -Unique | Out-File newaggregate.txt
$All = Get-Content .\newaggregate.txt
foreach ($file in (Get-ChildItem [1-3].txt)){
Compare-Object $All (Get-Content $file.FullName) |
Select-Object #{n='File';e={$File}},
#{n="Missing";e={$_.InputObject}} -ExcludeProperty SideIndicator
}
File Missing
---- -------
Q:\Test\2019\05\07\1.txt 9
Q:\Test\2019\05\07\1.txt 10
Q:\Test\2019\05\07\2.txt 1
Q:\Test\2019\05\07\2.txt 10
Q:\Test\2019\05\07\3.txt 1
Q:\Test\2019\05\07\3.txt 2
there are two ways to achieve this one is using select-object -Unique which works when data is not sorted and can be used for small data or lists.
When dealing with large files we can use get-Unique command which works with sorted input, if input data is not sorted then it will give wrong results.
Get-ChildItem *.txt | Get-Content | measure -Line #225949
Get-ChildItem *.txt | Get-Content | sort | Get-Unique | measure -Line #119650
Here is my command for multiple files :
Get-ChildItem *.txt | Get-Content | sort | Get-Unique >> Unique.txt
Looking for a PowerShell script that looks in a text file for rows that have too many (or too few) tabs.
I found this PowerShell script that does exactly what I want (almost).
This counts the number of tabs per row:
Get-Content test.txt | ForEach-Object {
($_ | Select-String `t -all).matches | Measure-Object | Select-Object count
}
Can someone extend/modify/re-write this to return only the rows (with row numbers) that have more than, or less than, X number of tabs per row?
Don't use Get-Content before piping to Select-String, you'll lose contextual information about each line.
Instead, use the -Path parameter with Select-String:
$Tabs = Select-String -Path .\test.txt -Pattern "`t" -AllMatches
$Tabs |Select-Object LineNumber,Line,#{Name='TabCount';Expression={ $_.Matches.Count }}
To return only the ones where the number of tabs is greater than $x, use Where-Object:
$x = 3
$Tabs |Where-Object { $_.TabCount -ge $x} | Select-Object -ExpandProperty Line
If you just want a quick overview of the distribution, you could also use Group-Object:
Get-Content .\test.txt | Group-Object { "{0} tabs" -f [regex]::Matches($_,"`t").Count }
Lots of ways to do this. Get-Content works just fine for me and we create a custom object that you can then filter as desired.
Get-Content test.txt | ForEach-Object{
New-Object PSObject -Property #{
Line = $_
LineNumber = $_.ReadCount
NumberofTabs = [regex]::matches($_,"`t").count
}
}
Use the .net regex method to count the tabs returned and populate a value based on the result.
NumberofTabs Number Line
------------ ------ ----
8 1 ;lkjasfdsa
8 2 asdfasdf
4 3 asdfasdfasdfa
2 4 fasdfjasdlfjas;l
Now you can use PowerShell to filter as you see fit.
} | Where-Object { $_.NumberofTabs -ne 4}
So if 4 was the perfect number then line 3 would be ommited from the results.