Donald Knuth once got the task to write a literate program computing the word frequency of a file.
Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
Doug McIlroy famously rewrote the 10 pages of Pascal in a few lines of sh:
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
As a little exercise, I converted this to Powershell:
(-split ((Get-Content -Raw test.txt).ToLower() -replace '[^a-zA-Z]',' ')) |
Group-Object |
Sort-Object -Property count -Descending |
Select-Object -First $Args[0] |
Format-Table count, name
I like that Powershell combines sort | uniq -c into a single Group-Object.
The first line looks ugly, so I wonder if it can be written more elegantly? Maybe there is a way to load the file with a regex delimiter somehow?
One obvious way to shorten the code would be to uses the aliases, but that does not help readability.
I would do it this way.
PS C:\users\me> Get-Content words.txt
One one
two
two
three,three.
two;two
PS C:\users\me> (Get-Content words.txt) -Split '\W' | Group-Object
Count Name Group
----- ---- -----
2 One {One, one}
4 two {two, two, two, two}
2 three {three, three}
1 {}
EDIT: Some code from Bruce Payette's Windows Powershell in Action
# top 10 most frequent words, hash table
$s = gc songlist.txt
$s = [string]::join(" ", $s)
$words = $s.Split(" `t", [stringsplitoptions]::RemoveEmptyEntries)
$uniq = $words | sort -Unique
$words | % {$h=#{}} {$h[$_] += 1}
$frequency = $h.keys | sort {$h[$_]}
-1..-10 | %{ $frequency[$_]+" "+$h[$frequency[$_]]}
# or
$grouped = $words | group | sort count
$grouped[-1..-10]
Thanks js2010 and LotPings for important hints. To document what is probably the best solution:
$Input -split '\W+' |
Group-Object -NoElement |
Sort-Object count -Descending |
Select-Object -First $Args[0]
Things I learned:
$Input contains stdin. This is closer to McIlroys code than Get-Content some file.
split can actually take regex delimiters
the -NoElement parameter let me get rid of the Format-Table line.
Windows 10 64-bit. PowerShell 5
How to find what whole word (the not -the- or weather) regardless of case is most frequently used in a text file and how many times it is used using Powershell:
Replace 1.txt with your file.
$z = gc 1.txt -raw
-split $z | group -n | sort c* | select -l 1
Results:
Count Name
----- ----
30 THE
I am attempting to rename a folder based on the first 10 characters inside a file using a powershell command.
I got as far as far as pulling the data I need to use to rename but I don't know how to pass it.
Get-Content 'C:\DATA\Company.dat' |
Select-Object -first 10 |
rename 'C:\DATA\FOLDER' 'C:\DATA\FOLDER (first 10)'
the part I'm stuck on is (first 10), I don't know what to pass to that section to complete my task?
Select-Object -first 10 will take the first 10 objects. In your case this will be the first 10 lines of the file, not 10 characters.
You can use something like this
Rename-Item -Path 'C:\DATA\FOLDER' -NewName "C:\DATA\$((Get-Content 'C:\DATA\Company.dat' | Select-Object -first 1).Substring(0,10))"
Using -first 1 to get the first line and .Substring(0,10) to get the first 10 characters.
Edit:
Or as #AdminOfThings mentioned, without the Select-Object
Rename-Item -Path 'C:\DATA\FOLDER' -NewName "C:\DATA\$((Get-Content 'C:\DATA\Company.dat' -raw).Substring(0,10))"
To complement Michael B.'s helpful answer with a 3rd approach:
If the characters of interest are known to be all on the 1st line (which is a safe assumption in your case), you can use Get-Content -First 1 ... (same as: Get-Content -TotalCount 1 ...) to retrieve that 1st line directly (and exclusively), which:
performs better than Get-Content ... | Select-Object -First 1
avoids having to read the entire file into memory with Get-Content -Raw ...
Rename-Item 'C:\DATA\FOLDER' `
"FOLDER $((Get-Content -First 1 C:\DATA\Company.dat).Substring(0, 10))"
Note:
It is sufficient to pass only the new name to Rename-Item's 2nd positional argument (the -NewName parameter); e.g., FOLDER 1234567890, not the whole path. While you can pass the whole path, it must refer to the same location as the input path.
The substring-extraction command is embedded inside an expandable string ("...") by way of $(...) the subexpression operator.
As for what you tried:
Select-Object -First 10 gets the first 10 input objects, which are the file's lines output by Get-Content; in other words: you'll send 10 lines rather than 10 characters through the pipeline, and even if they were characters, they'd be sent one by one.
While it is possible to solve this problem in the pipeline, it would be cumbersome and slow:
-join ( # -join, given an array of chars., returns a string
Get-Content -First 1 C:\DATA\Company.dat | # get 1st line
ForEach-Object ToCharArray | # convert to a char. array
Select-Object -First 10 # get first 10 chars.
) |
Rename-Item 'C:\DATA\FOLDER' { 'FOLDER ' + $_ }
That said, you could transform the above into something faster and more concise:
-join (Get-Content -First 1 C:\DATA\Company.dat)[0..9] |
Rename-Item 'C:\DATA\FOLDER' { 'FOLDER ' + $_ }
Note:
Get-Content -First 1 returns (at most) 1 line, in which case PowerShell returns that line as-is, not wrapped in an array.
Indexing into a string ([...]) with the range operator (..') - e.g., [0..9] - implicitly extracts the characters at the specified positions as an array; it is as if you had called .ToCharArray()[0..9]
Note how the new name is determined via a delay-bind script-block argument ({ ... }) in which $_ refers to the input object (the 10-character string, in this case); it is this technique that enables a renaming command to operate on multiple inputs, where each new name is derived from the specific input at hand.
I have a csv file with 1600 lines from which top ten lines given below
N,EQ,ADANIPORTS,ADANI PORT & SEZ LTD,384.5,385,387.8,375,376.75,792818726.1,2085488,Y, ,40850,452.35,350.45
N,EQ,ASIANPAINT,ASIAN PAINTS LIMITED,1394.75,1395,1411,1385.05,1393.5,1284559258,919355,Y, ,36117,1490.6,1090.1
N,EQ,AXISBANK,AXIS BANK LIMITED,631.75,638.05,643.4,634,639.9,9599936309,15035968,Y, ,144038,644.65,447.5
N,EQ,BAJAJ-AUTO,BAJAJ AUTO LIMITED,2685.55,2683.9,2697,2664,2682.25,1476618943,551229,Y, ,23611,3468.35,2605
N,EQ,BAJAJFINSV,BAJAJ FINSERV LTD.,7092.1,7092,7129,7025.25,7050.65,909166393.3,128111,Y, ,19707,7200,4500
N,EQ,BAJFINANCE,BAJAJ FINANCE LIMITED,2893.85,2892,2943.4,2891.05,2916.6,3884349778,1327710,Y, ,52356,2943.4,1511.2
N,EQ,BHARTIARTL,BHARTI AIRTEL LIMITED,369.9,370,370.8,365,368.95,768282183.8,2089422,Y, ,26515,564.8,331
N,EQ,BPCL,BHARAT PETROLEUM CORP LT,357.75,358.25,362,353.5,356.95,1738725370,4865929,Y, ,77863,551.55,353.5
N,EQ,CIPLA,CIPLA LTD,657.95,658,658,645,651.2,1235846442,1904031,Y, ,38575,665,507.2
N,EQ,COALINDIA,COAL INDIA LTD,289.05,287.85,293.6,287.8,291,791484837,2713583,Y, ,55421,316.95,235.85
I wanted to sort 10 the column in descending order so that top 20 I can find out.
The file name is Pd240818.csv
my powershell code is as below.
# To remove unwanted few lines
sls ",BE,",",EQ," .\Pd240818.csv | select -exp line | Where-Object {$_ -notmatch ',EQ, ,'} > .\temp.csv
#Sorting line is as follows
gc .\temp.csv | Where-Object {$_ -notmatch 'MKT,'}|%{$_.split(",")[9]}|Sort-Object -Descending| Select-Object -first 20 > temp.txt
Sorted
I get temp.txt as follows:
99988.7
99896.5
9989273.6
99769.75
996134.55
9933960.45
99228.65
99199.95
989418.15
988423057.7
9884111.1
98572145.2
982146.5
981497584.9
97982.75
9786178.9
9775915.05
9760482.5
97384498.85
971033.85
Where as if I sort the same column in excel, I get as below.
28818819313
9599936309
8459873415
6175554483
5889553012
5690666055
5439638100
5121938441
5079530750
5042021707
4972762046
4889394601
4742835986
3884349778
3690976213
3486309023
3388956937
3336437125
3206801588
3114870807
Where am I doing wrong. How to correct it?
The clue is seeing numbers of different lengths, all sorted together:
This is a common problem, where numbers are sorted as text, instead of number values - when we sort words it does not matter how long they are, we put all the a together, then all the b together ... do that with numbers and put all the 9 together, then all the 8 together, you see this varying length sort:
99896.5
9989273.6
99769.75
The solution is to convert the text to numbers, while sorting, then they will sort on the value:
.. | Sort-Object -Descending -Property { $_ -as [decimal] } | ..
Then the output is more like you want:
988423057.7
981497584.9
98572145.2
97384498.85
9989273.6
How to find the count of unique numbers in a CSV file? When I use the following command in PowerShell ISE
1,2,3,4,2 | Sort-Object | Get-Unique
I can get the unique numbers but I'm not able to get this to work with CSV files. If for example I use
$A = Import-Csv C:\test.csv | Sort-Object | Get-Unique
$A.Count
it returns 0. I would like to count unique numbers for all the files in a given folder.
My data looks similar to this:
Col1,Col2,Col3,Col4
5,,7,4
0,,9,
3,,5,4
And the result should be 6 unique values (preferably written inside the same CSV file).
Or would it be easier to do it with Notepad++? So far I have found examples only on how to count the unique rows.
You can try the following (PSv3+):
PS> (Import-CSV C:\test.csv |
ForEach-Object { $_.psobject.properties.value -ne '' } |
Sort-Object -Unique).Count
6
The key is to extract all property (column) values from each input object (CSV row), which is what $_.psobject.properties.value does;
-ne '' filters out empty values.
Note that, given that Sort-Object has a -Unique switch, you don't need Get-Unique (you need Get-Unique only if your input already is sorted).
That said, if your CSV file is structured as simply as yours, you can speed up processing by reading it as a text file (PSv2+):
PS> (Get-Content C:\test.csv | Select-Object -Skip 1 |
ForEach-Object { $_ -split ',' -ne '' } |
Sort-Object -Unique).Count
6
Get-Content reads the CSV file as a line of strings.
Select-Object -Skip 1 skips the header line.
$_ -split ',' -ne '' splits each line into values by commas and weeds out empty values.
As for what you tried:
Import-CSV C:\test.csv | Sort-Object | Get-Unique:
Fundamentally, Sort-Object emits the input objects as a whole (just in sorted order), it doesn't extract property values, yet that is what you need.
Because no -Property argument is passed to Sort-Object to base the sorting on, it compares the custom objects that Import-Csv emits as a whole, by their .ToString() values, which happen to be empty[1]
, so they all compare the same, and in effect no sorting happens.
Similarly, Get-Unique also determines uniqueness by .ToString() here, so that, again, all objects are considered the same and only the very first one is output.
[1] This may be surprising, given that using a custom object in an expandable string does yield a value: compare $obj = [pscustomobject] #{ foo ='bar' }; $obj.ToString(); '---'; "$obj". This inconsistency is discussed in this GitHub issue.
Using PowerShell, I can import the CSV file and count how many objects are equal to "a". For example,
#(Import-csv location | where-Object{$_.id -eq "a"}).Count
Is there a way to go through every column and row looking for the same String "a" and adding onto count? Or do I have to do the same command over and over for every column, just with a different keyword?
So I made a dummy file that contains 5 columns of people names. Now to show you how the process will work I will show you how often the text "Ann" appears in any field.
$file = "C:\temp\MOCK_DATA (3).csv"
gc $file | %{$_ -split ","} | Group-Object | Where-Object{$_.Name -like "Ann*"}
Don't focus on the code but the output below.
Count Name Group
----- ---- -----
5 Ann {Ann, Ann, Ann, Ann...}
9 Anne {Anne, Anne, Anne, Anne...}
12 Annie {Annie, Annie, Annie, Annie...}
19 Anna {Anna, Anna, Anna, Anna...}
"Ann" appears 5 times on it's own. However it is a part of other names as well. Lets use a simple regex to find all the values that are only "Ann".
(select-string -Path 'C:\temp\MOCK_DATA (3).csv' -Pattern "\bAnn\b" -AllMatches | Select-Object -ExpandProperty Matches).Count
That will return 5 since \b is for a word boundary. In essence it is only looking at what is between commas or beginning or end of each line. This omits results like "Anna" and "Annie" that you might have. Select-Object -ExpandProperty Matches is important to have if you have more than one match on a single line.
Small Caveat
It should not matter but in trying to keep the code simple it is possible that your header could match with the value you are looking for. Not likely which is why I don't account for it. If that is a possibility then we could use Get-Content instead with a Select -Skip 1.
Try cycling through properties like this:
(Import-Csv location | %{$record = $_; $record | Get-Member -MemberType Properties |
?{$record.$($_.Name) -eq 'a';}}).Count