how to find unique word in text file and then store unique words in text file using powershell - powershell

I am using PowerShell. Here I want to remove duplicate words from the text file and then store the unique words in the text file. What I do is here.
$A = $( foreach ($line in Get-Content C:\Test1\File1.txt) {
$line.tolower().split(" ")
}) | Sort-Object | Get-Unique
$A | export-csv "somefile.csv"
Here is my file.

PowerShell can use a dotnet type called a hashset which is perfect for doing exactly this, and at the figurative speed of light too!
First we read the file into memory in PowerShell and assign it to a variable called $lines.
Next, we split into just the unique $words.
Finally, we create a hashset which will only allow unique words or items.
$lines = get-content "C:\Users\Stephen\OneDrive\Documents\quotes.txt"
[string[]]$words = $lines.Split()
$uniqueWords = [System.Collections.Generic.HashSet[string]]::new($words)
Here's some info on how this works, we're using the hashset constructor which accepts an input value.
But its FAST!
It is amazingly fast to use a hashset too! I measured the performance on a reasonably sized file of 10MB of text from samplefile.com with a number of famous quotes and other info.
Method TotalMs
------ -------
Get-Unique 21484.4956
Using Hashset 1840.7407
Get hashset is dramatically faster. It's an order of magnitude faster in the worst case, and I've seen it be two orders of magnitude or more before.

Or simply as one-liner
(Get-Content 'C:\Test1\File1.txt' -Raw) -split '\W' | Sort-Object -Unique | Set-Content -Path 'C:\Test1\File2.txt'
\W is regex for a Non-Word character like space, comma, etc.

The main issue with your example was that you weren't processing the returned array from the split function:
Get-Content hello.txt | ForEach-Object { $wrds=$_.Split(" "); foreach ($i in $wrds) { Write-Output $i } } | Sort-Object | Get-Unique
Split each line into the array wrds and then loop on the contents to write the output before performing the sort and processing the duplication.

Related

Powershell: Import-csv, rename all headers

In our company there are many users and many applications with restricted access and database with evidence of those accessess. I don´t have access to that database, but what I do have is automatically generated (once a day) csv file with all accessess of all my users. I want them to have a chance to check their access situation so i am writing a simple powershell script for this purpose.
CSV:
user;database1_dat;database2_dat;database3_dat
john;0;0;1
peter;1;0;1
I can do:
import-csv foo.csv | where {$_.user -eq $user}
But this will show me original ugly headres (with "_dat" suffix). Can I delete last four characters from every header which ends with "_dat", when i can´t predict how many headers will be there tomorrow?
I am aware of calculated property like:
Select-Object #{ expression={$_.database1_dat}; label='database1' }
but i have to know all column names for that, as far as I know.
Am I convicted to "overingeneer" it by separate function and build whole "calculated property expression" from scratch dynamically or is there a simple way i am missing?
Thanks :-)
Assuming that file foo.csv fits into memory as a whole, the following solution performs well:
If you need a memory-throttled - but invariably much slower - solution, see Santiago Squarzon's helpful answer or the alternative approach in the bottom section.
$headerRow, $dataRows = (Get-Content -Raw foo.csv) -split '\r?\n', 2
# You can pipe the result to `where {$_.user -eq $user}`
ConvertFrom-Csv ($headerRow -replace '_dat(?=;|$)'), $dataRows -Delimiter ';'
Get-Content -Raw reads the entire file into memory, which is much faster than reading it line by line (the default).
-split '\r?\n', 2 splits the resulting multi-line string into two: the header line and all remaining lines.
Regex \r?\n matches a newline (both a CRLF (\r\n) and a LF-only newline (\n))
, 2 limits the number of tokens to return to 2, meaning that splitting stops once the 1st token (the header row) has been found, and the remainder of the input string (comprising all data rows) is returned as-is as the last token.
Note the $null as the first target variable in the multi-assignment, which is used to discard the empty token that results from the separator regex matching at the very start of the string.
$headerRow -replace '_dat(?=;|$)'
-replace '_dat(?=;|$)' uses a regex to remove any _dat column-name suffixes (followed by a ; or the end of the string); if substring _dat only ever occurs as a name suffix (not also inside names), you can simplify to -replace '_dat'
ConvertFrom-Csv directly accepts arrays of strings, so the cleaned-up header row and the string with all data rows can be passed as-is.
Alternative solution: algorithmic renaming of an object's properties:
Note: This solution is slow, but may be an option if you only extract a few objects from the CSV file.
As you note in the question, use of Select-Object with calculated properties is not an option in your case, because you neither know the column names nor their number in advance.
However, you can use a ForEach-Object command in which you use .psobject.Properties, an intrinsic member, for reflection on the input objects:
Import-Csv -Delimiter ';' foo.csv | where { $_.user -eq $user } | ForEach-Object {
# Initialize an aux. ordered hashtable to store the renamed
# property name-value pairs.
$renamedProperties = [ordered] #{}
# Process all properties of the input object and
# add them with cleaned-up names to the hashtable.
foreach ($prop in $_.psobject.Properties) {
$renamedProperties[($prop.Name -replace '_dat(?=.|$)')] = $prop.Value
}
# Convert the aux. hashtable to a custom object and output it.
[pscustomobject] $renamedProperties
}
You can do something like this:
$textInfo = (Get-Culture).TextInfo
$headers = (Get-Content .\test.csv | Select-Object -First 1).Split(';') |
ForEach-Object {
$textInfo.ToTitleCase($_) -replace '_dat'
}
$user = 'peter'
Get-Content .\test.csv | Select-Object -Skip 1 |
ConvertFrom-Csv -Delimiter ';' -Header $headers |
Where-Object User -EQ $user
User Database1 Database2 Database3
---- --------- --------- ---------
peter 1 0 1
Not super efficient but does the trick.

Using powershell only, sort a text file with 300 lines, first by string length and then, once in that length, set them all alphbetically

I am trying to do this as a one liner in powershell, so that I can move on to check these strings against a check string.
The trouble I am having is that no matter what I do, I can only set it by string length.
The following attempts failed to get the required result.One does text great, the other by length successfully. I have also tried to pipe them in but I believe that neither accept pipeline input.
Your help is appreciated as am new to powershell.
PS C:\Users\IEUser> Get-Content Desktop/dict.txt | Sort-Object
PS C:\Users\IEUser> Get-Content Desktop/dict.txt | Sort-Object -Property Length
Given the sample array $toSort:
$toSort = #(
'abcdefghwxyefg'
'abcdefghghijkl'
'abcdefghwxyefgabcdzx'
'abcdefghwxyefgabcdef'
'abcdzx'
'abcdef'
'abzxc'
'abcde'
'wxy'
'efg'
'abcdefgh'
'ijklmnop'
)
You can use Sort-Object to sort the array first by the Length property and then by alphabetical order like this:
$toSort | Sort-Object Length, { $_ }
Thanks Mathias for pointing it out, I was previously using { $_[0] } which would sort only the first char of each line.
Including how the actual answer should be:
Get-Content Desktop/dict.txt | Sort-Object -Property Length, { $_ } |
Out-File path/to/sortedDict.txt
If you want to have some fun with LINQ you can accomplish the same using first OrderBy to sort by Length and ThenBy alphabetical order:
[Linq.Enumerable]::ThenBy(
[Linq.Enumerable]::OrderBy($toSort, [Func[object, int]]{param($s) $s.Length }),
[Func[object, string]]{param($s) $s }
)

Cleanup huge text file containing domain

I have a database that contains a log of domains listed in the following matter:
.youtube.com
.ziprecruiter.com
0.etsystatic.com
0.sparkpost.com
00.mail.ne1.yahoo.com
00072e01.pphosted.com
00111b01.pphosted.com
001d4f01.pphosted.com
011.mail.bf1.yahoo.com
1.amazonaws.com
How would I go about cleaning them up using powershell or grep, though I rather use powershell, so that they contain only the root domain with the .com extension and remove whatever word and . is before that.
I'm thinking best way to do is is a query that looks for dots from right to left and removes the second dot and whatever comes after it. For example 1.amazonaws.com here we remove the second dot from the right and whatever is after it?
i.e.
youtube.com
ziprecruiter.com
etsystatic.com
yahoo.com
pphosted.com
amazonaws.com
You can read each line into an array of strings with Get-Content, Split on "." using Split(), get the last two items with [-2,-1], then join the array back up using -join. We can then retrieve unique items using -Unique from Select-Object.
Get-Content -Path .\database_export.txt | ForEach-Object {
$_.Split('.')[-2,-1] -join '.'
} | Select-Object -Unique
Or using Select-Object -Last 2 to fetch the last two items, then piping to Join-String.
Get-Content -Path .\database_export.txt | ForEach-Object {
$_.Split('.') | Select-Object -Last 2 | Join-String -Separator '.'
} | Select-Object -Unique
Output:
youtube.com
ziprecruiter.com
etsystatic.com
sparkpost.com
yahoo.com
pphosted.com
amazonaws.com
You can use the String.Trim() method to clean leading and trailing dots, then use the regex -replace operator to remove everything but the top- and second-level domain name:
$strings = Get-Content database_export.txt
#($strings |ForEach-Object Trim '.') -replace '.*?(\w+\.\w+)$','$1' |Sort-Object -Unique
here is yet another method. [grin]
what it does ...
creates an array of strings to work with
when ready to do this for real, remove the entire #region/#endregion section and use Get-Content to load the file.
iterates thru the $InStuff collection of strings
splits the current item on the dots
grabs the last two items in the resulting array
joins them with a dot
outputs the new string to the $Results collection
shows that on screen
the code ...
#region >>> fake reading in a text file
# in real life, use Get-Content
$InStuff = #'
.youtube.com
.ziprecruiter.com
0.etsystatic.com
0.sparkpost.com
00.mail.ne1.yahoo.com
00072e01.pphosted.com
00111b01.pphosted.com
001d4f01.pphosted.com
011.mail.bf1.yahoo.com
1.amazonaws.com
'# -split [System.Environment]::NewLine
#endregion >>> fake reading in a text file
$Results = foreach ($IS_Item in $InStuff)
{
$IS_Item.Split('.')[-2, -1] -join '.'
}
$Results
output ...
youtube.com
ziprecruiter.com
etsystatic.com
sparkpost.com
yahoo.com
pphosted.com
pphosted.com
pphosted.com
yahoo.com
amazonaws.com
please note that this code expects the strings to be more-or-less-valid URLs. i can think of invalid ones that end with a dot ... and those would fail. if you need to deal with such, add the needed validation code.
another idea ... if the file is large [tens of thousands of strings], you may want to use the ForEach-Object pipeline cmdlet [as shown by RoadRunner] to save RAM at the expense of speed.

Count unique numbers in CSV (PowerShell or Notepad++)

How to find the count of unique numbers in a CSV file? When I use the following command in PowerShell ISE
1,2,3,4,2 | Sort-Object | Get-Unique
I can get the unique numbers but I'm not able to get this to work with CSV files. If for example I use
$A = Import-Csv C:\test.csv | Sort-Object | Get-Unique
$A.Count
it returns 0. I would like to count unique numbers for all the files in a given folder.
My data looks similar to this:
Col1,Col2,Col3,Col4
5,,7,4
0,,9,
3,,5,4
And the result should be 6 unique values (preferably written inside the same CSV file).
Or would it be easier to do it with Notepad++? So far I have found examples only on how to count the unique rows.
You can try the following (PSv3+):
PS> (Import-CSV C:\test.csv |
ForEach-Object { $_.psobject.properties.value -ne '' } |
Sort-Object -Unique).Count
6
The key is to extract all property (column) values from each input object (CSV row), which is what $_.psobject.properties.value does;
-ne '' filters out empty values.
Note that, given that Sort-Object has a -Unique switch, you don't need Get-Unique (you need Get-Unique only if your input already is sorted).
That said, if your CSV file is structured as simply as yours, you can speed up processing by reading it as a text file (PSv2+):
PS> (Get-Content C:\test.csv | Select-Object -Skip 1 |
ForEach-Object { $_ -split ',' -ne '' } |
Sort-Object -Unique).Count
6
Get-Content reads the CSV file as a line of strings.
Select-Object -Skip 1 skips the header line.
$_ -split ',' -ne '' splits each line into values by commas and weeds out empty values.
As for what you tried:
Import-CSV C:\test.csv | Sort-Object | Get-Unique:
Fundamentally, Sort-Object emits the input objects as a whole (just in sorted order), it doesn't extract property values, yet that is what you need.
Because no -Property argument is passed to Sort-Object to base the sorting on, it compares the custom objects that Import-Csv emits as a whole, by their .ToString() values, which happen to be empty[1]
, so they all compare the same, and in effect no sorting happens.
Similarly, Get-Unique also determines uniqueness by .ToString() here, so that, again, all objects are considered the same and only the very first one is output.
[1] This may be surprising, given that using a custom object in an expandable string does yield a value: compare $obj = [pscustomobject] #{ foo ='bar' }; $obj.ToString(); '---'; "$obj". This inconsistency is discussed in this GitHub issue.

Adding columns and manipulating existing column values in csv file using powershell

I have a lot of csv files with values arranged like so:
X1,Y1
X2,Y2
...,...
Xn,Yn
I find it very tedious processing these with excel, so I want to setup a batch script to process these files such that they appear like this:
#where N is a specified value like 65536
X1,N-Y1,1
X2,N-Y2,2
...,...,...
Xn,N-Yn,n
I have only recently started using powershell for image processing (really simple scripts) and file name appending, so I am not certain how to go about this. A lot of the scripts I have encountered looking to answer this question use csv files with titles per column whereas my files are just arrays of values without object titles in the first row. I would like to avoid running multiple scripts to add titles.
My bonus question is something I have yet to find a good answer to at all, and is the most tedious part of processing. Using excels sort function, I usually change the order of the Yn values in Col2 such that they are sorted in the exported csv like so:
X1,N-Yn,n
...,...,...
Xn-1,N-Y2,2
Xn,N-Y1,1
Using the Col3 values as the sorting order (largest to smallest), then I delete this column so that the final saved csv only contains the first two columns (crucial step). Any help at all would be greatly appreciated, I apologize for the long-winded-ness of this question.
I have encountered looking to answer this question use csv files with titles per column whereas my files are just arrays of values without object titles in the first row.
The -Header parameter of Import-Csv is for adding column headers when the file does not contain them. It takes an array of strings, of however many columns there are.
I would like to avoid running multiple scripts to add titles.
If you couldn't use -Header, you could read the lines with Get-Content into memory, add a header in memory, and then use ConvertFrom-CSV all in one script.
That said, if I'm reading it rightly, you want:
No headers in the input file, and I imagine no headers in the output file
The whole point of adding the third column and sorting and removing it is just to reverse the lines?
The only column you keep is column 1?
I wouldn't use Import-Csv for this, it won't make it much nicer.
$n = 65536
# Read lines into a list, and reverse it
$lines = [Collections.Generic.List[String]](Get-Content -LiteralPath 'c:\test\test.csv')
$lines.Reverse()
# Split each line into two, create a new line with X and N-Y
# write new lines to an output file
$lines | ForEach-Object {
$x, $y = $_.split(',')
"$x,$($n - [int]$y)"
} | Set-Content -LiteralPath 'c:\test\output.csv' -Encoding Ascii
If you do want to use CSV handling, then:
$n = 65536
$counter = 1
Import-Csv -LiteralPath 'C:\test\test.csv' -Header 'ColX', 'ColY' |
Add-Member -MemberType ScriptProperty -Name 'ColN-Y' -Value {$n - $_.ColY} -PassThru |
Add-Member -MemberType ScriptProperty -Name 'N' -Value {$script:counter++} -PassThru |
Sort-Object -Property 'N' -Descending |
Select-Object -Property 'ColX', 'ColN-Y' |
Export-Csv -LiteralPath 'c:\test\output.csv' -NoTypeInformation
But the output will have CSV headers and double-quoted values.
I would try something like, by extending the original table with a calculatable script-property as a new column:
#Your N number
$N = 65536
# Import CSV file without header columns
$table = Import-Csv -Header #("colX","colY") `
-Delimiter ',' `
-Path './numbers.csv'
Write-Host "Original table"
$table | Format-Table
# Manipulate table
$newtable = $table |
Add-Member -MemberType ScriptProperty -Name colNX -Value { $N-$this.colX } - PassThru
Write-Host "New table"
$newtable | Format-Table