Cleanup huge text file containing domain

Cleanup huge text file containing domain - powershell

I have a database that contains a log of domains listed in the following matter:
.youtube.com
.ziprecruiter.com
0.etsystatic.com
0.sparkpost.com
00.mail.ne1.yahoo.com
00072e01.pphosted.com
00111b01.pphosted.com
001d4f01.pphosted.com
011.mail.bf1.yahoo.com
1.amazonaws.com
How would I go about cleaning them up using powershell or grep, though I rather use powershell, so that they contain only the root domain with the .com extension and remove whatever word and . is before that.
I'm thinking best way to do is is a query that looks for dots from right to left and removes the second dot and whatever comes after it. For example 1.amazonaws.com here we remove the second dot from the right and whatever is after it?
i.e.
youtube.com
ziprecruiter.com
etsystatic.com
yahoo.com
pphosted.com
amazonaws.com

You can read each line into an array of strings with Get-Content, Split on "." using Split(), get the last two items with [-2,-1], then join the array back up using -join. We can then retrieve unique items using -Unique from Select-Object.
Get-Content -Path .\database_export.txt | ForEach-Object {
$_.Split('.')[-2,-1] -join '.'
} | Select-Object -Unique
Or using Select-Object -Last 2 to fetch the last two items, then piping to Join-String.
Get-Content -Path .\database_export.txt | ForEach-Object {
$_.Split('.') | Select-Object -Last 2 | Join-String -Separator '.'
} | Select-Object -Unique
Output:
youtube.com
ziprecruiter.com
etsystatic.com
sparkpost.com
yahoo.com
pphosted.com
amazonaws.com

You can use the String.Trim() method to clean leading and trailing dots, then use the regex -replace operator to remove everything but the top- and second-level domain name:
$strings = Get-Content database_export.txt
#($strings |ForEach-Object Trim '.') -replace '.*?(\w+\.\w+)$','$1' |Sort-Object -Unique

here is yet another method. [grin]
what it does ...
creates an array of strings to work with
when ready to do this for real, remove the entire #region/#endregion section and use Get-Content to load the file.
iterates thru the $InStuff collection of strings
splits the current item on the dots
grabs the last two items in the resulting array
joins them with a dot
outputs the new string to the $Results collection
shows that on screen
the code ...
#region >>> fake reading in a text file
# in real life, use Get-Content
$InStuff = #'
.youtube.com
.ziprecruiter.com
0.etsystatic.com
0.sparkpost.com
00.mail.ne1.yahoo.com
00072e01.pphosted.com
00111b01.pphosted.com
001d4f01.pphosted.com
011.mail.bf1.yahoo.com
1.amazonaws.com
'# -split [System.Environment]::NewLine
#endregion >>> fake reading in a text file
$Results = foreach ($IS_Item in $InStuff)
{
$IS_Item.Split('.')[-2, -1] -join '.'
}
$Results
output ...
youtube.com
ziprecruiter.com
etsystatic.com
sparkpost.com
yahoo.com
pphosted.com
pphosted.com
pphosted.com
yahoo.com
amazonaws.com
please note that this code expects the strings to be more-or-less-valid URLs. i can think of invalid ones that end with a dot ... and those would fail. if you need to deal with such, add the needed validation code.
another idea ... if the file is large [tens of thousands of strings], you may want to use the ForEach-Object pipeline cmdlet [as shown by RoadRunner] to save RAM at the expense of speed.

Related

Powershell: Import-csv, rename all headers

In our company there are many users and many applications with restricted access and database with evidence of those accessess. I don´t have access to that database, but what I do have is automatically generated (once a day) csv file with all accessess of all my users. I want them to have a chance to check their access situation so i am writing a simple powershell script for this purpose.
CSV:
user;database1_dat;database2_dat;database3_dat
john;0;0;1
peter;1;0;1
I can do:
import-csv foo.csv | where {$_.user -eq $user}
But this will show me original ugly headres (with "_dat" suffix). Can I delete last four characters from every header which ends with "_dat", when i can´t predict how many headers will be there tomorrow?
I am aware of calculated property like:
Select-Object #{ expression={$_.database1_dat}; label='database1' }
but i have to know all column names for that, as far as I know.
Am I convicted to "overingeneer" it by separate function and build whole "calculated property expression" from scratch dynamically or is there a simple way i am missing?
Thanks :-)

Assuming that file foo.csv fits into memory as a whole, the following solution performs well:
If you need a memory-throttled - but invariably much slower - solution, see Santiago Squarzon's helpful answer or the alternative approach in the bottom section.
$headerRow, $dataRows = (Get-Content -Raw foo.csv) -split '\r?\n', 2
# You can pipe the result to `where {$_.user -eq $user}`
ConvertFrom-Csv ($headerRow -replace '_dat(?=;|$)'), $dataRows -Delimiter ';'
Get-Content -Raw reads the entire file into memory, which is much faster than reading it line by line (the default).
-split '\r?\n', 2 splits the resulting multi-line string into two: the header line and all remaining lines.
Regex \r?\n matches a newline (both a CRLF (\r\n) and a LF-only newline (\n))
, 2 limits the number of tokens to return to 2, meaning that splitting stops once the 1st token (the header row) has been found, and the remainder of the input string (comprising all data rows) is returned as-is as the last token.
Note the $null as the first target variable in the multi-assignment, which is used to discard the empty token that results from the separator regex matching at the very start of the string.
$headerRow -replace '_dat(?=;|$)'
-replace '_dat(?=;|$)' uses a regex to remove any _dat column-name suffixes (followed by a ; or the end of the string); if substring _dat only ever occurs as a name suffix (not also inside names), you can simplify to -replace '_dat'
ConvertFrom-Csv directly accepts arrays of strings, so the cleaned-up header row and the string with all data rows can be passed as-is.
Alternative solution: algorithmic renaming of an object's properties:
Note: This solution is slow, but may be an option if you only extract a few objects from the CSV file.
As you note in the question, use of Select-Object with calculated properties is not an option in your case, because you neither know the column names nor their number in advance.
However, you can use a ForEach-Object command in which you use .psobject.Properties, an intrinsic member, for reflection on the input objects:
Import-Csv -Delimiter ';' foo.csv | where { $_.user -eq $user } | ForEach-Object {
# Initialize an aux. ordered hashtable to store the renamed
# property name-value pairs.
$renamedProperties = [ordered] #{}
# Process all properties of the input object and
# add them with cleaned-up names to the hashtable.
foreach ($prop in $_.psobject.Properties) {
$renamedProperties[($prop.Name -replace '_dat(?=.|$)')] = $prop.Value
}
# Convert the aux. hashtable to a custom object and output it.
[pscustomobject] $renamedProperties
}

You can do something like this:
$textInfo = (Get-Culture).TextInfo
$headers = (Get-Content .\test.csv | Select-Object -First 1).Split(';') |
ForEach-Object {
$textInfo.ToTitleCase($_) -replace '_dat'
}
$user = 'peter'
Get-Content .\test.csv | Select-Object -Skip 1 |
ConvertFrom-Csv -Delimiter ';' -Header $headers |
Where-Object User -EQ $user
User Database1 Database2 Database3
---- --------- --------- ---------
peter 1 0 1
Not super efficient but does the trick.

Powershell Select Specific Text

Hello and thank you for your time. Here is what I am looking to do. I have several log files that I need to search through. I do this by using Get-ChildItem -Path C:\mylogfiles\*.log | Select-String -Pattern 'MyTextHere' However, now I want to complicate my life and only select text that is between single quotes in the log files.
Here is a sample of my log file:
This is some sample text in my log file. It has a lot of garbage that I don't want to see. However, it has text that I want to find, and if found I would like it to save just the selected text to a CSV file. I want to copy everything that is between single quotes. Here comes the text 'Please copy this text that is between the single quotes'
Any idea how I would go about doing this?

The following combines Select-String with ForEach-Object to extract only the phrases of interest (the parts of the line that matched the regex), wraps them in a [pscustomobject] instance with a .Phrase property and exports the results with Export-Csv:
Select-String -Path C:\mylogfiles\*.log -AllMatches -Pattern "'.*?'" |
ForEach-Object {
foreach ($phrase in $_.Matches.Value) {
[pscustomobject] #{ Phrase = $phrase.Trim("'") }
}
} |
Export-Csv -NoTypeInformation -Encoding utf8 result.csv
Note: If there can only ever be at most one phrase of interest per line, you can omit the -AllMatches switch and replace the ForEach-Object call with the following Select-String call, which uses a calculated property:
# ... |
Select-Object -Property #{ Name='Phrase'; Expression={ $_.Matches.Value.Trim("'") } } |
# ...

Powershell - Finding the output of get-contents and searching for all occurrences in another file using wild cards

I'm trying to get the output of two separate files although I'm stuck on the wild card or contains select-string search from file A (Names) in file B (name-rank).
The contents of file A is:
adam
george
william
assa
kate
mark
The contents of file B is:
12-march-2020,Mark-1
12-march-2020,Mark-2
12-march-2020,Mark-3
12-march-2020,william-4
12-march-2020,william-2
12-march-2020,william-7
12-march-2020,kate-54
12-march-2020,kate-12
12-march-2020,kate-44
And I need to match on every occurrence of the names after the '-' so my ordered output should look like this which is a combination of both files as the output:
mark
Mark-1
Mark-2
Mark-3
william
william-2
william-4
william-7
Kate
kate-12
kate-44
kate-54
So far I only have the following and I'd be grateful for any pointers or assistance please.
import-csv (c:\temp\names.csv) |
select-string -simplematch (import-csv c:\temp\names-rank.csv -header "Date", "RankedName" | select RankedName) |
set-content c:\temp\names-and-ranks.csv
I imagine the select-string isn't going to be enough and I need to write a loop instead.

The data you give in the example does not give you much to work with, and the desired output is not that intuitive, most of the time with Powershell you would like to combine the data in to a much richer output at the end.
But anyway, with what is given here and what you want, the code bellow will get what you need, I have left comments in the code for you
$pathDir='C:\Users\myUser\Downloads\trash'
$names="$pathDir\names.csv"
$namesRank="$pathDir\names-rank.csv"
$nameImport = Import-Csv -Path $names -Header names
$nameRankImport= Import-Csv -Path $namesRank -Header date,rankName
#create an empty array to collect the result
$list=#()
foreach($name in $nameImport){
#get all the match names
$match=$nameRankImport.RankName -like "$($name.names)*"
#add the name from the First list
$list+=($name.names)
#if there are any matches, add them too
if($match){
$list+=$match
}
}
#Because its a one column string, Export-CSV will now show us what we want
$list | Set-Content -Path "$pathDir\names-and-ranks.csv" -Force

For this I would use a combination of Group-Object and Where-Object to first group all "RankedName" items by the name before the dash, then filter on those names to be part of the names we got from the 'names.csv' file and output the properties you need.
# read the names from the file as string array
$names = Get-Content -Path 'c:\temp\names.csv' # just a list of names, so really not a CSV
# import the CSV file and loop through
Import-Csv -Path 'c:\temp\names-rank.csv' -Header "Date", "RankedName" |
Group-Object { ($_.RankedName -split '-')[0] } | # group on the name before the dash in the 'RankedName' property
Where-Object { $_.Name -in $names } | # use only the groups that have a name that can be found in the $names array
ForEach-Object {
$_.Name # output the group name (which is one of the $names)
$_.Group.RankedName -join [environment]::NewLine # output the group's 'RankedName' property joined with a newline
} |
Set-Content -Path 'c:\temp\names-and-ranks.csv'
Output:
Mark
Mark-1
Mark-2
Mark-3
william
william-4
william-2
william-7
kate
kate-54
kate-12
kate-44

how to find unique word in text file and then store unique words in text file using powershell

I am using PowerShell. Here I want to remove duplicate words from the text file and then store the unique words in the text file. What I do is here.
$A = $( foreach ($line in Get-Content C:\Test1\File1.txt) {
$line.tolower().split(" ")
}) | Sort-Object | Get-Unique
$A | export-csv "somefile.csv"
Here is my file.

PowerShell can use a dotnet type called a hashset which is perfect for doing exactly this, and at the figurative speed of light too!
First we read the file into memory in PowerShell and assign it to a variable called $lines.
Next, we split into just the unique $words.
Finally, we create a hashset which will only allow unique words or items.
$lines = get-content "C:\Users\Stephen\OneDrive\Documents\quotes.txt"
[string[]]$words = $lines.Split()
$uniqueWords = [System.Collections.Generic.HashSet[string]]::new($words)
Here's some info on how this works, we're using the hashset constructor which accepts an input value.
But its FAST!
It is amazingly fast to use a hashset too! I measured the performance on a reasonably sized file of 10MB of text from samplefile.com with a number of famous quotes and other info.
Method TotalMs
------ -------
Get-Unique 21484.4956
Using Hashset 1840.7407
Get hashset is dramatically faster. It's an order of magnitude faster in the worst case, and I've seen it be two orders of magnitude or more before.

Or simply as one-liner
(Get-Content 'C:\Test1\File1.txt' -Raw) -split '\W' | Sort-Object -Unique | Set-Content -Path 'C:\Test1\File2.txt'
\W is regex for a Non-Word character like space, comma, etc.

The main issue with your example was that you weren't processing the returned array from the split function:
Get-Content hello.txt | ForEach-Object { $wrds=$_.Split(" "); foreach ($i in $wrds) { Write-Output $i } } | Sort-Object | Get-Unique
Split each line into the array wrds and then loop on the contents to write the output before performing the sort and processing the duplication.

How to read a CSV file but exclude certain columns containing blanks using Get-Content

I want to read a CSV file and exclude rows where dynamically selected columns contain blanks but not all rows of those dynamically selected columns contain blanks.
Trying to use the where clause in the statement below (but not working):
Get-Content $Source -ReadCount 1000 |
Where {
ForEach($NotEqualBlankCol in $BlankColumns)
{
$NotEqualBlankCol -ne $null -and $NotEqualBlankCol -ne ''}
} |
ConvertFrom-Csv |
Sort-Object -Property $SortByColNames.Replace('"', '') -Unique |
.
.
.
| Out-File $Destination
$BlankColumns is my dynamic object string array which I would like to loop through containing the column names of the CSV that are blank. it can be 1 column or more. When more then all of the selected columns need to be blank to qualify as a row that does not need to be included in the final CSV file output.
How do I do it using Get-Content? Any help would be appreciated.

Using Get-Content
Ok. So what this will do it read in the contents of a file X lines at a time. It will parse each line into its indiviual columns. Then it will check the specified columns for blanks. If any of the flagged columns contains a black then it will be filtered out. Consider the test data I used for this
id,first_name,last_name,email,gender,ip_address
1,Christina,Tucker,ctucker0#bbc.co.uk,Female,91.33.192.187
2,Jacqueline,Torres,jtorres1#shop-pro.jp,Female,205.70.183.107
3,Kathy,Perez,kperez2#hugedomains.com,Female,35.175.154.127
4,"",Holmes,eholmes3#canalblog.com,,
5,Ernest,Walker,ewalker4#marketwatch.com,Male,140.110.129.21
6,,Garza,cgarza5#jugem.jp,,
7,,Cunningham,jcunningham6#ox.ac.uk,Female,
8,,Clark,lclark7#posterous.com,,
9,,Ortiz,lortiz8#shareasale.com,,
Notice that the first_name and gender are blank for some of these folks. id 1,2,3,5,10 have complete data. The rest should be filtered.
$BlankColumns = "first_name","gender"
$headers = (Get-Content $path -TotalCount 1).Split(",")
$potentialBlankHeaderIndecies = 0..($headers.Count - 1) | Where-Object{$BlankColumns -contains $headers[$_]}
$potentialBlankHeaderIndecies
Get-Content $path -ReadCount 3 | Foreach-Object{
# Check to see if any of the indexes from a split are empty
$_ | Where-Object{
[bool[]](($_.Split(","))[$potentialBlankHeaderIndecies] | ForEach-Object{
![string]::IsNullOrEmpty($_.Trim('"'))
}) -notcontains $false
}
}
The output of this code is the file, as string, with the removed entries. You can just pipe this into a variable, file or what even you need.
To go into a little more detail we take the header names we want to check and this read in the first line of the csv file. That should contain the column names. Using that we determine the column indexes that we want to scrutinize. The we read in the whole file and parse it line by line. For each line we split on the comma and check the elements matching the identified headers. Check each of those elements if they are blank or null. We trim quotes in case it is a string "" which I will assume you would count as blank. Of all the elements we evaluate as a Boolean whether or not it is empty. If at least one is then it fails the where-object clause and gets ommited.
Using Import-CSV
$BlankColumns = "first_name","gender"
Import-CSV $path | Where-Object{
$line = $_
($BlankColumns | ForEach-Object{
![string]::IsNullOrEmpty(($line.$_.Trim('"')))
}) -notcontains $false
}
Very similar approach just a lot less overhead since we are dealing with objects now instead of strings.
Now you could use Export-CSV or ConvertFrom-CSV depending on your needs in the rest of the project.
Changing the filter criteria.
Both examples above filter columns where any of the columns contain blanks. If you want to omit only where all are blank change the line }) -notcontains $false to }) -contains $true

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Cleanup huge text file containing domain - powershell

Related

Powershell: Import-csv, rename all headers

Powershell Select Specific Text

Powershell - Finding the output of get-contents and searching for all occurrences in another file using wild cards

how to find unique word in text file and then store unique words in text file using powershell

How to read a CSV file but exclude certain columns containing blanks using Get-Content

Categories

Resources