Getting domain name from a url using powershell? - powershell

Basically I have a huge csv of phishing links and I'm trying to trim off https://www. and anything after .com .edu etc. so basically the ideal ouput of the powershell script would be a long list of urls all of which look something like google.com or microsoft.com so far I have imported the csv but everything I have tried either doesn't work or leaves the www on the beggining. Any help would be great. The csv im using is this: http://data.phishtank.com/data/online-valid.csv
$urls = Import-Csv -Path .\online-valid.csv | select -ExpandProperty "url"

The below will take your CSV and do magic for you. Have a play around with [Uri], it is very useful when parsing web links.
$csv = import-csv C:\temp\verified_online.csv
Foreach($Site in $csv) {
$site | Add-Member -MemberType NoteProperty -Name "Host" -Value $(([Uri]$Site.url).Host -replace '^www\.')
}
$csv | Export-Csv C:\temp\verified_online2.csv -NoTypeInformation
Adjusted based on recommendation from Mklement0.

A concise and fast alternative to Drew's helpful answer based on casting the URL strings directly to an array of [uri] (System.Uri) instances, and then trimming prefix www., if present, from their .Host (server name) property:
([uri[]] (Import-Csv .\online-valid.csv).url).Host -replace '^www\.'
Note that the -replace operator is regex-based, and regex ^www\. makes sure what www is only replaced at the start (^) of the string, and only if followed by a literal . (\.), in which case this prefix is removed (replaced with the implied empty string); if no such prefix is present, the input string is passed through as-is.
The solution reads the entire CSV file into memory at once, for convenience and speed, and outputs just the trimmed server names, as an array of strings.

Related

Script returning error: "Get-Content : An object at the specified path ... does not exist, or has been filtered by the -Include or -Exclude parameter

EDIT
I think I now know what the issue is - The copy numbers are not REALLY part of the filename. Therefore, when the array pulls it and then is used to get the match info, the file as it is in the array does not exist, only the file name with no copy number.
I tried writing a rename script but the same issue exists... only the few files I manually renamed (so they don't contain copy numbers) were renamed (successfully) by the script. All others are shown not to exist.
How can I get around this? I really do not want to manually work with 23000+ files. I am drawing a blank..
HELP PLEASE
I am trying to narrow down a folder full of emails (copies) with the same name "SCADA Alert.eml", "SCADA Alert[1].eml"...[23110], based on contents. And delete the emails from the folder that meet specific content criteria.
When I run it I keep getting the error in the subject line above. It only sees the first file and the rest it says do not exist...
The script reads through the folder, creates an array of names (does this correctly).
Then creates an variable, $email, and assigns the content of that file. for each $filename in the array.
(this is where is breaks)
Then is should match the specific string I am looking for to the content of the $email var and return true or false. If true I want it to remove the email, $filename, from the folder.
Thus narrowing down the email I have to review.
Any help here would be greatly appreciated.
This is what I have so far... (Folder is in the root of C:)
$array = Get-ChildItem -name -Path $FolderToRead #| Get-Content | Tee C:\Users\baudet\desktop\TargetFile.txt
Foreach ($FileName in $array){
$FileName # Check File
$email = Get-Content $FolderToRead\$FileName
$email # Check Content
$ContainsString = "False" # Set Var
$ContainsString # Verify Var
$ContainsString = %{$email -match "SYS$,ROC"} # Look for String
$ContainsString # Verify result of match
#if ($ContainsString -eq "True") {
#Remove-Item $FolderToRead\$element
#}
}
Here's a PowerShell-idiomatic solution that also resolves your original problems:
Get-ChildItem -File -LiteralPath $FolderToRead | Where-Object {
(Get-Content -Raw -LiteralPath $_.FullName) -match 'SYS\$,ROC'
} | Remove-Item -WhatIf
Note: The -WhatIf common parameter in the command above previews the operation. Remove -WhatIf once you're sure the operation will do what you want.
Note how the $ character in the RHS regex of the -match operator is \-escaped in order to use it verbatim (rather than as metacharacter $, the end-of-input anchor).
Also, given that $ is also used in PowerShell's string interpolation, it's better to use '...' strings (single-quoted, verbatim strings) to represent regexes, assuming no actual up-front string expansion is needed before the regex engine sees the resulting string - see this answer for more information.
As for what you tried:
The error message stemmed from the fact that Get-Content $FolderToRead\$FileName binds the file-name argument, $FolderToRead\$FileName, implicitly (positionally) to Get-Content's -Path parameter, which expects PowerShell wildcard patterns.
Since your file names literally contain [ and ] characters, they are misinterpreted by the (implied) -Path parameter, which can be avoided by using the -LiteralPath parameter instead (which must be specified explicitly, as a named argument).
%{$email -match "SYS$,ROC"} is unnecessarily wrapped in a ForEach-Object call (% is a built-in alias); while that doesn't do any harm in this case, it adds unnecessary overhead;
$email -match "SYS$,ROC" is enough, though it needs to be corrected to
$email -match 'SYS\$,ROC', as explained above.
[System.IO.Directory]::EnumerateFiles($Folder) |
Where-Object {$true -eq [System.IO.File]::ReadAllText($_, [System.Text.Encoding]::UTF8).Contains('SYS$,ROC') } |
ForEach-Object {
Write-Host "Removing $($_)"
#[System.IO.File]::Delete($_)
}
Your mistakes:
%{$email -match "SYS$,ROC"} - What % is intended to be? This is ForEach-Object alias.
%{$email -match "SYS$,ROC"} - Why use -match? This is much slower than -like or String.Contains()
%{$email -match "SYS$,ROC"} - When using $ inside double quotes, you should escape this using single backtick symbol (I have `$100). Otherwise, everything after $ is variable name: Hello, $username; I's $($weather.ToString()) today!
Write debug output in a right way: use Write-Debug, Write-Verbose, Write-Host, Write-Warning, Write-Error, Write-Information.
Can be better:
Avoid using Get-ChildItem, because Get-ChildItem returns files with attributes (like mtime, atime, ctime, etc). This additional info is additional request per file. When you need only list of files, use native .Net EnumerateFiles from System.IO.Directory. This is significant performace boost on huge amounts of files.
Use RealAllText or ReadAllLines or ReadAllBytes from System.IO.File static class to be more concrete instead of using universal Get-Content.
Use pipelines ;-)

Get Substring of value when using import-Csv in PowerShell

I have a PowerShell script that imports a CSV file, filters out rows from two columns and then concatenates a string and exports to a new CSV file.
Import-Csv "redirect_and_canonical_chains.csv" |
Where { $_."Number of Redirects" -gt 1} |
Select {"Redirect 301 ",$_.Address, $_."Final Address"} |
Export-Csv "testing-export.csv" –NoTypeInformation
This all works fine however for the $_.Address value I want to strip the domain, sub-domain and protocol etc using the following regex
^(?:https?:\/\/)?(?:[^#\/\n]+#)?(?:www\.)?([^:\/\n]+)
This individually works and matches as I want but I am not sure of the best way to implement when selecting the data (should I use $match, -replace etc) or whether I should do it after importing?
Any advice greatly appreciated!
Many thanks
Mike
The best place to do it would be in the select clause, as in:
select Property1,Property2,#{name='NewProperty';expression={$_.Property3 -replace '<regex>',''}}
That's what a calculated property is: you give the name, and the way to create it.Your regex might need revision to work with PowerShell, though.
I've realized now that I can just use .Replace in the following way :)
Select {"Redirect 301 ",$_.Address.Replace('http://', 'testing'), $_."Final Address"}
Based on follow-up comments, the intent behind your Select[-Object] call was to create a single string with space-separated entries from each input object.
Note that use of Export-Csv then makes no sense, because it will create a single Length column with the input strings' length rather than output the strings themselves.
In a follow-up comment you posted a solution that used Write-Host to produce the output string, but Write-Host is generally the wrong tool to use, unless the intent is explicitly to write to the display only, thereby bypassing PowerShell's output streams and thus the ability to send the output to other commands, capture it in a variable or redirect it to a file.
Here's a fixed version of your command, which uses the -join operator to join the elements of a string array to output a single, space-separated string:
$sampleCsvInput = [pscustomobject] #{
Address = 'http://www.example.org/more/stuff';
'Final Address' = 'more/stuff2'
}
$sampleCsvInput | ForEach-Object {
"Redirect 301 ",
($_.Address -replace '^(?:https?://)?(?:[^#/\n]+#)?(?:www\.)?([^:/\n]+)', ''),
$_.'Final Address' -join ' '
}
Note that , - PowerShell's array-construction operator - has higher precedence than the -join operator, so the -join operation indeed joins all 3 preceding array elements.
The above yields the following string:
Redirect 301 /more/stuff more/stuff2

Is there a way to loop through the publicsuffix list within Powershell?

I'm trying to test out a web filtering solution so i have a powershell which loop throughs a list of URLs and returns the webresponse. The problem is that often times you hit cdns or other sites that maybe unauthorized 403 or 404 not found and you need to find the root domain.
The only logical solution from what i've found is to cross reference it against the publicsuffix list. The only language it doesn't operate well with from what i've seen is PowerShell. I'm wondering if anyone has come across this or has a solution.
While your solution works, there is an alternative that is both more concise and much faster:
$url = 'https://publicsuffix.org/list/public_suffix_list.dat'
(Invoke-RestMethod $url) -split "`n" -match '^[^/\s]' |
Set-Content .\public_suffix_list.dat
Invoke-RestMethod $url returns the text file at the specified URL as a single string.
-split "`n" splits the string into an array of lines
-match '^[^/\s]' matches those lines that start with (^) a character (from the set enclosed in [...]) that is not (^) a literal / and not a whitespace character (/s), which effectively filters out comment / (hypothetical) non-data lines.
The above saves the data-lines-only array to a file, as in your solution.
Note that determining whether a given URL has a public suffix involves more than just suffix matching against the data lines, because the latter have wildcard labels (*) and involve exceptions (lines starting with !) - see https://publicsuffix.org/list/
# You can use whatever directory
$workingdirectory = "C:\"
# Downloads the public suffix list
Invoke-WebRequest -Uri "https://publicsuffix.org/list/public_suffix_list.dat" -OutFile "$workingdirectory\public_suffix_list.dat"
# Gets the content of the file, removes the empty spaces, removes all the
# comments that has // and outputs it to a file
(gc $workingdirectory\public_suffix_list.dat) |
? { $_.Trim() -ne "" } |
Select-String -Pattern "//" -NotMatch |
Set-Content "$workingdirectory\public_suffix_list.dat"

Dynamic replacement of strings from import-csv

I have a csv file that contains fields with values that begin with "$" that are representative of variables in different powershell scripts. I am attempting to import the csv and then replace the string version of the variable (ex. '$var1') with the actual variable in the script. I have been able to isolate the appropriate strings from the input but I'm having difficulty turning the corner on modifying the value.
Example:
CSV input file -
In Server,Out Server
\\$var1\in_Company1,\\$var2\in_Company1
\\$var1\in_Company2,\\$var2\in_Company2
Script (so far) -
$Import=import-csv "C:\temp\test1.csv"
$Var1="\\server1"
$Var2="\\server2"
$matchstring="(?=\$)(.*?)(?=\\)"
$Import| %{$_ | gm -MemberType NoteProperty |
%{[regex]::matches($Import.$($_.name),"$matchstring")[0].value}}
Any thoughts as to how to accomplish this?
The simplest way to address this that I could think of was with variable expansion as supposed to replacement. You have the variables set in the CSV so lets just expand them to their respective values in the script.
# If you dont have PowerShell 3.0 -raw will not work use this instead
# $Import=Get-Content $path | Out-String
$path = "C:\temp\test1.csv"
$Import = Get-Content $path -Raw
$Var1="\\server1"
$Var2="\\server2"
$ExecutionContext.InvokeCommand.ExpandString($Import) | Set-Content $path
This will net the following output in $path
In Server,Out Server
\\\\server1\in_Company1,\\\\server2\in_Company1
\\\\server1\in_Company2,\\\\server2\in_Company2
If the slashes are doubled up here and you do not want them to be then just change the respective variable values in your script of the data in the CSV.
Caveat
This has the potential to execute malicious code you did not mean too. Have a look at this thread for more on Expanding variables in file contents. If you are comfortable with the risks then this solution as presented should be fine.
Maybe I misunderstood, but do you need this?
$Import=import-csv "C:\temp\test1.cvs"
$Var1="\\server1"
$Var2="\\server2"
Foreach ($row in $Import )
{
$row.'In Server' = ($row.'In Server' -replace '\\\\\$var1', "$var1")
$row.'Out Server' = ($row.'Out Server' -replace '\\\\\$var2', "$var2")
}
$import | set-content $path

Reformat column names in a csv with PowerShell

Question
How do I reformat an unknown CSV column name according to a formula or subroutine (e.g. rename column " Arbitrary Column Name " to "Arbitrary Column Name" by running a trim or regex or something) while maintaining data?
Goal
I'm trying to more or less sanitize columns (the names) in a hand-produced (or at least hand-edited) csv file that needs to be processed by an existing PowerShell script. In this specific case, the columns have spaces that would be removed by a call to [String]::Trim(), or which could be ignored with an appropriate regex, but I can't figure a way to call or use those techniques when importing or processing a CSV.
Short Background
Most files and columns have historically been entered into the CSV properly, but recently a few columns were being dropped during processing; I determined it was because the files contained a space (e.g., Select-Object was being told to get "RFC", but Import-CSV retrieved "RFC ", so no matchy-matchy). Telling the customer to enter it correctly by hand (though preferred and much simpler) is not an option in this case.
Options considered
I could manually process the text of the file, but that is a messy and error prone way to re-invent the wheel. I wonder if there's a syntax with Select-Object that would allow a softer match for column names, but I can't find that info.
The closest I have come conceptually is using a calculated property in the call to Select-Object to rename the column, but I can only find ways to rename a known column to another known column. So, this would require enumerating the columns and matching them exactly (preferred) or a softer match (like comparing after trimming or matching via regex as a fallback) with expected column names, then creating a collection of name mappings to use in constructing calculated properties from that information to select into a new object.
That seems like it would work, but more it's work than I'd prefer, and I can't help but hope that there's a simpler way I haven't been able to find via Google. Maybe I should try Bing?
Sample File
Let's say you have a file.csv like this:
" RFC "
"1"
"2"
"3"
Code
Now try to run the following:
$CSV = Get-Content file.csv -First 2 | ConvertFrom-Csv
$FixedHeaders = $CSV.PSObject.Properties.Name.Trim(' ')
Import-Csv file.csv -Header $FixedHeaders |
Select-Object -Skip 1 -Property RFC
Output
You will get this output:
RFC
---
1
2
3
Explanation
First we use Get-Content with parameter -First 2 to get the first two lines. Piping to ConvertFrom-Csv will allow us to access the headers with PSObject.Properties.Name. Use Import-Csv with the -Header parameter to use the trimmed headers. Pipe to Select-Object and use -Skip 1 to skip the original headers.
I'm not sure about comparisons in terms of efficiency, but I think this is a little more hardened, and imports the CSV only once. You might be able to use #lahell's approach and Get-Content -raw, but this was done and it works, so I'm gonna leave it to the community to determine which is better...
#import the CSV
$rawCSV = Import-Csv $Path
#get actual header names and map to their reformatted versions
$CSVColumns = #{}
$rawCSV |
Get-Member |
Where-Object {$_.MemberType -eq "NoteProperty"} |
Select-Object -ExpandProperty Name |
Foreach-Object {
#add a mapping to the original from a trimmed and whitespace-reduced version of the original
$CSVColumns.Add(($_.Trim() -replace '(\s)\s+', '$1'), "$_")
}
#Create the array of names and calculated properties to pass to Select-Object
$SelectColumns = #()
$CSVColumns.GetEnumerator() |
Foreach-Object {
$SelectColumns += {
if ($CSVColumns.values -contains $_.key) {$_.key}
else { #{Name = $_.key; Expression = $CSVColumns[$_.key]} }
}
}
$FormattedCSV = $rawCSV |
Select-Object $SelectColumns
This was hand-copied to a computer where I don't have the rights to run it, so there might be an error - I tried to copy it correctly
You can use gocsv https://github.com/DataFoxCo/gocsv to see the headers of the csv, you can then rename the headers, behead the file, swap columns, join, merge, any number of transformations you want