web scraping using powershell - powershell

I am trying to scrape the pages of website https://www.enghindi.com/ .
URLs are saved in csv file, for example
URL
Hindi meaning
Url1
hindi meaning
url2
hindi meaning
now, everytime I am running following script . it just shows result of only URL1 and that goes into multiple cells. I want all result of url 1 should be in one cell (in hindi meaning box) and similarly for URL2.
url1 : https://www.enghindi.com/index.php?q=close
url2 : https://www.enghindi.com/index.php?q=compose
$URLs = import-csv -path C:\Scripts\PS\urls.csv | select -expandproperty urls
foreach ($url in $urls)
{
$web = Invoke-WebRequest $url
$data = $web.AllElements | Where{$_.TagName -eq "BIG"} | Select-Object -Expand InnerText
$datafinal = $data.where({$_ -like "*which*"},'until')
}
foreach ($item in $datafinal) {
[ pscustomobject]#{ Url = $url; Data = $item } | Export-Csv -Path C:\Scripts\PS\output.csv -NoTypeInformation -Encoding unicode -Append
}
Are there other ways I can get english to hindi word meaning using web scraping instead of copying and pasting. I prefer google translate but that I think difficult that is why i am trying with enghindi.com.
thanks alot

Web scraping, due its inherent unreliability, should only be a last resort.
You can make it work in Windows PowerShell, but note that the HTML DOM parsing is no longer available in PowerShell (Core) 7+.
You code has two basic problems:
It operates on $datafinal after the foreach loop, at which point you only see the results of the last Invoke-WebRequest call.
You loop over each element of array $datafinal and create an output object for each, instead of creating an output object per input URL.
The following reformulation fixes these problems:
# Sample input URLs
$URLs = #(
'https://www.enghindi.com/index.php?q=close',
'https://www.enghindi.com/index.php?q=compose'
)
$URLs |
ForEach-Object {
$web = Invoke-WebRequest $_
$data = $web.AllElements | Where { $_.TagName -eq "BIG" } | Select-Object -Expand InnerText
$datafinal = $data.where({ $_ -like "*which*" }, 'until')
# Create the output object for the URL at hand and implicitly output it.
# Join the $datafinal elements with newlines to form a single vaulue.
[pscustomobject] #{
Url = $_
Hindi = $datafinal -join "`n"
}
} |
ConvertTo-Csv -NoTypeInformation
Note that, for demonstration purposes, ConvertTo-Csv is used in lieu of Export-Csv, which allows you to see the results instantly.

Related

How to get value of a link from a website with powershell?

I want to get download URL of the last version of GIMP from it's site ,I wrote a script but it returns the link name I do not know how to get the value
$web = Invoke-WebRequest -Uri "https://download.gimp.org/pub/gimp/v2.10/windows/"
$web.Links | Where-Object href -like '*exe' | select -Last 1 | select -expand href
the above code returne link name (gimp-2.10.32-setup.exe)
but I need the value ("https://download.gimp.org/pub/gimp/v2.10/windows/gimp-2.10.32-setup.exe")
can someone guide me how to do it
You know that the url presented is relative.
Just append the root part of the URL yourself.
$Uri = 'https://download.gimp.org/pub/gimp/v2.10/windows/'
$web = Invoke-WebRequest -Uri $uri
$ExeRelLink = $web.Links | Where-Object href -like '*exe' | select -Last 1 -expand href
# Here is your download link.
$DownloadLink = $Uri + $ExeRelLink
Additional Note
You can combine the -Last and -Expand from your 2 select statements into 1.
There are several downloads sites with exactly the same or very similar layout to this GIMP page, including many Apache projects like Tomcat and ActiveMQ. I had written a little function to parse these and other pages in the past, and interestingly it also worked for this GIMP page. I thought it was worth sharing as such.
Function Extract-FilenameFromWebsite {
[cmdletbinding()]
Param(
[parameter(Position=0,ValueFromPipeline)]
$Url
)
begin{
$pattern = '<a href.+">(?<FileName>.+?\..+?)</a>\s+(?<Date>\d+-.+?)\s{2,}(?<Size>\d+\w)?'
}
process{
$website = Invoke-WebRequest $Url -UseBasicParsing
switch -Regex ($website.Content -split '\r?\n'){
$pattern {
[PSCustomObject]#{
FileName = $matches.FileName
URL = '{0}{1}' -f $Url,$matches.FileName
LastModified = [datetime]$matches.Date
Size = $matches.Size
}
}
}
}
}
It's assumed the site passed in has a trailing slash. If you want to account for either, you can add this simple line to the process block.
if($Url -notmatch '/$'){$Url = "$Url/"}
To get the latest version, call the function like this
$url = 'https://download.gimp.org/pub/gimp/v2.10/windows/'
$latest = Extract-FilenameFromWebsite -Url $Url | Where-Object filename -like '*exe' |
Sort-Object LastModified | Select-Object -Last 1
$latest.url
Or you could expand the property while retrieving
$url = 'https://download.gimp.org/pub/gimp/v2.10/windows/'
$latesturl = Extract-FilenameFromWebsite -Url $Url | Where-Object filename -like '*exe' |
Sort-Object LastModified | Select-Object -Last 1 -ExpandProperty URL
$latesturl

Parsing HTML with <DIV> class to variable

I am trying to parse a server monitoring page which doesnt have any class name . The HTML file looks like this
<div style="float:left;margin-right:50px"><div>Server:VIP Owner</div><div>Server Role:ACTIVE</div><div>Server State:AVAILABLE</div><div>Network State:GY</div>
how do i parse this html content to a variable like
$Server VIP Owner
$Server_Role Active
$Server_State Available
Since there is no class name.. i am struggling to get this extracted.
$htmlcontent.ParsedHtml.getElementsByTagName('div') | ForEach-Object {
>> New-Variable -Name $_.className -Value $_.textContent
While you are only showing us a very small part of the HTML, it is very likely there are more <div> tags in there.
Without an id property or anything else that uniquely identifies the div you are after, you can use a Where-Object clause to find the part you are looking for.
Try
$div = ($htmlcontent.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>Server Name:*' }).outerText
# if you're on PowerShell version < 7.1, you need to replace the (first) colons into equal signs
$result = $div -replace '(?<!:.*):', '=' | ConvertFrom-StringData
# for PowerShell 7.1, you can use the `-Delimiter` parameter
#$result = $div | ConvertFrom-StringData -Delimiter ':'
The result is a Hashtable like this:
Name Value
---- -----
Server Name VIP Owner
Server State AVAILABLE
Server Role ACTIVE
Network State GY
Of course, if there are more of these in the report, you'll have to loop over divs with something like this:
$result = ($htmlcontent.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>Server Name:*' }) | Foreach-Object {
$_.outerText -replace '(?<!:.*):', '=' | ConvertFrom-StringData
}
Ok, so the original question did not show what we are dealing with..
Apparently, your HTML contains divs like this:
<div>=======================================</div>
<div>Service Name:MysqlReplica</div>
<div>Service Status:RUNNING</div>
<div>Remarks:Change role completed in 1 ms</div>
<div>=======================================</div>
<div>Service Name:OCCAS</div>
<div>Service Status:RUNNING</div>
<div>Remarks:Change role completed in 30280 ms</div>
To deal with blocks like that, you need a whole different approach:
# create a List object to store the results
$result = [System.Collections.Generic.List[object]]::new()
# create a temporary ordered dictionary to build the resulting items
$svcHash = [ordered]#{}
foreach ($div in $htmlcontent.ParsedHtml.getElementsByTagName('div')) {
switch -Regex ($div.InnerText) {
'^=+' {
if ($svcHash.Count) {
# add the completed object to the list
$result.Add([PsCustomObject]$svcHash)
$svcHash = [ordered]#{}
}
}
'^(Service .+|Remarks):' {
# split into the property Name and its value
$name, $value = ($_ -split ':',2).Trim()
$svcHash[$name] = $value
}
}
}
if ($svcHash.Count) {
# if we have a final service block filled. This happens when no closing
# <div>=======================================</div>
# was found in the HTML, we need to add that to our final array of PSObjects
$result.Add([PsCustomObject]$svcHash)
}
# output on screen
$result | Format-Table -AutoSize
# output to CSV file
$result | Export-Csv -Path 'X:\services.csv' -NoTypeInformation
Output on screen using the above example:
Service Name Service Status Remarks
------------ -------------- -------
MysqlReplica RUNNING Change role completed in 1 ms
OCCAS RUNNING Change role completed in 30280 ms

Powershell function returning an array instead of string

i'm importing a csv and i would like to add a column to it (with the result based off of the previous columns)
my data looks like this
host address,host prefix,site
10.1.1.0,24,400-01
i would like to add a column called "sub site"
so I wrote this module but the problem is, the actual ending object is an array instead of string
function site {
Param($s)
$s -match '(\d\d\d)'
return $Matches[0]
}
$csv = import-csv $file | select-object *,#{Name='Sub Site';expression= {site $_.site}}
if I run the command
PS C:\>$csv[0]
Host Address :10.1.1.0
host prefix :24
site :400-01
sub site : {True,400}
when it should look like
PS C:\>$csv[0]
Host Address :10.1.1.0
host prefix :24
site :400-01
sub site : 400
EDIT: I found the solution but the question is now WHY.
If I change my function to $s -match "\d\d\d" |out-null I get back the expected 400
Good you found the answer. I was typing this up as you found it. The reason is because the -match returns a value and it is added to the pipeline, which is all "returned" from the function.
For example, run this one line and see what is does:
"Hello" -match 'h'
It prints True.
Since I had this typed up, here is another way to phrase your question with the fix...
function site {
Param($s)
$null = $s -match '(\d\d\d)'
$ret = $Matches[0]
return $ret
}
$csv = #"
host address,host prefix,site
10.1.1.1,24,400-01
10.1.1.2,24,500-02
10.1.1.3,24,600-03
"#
$data = $csv | ConvertFrom-Csv
'1 =============='
$data | ft -AutoSize
$data2 = $data | select-object *,#{Name='Sub Site';expression= {site $_.site}}
'2 =============='
$data2 | ft -AutoSize

Extracting a portion of a string then using it to match with other strings in Powershell

I previously asked for assistance parsing a text file and have been using this code for my script:
import-csv $File -header Tag,Date,Value|
Where {$_.Tag -notmatch '(_His_|_Manual$)'}|
Select-Object *,#{Name='Building';Expression={"{0} {1}" -f $($_.Tag -split '_')[1..2]}}|
Format-table -Groupby Building -Property Tag,Date,Value
I've realized since then that, while the code filters out any tags containing _His or _Manual, I need to also filter any tags associated with _Manual. For example, the following tags are present in my text file:
L01_B111_BuildingName1_MainElectric_111A01ME_ALC,13-Apr-17 08:45,64075
L01_B111_BuildingName1_MainElectric_111A01ME_Cleansed,13-Apr-17 08:45,64075
L01_B111_BuildingName1_MainElectric_111A01ME_Consumption,13-Apr-17 08:45,10.4
L01_B333_BuildingName3_MainWater_333E02MW_Manual,1-Dec-16 18:00:00,4.380384E+07
L01_B333_BuildingName3_MainWater_333E02MW_Cleansed,1-Dec-16 18:00:00,4.380384E+07
L01_B333_BuildingName3_MainWater_333E02MW_Consumption,1-Dec-16 18:00:00,25.36
The 333E02MW_Manual string would be excluded using my current code, but how could I also exclude 333E02MW_Cleansed and 333E02MW_Consumption? I feel I would need something that will allow me to extract the 8-digit code before each _Manual instance and then use it to find any other strings with a {MatchingCode}
xxx_xxxx_xxxxxxxxxxx_xxxxxxxxxx_MatchingCode_Cleansed
xxx_xxxx_xxxxxxxxxxx_xxxxxxxxxx_MatchingCode_Consumption
I know there are the -like -contains and -match operators and I've seen these posts on using substrings and regex, but how could I extract the MatchingCode to actually have something to match to? This post seems to come closest to my goal, but I'm not sure how to apply it to PowerShell.
You can find every tag that ends with _Manual and create a regex pattern that matches any of the parts before _Manual. Ex.
$Data = Import-Csv -Path $File -Header Tag,Date,Value
#Create regex that matches any prefixes that has a manual row (matches using the value before _Manual)
$ExcludeManualPattern = ($Data | Foreach-Object { if($_.Tag -match '^(.*?)_Manual$') { [regex]::Escape($Matches[1]) } }) -join '|'
$Data | Where-Object { $_.Tag -notmatch '_His_' -and $_.Tag -notmatch $ExcludeManualPattern } |
Select-Object -Property *,#{Name='Building';Expression={"{0} {1}" -f $($_.Tag -split '_')[1..2]}}|
Format-table -GroupBy Building -Property Tag,Date,Value

Convert a list of URLs into clickable HTML links using ConvertTo-HTML

I have the function below that will create an HTML table however I want the values in my table to hyperlinks.
Function MyFunction{
clear-host
$item=$null
$hash = $null #hashTable
$hash =#{"Google.com" = "www.google.com"; "Yahoo" = "www.yahoo.com";"My Directory" ="C:\Users\Public\Favorites" ;"MSN" = "www.msn.com"}
ForEach($item in #($hash.KEYS.GetEnumerator())){
$hash | Sort-Object -Property name
}
$hash.GetEnumerator() | sort -Property Name |Select-Object Name, Value | ConvertTo-HTML | Out-File .\Test.html
}
MyFunction
Small side note before we start: The ForEach seems pointless as it just outputs the sorted table for each element in the table. So that gets removed.
ConvertTo-HTML on its own is great for making table with PowerShell object but was we need is hyperlinks instead. ConvertTo-HTML does support calculated properties so at first it would seem easy to just make a formatted hyperlink.
ConvertTo-HTML -Property *,#{Label="Link";Expression={"<a href='$($_.Value)'>$($_.Name)</a>"}}
The small issue with that is ConvertTo-Html does some conversion on the string being sent to it which obfuscates our intention. Looking in the output file that is created we see the following for the Yahoo link:
<td><a href='www.yahoo.com'>Yahoo</a></td>
ConvertTo-Html made our text browser friendly which would be nice normally but hinders us now. Reading a blog from PowerShell Magazine covers this issue by decoding the html before it is send to file.
Function MyFunction{
clear-host
$hash = #{"Google.com" = "www.google.com"; "Yahoo" = "www.yahoo.com";"My Directory" ="C:\Users\Public\Favorites" ;"MSN" = "www.msn.com"}
$hash.GetEnumerator() | sort -Property Name | Select-Object Name, Value |
ConvertTo-HTML -Property *,#{Label="Link";Expression={"<a href='$($_.Value)'>$($_.Name)</a>"}}
}
$html = MyFunction
Add-Type -AssemblyName System.Web
[System.Web.HttpUtility]::HtmlDecode($html) | Out-File C:\temp\test.html
Using [System.Web.HttpUtility]::HtmlDecode converts values like < back to what we want them to be. Have a look at the output
Went looking to see if this has been asked before and there is a similar answer: How to generate hyperlinks to .htm files in a directory in Powershell?. It is handled in a different way so im on the fence about marking this as a duplicate.