Parsing HTML with PowerShell on dynamic sites - powershell

Is there a way to parse HTML from http://www.pgatour.com site using Invoke-WebRequest cmdlet? When I try doing this, ParsedHtml does not contain elements that I need (because cmdlet incorrectly parses the page).
I tried getting data from this page by creating IE COM object in PowerShell and it works, but very slow, so I'm wondering if there is another approach using Invoke-WebRequest (or even external parsers).
Thanks!

You could give the HmtlAgilityPack a try to parse the content returned by Invoke-WebRequest. In this scenario, I would use the -UseBasicParsing parameter.

Window 10 64-bit. PowerShell 5.1
Parsing HTML with PowerShell 5.1 on dynamic sites using Invoke-WebRequest and a regex that returns everything between un-nested tags like <html>,<title>,<head>, and <body>. It will take some tweaking for nested tags.
Invoke-WebRequest -Uri http://www.pgatour.com | sc golf.html
(gc -raw golf.html) -match '(<body>)(.*|\n).*?(<\/body>)'
$matches[0]
Everything between <div class="success-message"> and the next </div>
Invoke-WebRequest -Uri http://www.pgatour.com | sc golf.html
(gc -raw golf.html) -match '(<div class="success-message">)(.*?|\n)*(<\/div>)'
$matches[0]
Greedy and lazy quantifiers explained
regex101.com is your friend.

Related

How to improve or optimize a particular Invoke-WebRequest in Powershell?

I have this Powershell command:
((Invoke-WebRequest https://www.intel.com/content/www/us/en/download/19344/intel-graphics-windows-10-windows-11-dch-drivers.html).AllElements | Where-Object -Property TagName -eq "META" | where -Property name -eq RecommendedDownloadUrl).content
I know that this can probably be done better, It's a specific question but I think I can learn a lot from your answers.
I want to just get the recommended URL from the META tag, to download the latest graphics driver from intel's website.
I ran one round of improvement, reducing Where-Object to just one command:
((Invoke-WebRequest https://www.intel.com/content/www/us/en/download/19344/intel-graphics-windows-10-windows-11-dch-drivers.html).AllElements | Where-Object {$_.TagName -eq "META" -and $_.name -eq "RecommendedDownloadUrl"}).content
Thanks!
You can at least return only the elements that match the name you want with:
(Invoke-WebRequest https://www.intel.com/content/www/us/en/download/19344/intel-graphics-windows-10-windows-11-dch-drivers.html).ParsedHtml.getElementsByName('RecommendedDownloadUrl')
If you are ok with not specifying that it is META you can just take that result and get the content of it:
$((Invoke-WebRequest https://www.intel.com/content/www/us/en/download/19344/intel-graphics-windows-10-windows-11-dch-drivers.html).ParsedHtml.getElementsByName('RecommendedDownloadUrl')).content
Assuming that they only have one element named RecommendedDownloadUrl that should work fine. It still parses the page, so it probably isn't much faster, but it works with the object's inherent methods rather than pumping tons of objects through a Where-Object filter.
This is probably the best way to do it - but if you want to optimize for speed, you could use the -UseBasicParsing switch to prevent PowerShell from spinning up a headless instance of Internet Explorer to parse the html (which is why it's slow):
$content = Invoke-WebRequest 'https://www.intel.com/content/www/us/en/download/19344/intel-graphics-windows-10-windows-11-dch-drivers.html' -UseBasicParsing |% Content
Now we have some raw html in a variable - now we just need to manually parse out the link in the relevant meta tag:
if($content -match '<meta\ name="RecommendedDownloadUrl"\ content="([^"]+)'){
# grab link from capture group
$Matches[1]
}
The reason I wouldn't consider this "better" than your current solution is that if intel makes changes to the HTML that wouldn't other affect the DOM, your script might break - if they suddenly switch the order of the attributes around it won't work anymore

Compare-Object API response with text file shows everything is different

I have a Web API that returns an HTML string which I want to compare with a html file on my local machine.
To do so, I have the following code
$Result = (Invoke-WebRequest `
-Uri "{uri}" `
-Headers #{"some-header", "some-value"}).Content
$TestContent = Get-Content -Path ($RepositoryLocation + "index.html") -Raw
$Equal = Compare-Object -ReferenceObject $TestContent -DifferenceObject $Result
When I now use Write-Hosts $Equal it displays me that the whole content is different
When I use Write-Host $Equal.SideIndicator it displays me => <= which also indicates that the complete file is different
Furthermore, using the command with -IncludeEqual -ExcludeDifferent displays empty result, so like I said, no lines are the same.
So what I tried next was to save the Content of $Result into a text file and compare them then, but still, it told me, that the whole file is different.
I then used diffchecker.com as well as JetBrains IDEs integrated comparison tool, to check for differences. Both tools told me that the content is identical. I'm losing my mind, why does PowerShell tell me they have complete different content?
Sadly, I cannot post the content of the API response as well as the content of the index.html
What I thought maybe could be the reason is
Encoding, however both are UTF8
Line endings, however no diff if I use CL, CL RF or RF on the file
Some hidden characters (tabs instead of spaces) but I activated to see that on JetBrains IDE and they still are identical.
How do I know what's causing this issue here?
Not sure if Compare-Object is the right choice here. As the name says, it's for objects. Why not use the equals-operator -eq or string.Equals?
$equal = $testContent -eq $result
# or
$equal = $testContent.Equals($result)
Compare-Object does not do line-by-line-comparison if both are just single strings. So your strings are not equal, as long as there's any tiny difference, as much as just an extra line-feed at the end, etc.
You could try several things:
# trim
$equal = $testContent.Trim() -eq $result.Trim()
# case-insensitive comparison
$equal = $testContent.Equals($result, 'OrdinalIgnoreCase')
# or both
$equal = $testContent.Trim().Equals($result.Trim(), 'OrdinalIgnoreCase')
Btw: If you do want a line-by-line-comparison, you have to split up the strings into lines first, e.g.:
Compare-Object ($testContent -split "`r`n") ($result -split "`r`n")

Output in a different format

I'm hitting an API endpoint for the UniFi Controller.
I run a command to give me devices that do NOT have a specific MAC address.
The command returns 3 mac addresses.
When working with PowerShell, how can you use the 3 mac addresses returned as 3 different items so that I can do a 'foreach' statement on them?
$returnedDevices = $response.data | select mac | where-object {$_.mac -notlike "18:e8:29:4f:0b:33"}
This is what is returned:
mac
---
80:2a:a8:c9:c3:c3
18:e8:29:93:00:d1
18:e8:29:93:6c:85
When I run: $returnedDevices.Count
I get '3'.
So it looks like 3 different values, but if I try to use this in a ForEach statement:
ForEach ($item in $returnedDevices) {
$command = "`{`"cmd`":`"restart`",`"reboot_type`":`"soft`"`,`"mac`":`"$item.mac`"`}"
$response = Invoke-RestMethod -Uri "$devURI" -Method Post -Body $command -ContentType "application/json" -WebSession $session
}
What happens is I get an error because this is what is being sent to the API:
{"cmd":"restart","reboot_type":"soft","mac":"#{mac=80:2a:a8:c9:c3:c3}.mac"}
I would think, it should be:
{"cmd":"restart","reboot_type":"soft","mac":"80:2a:a8:c9:c3:c3}.mac"}
How can I either trim the beginning or, better yet, get the right format?
Also, if I do the following, the output looks like what I would expect:
$item.mac
Output:
80:2a:a8:c9:c3:c3
Thank you,
When you write a variable and its property inside of a string, that property becomes part of the quoted string.
$command = "`{`"cmd`":`"restart`",`"reboot_type`":`"soft`"`,`"mac`":`"$item.mac`"`}"
# vs
$command = "`{`"cmd`":`"restart`",`"reboot_type`":`"soft`"`,`"mac`":`"$($item.mac)`"`}"
1st Command is going to display $item but consider .mac to be part of the regular string. If you need to get the mac out of the variable $item and display it, you have to use the whole thing as a variable... and thats by encapsulating them with $().

Invoke-WebRequest and Hebrew characters

I already tried the reghack for PS to support Hebrew characters. I can type Hebrew no problems but for some reasons characters containing Hebrew returned from Invoke-WebRequest are in gibberish (see the following screenshot).
Here's the site URL I'm attempting to query:
https://www.hometheater.co.il/vt278553.html
Update:
It looks like the content-type being returned is ALWAYS of charset Windows-1255 which is probably the issue.
This seems to be not only an issue of having to specify the encoding but also that the shell cannot display the encoding correctly. If you specify the encoding to a file and edit it with a decent text editor (not Notepad but e.g. Notepad++), then you will be see that it has parsed it correctly.
Invoke-WebRequest -Uri "https://www.hometheater.co.il/vt278553.html" -ContentType "text/plain; charset=Windows-1255" -OutFile content.txt
We can also test that the in-memory presentation is correct by reading it and writing it to another file:
Get-Content .\content.txt | Set-Content test.txt

Select-String -Pattern Output when piping into

I try to get PowerShell to spit out a Semantic Versioning Variable but get stuck in it displaying just the command I entered (doing that in the ISE) or either of 2 errors ('missing argument' or 'doesn't accept piped input'), which, if I try to resolve them, end in the command simply being displayed again.
How do I get this:
(Invoke-WebRequest -Uri http://someplace).Links.href | Out-String -Stream |
Select-String -Pattern [regex]$someGoodRegex -OutVariable $NodeVersion_target
assuming the regex and web request point to solid things to simply stick the searched term in the -OutVariable defined?
On a more general note, is there a good way to display the properties of the objects in the pipe? After a great deal of digging, I stumbled upon {$_} but can't get it to display anything but the command again if the command gets a little more complex that just a simple cmdlet.
Remove the [regex]. Select-String already treats the argument to the parameter -Pattern as a regular expression.
Remove the $ from the variable name. You need it to use a variable directly, but the -OutVariable parameter expects the bare variable name without the leading $.
You can also remove the Out-String -Stream.
Something like this should work:
$uri = 'http://www.example.com/'
$re = 'v\d+\.\d+\.\d+/s'
(Invoke-WebRequest -Uri $uri).Links.href |
Select-String -Pattern $re -OutVariable NodeVersion_target
Alternatively you can assign the output of the pipeline to a variable instead of using -OutVariable:
$uri = 'http://www.example.com/'
$re = 'v\d+\.\d+\.\d+/s'
$NodeVersion_target = (Invoke-WebRequest -Uri $uri).Links.href |
Select-String -Pattern $re
The latter is actually more PoSh.
As for inspecting the current object in a pipeline: pipe into Get-Member to get a list of the properties/methods of the pipelined objects, and pipe into Format-List * to get a list of the values of the pipelined objects.