Using querySelectorAll on an mshtml.HTMLDocumentClass object in PowerShell causes a crash - powershell

I'm trying to do some web-scraping via PowerShell, as I've recently discovered it is possible to do so without too much trouble.
A good starting point is to just fetch the HTML, use Get-Member, and see what I can do from there, like so:
$html = Invoke-WebRequest "https://www.google.com"
$html.ParsedHtml | Get-Member
The methods available to me for fetching specific elements appear to be the following:
getElementById()
getElementsByName()
getElementsByTagName()
For example I can get the first IMG tag in the document like so:
$html.ParsedHtml.getElementsByTagName("img")[0]
However after doing some more research in to whether I could use CSS Selectors or XPath I discovered that there are unlisted methods available, since we are just using the HTML Document object documented here:
querySelector()
querySelectorAll()
So instead of doing:
$html.ParsedHtml.getElementsByTagName("img")[0]
I can do:
$html.ParsedHtml.querySelector("img")
So I was expecting to be able to do:
$html.ParsedHtml.querySelectorAll("img")
...in order to get all of the IMG elements. All the documentation I've found and googling I've done supports this. However, in all my testing this function crashes the calling process and reports a heap corruption exception code in the Event Log (0xc0000374).
I'm using PowerShell 5 on Windows 10 x64. I've tried it in a Win10 x64 VM that is a clean build and just patched up. I've also tried it in Win7 x64 upgraded to PowerShell 5. I haven't tried it on anything prior to PowerShell 5 as all our systems here are upgraded, but I probably will once I have time to spool a new vanilla VM for testing.
Has anyone run in to this issue before? All my research so far is a dead end. Are there alternatives to querySelectorAll? I need to scrape pages that will have predictable sets of tags inside unpredictable layouts and potentially no IDs or classes assigned to the tags, so I want to be able to use selectors that allow structure/nesting/wildcards.
P.S. I've also tried using the InternetExplorer.Application COM object in PowerShell, the result is the same, except instead of PowerShell crashing Internet Explorer crashes. This was actually my original approach, here's the code:
# create browser object
$ie = New-Object -ComObject InternetExplorer.Application
# make browser visible for debugging, otherwise this isn't necessary for function
$ie.Visible = $true
# browse to page
$ie.Navigate("https://www.google.com")
# wait till browser is not busy
Do { Start-Sleep -m 100 } Until (!$ie.Busy)
# this works
$ie.document.getElementsByTagName("img")[0]
# this works as well
$ie.document.querySelector("img")
# blow it up
$ie.document.querySelectorAll("img")
# we wanna quit the process, but since we blew it up we don't really make it here
$ie.Quit()
Hope I'm not breaking any rules and this post makes sense and is relevant, thanks.
UPDATE
I tested earlier PowerShell versions. v2-v4 crash using the InternetExplorer.Application COM method. v3-4 crash using the Invoke-WebRequest method, v2 doesn't support it.

I ran into this problem, too, and posted about it on reddit. I believe the problem happens when Powershell tries to enumerate the HTML DOM NodeList object returned by querySelectorAll(). The same object is returned by childNodes() which can be enumerated by PS, so I'm guessing there's some glue code written for .ParsedHtml.childNodes but not .ParsedHtml.querySelectorAll(). The crash can be triggered by Intellisense trying to get tab-complete help for the object, too.
I found a way around it, though! Just access the native DOM methods .item() and .length directly and emit the node objects into a PowerShell array. The following code pulls the newest page of posts from /r/Powershell, gets the post list anchors via querySelectorAll() then manually enumerates them using the native DOM methods into a Powershell-native array.
$Result = Invoke-WebRequest -Uri "https://www.reddit.com/r/PowerShell/new/"
$NodeList = $Result.ParsedHtml.querySelectorAll("#siteTable div div p.title a")
$PsNodeList = #()
for ($i = 0; $i -lt $NodeList.Length; $i++) {
$PsNodeList += $NodeList.item($i)
}
$PsNodeList | ForEach-Object {
$_.InnerHtml
}
Edit .Length seems to work capitalized or lower-case. I would have expected the DOM to be case-sensitive, so either there's some things going on to help translate or I'm misunderstanding something. Also, the CSS selector is grabbing the source links (self.PowerShell mostly), but that it my CSS selector logic error, not a problem with querySelectorAll(). Note that the results of querySelectorAll() are not live, so modifying them won't modify the original DOM. And I haven't tried modifying them or using their methods yet, but clearly we can grab at the very least .InnerHtml.
Edit 2: Here is a more-generalized wrapper function:
function Get-FixedQuerySelectorAll {
param (
$HtmlWro,
$CssSelector
)
# After assignment, $NodeList will crash powershell if enumerated in any way including Intellisense-completion while coding!
$NodeList = $HtmlWro.ParsedHtml.querySelectorAll($CssSelector)
for ($i = 0; $i -lt $NodeList.length; $i++) {
Write-Output $NodeList.item($i)
}
}
$HtmlWro is an HTML Web Response Object, the output of Invoke-WebReqest. I originally tried to pass .ParsedHtml but then it would crash on assignment. Doing it this way returns the nodes in a Powershell array.

The #midnightfreddie's solution worked fine for me before, but now it throws Exception from HRESULT: 0x80020101 when calling $NodeList.item($i).
I found the following workaround:
function Invoke-QuerySelectorAll($node, [string] $selector)
{
$nodeList = $node.querySelectorAll($selector)
$nodeListType = $nodeList.GetType()
$result = #()
for ($i = 0; $i -lt $nodeList.length; $i++)
{
$result += $nodeListType.InvokeMember("item", [System.Reflection.BindingFlags]::InvokeMethod, $null, $nodeList, $i)
}
return $result
}
This one works for New-Object -ComObject InternetExplorer.Application as well.

Related

PowerShell Selenium WebDriver and ChromeDriver - FindElementBy() syntax differences

I've followed the steps in this article https://adamtheautomator.com/selenium-powershell/ and a couple of others, to install Selenium WebDriver and ChromeDriver for use with PowerShell. Things are fine, however, I did struggle to implement the use of FindElementByXPath() or FindElementByCssSelector(). It didn't find a master documentation for using Selenium WebDriver and ChromeDriver form PowerShell. It seems there is some issue with the methods FindElementXXXXX() as they are not working in my case, although they are mentioned in the above-mentioned article.
After a lot of trial and error, I was able to implement FindElementBy() as follows:
$scriptPath = "C:\Projects\Selenium\Setup"
Add-Type -path "$scriptPath\webdriver.dll"
$ChromeOptions = $null
#$ChromeOptions = New-Object OpenQA.Selenium.Chrome.ChromeOptions
#$ChromeOptions.AddArgument("--user-data-dir=C:\Users\user_dir\AppData\Local\Google\Chrome\User Data\")
$chrome = New-Object OpenQA.Selenium.Chrome.ChromeDriver($ChromeOptions)
#$chrome = New-Object OpenQA.Selenium.Chrome.ChromeDriver
$usernameVal = "user-name"
$pwordVal = "password432"
$chrome.Navigate().GoToUrl("http://your-website-with-username-pwd.com")
Start-Sleep -s 6
$loginName = [OpenQA.Selenium.By]::CssSelector("#loginName-id")
$loginNameElm = $chrome.FindElement($loginName)
$loginNameElm.SendKeys($usernameVal)
$pword = [OpenQA.Selenium.By]::CssSelector("[a-id=password-input-xyz]")
$pwordElm = $chrome.FindElement($pword)
$pwordElm.SendKeys($pwordVal)
$btnLogin = [OpenQA.Selenium.By]::CssSelector("#ng-app > div.app.ng-scope > div > div:nth-child(2) > div > form > div > div.ng-isolate-scope > div > div > div > div > div.fis-action-bar.clearfix.ng-scope > div.fis-block > div.fis-primary-block > button")
#The CSS Selector below didn't work
#$btnLogin = [OpenQA.Selenium.By]::CssSelector("button:contains('Sign in')")
$btnLoginElm = $chrome.FindElement($btnLogin)
$btnLoginElm.click()
So as you can see the trick is to use the syntax [OpenQA.Selenium.By] namespace or class.
What really puzzled me is in all references to this form are using the syntax By.CssSelector() as in this page https://www.toolsqa.com/selenium-webdriver/find-element-selenium/.
Also, I tried to use CSS Selector format button:contains('Sign in') but it did throw an error although it is working using $() notation. If I used the other normal selector (by copying it from Chrome DevTools) it works fine. I was trying to avoid having a supper long css selector that looks funny.
So the question is:
Where is the main documentation for using Selenium from PowerShell?
Why there are different variations if the use of a method like FindElementBy and in other instance we see FindElementByXPath()?
How we can import the namespace[OpenQA.Selenium.By] to avoid typing the whole name? So that we use the syntax By.CssSelector() directly in PowerShell.
How can we locate a button with the value used for the caption? That is to avoid using a very long CSS Selector.
I don't like the option to use sleep in PowerShell to wait until Chrome is ready. Is there another better way?

Better custom error handling for powershell

So I have a powershell script that integrates with several other external third-party EXE utilities. Each one returns its own kind of errors as well as some return non-error related output to stderr (yes badly designed I know, I didn't write these utilities). So What I'm currently doing is parsing the output of each utility and doing some keyword matching. This approach does work but I feel that as I use these scripts and utilties I'll have to add more exceptions to what the error actually is. So I need to create something that is expandable,possibly a kind of structure I can add to an external file like a module.
I was thinking of leveraging the features of a custom PSObject to get this done but I am struggling with the details. Currently my parsing routine for each utility is:
foreach($errtype in {'error','fail','exception'})
{
if($JobOut -match $errtype){ $Status = 'Failure' }
else if($JobOut -match 'Warning'){$Status = 'Warning' }
else { $Status = 'Success' }
}
So this looks pretty straightforward until I run into some utility that contain some of the keywords in $errtype within $JobOut that is not an error. So now I have to add some exceptions to the logic:
foreach($errtype in {'error','fail','exception'})
{
if($JobOut -match 'error' -and(-not($JobOut -match 'Error Log' }
elseif($JobOut -match $errtype){ $Status = 'Failure' }
else if($JobOut -match 'Warning'){$Status = 'Warning' }
else { $Status = 'Success' }
}
So as you can see this method has the potential to get out of control quickly and I would rather not start editing core code to add a new error rule every time I come across a new error.
Is there a way to maybe create a structure of errors for each utility that contains the logic for what is an error. Something that would be easy to add new rules too?
Any help with this is really appreciated.
I would think a switch would do nicely here.
It's very basic, but can be modified very easily and is highly expandable and I like that you can have an action based on the input to the switch, which could be used for logging or remediation.
Create a function that allows you to easily provide input to the switch and then maintain that function with all your error codes, or words, etc. then simply use the function where required.
TechNet Tips on Switches
TechNet Tips on Functions

creating GUI forms without variables

I am trying to find a way to create a form in PowerShell without using any variables unless they are temporarily or virtually assigned. I want to be able to run a command similar to this:
(New-Object System.Windows.Forms.Form).ShowDialog()
where I can enter in a code into an event that is triggered once the form is created. That event will then be responsible for creating all the objects and other events inside the form. Once the form is launched, I will not need any variables accept for the ones that are virtually assigned within the events.
This to avoid using too much system resources from assigning and endless amount of variables for each object in the form. The script that I am currently working on in PowerShell is very possibly going to be really big, and even if it is not a very large script, efficiency and clean code is always the key to writing a good program or script.
add-type -ass System.Windows.Forms
$x = (New-Object System.Windows.Forms.Form)
$x.Text = 'Message Box'
$x.Size = '300,150'
$x.Font = $x.Font.Name + ',12'
$x.Controls.Add((New-Object System.Windows.Forms.Label))
$x.Controls[-1].Size = $x.Size
$x.Controls[-1].Text = 'Here is a message for you'
$x.ShowDialog()
Remove-Variable x
It is very possible to access these objects still with the exact same kind of access when you define each object with a variable. It cost me many hours of research and just simply attempting random commands to find out how to do this. Here is all the commands you may need to relearn if you are interested in my solution:
# create item in form:
$x.Controls.Add((New-Object System.Windows.Forms.Button))
# access the last created item in the form:
$x.Controls[-1]
# change it's name to identify it easier
$x.Controls[-1].Name = 'button1'
# access the item by it's new name:
$x.Controls['button']
# delete the item by it's name:
$x.Controls.Remove($x.Controls['button1'])
If your familiar with form creation in PowerShell then this should all make sense to you and you should be familiar with how the rest of it works. Also, another note to make for those who are interested in what I am trying to do is that any of these commands can be done within an event by replacing $x with $this. If it is inside an event of an object inside the "controls" section of the form, then you would use $this.parent.
This is exactly what I mean by having the ability to create a form with virtually no variables. The only problem I am having with this is that I am unsure how to assign an event and call the method ShowDialog() at the same time.
I found an a very interesting solution to this, however I am not sure to what the limits are to this solution and it dose not quite work in the way that I would personally like it to.
file.ps1:
add-type -ass System.Windows.Forms
$x = (New-Object System.Windows.Forms.Form)
$x.Text = 'Message Box'
$x.Size = '300,150'
$x.Font = $x.Font.Name + ',12'
$x.Controls.Add((New-Object System.Windows.Forms.Label))
$x.Controls[-1].Size = $x.Size
$x.Controls[-1].Text = 'Here is a message for you'
$x
remove-variable x
command to execute the code:
(iex(Get-Content 'file.ps1'|out-string)).ShowDialog()

Getting result of .Net object asynchronous method in powershell

I'm trying to call an async method on a .Net object instantiated in Powershell :
Add-Type -Path 'my.dll'
$myobj = new-object mynamespace.MyObj()
$res = $myobj.MyAsyncMethod("arg").Result
Write-Host "Result : " $res
When executing the script, the shell doesn't seem to wait for MyAsyncMethod().Result and displays nothing, although inspecting the return value indicates it is the correct type (Task<T>). Various other attempts, such as intermediary variables, Wait(), etc. gave no results.
Most of the stuff I found on the web is about asynchronously calling a Powershell script from C#. I want the reverse, but nobody seems to be interested in doing that. Is that even possible and if not, why ?
I know this is a very old thread, but it might be that you were actually getting an error from the async method but it was being swallowed because you were using .Result.
Try using .GetAwaiter().GetResult() instead of .Result and that will cause any exceptions to be bubbled up.
For long running methods, use the PSRunspacedDelegate module, which will enable you to run the task asynchronously:
$task = $myobj.MyAsyncMethod("arg");
$continuation = New-RunspacedDelegate ( [Action[System.Threading.Tasks.Task[object]]] {
param($t)
# do something with $t.Result here
} );
$task.ContinueWith($continuation);
See documentation on GitHub. (Disclaimer: I wrote it).
This works for me.
Add-Type -AssemblyName 'System.Net.Http'
$myobj = new-object System.Net.Http.HttpClient
$res = $myobj.GetStringAsync("https://google.com").Result
Write-Host "Result : " $res
Perhaps check that PowerShell is configured to use .NET 4:
How can I run PowerShell with the .NET 4 runtime?

powershell com object windows installer

I would like to leverage the following
$wi=new-object -com WindowsInstaller.Installer
If I do a $wi |gm I do not see the method I want "Products".
I would like to iterate Products and show a list of all items installed on the system.
So I thought... let me do a $wi.gettype().invokemember
Not really sure what to do $wi.gettype().invokemember("Products","InvokeMethod")
or something yields cannot find an overload...
But now I am lost. I have looked elsewhere but I don't want to create a whole XML file. I should be able to access the com objects methods.
If you are trying to get a list of installed programs in Windows, there is a native Powershell way, which is actually using WMI behind the scenes:
Get-WmiObject Win32_Product
Here's a related article from Microsoft Scripting Guys.
It appears that this approach has some issues, so better be avoided.
When you query this class, the way the provider works is that it
actually performs a Windows Installer “reconfiguration” on every MSI
package on the system as its performing the query!
I tried my best to find a solution that involves WindowsInstaller com object, but all of them point to an article that no longer exists. Here is one on stackoverflow.
An alternative solution is to give a try to psmsi on CodePlex.
Here my code-snippet for this:
cls
$msi = New-Object -ComObject WindowsInstaller.Installer
$prodList = foreach ($p in $msi.Products()) {
try {
$name = $msi.ProductInfo($p, 'ProductName')
$ver = $msi.ProductInfo($p, 'VersionString')
$guid = $p
[tuple]::Create($name, $ver, $guid)
} catch {}
}
$prodlist