Web scraping with PowerShell - an empty href - powershell

I am trying to get download links from web pages, but in some web pages some of <a> tags look like this: <a href="#" ....
For example, this page
https://www.google.com/chrome/browser/thankyou.html?standalone=1&system=true&platform=win
has a link with "click here to retry" inner text, which looks like
https://dl.google.com/tag/s/appguid%3D%7B8A69D345-D564-463C-AFF1-A69D9E530F96%7D%26iid%3D%7B3218A264-6918-BBE2-56B1-1CEFAB4A9C43%7D%26lang%3Den%26browser%3D3%26usagestats%3D0%26appname%3DGoogle%2520Chrome%26needsadmin%3Dtrue%26installdataindex%3Ddefaultbrowser/update2/installers/ChromeStandaloneSetup.exe
in any web-browser, but not in PowerShell.
What doesn't work
I tried Invoke-WebRequest:
PS> $url="https://www.google.com/chrome/browser/thankyou.html?standalone=1&system=true&platform=win"
PS> (Invoke-WebRequest $url).Links | ? {$_.InnerText -eq "click here to retry"} | select outerHTML, href | fl
outerHTML : <A class=retry-link href="#" data-g-label="retry-dl"
data-g-event="retry-download">click here to retry</A>
href : #
Other methods - System.Net.WebClient, System.Net.WebRequest + StreamReader - return the same result. I guess it's because they don't process javascripts in web pages, but correct me if I wrong.
Works
Using IE works, all right. But it's SO SLOW and... feels like an overkill. This code does the job:
$url="https://www.google.com/chrome/browser/thankyou.html?standalone=1&system=true&platform=win"
$ie = New-Object -comobject InternetExplorer.Application
$ie.visible = $true
$ie.silent = $true
$ie.Navigate2($url)
while ($ie.busy) { Start-Sleep -m 100 }
while ($ie.Document.readyState -ne 'Complete') { Start-Sleep -m 100 }
$link = $ie.Document.getElementsByTagName("a") | ? { $_.InnerText -eq "click here to retry" } | select -First 1 -expandProperty href
Write-Host $link
Any alternatives?
If you know how to get a proper href values from such links without using IE, please let me know.

Related

how to access neteller account by Invoke-WebRequest

I would like to acces on my neteller account by Invoke-WebRequest .
I have found the fields to be filled
<input type="password" name="password" id="form-login-password" maxlength="32" placeholder="Password" autocomplete="off" pattern=".*" required="" value="Welcome 123">
I have tried this code things .
$content=Get-Content "$home\Downloads\test log\NETELLER » Signin.html"
$qw=Invoke-WebRequest https://member.neteller.com -Method Post -Body $content -SessionVariable test
with password and loggin inside of the file but same issue.
I would like to get the page after the login is done .
Moving from the comment section to be more specific for the OP.
As for your comments...
How can i get the button event and how to use it ?
I have found this
input type="submit" name="btn-login" class="button radius"
value="Sign in" id="btn-login"
There are several ways, but again, the site, the way it is coded, actually prevents some automation actions. So, my point is, it appears that they do not want folks using automation at / on their site.
Clicking buttons, links and the like, require the browser UI be visible. You seem to be wanting to do this non-visible, but I could be wrong.
All things considered, there are a few ways to get to click. Totally dependent on the site designers and what they made available to act on. Here are a few examples from sites, I've had to deal with, single and multi-page/form sites, that allowed automation.
# Starting here...
$ie.Document.getElementsByName('commit').Item().Click();
# Or
$ie.document.IHTMLDocument3_getElementsByTagName("button") |
ForEach-Object { $_.Click() }
# Or
($ie.Document.IHTMLDocument3_getElementsByTagName('button') |
Where-Object innerText -eq 'SIGN IN').Click()
# Or
($ie.document.getElementById('submitButton') |
select -first 1).click()
# Or ...
$Link=$ie.Document.getElementsByTagName("input") |
where-object {$_.type -eq "button"}
$Link.click()
# Or ...
$Submit = $ie.document.getElementsByTagName('INPUT') |
Where-Object {$($_.Value) -match 'Zaloguj'}
$Submit.click()
# Or
$ie.Document.getElementById("next").Click()
# Or
$SubmitButton=$Doc.IHTMLDocument3_getElementById('button') |
Where-Object {$_.class -eq 'btn btn-primary'}
$SubmitButton.Click()
# Or
Invoke-WebRequest ("https://portal.concordfax.com/Account/LogOn" +
$R.ParsedHtml.getElementsByClassName("blueButton login").click
Here is a full example of what I mean.
Scrape the site for object info.
$url = 'https://pwpush.com'
($FormElements = Invoke-WebRequest -Uri $url -SessionVariable fe)
($Form = $FormElements.Forms[0]) | Format-List -Force
$Form | Get-Member
$Form.Fields
Use the info on the site
$IE = New-Object -ComObject "InternetExplorer.Application"
$FormElementsequestURI = "https://pwpush.com"
$Password = "password_payload"
$SubmitButton = "submit"
$IE.Visible = $true
$IE.Silent = $true
$IE.Navigate($FormElementsequestURI)
While ($IE.Busy) {
Start-Sleep -Milliseconds 100
}
$Doc = $IE.Document
$Doc.getElementsByTagName("input") | ForEach-Object {
if ($_.id -ne $null){
if ($_.id.contains($SubmitButton)) {$SubmitButton = $_}
if ($_.id.contains($Password)) {$Password = $_}
}
}
$Password.value = "1234"
$SubmitButton.click()

InternetExplorer.Application, click checkbox and PowerShell

I am having an issue with selecting a checkbox. I have been able to successfully login and navigate via the ID tags, but this one doesn't have one. The only tags it has is type and class.
I can find this section using $ie.document.IHTMLDocument3_getElementByTagName("input"), but I can't find any way to utilize it.
Here's the html I'm working with:
<th class="cText sorting_disabled cHeader" rowspan="1" colspan="1" style="width: 5px;" aria-label="">
<input type="checkbox" class="selectall">
</th>
What I have thus far:
$ie = New-Object -ComObject "InternetExplorer.Application"
$ie.visible = "true"
$ie.navigate("https://some.site.com")
while($ie.Busy) { Start-Sleep -Milliseconds 100 }
# login
$usernameField = $ie.document.IHTMLDocument3_getElementByID("userid")
$passwordField = $ie.document.IHTMLDocument3_getElementByID("Password")
$usernameField.value = "email#domain.com"
$passwordField.value = "supercoolpassword"
$btn_Submit = $ie.document.IHTMLDocument3_getElementByID("btn_signIn")
$btn_Submit.click()
# go to downloads page
$ie.navigate("https://some.site.com/pages/mydownloads.aspx")
# selectall packages to download has me clueless
When the checkbox is clicked the result should be all checkboxes should be ticked.
This is common place, so, you have to approach this differently.
$SiteSource.AllElements | Where{$_.TagName -eq 'input'}
or, like if it were a button
($ie.Document.IHTMLDocument3_getElementsByTagName('button') |
Where-Object innerText -eq 'SIGN IN').Click()
But how are you walking the site to get the needed objects?
An approach I regularly use before making coding decisions.
$SiteSource = Invoke-WebRequest -Uri 'SomeUrl'
# single form site
$SiteSource.Forms | Format-List -Force
# multi-form sites
$SiteSource.Forms[0] | Format-List -Force
$SiteSource.Forms[1] | Format-List -Force
# Check for what can be used.
$SiteSource.Forms.Fields
$SiteSource.InputFields
Then once gathering the above... code for what can be found and used.
$ie = New-Object -com InternetExplorer.Application
$ie.visible=$true
$ie.navigate('SomeUrl')
while($ie.ReadyState -ne 4) {start-sleep -m 100}
$UserID = $ie.document.getElementsByTagName('INPUT') |
Where-Object {$($_.Name) -match 'userid'}
$UserId.value = 'UserID'
$UserPassword = $ie.document.getElementsByTagName('INPUT') |
Where-Object {$($_.Name) -match 'password'}
$UserPassword.value = 'password'
$Submit = $ie.document.getElementsByTagName('INPUT') |
Where-Object {$($_.Value) -match 'SomeString'}
$Submit.click()

powershell click google sreach result

i have a powershell script open google search and find a specific website , click on link and open the web site
$i=0
While ($true) {
if ($StatusCode = Test-Connection -ComputerName 8.8.8.8 -Quiet) {
$IE = new-object -com internetexplorer.application
$IE.navigate(googel search address )
$IE.visible=$true
while ($IE.busy) {sleep 10}
$Link = #($IE.Document.getElementsByTagName("span") | ? {$_.InnerHTML -eq "www.iranelectronic.com/"})[0]
if ($Link -eq $null){ $Link = $IE.Document.getElementsByTagName("cite") | ? {$_.InnerHTML -eq "www.iranelectronic.com/"} }
if ($Link -eq $null){$ie.quit();Break}
$Link.click()
start-sleep 10
$ie.quit()
Break
}
else {
foreach ($_ in 1..10){
$i++
start-sleep 900
$StatusCode = Test-Connection -ComputerName 8.8.8.8 -Quiet
}
Break
}
}
but not working in windows 7 ie version 10 and $link in null !
windows 10 ie version 11 is working great
i check inspect element internet explorer version 10 on windows 7 and get this result :
<div class="rc"><div class="r"><a onmousedown="return rwt(this,'','','','1','AOvVaw2SZfaltNouKdqFDN6YQ_p3','','2ahUKEwiJjurWwLrfAhVELFAKHaPRCyMQFjAAegQIABAC','','',event)" href="http://www.iranelectronic.co/"><h3 class="LC20lb">ایران الکترونیک بزرگترین مرکز پخش انواع قطعات ...</h3><br><div class="TbwUpd" style="display: inline-block;"><cite class="iUh30"><span dir="ltr">www.iranelectronic.co/</span></cite></div></a><span><div class="action-menu ab_ctl"><a class="GHDvEf ab_button" id="am-b0" role="button" aria-expanded="false" aria-haspopup="true" aria-label="گزینه‌های نتیجه" href="#" data-ved="2ahUKEwiJjurWwLrfAhVELFAKHaPRCyMQ7B0wAHoECAAQAw" jsaction="m.tdd;keydown:m.hbke;keypress:m.mskpe"><span class="mn-dwn-arw"></span></a><div tabindex="-1" class="action-menu-panel ab_dropdown" role="menu" data-ved="2ahUKEwiJjurWwLrfAhVELFAKHaPRCyMQqR8wAHoECAAQBA" jsaction="keydown:m.hdke;mouseover:m.hdhne;mouseout:m.hdhue"><ol><li class="action-menu-item ab_dropdownitem" role="menuitem"><a class="fl" onmousedown="return rwt(this,'','','','1','AOvVaw0bYKQCsuGdo8sgrLerKTQW','','2ahUKEwiJjurWwLrfAhVELFAKHaPRCyMQIDAAegQIABAF','','',event)" href="http://webcache.googleusercontent.com/search?q=cache:qlEq1WcGd3sJ:www.iranelectronic.co/+&cd=1&hl=fa&ct=clnk&gl=ir">ذخیره شده</a></li><li class="action-menu-item ab_dropdownitem" role="menuitem"><a class="fl" href="/search?q=related:www.iranelectronic.co/+iranelectronic&tbo=1&sa=X&ved=2ahUKEwiJjurWwLrfAhVELFAKHaPRCyMQHzAAegQIABAG">مشابه</a></li></ol></div></div></span></div><div class="s"><div><span class="st">پست الکترونیک: info#<em>iranelectronic</em>.co تلفن: 07132358510 - 07132338668 - 07132333600 فکس: 07132348649 همراه: 09171166771 آدرس: شیراز، خیابان زند، ...</span></div></div></div>
help me to resolve this problem
thanks .
Regards
Untested, but if you are searching for a link, why try and find the <span> or <cite> element?
I think this should get you the link to perform the click() on:
$link = $ie.Document.getElementsByTagName('A') | Where-Object {$_.href -like 'http://www.iranelectronic.co*'}
As a sidenote, I would advise to not just quit IE, but also clear it from memory when done:
$ie.Quit()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($ie) | Out-Null
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()

Powershell System.Windows.Forms.SendKeys doesn't sent key to IE form

I am working on a powershell script, which should fill an IE form, but it doesn't send the value to the form, instead, it just writes the value ("search text") into the Powershell script..?
I have copied my code from this post:
Filling web form via PowerShell does not recognize the values entered
But I don't need to open a new IE window, I need to use an existing IE window, which is already open, before I start the ps script.
My code (for your info,"lst-ib" is the id of google.at's search field):
[void] [System.Reflection.Assembly]::LoadWithPartialName("System.Windows.Forms")
[void] [System.Reflection.Assembly]::LoadWithPartialName("Microsoft.VisualBasic")
$app = new-object -com shell.application
$ie = $app.windows() | where {$_.Type -eq "HTML Document"}
$ie.visible = $true;
while($ie.Busy) { Start-Sleep -Milliseconds 100 }
$ie.document.IHTMLDocument3_getElementById("lst-ib").focus()
Start-Sleep -Milliseconds 1000;
[System.Windows.Forms.SendKeys]::Sendwait("test searchtext");
As for this...
But I don't need to open a new IE window, I need to use an existing IE
window, which is already open, before I start the ps script.
What / who opens this?
You have to discover what wind this is, in case of more that one IE session / tab is open.
You have to call the from object explicitly to act on the.
You also are using elements that don't really exist with what you are doing.
Since we cannot access you site, here is an example I provided a while back to someone have issues interacting with the aircanada site.
Now that site is a bite funky, as it is a mult-page site, with language popups recently added, I don't deal with the language popup in this example, so, just click past that manually, since I've not updated the code to deal with it. Once you click past that, that form will fill that values passed in.
You will note I am not using sendkeys for any of the page interactions. I am just using the from elements and passing in what values I want. Now this does open a new window for this due to the URL. Where as your code needs to look for that window first.
# Check from elements for the multi-page site
$url = 'https://www.aircanada.com/ca/en/ado/profile/sign-in.html'
($FormElements = Invoke-WebRequest -Uri $url -SessionVariable fe)
($FormElements = Invoke-WebRequest -Uri $url -SessionVariable fe).Forms
($Form0 = $FormElements.Forms[0]) | Format-List -Force
$Form0.Fields
($Form1 = $FormElements.Forms[1]) | Format-List -Force
$Form1.Fields
($Form2 = $FormElements.Forms[2]) | Format-List -Force
$Form2.Fields
$SiteSource = Invoke-WebRequest -Uri $url
$SiteSource.AllElements | Where{$_.TagName -eq "Button"} `
| Select-Object -Expand InnerText
$SiteSource.AllElements | Where{$_.TagName -eq "Button"} `
| Select-Object -Property * | Where innerText -eq 'SIGN IN'
So armed with the above, you get to do this.
# Navigate to aircanada
$url = 'https://www.aircanada.com/ca/en/ado/profile/sign-in.html'
# Instantiate IE
$ie = New-Object -com internetexplorer.application
$ie.visible = $true
$ie.navigate($url)
# Wait for IE to load
while ($ie.Busy -eq $true) { Start-Sleep -Seconds 1 }
# Take needed actions on the air canada site. Fill out the site page
($ie.document.getElementById('agencyIATA') | Select-Object -first 1).value = '1234567'
($ie.document.getElementById('agencyID ') | Select-Object -first 1).value = '789'
($ie.document.getElementById('bookingAgent') | Select-Object -first 1).value = '159'
($ie.document.getElementById('password') | Select-Object -first 1).value = '1234'
($ie.document.getElementById('rememberAgencyInfo') | Select-Object -first 1).value = $true
# Start-Sleep -Seconds 2
# ($ie.Document.IHTMLDocument3_getElementsByTagName('button') `
# | Where-Object innerText -eq 'SIGN IN').Click()
All this being said, see also:
Controlling Internet Explorer object from PowerShell
https://blogs.msdn.microsoft.com/powershell/2006/09/10/controlling-internet-explorer-object-from-powershell

Download an Excel document from website with Powershell

recently I began to learn PowerShell to automate my job tasks.
So I want to access a web page and click on a button that download automatically an Excel file. This is the button that I want to click on:
<div class="NormalButton">
<a class="ActiveLink" title="Excel" alt="Excel" onclick="$find('ctl32').exportReport('EXCELOPENXML');" href="javascript:void(0)" style="padding:3px 8px 3px 8px;display:block;white-space:nowrap;text-decoration:none;">Excel</a>
</div>
This would be my PowerShell script:
$ie = New-Object -com "InternetExplorer.Application"
$ie.Navigate("http://test.test/")
$ie.Visible = $true
$link = $ie.Document.getElementsByTagName('a') | Where-Object {$_.onclick -eq "$find('ctl32').exportReport('EXCELOPENXML');"}
$link.click()
If I try to run it from the console I receive the error "You cannot call a method on a null-valued expression."
If its useful I'm using PowerShell 4.0 and the webpage has a delay until the report is loaded.
I have completed it by the following code:
[void][System.Reflection.Assembly]::LoadWithPartialName("'System.Windows.Forms")
[void][System.Reflection.Assembly]::LoadWithPartialName("'Microsoft.VisualBasic")
$url = "https://webpage.com"
$ie = New-Object -com internetexplorer.application
$ie.navigate($url)
$ie.StatusBar = $false
$ie.ToolBar = $false
$ie.visible = $true
#Get Excel
Start-Sleep -s 40
$btnExcel = $ie.Document.links | where-object { $_.outerText -eq 'Excel' -and $_.innerText -eq 'Excel' }
$btnExcel.click()
# Get Internet Explorer Focus
Start-Sleep -s 5
[Microsoft.VisualBasic.Interaction]::AppActivate("internet explorer")
[System.Windows.Forms.SendKeys]::SendWait("{F6}");
[System.Windows.Forms.SendKeys]::SendWait("{TAB}");
[System.Windows.Forms.SendKeys]::SendWait(" ");
Start-Sleep 1
[System.Windows.Forms.SendKeys]::SendWait("$file");
[System.Windows.Forms.SendKeys]::SendWait("{ENTER}");
# Get Internet Explorer Focus
Start-Sleep -s 1
[Microsoft.VisualBasic.Interaction]::AppActivate("internet explorer")
[System.Windows.Forms.SendKeys]::SendWait("^{F4}");
Thank you all for you time :)
I Think the Your problem is because you are using Double Quotes which powershell threat it like variable and Try to expand it, so try to change it to Single quote, from:
$link = $ie.Document.getElementsByTagName('a') | Where-Object {$_.onclick -eq "$find('ctl32').exportReport('EXCELOPENXML');"}
to:
$link = $ie.Document.getElementsByTagName('a') | Where-Object {$_.onclick -eq '$find('ctl32').exportReport('EXCELOPENXML');'}
Also, you can change the -eq to -match and take just a portion of it like:
$_.onclick -match '$find('ctl32').exportReport'
For more information see: About Quoting Rules