Scrape webpage using powershell - powershell

I need to scrape a webpage.
I specifically need to extract the section "USDT Funding Market" on the page.
$WebExtract = Invoke-WebRequest -Uri "https://www.kucoin.com/margin/lend/USDT"
Although $WebExtract.content or $WebExtract.AllElements does not contain the "USDT Funding Market".

The reason for this is that your initial web request is only returning you the HTML of the page, however a lot of modern web applications like this have some JavaScript under the hood that will reach out to their API's and fill the page with data after the initial HTML is loaded.
The good news is, usually with public websites like this you can hit the underlying API's without worrying about authentication - which actually makes parsing the data easier, as it's typically in JSON format.
Try opening the debug console in your browser (usually F12), open the network tab and refresh the page - you can typically find these API requests/endpoints in there, e.g;
Simply copy that URL out and invoke the web request to the API instead, for example;
$request = Invoke-WebRequest -Uri 'https://www.kucoin.com/_api/currency/prices?base=USD&targets='
$json = $request.Content | ConvertFrom-Json
$json.data
This might not be the specific API endpoint/data you're after, but hopefully this points you in the right direction. If you poke around in the console you should be able to find the one you're looking for.
EDIT:
Apologies for the misleading information here, I didn't notice that these numbers were updating in real time on the site. The above is still pretty common for a lot of websites so I'll leave it, however on sites that are this responsive/live another common technology used is Web Sockets.
If you have a look at the last lines of the console, you should be able to see a request ending with 'transport=websocket' to a URL like this;
wss://ws-web.kucoin.com/socket.io/?token=...%3D%3D&format=json&acceptUserMessage=false&connectId=connect_welcome&EIO=3&transport=websocket
If you select this line and head over to the Messages tab in your browser console/debugger, you'll be able to see the web socket messages being returned;
This looks like the data you're after, but I'm not hugely familiar with querying Web Sockets through PowerShell.
Simply invoking a web request won't work here, as it's using the web sockets protocol to handle the communication. You would also need to find a way to get a valid token for opening a web sockets connection (likely just a web request for that).
Perhaps these posts will help;
http://wragg.io/powershell-slack-bot-using-the-real-time-messaging-api/
How to use a Websocket client to open a long-lived connection to a URL, using PowerShell V2?

Related

Authentication With Invoke-WebRequest

The login page is this: https://login.procore.com/
I feel like I'm close to getting it to work, but have hit a brick wall due to a lack of understanding of login procedures. Here is the code so far, without the actual sign in information.
$r=Invoke-WebRequest https://login.procore.com/ -SessionVariable fb
$form = $r.Forms[0]
$form.Fields["session_email"] = "xxxxxxxxx"
$form.Fields["session_password"] = "xxxxxxxx"
$r=Invoke-WebRequest ('https://login.procore.com/' + $form.Action) -WebSession $fb -Method $form.Method -Body $form.Fields
Could someone help me understand what is missing? I did notice that $form.Fields contains an empty field named: session_sso_target_url, but honestly have no clue what it means, or how to use it.
You've given me insufficient info to provide a complete answer, because I don't have a login and don't see a way to sign up for a free trial, and you haven't stated what kind of error you are getting.
I hazard a guess that session_sso_target_url relates to federation, which is semantically related to single sign-on (SSO). In federation, an application is configured to accept logins from another login domain. The obvious example in corporateland is ADFS, but any time you see an app that says Login with Facebook or Login with Google, that's the same thing. Federation is a big topic. The meaning of having a target URL is that the browser is often redirected to the identity provider (ADFS / FB / GOOG etc) with the callback URL that the browser should come back to once it is authenticated.
Suffice it to say that I suspect that you need do nothing with this field! And the reason I say this is because I hit it with Fiddler.
You should know about Fiddler. It is a cost-free debugging proxy from Telerik. I am not affiliated with Telerik, but I owe them hours of saved time when web scraping. (It is not the only tool for the job, and if any moderator deems that I am violating site rules, I will be happy to sanitise this post.)
Do this:
install Fiddler
Set it up to listen on 127:0.0.1:whatever and to be your system proxy
in Tools > Options > HTTPS, set it to decrypt HTTPS (this will replace all certs with auto-generated self-signed ones, so do not leave this running while you perform other tasks)
Set your filters to only include traffic to *.procore.com
Log in through your browser - you should now see web traffic in the left-hand pane. This captured traffic is your baseline.
Select any one web request and look at the Inspectors tab in the right-hand pane. You can look at Raw, Forms, Cookies, etc. This gives you a low-level view of what your client is doing.
Run your code snippet. You can now compare the differences between the baseline and your code, and adjust accordingly.

WWW::Mechanize::Chrome capture XHR response

I am using Perl WWW::Mechanize::Chrome to automate a JS heavy website.
In response to a user click the page among many other requests, requests and loads a JSON file using XHR.
Is there some way to save this particular JSON data to a file?
To intercept requests like that, you generally need to use the webRequest API to filter and retrieve specific responses. I do not think you can do that via WWW::Mechanize::Chrome.
WWW::Mechanize::Chrome tries to give you the content of all requests, but Chrome itself does not make the content of XHR requests available ( https://bugs.chromium.org/p/chromium/issues/detail?id=457484 ). So the approach I take in (for example ) Net::Google::Keep is to replay the XHR requests using plain Perl LWP requests by copying the cookies and parameters from the Chrome requests-
Please note that the official support forum for WWW::Mechanize::Chrome is https://perlmonks.org , not StackOverflow.

Invoke-WebRequest is unable to POST data on Single Page App (1 URI) and KeepAlive is disabled

TL;DR
I need to submit multiple forms on a site that reloads pages but the server closes session after every request on a single page app that only has one URL and I think this is preventing my POST method from going through. The main problem is every request uses the first state of the page and I can't get to states farther in the process.
What I'm trying to accomplish
I am trying to automate a process that requires me to go to take the following steps in order:
Navigate to webpage. (GET)
Click on a button that reloads the page with new data but uses same URL. (POST)
Enter text into a field on new page.
Click on the form to submit the text. (POST)
... perform unrelated admin tasks....
I'm trying to automate this process using the Invoke-WebRequest cmdlet in PowerShell following steps similar to those found in the PowerShell Cookbook regarding FaceBook login.
When I run an Invoke-WebRequest from a completely fresh PowerShell Session I get a response but I can't reuse that session ever again. To make another request I need to create a new -SessionVariable or use -DisableKeepAlive.
The server will always return a connection close in the response no matter what even though it is using http 1.1 and it's not my site so I can' change this.
So how can I go about establishing a connection to the server that I can reuse to POST the form data? I feel like it should be doable because it is clearly happening on the WebPage itself.
When I go to the WebPage, open the Developer Tools in Chrome and step through the process the header contains this in the Form Data field:
RAW
ggt_textbox%2810007%29=&action=adminloginbtn&ggt_hidden%2810008%29=2
PARSED and DECODED
ggt_textbox(10007):
action:adminloginbtn
ggt_hidden(10008):2
If I try to do something like this:
Invoke-WebRequest $uri -SessionVariable session -Verbose -Method POST -Body "ggt_textbox%2810007%29=&action=adminloginbtn" -DisableKeepAlive
It returns the page I'm expecting in step 2. So I performed steps 3 and 4 in Chrome to try and do the same thing. I get the the following Form Data in Chrome Dev Tools:
RAW
ggt_textbox%2810006%29=textIentered&action=loginbtn&ggt_hidden%2810008%29=3
PARSED
ggt_textbox(10006):textIentered
action:loginbtn
ggt_hidden(10008):3
So that made me think I could do something like this:
Invoke-WebRequest $uri -WebSession $session -Verbose -Method POST -Body "ggt_textbox%2810006%29=textIentered&action=loginbtn&ggt_hidden%2810008%29=3"# -DisableKeepAlive
But since the main page and the login page use the same URI it tries to POST to a form that doesn't exist because it's looking at the very first page.
I did some more digging and found when I perform this same action from the webpage itself it returns a 302 Moved Temporarily status code the response header actually has a cookie in it (still closes the connection) which is a first and then appears to do a GET request using the new cookie and I'm now logged into the admin page.
So I think I have two problems I need to get around:
How can I get to the form that exists after I click the first button since they use the same URI?
How can I get around the 302 status since I'm only getting back a header and nothing else. I think I need to do a GET request using the cookie from the header but I'm not sure how to specify a cookie with Invoke-WebRequest. I think I would need to use the -Header parameter and specify Cookie: COOKIENAME=CookieID
I think most of all I need to get through my first question and then from there I can start working towards my second.
All help is appreciated and I can provide any header/source needed but the web page is super simple so there is not a whole lot going on in the front end other than a couple of buttons and a logo with a little bit of inline JavaScript.
EDIT
After doing some additional reading about 302 and redirects I found out that shouldn't be a problem. The reason for this is explained in this question.
I figured out my problem. The inline JavaScript is validating to make sure the length of the string is greater than 0 before submitting the login. I don't think there is away to by pass the client side validation easily in a script.

Getting hold of Amazon Fiona (Kindle) CSRF token

Amazon has an administration page for content sent to your Kindle. This page uses an undocumented HTTP API that sends requests like this:
{
"csrfToken":"gEABCzVR2QsRk3F2QVkLcdKuQzYCPcpGkFNte0SAAAAAJAAAAAFkUgW5yYXcAAAAA",
"data":{"param":{"DeleteContent":{"asinDetails":{"3RSCWFGCUIZ3LD2EEROJUI6M5X63RAE2":{"category":"KindlePDoc"},"375SVWE22FINQY3FZNGIIDRBZISBGJTD":{"category":"KindlePDoc"},"4KMPV2CIWUACT4QHQPETLHCVTWEJIM4N":{"category":"KindlePDoc"}}}}}
}
I made a wrapper library for the previous API they used, but this time they have added CSRF tokens, making each session unique. That is a bit of a show stopper, and I was wondering how I can get hold of these tokens. I did not find it in the cookies. This is for use in a Chrome Extension, so issues like CORS is not an issue.
Actually, after manually going searching the Response tab of each request in the "XHR" and "Doc" tab, I was able to find out that this token is set in an inline script in the myx.html (main page):
var csrfToken = "gPNABCIemSqEWBeXae3l1CqMPESRa4bXBq0W7rCIAAAAJAAAAAFkUlo1yYXcAAAAA";
This means it is set on the window object, making it available for all there. I guess this means a Chrome extension would need to fetch this page and manually parse the html to retrieve this token. Sad, but doable, although highly fragile :-(

Why does Fiddler break my site's redirects?

Why does using Fiddler break my site sometimes on page transitions.
After a server side redirect -- in the http response (as found in Fiddler) I get this:
Object moved
Object moved to here.
The site is an ASP.NET 1.1 / VB.NET 1.1 [sic] site.
Why doesnt Fiddler just go there for me? i dont get it.
I'm fine with this issue when developing but I'm worried that other proxy servers might cause this issue for 'real customers'. Im not even clear exactly what is going on.
That's actually what Response.Redirect does. It sends a 302 - Object moved response to the user-agent. The user-agent then automatically goes to the URL specified in the 302 response. If you need a real server-side redirect without round-tripping to the client, try Server.Transfer.
If you merely constructed the request using the request builder, you're not going to see Fiddler automatically follow the returned redirect.
In contrast, if you are using IE or another browser, it will generally check the redirect header and follow it.
For IE specifically, I believe there's a timing corner case where the browser will fail to follow the redirect in obscure situations. You can often fix this by clicking Tools / Fiddler Options, and enabling both the "Server" and "Client" socket reuse settings.
Thanks user15310, it works with Server.Transfer
Server.Transfer("newpage.aspx", true);
Firstly, transferring to another page using Server.Transfer conserves server resources. Instead of telling the browser to redirect, it simply changes the "focus" on the Web server and transfers the request. This means you don't get quite as many HTTP requests coming through, which therefore eases the pressure on your Web server and makes your applications run faster.
But watch out: because the "transfer" process can work on only those sites running on the server, you can't use Server.Transfer to send the user to an external site. Only Response.Redirect can do that.
Secondly, Server.Transfer maintains the original URL in the browser. This can really help streamline data entry techniques, although it may make for confusion when debugging.
That's not all: The Server.Transfer method also has a second parameter—"preserveForm". If you set this to True, using a statement such as Server.Transfer("WebForm2.aspx", True), the existing query string and any form variables will still be available to the page you are transferring to.
Read more here:
http://www.developer.com/net/asp/article.php/3299641/ServerTransfer-Vs-ResponseRedirect.htm