The login page is this: https://login.procore.com/
I feel like I'm close to getting it to work, but have hit a brick wall due to a lack of understanding of login procedures. Here is the code so far, without the actual sign in information.
$r=Invoke-WebRequest https://login.procore.com/ -SessionVariable fb
$form = $r.Forms[0]
$form.Fields["session_email"] = "xxxxxxxxx"
$form.Fields["session_password"] = "xxxxxxxx"
$r=Invoke-WebRequest ('https://login.procore.com/' + $form.Action) -WebSession $fb -Method $form.Method -Body $form.Fields
Could someone help me understand what is missing? I did notice that $form.Fields contains an empty field named: session_sso_target_url, but honestly have no clue what it means, or how to use it.
You've given me insufficient info to provide a complete answer, because I don't have a login and don't see a way to sign up for a free trial, and you haven't stated what kind of error you are getting.
I hazard a guess that session_sso_target_url relates to federation, which is semantically related to single sign-on (SSO). In federation, an application is configured to accept logins from another login domain. The obvious example in corporateland is ADFS, but any time you see an app that says Login with Facebook or Login with Google, that's the same thing. Federation is a big topic. The meaning of having a target URL is that the browser is often redirected to the identity provider (ADFS / FB / GOOG etc) with the callback URL that the browser should come back to once it is authenticated.
Suffice it to say that I suspect that you need do nothing with this field! And the reason I say this is because I hit it with Fiddler.
You should know about Fiddler. It is a cost-free debugging proxy from Telerik. I am not affiliated with Telerik, but I owe them hours of saved time when web scraping. (It is not the only tool for the job, and if any moderator deems that I am violating site rules, I will be happy to sanitise this post.)
Do this:
install Fiddler
Set it up to listen on 127:0.0.1:whatever and to be your system proxy
in Tools > Options > HTTPS, set it to decrypt HTTPS (this will replace all certs with auto-generated self-signed ones, so do not leave this running while you perform other tasks)
Set your filters to only include traffic to *.procore.com
Log in through your browser - you should now see web traffic in the left-hand pane. This captured traffic is your baseline.
Select any one web request and look at the Inspectors tab in the right-hand pane. You can look at Raw, Forms, Cookies, etc. This gives you a low-level view of what your client is doing.
Run your code snippet. You can now compare the differences between the baseline and your code, and adjust accordingly.
Related
I need to scrape a webpage.
I specifically need to extract the section "USDT Funding Market" on the page.
$WebExtract = Invoke-WebRequest -Uri "https://www.kucoin.com/margin/lend/USDT"
Although $WebExtract.content or $WebExtract.AllElements does not contain the "USDT Funding Market".
The reason for this is that your initial web request is only returning you the HTML of the page, however a lot of modern web applications like this have some JavaScript under the hood that will reach out to their API's and fill the page with data after the initial HTML is loaded.
The good news is, usually with public websites like this you can hit the underlying API's without worrying about authentication - which actually makes parsing the data easier, as it's typically in JSON format.
Try opening the debug console in your browser (usually F12), open the network tab and refresh the page - you can typically find these API requests/endpoints in there, e.g;
Simply copy that URL out and invoke the web request to the API instead, for example;
$request = Invoke-WebRequest -Uri 'https://www.kucoin.com/_api/currency/prices?base=USD&targets='
$json = $request.Content | ConvertFrom-Json
$json.data
This might not be the specific API endpoint/data you're after, but hopefully this points you in the right direction. If you poke around in the console you should be able to find the one you're looking for.
EDIT:
Apologies for the misleading information here, I didn't notice that these numbers were updating in real time on the site. The above is still pretty common for a lot of websites so I'll leave it, however on sites that are this responsive/live another common technology used is Web Sockets.
If you have a look at the last lines of the console, you should be able to see a request ending with 'transport=websocket' to a URL like this;
wss://ws-web.kucoin.com/socket.io/?token=...%3D%3D&format=json&acceptUserMessage=false&connectId=connect_welcome&EIO=3&transport=websocket
If you select this line and head over to the Messages tab in your browser console/debugger, you'll be able to see the web socket messages being returned;
This looks like the data you're after, but I'm not hugely familiar with querying Web Sockets through PowerShell.
Simply invoking a web request won't work here, as it's using the web sockets protocol to handle the communication. You would also need to find a way to get a valid token for opening a web sockets connection (likely just a web request for that).
Perhaps these posts will help;
http://wragg.io/powershell-slack-bot-using-the-real-time-messaging-api/
How to use a Websocket client to open a long-lived connection to a URL, using PowerShell V2?
Powershell: The invoke-webrequest is not working for me for a particular site. Tried many things like setting TLS settings etc. But still fails.
One unique thing is it works for first request, post then it fails for 15mins but such is not the case when browsing the site via browser. Can anyone help me please
Invoke-WebRequest 'https://www1.nseindia.com/products/content/equities/equities/eq_turnapr2020.htm'
For the next request it waits forever then goes timeout. But in browser that is not the case(in IE, chrome)
Timeout error image. First request works
It was because of website's anti-scrape mechanism. To Bypass that we need to make it feel like a user request via browser. For this current problem with this website., solution was to include headers in the Invoke-WebRequest.
You can take copy of request headers via Chrome Developer Tools > Network tab > (right click url) > Copy > copy as powershell.
Then paste that & take the hashtable of headers.(you can then remove cookies section).
TL;DR
I need to submit multiple forms on a site that reloads pages but the server closes session after every request on a single page app that only has one URL and I think this is preventing my POST method from going through. The main problem is every request uses the first state of the page and I can't get to states farther in the process.
What I'm trying to accomplish
I am trying to automate a process that requires me to go to take the following steps in order:
Navigate to webpage. (GET)
Click on a button that reloads the page with new data but uses same URL. (POST)
Enter text into a field on new page.
Click on the form to submit the text. (POST)
... perform unrelated admin tasks....
I'm trying to automate this process using the Invoke-WebRequest cmdlet in PowerShell following steps similar to those found in the PowerShell Cookbook regarding FaceBook login.
When I run an Invoke-WebRequest from a completely fresh PowerShell Session I get a response but I can't reuse that session ever again. To make another request I need to create a new -SessionVariable or use -DisableKeepAlive.
The server will always return a connection close in the response no matter what even though it is using http 1.1 and it's not my site so I can' change this.
So how can I go about establishing a connection to the server that I can reuse to POST the form data? I feel like it should be doable because it is clearly happening on the WebPage itself.
When I go to the WebPage, open the Developer Tools in Chrome and step through the process the header contains this in the Form Data field:
RAW
ggt_textbox%2810007%29=&action=adminloginbtn&ggt_hidden%2810008%29=2
PARSED and DECODED
ggt_textbox(10007):
action:adminloginbtn
ggt_hidden(10008):2
If I try to do something like this:
Invoke-WebRequest $uri -SessionVariable session -Verbose -Method POST -Body "ggt_textbox%2810007%29=&action=adminloginbtn" -DisableKeepAlive
It returns the page I'm expecting in step 2. So I performed steps 3 and 4 in Chrome to try and do the same thing. I get the the following Form Data in Chrome Dev Tools:
RAW
ggt_textbox%2810006%29=textIentered&action=loginbtn&ggt_hidden%2810008%29=3
PARSED
ggt_textbox(10006):textIentered
action:loginbtn
ggt_hidden(10008):3
So that made me think I could do something like this:
Invoke-WebRequest $uri -WebSession $session -Verbose -Method POST -Body "ggt_textbox%2810006%29=textIentered&action=loginbtn&ggt_hidden%2810008%29=3"# -DisableKeepAlive
But since the main page and the login page use the same URI it tries to POST to a form that doesn't exist because it's looking at the very first page.
I did some more digging and found when I perform this same action from the webpage itself it returns a 302 Moved Temporarily status code the response header actually has a cookie in it (still closes the connection) which is a first and then appears to do a GET request using the new cookie and I'm now logged into the admin page.
So I think I have two problems I need to get around:
How can I get to the form that exists after I click the first button since they use the same URI?
How can I get around the 302 status since I'm only getting back a header and nothing else. I think I need to do a GET request using the cookie from the header but I'm not sure how to specify a cookie with Invoke-WebRequest. I think I would need to use the -Header parameter and specify Cookie: COOKIENAME=CookieID
I think most of all I need to get through my first question and then from there I can start working towards my second.
All help is appreciated and I can provide any header/source needed but the web page is super simple so there is not a whole lot going on in the front end other than a couple of buttons and a logo with a little bit of inline JavaScript.
EDIT
After doing some additional reading about 302 and redirects I found out that shouldn't be a problem. The reason for this is explained in this question.
I figured out my problem. The inline JavaScript is validating to make sure the length of the string is greater than 0 before submitting the login. I don't think there is away to by pass the client side validation easily in a script.
I am trying to query the Workfront REST Services from PowerShell
I am using a URL like this
https://ourcompany.attask-ondemand.com/attask/api/v4.0/project/search?apiKey=XYZetc
This returns JSON in both IE and Chrome and works in my Web Service tester.
All this runs behind a corporate proxy obviously.
The PowerShell I am using is
$postResult = Invoke-RestMethod -Uri $URI -Method "GET" -Proxy
http://internalproxyname:80 -ProxyUseDefaultCredentials
This fails with an Error
{"error":{"class":"com.attask.common.AuthenticationException","message":"You
are not currently logged in"}}
This looks like an Error at the attask END not the proxy at our end (I get different errors running this as a non auth user or with mangled credentials passed to the Proxy
The docs suggest I don't need to be logged in if I was using an apiKey. I am not logged in in the browsers I am using (I don't even have a user account on the workfront instance)
I have trawled various blogs and stack answers to no avail. Can anyone point me in the right direction for figuring out what is going on? or what I might be doing wrong.
I have Enabled a trust all certs policy and set the validation callback to Ignore within the powershell
but equally I've tried this with these turned off and also investigated various properties on the ServicePointManager. I can produce any number of different errors/issues but the closest I get seems to be the above.
Oh and the Workfront API docs and examples being wrong didn't help me when I was getting started :-)
many thanks
Steve
OK this was me being stupid. There was a bug in the code generating the URI (an extra slash) and the attask default error response is auth error not mangled request.
For reference the URL needs to be in the form shown in my original post. Don't miss off the api version number and don't use a port number as the code samples show.
Always look for the simple things first (I should remember that)
Doh!
Amazon has an administration page for content sent to your Kindle. This page uses an undocumented HTTP API that sends requests like this:
{
"csrfToken":"gEABCzVR2QsRk3F2QVkLcdKuQzYCPcpGkFNte0SAAAAAJAAAAAFkUgW5yYXcAAAAA",
"data":{"param":{"DeleteContent":{"asinDetails":{"3RSCWFGCUIZ3LD2EEROJUI6M5X63RAE2":{"category":"KindlePDoc"},"375SVWE22FINQY3FZNGIIDRBZISBGJTD":{"category":"KindlePDoc"},"4KMPV2CIWUACT4QHQPETLHCVTWEJIM4N":{"category":"KindlePDoc"}}}}}
}
I made a wrapper library for the previous API they used, but this time they have added CSRF tokens, making each session unique. That is a bit of a show stopper, and I was wondering how I can get hold of these tokens. I did not find it in the cookies. This is for use in a Chrome Extension, so issues like CORS is not an issue.
Actually, after manually going searching the Response tab of each request in the "XHR" and "Doc" tab, I was able to find out that this token is set in an inline script in the myx.html (main page):
var csrfToken = "gPNABCIemSqEWBeXae3l1CqMPESRa4bXBq0W7rCIAAAAJAAAAAFkUlo1yYXcAAAAA";
This means it is set on the window object, making it available for all there. I guess this means a Chrome extension would need to fetch this page and manually parse the html to retrieve this token. Sad, but doable, although highly fragile :-(