How do I access the data shown on the sources tab in Chrome? - dom

So I am a little bit stuck here.
I am doing some scraping with Puppeteer, but at some point I have to download a file. The problem is that the file is "generated" after a click on a button. I know how to do that on Puppeteer, also to catcha requests and responses from the page, however, none of them is of use.
So I have a button on a previous page, this is the button after inspecting it.
<button
id="ReporteOpinionForm:botonConsultar"
name="ReporteOpinionForm:botonConsultar"
class="ui-button ui-widget ui-state-default ui-corner-all ui-button-text-only ui-button"
type="submit">
<span class="ui-button-text ui-c">
Consultar
</span>
</button>
By inspecting the whole page I can see that it uses Primefaces and JSF
So once I click it a XHTMLRequest is sent to an XHML endpoint
The response (on the right) is nothing more than an ID or some sort, also the method is POST and the body is FormData which has some unimportant stuff I think
After a few seconds, a new page is loaded with the PDF embedded
But after inspecting the page, it only has an empty body
But if I go to the sources tab on the devtools I can see this
The content is a base64 encoded string, which if decoded to a PDF file, results in the file that can be seen on the viewer, so, the primary objective here is to download the file, I have tried several things.
• Catch the request and the response and copy the response, but the response of the XHMLHttpRequest is different and not the base64 string
• Move the mouse to the PDF bar that appears on the top of the page and click on the download button, but that doesnt work either
• Try to print the page into a PDF, however, the scripts breaks on headless mode when the button to generate the PDF is clicked.
I am lost and I don't know what to do or what I am missing or not seeing
Any help would be appreciated, thanks

Related

Is it possible to prevent Google Web Light from removing multipart attributes from form tag?

I have a form that contains a file input and the form is configured correctly to handle this normally. The form tag contains the following attribute:
enctype="multipart/form-data"
This works fine in all normal browsers, but not in Google Web Light. In Google Web Light I noticed that the file name is being passed, but not the associated file data.
Upon inspection of the page as modified through Google Web Light I noticed the enctype attribute was removed and I believe this is the issue with why the file data is not getting passed the form's action.
Is there any way to prevent Google Web Light from doing this?
I'm also looking for a way of having Google Web Light preserve:
enctype="multipart/form-data"
As a workaround, have you tried 'Opting out of transcoding' from https://support.google.com/webmasters/answer/6211428?hl=en ?
If you do not want your pages to be transcoded, set the HTTP header "Cache-Control: no-transform" in your page response. If Googlebot sees this header, your page will not be transcoded.

How can I render a completed CGI form as a PDF?

I have an HTML form which a user may have filled in or partially filled in. I want to snapshot that state and render it as a PDF document. I've been using wkhtmltopdf.
I've tried this from both the client side and the server side, and the rendered result is always the original form, never the filled-in one.
I notice if I reload the filled-in form page I get back the filled-in form, but if I cut and paste the form's URL into a new window, I get the initial, non-filled-in form.
So I've convinced myself that, if I could use CGI::Session properly, I could successfully open a session identical to the filled-in session. I tried using CGI::Session::Plugin::Redirect with no joy. I think the key is that window.open() has to use the SID of the filled-in form window.
I don't have a lot of experience with CGI session management, so this has been a four-day quest to nowhere. Any advice is appreciated, even if it's to abandon this approach and go back to the more common post->render a new form in a new window, and generate the PDF from that. I'd like to avoid all of that if I can.
Say you have the following HTML document on your web server:
/var/www/html/index.html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<form action="/process.cgi">
<input type="text" name="foo">
</form>
</body>
</html>
When you navigate to http://hostname/index.html in your browser, the webserver returns this document and the browser displays it.
When you fill in the text field in your browser, the document on the webserver doesn't change. So anybody who navigates to http://hostname/index.html will get the original, unmodified form. This is why you can't simply copy and paste the URL into another browser tab and get the filled-in form.
Most browsers use caching by default. When you fill in some fields in a form, the browser caches what you entered. When you reload the page, the webserver sends the exact same document as before* (i.e. the unmodified form), but the browser uses the cached data to fill in the form fields the way you had them. If you override the cache when you reload the page (Ctrl+F5 in Firefox), the form fields will not be filled in. Note that neither the URL nor the document on the server have changed. This is why you can't copy and paste the URL into another browser tab after reloading the page and get the filled-in form.
wkhtmltopdf takes a URL, renders the corresponding page, and generates a PDF based on what is rendered. Based on the explanation above, it should now be clear why wkhtmltopdf always generates an image of the unmodified form.
The solution
If filling in form fields doesn't change anything on the webserver, what does it change? It changes the DOM, a structure describing the document in your browser that you can access using JavaScript.
One approach would be to use a client-side JavaScript PDF generator like jsPDF; since it runs on the client, it has access to the DOM that the user is interacting with, so it can "see" the values the user enters into the form fields.
* Actually, the webserver will typically send a 304 Unmodified response to save bandwidth, but form caching works the same either way.
The explanation from ThisSuitIsBlackNot is accurate about why your design is failing. Typing characters into form fields in a browser changes only your screen and the data in the memory allocated to the browser.
I suggest a different solution. The WWW::Mechanize::Firefox module is a variant of WWW::Mechanize that uses a real browser application to retrieve and render web pages. It is mostly chosen when a site requires JavaScript support, but it is useful here because it has a content_as_png method which returns a PNG image of the current page. Hopefully that is enough for you to build a PDF file with the required content

Facebook "Like" button encoding

I've put like button on joomla site with cyrillic text:
http://womanew.kz/index.php/2011-10-22-01-42-12/2011-11-01-05-34-04/147-2011-11-07-06-22-02
When I push like button facebook's scraper incorrectly encodes the page.
Also I've created static test html page equal to this URL (http://womanew.kz/test.html), and it scrapes well.
All two pages have equal content, but the problem page is generated from Joomla PHP script.
Also I've noticed that scraper reencodes not full document (it keeps head block unchanged), please see its "See exactly what our scraper sees for your URL" debuging page on facebook.
What can I do to struggle this problem?
When I did this I use this manual and didn't have such problem. This is for Joomla website:
1.Go to the Facebook developer page here: http://developers.facebook.com/docs/reference/plugins/like-box
2.Enter the page number (found in your Facebook page URL) of your Facebook page.
3.If you don't have one, go to Facebook Create Page wizard and create a Brand >> Website page for your site.
4.Enter the height (typically 400 pixels) that you want for the Like widget
5.Click the Get Code button to generate the Facebook Code.
6.Open the Joomla Administrator on your site
7.Navigate to the Module Manager
8.Click the New button
9.Select Custom HTML
10.Paste the code into the HTML box (make sure the editor is turned off).
11.Give the module a name
12.Click the Hide Title radio button
13.Select the Module Position where you want the widget to appear
14.Click Save
Also, take a look at http://www.abolkog.com/portal/tutorials-how-to/100-how-to-add-a-facebook-like-box-to-your-joomla-site
Hope, it will help you!
I've found a bug in the site. One of joomla's modules was truncating UTF-8 strings in byte boundaries and this was created one-byte bug character in the page, after this character scraper starts using incorrect encoding.

Facebook Send button does not show up on Gwt popup panel?

I'm trying to get Facebook send or like button display on GWT popup panel and I am not successful. When inspecting the generated HTML, facebook HTML button looks properly inserted into but it just does not show up. Save Facebook button works well in html page (not on GWT panel).
Have you ever had success in displaying Facebook Send or Like or Share button in GWT? I know I ca implement FB Like button myself using API or other libraries. But I do it need that. I need standard Facebook Send button to be used in my GWT application.
Please advise me if you have experience and were successful.
Thank you very much!
I had a similar problem, and it was because, in the html file's script tag, I had changed and not updated the src="path" to a valid path. I had a wrong directory listed in the path. For me, it's correct now with:
<script type="text/javascript" src="clienttest/clienttest.nocache.js"></script>
I rebuilt, and the image appeared on the page. This problem was keeping the entry point module from loading.

How do I see what jQuery Mobile loaded in the background?

Using jQuery Mobile, I have seen how it adds to the existing DOM when links are clicked and a related page is served. Then, when ready, it switches to that new data-role="page". But when I do a "view source" in the browser (Google Chrome or Mozilla Firefox), I see the original page, as delivered, without the additional things loaded later (DOM injections). How can I see what the browser really has (post-render HTML)? If it happens to be a JavaScript solution, please don't presume I know where to put it and how to trigger it to show the content.
In Chrome: Wrench > Tools > Developer Tools would be a way to see the 'live' DOM.