puppeteer When use headless false - google-chrome-devtools

I have a script the parse a specific webpage.
When I set headless false, the puppeteer doesn't load page
await page.goto('https://www.google.com', {
waitUntil: 'load',
// Remove the timeout
timeout: 0
});
I tried with a lot of configurations, like:
const args = [
'--no-sandbox',
'--enable-logging',
' --v=1',
'--disable-gpu',
'--disable-extension',
'--disable-setuid-sandbox',
'--disable-infobars',
'--window-position=0,0',
'--ignore-certifcate-errors',
'--ignore-certifcate-errors-spki-list',
'--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'
];
const options = {
args,
headless: false, // default is true
userDataDir: "./user_data",
defaultViewport: null,
devtools: true,
ignoreHTTPSErrors: true,
};
But the script stops to await page.goto until timeout.

As I understand you're parsing not the google.com page.
The first thing that should be considered it's the waitUntil: 'load'. What it done, it considers navigation to be finished when the load event is fired.
The load event is fired when the whole webpage (HTML) has loaded fully, including all dependent resources such as JavaScript files, CSS files, and images.
There is a big chance that this event is not firing in your case for a reasonable timeout, so I would suggest not to rely on this waitUntil but use another wait like the presence of some selector, for example
await page.goto('https://www.google.com');
await page.waitForSelector('[name="q"]');

Related

Companion Uppy always fails with > 5GB uploads when getting the /complete request

companion: 2022-09-22T23:31:07.088Z [error] a62da431-f9ce-4fae-b18d-dc59189a53ea root.error PayloadTooLargeError: request entity too large
at readStream (/usr/local/share/.config/yarn/global/node_modules/raw-body/index.js:155:17)
at getRawBody (/usr/local/share/.config/yarn/global/node_modules/raw-body/index.js:108:12)
at read (/usr/local/share/.config/yarn/global/node_modules/body-parser/lib/read.js:77:3)
at jsonParser (/usr/local/share/.config/yarn/global/node_modules/body-parser/lib/types/json.js:135:5)
at Layer.handle [as handle_request] (/usr/local/share/.config/yarn/global/node_modules/express/lib/router/layer.js:95:5)
at trim_prefix (/usr/local/share/.config/yarn/global/node_modules/express/lib/router/index.js:317:13)
at /usr/local/share/.config/yarn/global/node_modules/express/lib/router/index.js:284:7
at Function.process_params (/usr/local/share/.config/yarn/global/node_modules/express/lib/router/index.js:335:12)
at next (/usr/local/share/.config/yarn/global/node_modules/express/lib/router/index.js:275:10)
at middleware (/usr/local/share/.config/yarn/global/node_modules/express-prom-bundle/src/index.js:174:5)
::ffff:172.29.0.1 - - [22/Sep/2022:23:31:07 +0000] "POST /s3/multipart/FqHx7wOxKS8ASbAWYK7ZtEfpWFOT2h9KIX2uHTPm2EZ.k1INl8vxfdpH7KBXhLTii1WL7GeDLzLcAKOW0vmxKhfCrcUCRMgHGdxEd5Nwxr._omBrtqOQFuY.Fl9nX.Vy/complete?key=videos%2Fuploads%2Ff86367432cef879b-4d84eb44-thewoods_weddingfilm_1.mp4 HTTP/1.1" 413 211 "http://localhost:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
It will always give out a 413 on the last request with about 188k of payload describing all the parts.
I've tried anything including:
var bodyParser = require("body-parser");
app.use(bodyParser.json({ limit: "50mb" }));
app.use(bodyParser.urlencoded({limit: "50mb", extended: false}));
but it has no effect. I've been spending months on this problem, read every article, complaint, issue on the internet regarding this and still, no one knows why it's happening and no one knows how to resolve it.
Can anyone help?
This is a known issue with the S3 plugin. It is fixed in the latest version of Uppy, but Companion is still on an older version. You can use the S3 Multipart plugin directly, which is what Companion uses under the hood.
const Uppy = require('#uppy/core')
const AwsS3Multipart = require('#uppy/aws-s3-multipart')
const uppy = new Uppy()
uppy.use(AwsS3Multipart, {
companionUrl: 'https://companion.uppy.io',
companionCookiesRule: 'same-origin',
limit: 5,
getUploadParameters (file) {
return {
method: 'post',
endpoint: 'https://companion.uppy.io/s3/multipart',
fields: {
filename: file.name,
size: file.size,
contentType: file.type
}
}
}
})
uppy.on('upload-success', (file, data) => {
console.log('Upload successful', file, data)
})
uppy.on('upload-error', (file, error) => {
console.log('Upload error', file, error)
})
uppy.addFile({
name: 'test.txt',
type: 'text/plain',
data: new Blob(['hello world'], { type: 'text/plain' })
})
uppy.upload()
The S3 Multipart plugin is a bit more complicated to use than the S3 plugin, but it is more flexible. It allows you to upload files larger than 5GB, and it allows you to upload files in parallel. It also allows you to upload files to S3-compatible services like Minio. The S3 plugin is a bit simpler to use, but it has some limitations. It doesn’t allow you to upload files larger than 5GB, and it doesn’t allow you to upload files in parallel. It also doesn’t allow you to upload files to S3-compatible services like Minio.

Flutter: How to extract user account name and video id from a shortened tiktok url?

I wanted to extract the real tiktok video link but my code seems to not be working. I want to get the
https://www.tiktok.com/#lilymaycreative/video/6911015584570395906?sender_device=pc&sender_web_id=6894321561748211206&is_from_webapp=v
from the shorten link which is
https://vm.tiktok.com/ZSTjLwCK/
var dio =
Dio(BaseOptions(connectTimeout: 10000, receiveTimeout: 10000, headers: {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}));
try {
Response response = await dio.get('https://vm.tiktok.com/ZSTjLwCK/');
_document = parse(response.data);
if (_document != null) print(_document);
print(jsonData);
} catch (e) {
print(e);
}
I can tell you a way that can help solve your problem.
If you press F12(developer tools) in your example site.
You detect video tag in html like below and you may access the video link.

Next.js dynamic api pages fail to respond to post requests with Content-Type=application/json headers

I've got a next.js react app running on a custom Express server with custom routes. I'm working on this project by myself, but I'm hoping I might have a collaborator at some point, and so my main goal is really just to clean things up and make everything more legible.
As such, I've been trying move as much of the Express routing logic as possible to the built in Next.js api routes. I'm also trying to replace all the fetch calls I have with axios requests, since they look less verbose.
// current code
const data = await fetch("/api/endpoint", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ foo: "bar" })
}).then(x => x.json());
// what I'd like
const data = await axios.post( "/api/endpoint", { foo: "bar" });
The problem I've been having is that the dynamic next.js api routes stall as soon as there's JSON data in the body. I'm not even getting an error, the request just gets stuck as "pending" and the await promise never resolved.
I get responses from these calls, but I can't pass in the data I need:
// obviously no data passed
const data = await axios.post( "/api/endpoint");
// req.body = {"{ foo: 'bar' }":""}, which is weird
const data = await axios.post( "/api/endpoint", JSON.stringify({ foo: "bar" }));
// req.body = "{ foo: 'bar' }" if headers omitted from fetch, so I could just JSON.parse here, but I'm trying to get away from fetch and possible parse errors
const data = await fetch("/api/endpoint", {
method: "POST",
// headers: { "Content-Type": "application/json" },
body: JSON.stringify({ foo: "bar" })
}).then(x => x.json());
If I try to call axios.post("api/auth/token", {token: "foo"}), the request just gets stuck as pending and is never resolved.
The Chrome Network panel gives me the following info for the stalled request:
General
Request URL: http://localhost:3000/api/auth/token
Referrer Policy: no-referrer-when-downgrade
Request Headers
Accept: application/json, text/plain, */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,es;q=0.8
Connection: keep-alive
Content-Length: 26
Content-Type: application/json;charset=UTF-8
Cookie: token=xxxxxxxxxxxxxxxxxxxx; session=xxxxxxxxxxxxxxxxxxxxxxx
Host: localhost:3000
Origin: http://localhost:3000
Referer: http://localhost:3000/dumbtest
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36
Request Payload
{token: "foo"}
I've tried looking into what might be causing this, and everything seems to point towards there being an issue with preflight requests, but, since those are related to CORS policies, I don't understand why I'd be encountering those. I'm making a request from http://localhost:3000 to http://localhost:3000/api/auth/token.
Even so, I did try to add cors middleware as shown in the next.js example, but that didn't make a difference. As far as I can tell, the request never even hits the server - I've got a console.log call as the first line in the handler, but it's never triggered by these requests.
Is there something obvious I'm missing? This feels like it should be a simple switch to make, but I've spent the last day and a half trying to figure this out, but I keep reaching the same point with every solution I try - staring at a gray pending request in my Network tab and a console reporting no errors or anything.
After a few more hours searching, I found my answer here
Turns out that since I was using a bodyParser middleware in my express server, I had to disable the Next body parsing by adding this at the top of my file:
export const config = {
api: {
bodyParser: false,
},
}

Mapping font url to font name

How to get font-name based off font-url using puppeteer
I am using Network.requestIntercepted to get list of fonts that are being used on a given website. However, the response does not contain any information about the font family that is being used in the CSS.
Is there a way to get font-family name and the corresponding font url that is being used on the page?
await client.on('Network.requestIntercepted', async e => {
if (e.resourceType == "Font") {
console.log(e)
fontCollection.add(e.request.url)
}
While the response contains font details, it does not contain the font-family name
{ interceptionId: 'interception-job-14.0',
request:
{ url:
'https://fonts.gstatic.com/s/lato/v15/S6uyw4BMUTPHjx4wWyWtFCc.ttf',
method: 'GET',
headers:
{ Origin: 'https://goldrate.com',
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/73.0.3679.0 Safari/537.36',
Accept: '*/*',
Referer:
'https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i' },
initialPriority: 'VeryHigh',
referrerPolicy: 'no-referrer-when-downgrade' },
frameId: '4127ABB5A3E704843D0AB4756C7507E4',
resourceType: 'Font',
isNavigationRequest: false }
You have two options:
Guess the font from URL and/or HTTP headers
Download the font file and inspect it
Option 1: Guess the font from URL and HTTP headers
By looking at the request information, you can see the font name at two positions. First, in the URL and second in the Referer:
URL
fonts.gstatic.com/s/lato/v15/S6uyw4BMUTPHjx4wWyWtFCc.ttf
Referer:
fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i
From that information you can therefore find out which font is being used.
Option 2: Download the font file and inspect it
If the first option is not realiable enough (maybe you want to crawl other pages, too?), you can always download the file by using a tool like node-fetch when intercepting the request
and then parse the meta information of the font file.
The library fontkit is able to parse a ttf file and read its Metadata like familyName or fullName:
Code sample
const fetch = require('node-fetch');
const fontkit = require('fontkit');
(async () => {
const response = await fetch('https://fonts.gstatic.com/s/lato/v15/S6uyw4BMUTPHjx4wWyWtFCc.ttf');
const buffer = await response.buffer();
const font = fontkit.create(buffer);
console.log(font.familyName); // "Lato"
console.log(font.fullName); // "Lato Regular"
})();
You could then do this inside your Network.requestIntercepted block to find out which font is being used.

Puppeteer's page.cookies() not retrieving all cookies shown in the Chrome dev tools

Using puppeteer, I am trying to retrieve all cookies for a specific web site (i.e. https://google.com) from Node.js.
My code is:
// Launch browser and open a new page
const browser = await puppeteer.launch({ headless: true, args: ['--disable-dev-shm-usage'] });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
var cookies = await page.cookies();
console.log(cookies);
await browser.close();
It only retrieves 2 cookies, named 1P_JAR and NID. However, when I open the Chrome Dev tools, it shows a lot more.
I tried using the Chrome Dev Tools directly instead of puppeteer but I am getting the same results.
Is there another function I should call? Am I doing it correctly?
The page.cookies() call only gets cookies that are available to JavaScript applications inside the browser, and not the ones marked httpOnly, which you see in the Chrome DevTools. The solution is to ask for all available cookies through the Devtools protocol and then filter for the site you're interested in.
var data = await page._client.send('Network.getAllCookies');
You can utilise Chrome DevTools Protocol -> getAllCookies
To get all browser cookies, regardless of any flags.
const client = await page.target().createCDPSession();
const cookies = (await client.send('Network.getAllCookies')).cookies;
This will also play nice with typescript and tslint since something like
const cookies = await page._client.send('Network.getAllCookies');
Will raise an error TS2341: Property '_client' is private and only accessible within class 'Page'..
Thanks #try-catch-finally. I got it resolved and it was a simple rookie mistake.
I was comparing cookies in my own Google Chrome instance with the Puppeteer instance. However, in my instance, I was logged in to my Google account and Puppeteer (obviously) was not.
Google uses 2 cookies when you are NOT logged in and 12 when you are logged in.
If you use Playwright in place of Puppeteer, httponly cookies are readily accessible:
const { chromium } = require('playwright')
(async () => {
const browser = await chromium.launch()
const context = await browser.newContext()
const page = await context.newPage()
await page.goto('https://google.com', { waitUntil: 'networkidle' })
let allCookies = await context.cookies()
console.log (allCookies)
})();
returns:
[
{
sameSite: 'None',
name: '1P_JAR',
value: '2021-01-27-19',
domain: '.google.com',
path: '/',
expires: 1614369040.389115,
httpOnly: false,
secure: true
},
{
sameSite: 'None',
name: 'NID',
value: '208=VXtmbaUL...',
domain: '.google.com',
path: '/',
expires: 1627588239.572781,
httpOnly: true,
secure: false
}
]
Just use it await page.goto('https://google.com', { waitUntil: 'networkidle2' }). And you can get all the cookies related.