How to retrieve the HTML of a page from CommonCrawl?

How to retrieve the HTML of a page from CommonCrawl? - common-crawl

Assuming I have:
the link of the CC*.warc file (and the file itself, if it helps);
offset; and
length
How can I get the HTML content of that page?
Thanks for your time and attention.

Using warcio it would be simply:
warcio extract --payload <file.warc.gz> <offset>
Alternatively, fetch the WARC record using the HTTP range request and then extract the payload at offset 0:
curl -s -r331727487-$((331727487+6613-1)) \
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-40/segments/1600400203096.42/warc/CC-MAIN-20200922031902-20200922061902-00310.warc.gz \
>warc_temp.warc.gz
warcio extract --payload warc_temp.warc.gz 0
The range starts at offset and ends at offset+length-1. See also getting WARC file

Related

How do I translate the following POST request into ESP8266 AT-command format?

I've got a working local website that takes in HTML form data.
The fields are:
Temperature
Humidity
The server successfully receives the data and spits out a graph updated with the new entries.
Using a browser tool, I was able to capture the actual POST request as follows:
http://127.0.0.1:5000/add_data
Temperature=25.4&Humidity=52.2
Content-Length:30
Now, I want to migrate from using the human interface browser with manual entries to an ESP01 device using AT commands.
According to the ESP AT-commands documentation, a POST request is performed using the following command:
AT+HTTPCPOST=
Find the link below for the full description of the command.
I cannot seem to get this POST request working. The ESP01 device immediately returns an "ERROR" message without any delay, as though it did not even try to send the request, that the syntax might be wrong.
Among many variations, the following is my best attempt:
AT+HTTPCPOST="http://MYIPADDR:5000/add_data",30,2,"Temperature: 25.4","Humidity: 52.2"
With MYIPADDR above replaced with my IP address.
How do I translate a post request into ESP01 AT command format, and are there any prerequisites needed to be in place to perform such a request?
I did connect the ESP01 device to the WiFi network.
Here's the link to the POST AT command description:
https://docs.espressif.com/projects/esp-at/en/release-v2.2.0.0_esp8266/AT_Command_Set/HTTP_AT_Commands.html#cmd-httpcpost

The documentation says:
AT+HTTPCPOST=url,length[,<http_req_header_cnt>][,<http_req_header>..<http_req_header>]
Response:
OK
The symbol > indicates that AT is ready for receiving serial data, and you can enter the data now. When the requirement of message length
determined by the parameter is met, the transmission starts.
...
Parameters
: HTTP URL. : HTTP data length to POST. The maximum
length is equal to the system allocable heap size.
<http_req_header_cnt>: the number of <http_req_header> parameters.
[<http_req_header>]: you can send more than one request header to the
server.
You're sending:
AT+HTTPCPOST="http://MYIPADDR:5000/add_data",30,2,"Temperature: 25.4","Humidity: 52.2"
The length is 30. The problem is that everything after the length is HTTP header fields; you need to send the variables in the body. So the command is:
AT+HTTPCPOST="http://MYIPADDR:5000/add_data",30
followed on the next line by after the ESP-01 send the > character:
Temperature=25.4&Humidity=52.2
Because you passed 30 as the body length, the ESP-01 will read exactly 30 characters after the end of the AT command and send that data as the post body. If the size of that data changes (for instance, maybe the temperature is 2.2, so one digit less), you'll need to send the new length rather than 30.

wget works fine for some .jpgs but downloads an .html file instead for others

I want to download web images from the command line.
This works fine sometimes, other times it doesn't and I can't figure out why.
Here's an example (Wikimedia Commons picture of the day):
wget https://commons.wikimedia.org/wiki/Main_Page#/media/File:01_Calanche_Piana.jpg
This somehow gets me an .html
HTTP request sent, awaiting response... 200 OK
Length: 185986 (182K) [text/html]
Saving to: 'Main_Page'
The following however (it's the same picture but with explicitly selected resolution) gets me a .jpg (which is what I want)
wget https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/01_Calanche_Piana.jpg/640px-01_Calanche_Piana.jpg
...
HTTP request sent, awaiting response... 200 OK
Length: 118796 (116K) [image/jpeg]
Saving to: '640px-01_Calanche_Piana.jpg'
I tried adding -O test.jpg to the first example, this will still be an .html file though.
Does anyone know why the command works in one case but not in the other?

why the command works in one case but not in the other?
This one
https://commons.wikimedia.org/wiki/Main_Page#/media/File:01_Calanche_Piana.jpg
despite what last letter might suggest is link to HTML page, note that there is # which is used to denote URI fragment, whilst this one
https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/01_Calanche_Piana.jpg/640px-01_Calanche_Piana.jpg
is URL to actual image. If you wondering what type of file is under given URL, but do not want to download that file you might do
wget -S --spider https://www.example.com
It will show you response headers, there might be many of them, but for determining type of resource Content-Type should suffice.

How to send multiple json body using jmeter?

I have written a REST API and now my requirement is to send the multiple JSON body to the API using POST method from JMeter. I have a csv file with four values(1,2,3,4). And in each of the four files I have the JSON body. I used :
Step-1) added the csv file to jmeter and create a reference and named it JSON_FILE
Step-2) ${__FileToString(C:Path_to_csv_file/${__eval(${JSON_FILE})}.txt,,)}
But from this I am able to access only first file i.e which is named with one. How do I send the body of all file to the API?
Help is highly appreciated.

You won't be able to use CSV Data Set Config as it will read the next value for each thread (virtual user) and/or Thread Group iteration.
If your requirement is to send all the files bodies at once you can go for an alternative approach
Add JSR223 PreProcessor as a child of the HTTP Request sampler which you use for sending the JSON payload
Put the following code into "Script" area:
def builder = new StringBuilder()
new File('/path/to/plans.csv').readLines().each { line ->
builder.append(new File(line).text).append(System.getProperty('line.separator'))
}
sampler.getArguments().removeAllArguments()
sampler.addNonEncodedArgument('', builder.toString(), '')
sampler.setPostBodyRaw(true)
the above code iterates through entries in plans.csv file, reads the file contents into a string and concatenates them altogether. Once done it sets the HTTP Request sampler body data to the generated cumulative string.
Check out The Groovy Templates Cheat Sheet for JMeter to learn more and what else could be achieved using Groovy scripting in JMeter.

Use Body data as follows in HTTP Sampler:
{__FileToString(${JSON_FILE},,)}
You have to put all the file path in your plan.csv file. At each line, there should be a file path.
Example:
Suppose, you have 4 files with JSON body which you want to use in your HTTP sampler.
Give the file path of these 4 files in your CSV file which is plan.csv. Each line contains a file path like this:
/User/file/file1.json
/User/file/file2.json
/User/file/file3.json
/User/file/file4.json
Now, in your CSV data set config, Use the proper file name of CSV file which contains all the file path and give it a variable name like JSON_FILE.
Now, Use {__FileToString(${JSON_FILE},,)} this line in your Body data. Also use the loop count value accordingly.

TYPO3 7.6: 404 error page: HTML wrapped in numbers

I created my own “404 Page not found” error page on a TYPO3 website and implemented it via the /typo3conf/LocalConfiguration.php as follows, using the page’s Speaking URL path:
return [
...
'FE' => [
...
'pageNotFound_handling' => '/page-not-found/',
]
]
Now when I call a non-existing page, the error page gets displayed but there is a 4-digit alphanumeric number (hexadecimal as far as I’ve seen by now) BEFORE the HTML source code and a “0” AFTER it. Example (the number in the beginning is different after most of the reloads):
37b3
<!DOCTYPE html>
...
</html>
0
When calling the error page URL itself the page is returned correctly without those numbers.
Having the RealURL extension activated or deactivated does not make a difference.
Thanks a lot in advance!

I added the full description from the install tool and I guess we might find the solution there.
How TYPO3 should handle requests for non-existing/accessible pages.
empty (default)
The next visible page upwards in the page tree is shown.
'true' or '1'
An error message is shown.
String
Static HTML file to show (reads content and outputs with correct headers), e.g. notfound.html or http://www.example.org/errors/notfound.html.
Prefix "REDIRECT:"
If prefixed with "REDIRECT:" it will redirect to the URL/script after the prefix.
Prefix "READFILE:"
If prefixed with "READFILE" then it will expect the remaining string to be a HTML file which will be read and outputted directly after having the marker "###CURRENT_URL###" substituted with REQUEST_URI and ###REASON### with reason text, for example: READFILE:fileadmin/notfound.html.
Prefix "USER_FUNCTION:"
If prefixed with "USER_FUNCTION:" a user function is called, e.g. USER_FUNCTION:fileadmin/class.user_notfound.php:user_notFound->pageNotFound where the file must contain a class user_notFound with a method pageNotFound() inside with two parameters $param and $ref.
What you configured:
You're passing a string, thus TYPO3 expects to find a file - which you don't have, because it's more like an URL.
From what you try to achieve I'd go with REDIRECT:/page-not-found/.
Thanks for pointing this one out btw, I will remove the string configuration from the core since it does not make sense to have more people trip into this pitfall.

In short: change the following line in the FE section of your LocalConfiguration.php:
'pageNotFound_handling' => '/your404page.html',
to
'pageNotFound_handling' => 'REDIRECT:/your404page.html',

Cause
The actual cause is a combination of chunked Content-Encoding and the TYPO3 not being able to decode that in some cases. In your case the page not found handler eventually uses GeneralUtility::getUrl() to retrieve the error page.
If you have [SYS][curlUse] enabled it will use cUrl to retrieve the page and there is no problem.
If you don't have [SYS][curlUse] enabled it will open a socket, read the headers and then read the rest of the body. If the webserver uses "chunked" Content-Encoding the body will contain blocks of data and each block starts with a line with the length in hexadecimal format. The content ends with an empty block (with of course a line with the length "0").
cUrl apparently knows how to decode chunked data.
getUrl() itself does not know how to handle chunked data and uses the content as is as the page content.
In TYPO3 8 LTS the guzzle library is used to handle HTTP requests. In the guzzle code I can't find anything about handling chunked data. Guzzle will check if the cUrl PHP extension is present and use that as preferred transport. In most installations cUrl is present and since this decodes chunked data automagically no problem is visible. I have to test guzzle with PHP that has cUrl disabled to see if the issue is also present in v8/master.
Workaround/solution
If the PHP extension cUrl is enabled in your installation you can simply set [SYS][curlUse] in the Install Tool. The numbers around the 404 page content will disappear.

wget --spider doesn't get file size for some links

I want to get some file size. Some people recommend wget --spider. However, when I run it on some links, like http://autos.cn.yahoo.com/ then it said Length: unspecified [text/html]. Is there way to solve this or I could use another way to get file size without actually downloading it? Thank you!

This happens because the server doesn't send the Content-Length header, or it's being sent malformed. You can ignore this by using the --ignore-length option:
$ wget --ignore-length http://autos.cn.yahoo.com/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to retrieve the HTML of a page from CommonCrawl? - common-crawl

Assuming I have: the link of the CC*.warc file (and the file itself, if it helps); offset; and length How can I get the HTML content of that page? Thanks for your time and attention.

Related

How do I translate the following POST request into ESP8266 AT-command format?

wget works fine for some .jpgs but downloads an .html file instead for others

How to send multiple json body using jmeter?

TYPO3 7.6: 404 error page: HTML wrapped in numbers

wget --spider doesn't get file size for some links

Categories

Resources