Common crawl - getting WARC file

Common crawl - getting WARC file - common-crawl

I would like to retrieve a web page using common crawl but am getting lost.
I would like to get the warc file for www.example.com. I see that this link (http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=https%3A%2F%2Fwww.example.com&output=json) produces the following json.
{"urlkey": "com,example)/", "timestamp": "20170820000102", "mime": "text/html", "digest": "B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A", "filename": "crawl-data/CC-MAIN-2017-34/segments/1502886105955.66/robotstxt/CC-MAIN-20170819235943-20170820015943-00613.warc.gz", "mime-detected": "text/html", "status": "200", "offset": "1109728", "length": "1166", "url": "http://www.example.com"}
Can someone please point me in the right direction how I can use these json elements to retrieve the HTML.
Thanks for helping a noob!

Take filename, offset and length from the JSON result to fill a HTTP range request from $offset to ($offset+$length-1). Add https://commoncrawl.s3.amazonaws.com/ as prefix to filename and decompress the result with gzip, e.g.
curl -s -r1109728-$((1109728+1166-1)) \
"https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886105955.66/robotstxt/CC-MAIN-20170819235943-20170820015943-00613.warc.gz" \
| gzip -dc
Of course, on AWS this can be done using Boto3 or the AWS-CLI:
aws --no-sign-request s3api get-object \
--bucket commoncrawl \
--key crawl-data/CC-MAIN-2017-34/segments/1502886105955.66/robotstxt/CC-MAIN-20170819235943-20170820015943-00613.warc.gz \
--range bytes=1109728-$((1109728+1166-1)) response.gz
If it's only for few documents and it doesn't matter that the documents are modified you could use the index server directly: http://index.commoncrawl.org/CC-MAIN-2017-34/http://www.example.com

Related

How to change confluent rest proxy body data schema for producing messages

The confluent rest proxy documents suggest that to post a data it must be as such:
$ curl -X POST -H "Content-Type: application/vnd.kafka.json.v2+json" \
--data '{"records":[{"value":{"name": "testUser"}}]}' \
"http://localhost:8082/topics/jsontest"
namely, every post data must be wrapped inside the following schema:
{"records":[
{"value":{<DATA>}}
]}
I was wondering if it's possible to change this schema? For instance, I might want to change records to log and include my data as inside an array as the value to the log as follows:
{"log": [<my_data>, <my_data>] }
How can I go about this?

The API is defined in the documentation and states the format that your payload must take.
If you want to batch your records together you need to do so within the defined schema, e.g.
curl -X POST -H "Content-Type: application/vnd.kafka.json.v2+json" \
--data '{ "records": [ { "value": { "name": "testUser1" } }, { "value": { "name": "testUser2" } } ] }' \
"http://localhost:8082/topics/jsontest"
AFAIK the only way to support the schema you're talking about would be to modify the source code yourself.

How to get raw content directly from api.github.com (or raw.githubusercontent.com)

First of all, please take note of the new API changes:
https://developer.github.com/changes/2020-02-10-deprecating-auth-through-query-param/
The problem seems to be that I have to exchange a github personal access token for a temporary token, in order to read from raw.githubusercontent.com.
I have this request info:
set -e
export github_personal_access_token=a8f464fdxxxxxxxxxxxxxxxxxxxxxxfb89e6be
export file_url="https://api.github.com/repos/oresoftware/live-mutex/contents/package.json?ref=master"
mkdir tmp && cd tmp
curl -H "Authorization: token $github_personal_access_token" "$file_url" 2> err.log > output.json
the output.json looks like:
{
"name": "package.json",
"path": "package.json",
"sha": "6a2d55983bb641ff217d822d8e60dbb6c8f85ea3",
"size": 1343,
"url": "https://api.github.com/repos/ORESoftware/live-mutex/contents/package.json?ref=master",
"html_url": "https://github.com/ORESoftware/live-mutex/blob/master/package.json",
"git_url": "https://api.github.com/repos/ORESoftware/live-mutex/git/blobs/6a2d55983bb641ff217d822d8e60dbb6c8f85ea3",
"download_url": "https://raw.githubusercontent.com/ORESoftware/live-mutex/master/package.json",
"type": "file",
"content": "ewogICJuYW1lIjogImxpdmUtbXV0ZXgiLAogICJ2ZXJzaW9uIjogIjAuMi4y\nNCIsCiAgImRlc2NyaXB0aW9uIjogIlNpbXBsZSBtdXRleCB0aGF0IHVzZXMg\nYSBUQ1Agc2VydmVyOyB1c2VmdWwgaWYgeW91IGNhbm5vdCBpbnN0YWxsIFJl\nZGlzLCBldGMuIiwKICAibWFpbiI6ICJkaXN0L21haW4uanMiLAogICJ0eXBp\nbmdzIjogImRpc3QvbWFpbi5kLnRzIiwKICAidHlwZXMiOiAiZGlzdC9tYWlu\nLmQudHMiLAogICJiaW4iOiB7CiAgICAibG14X2FjcXVpcmVfbG9jayI6ICJh\nc3NldHMvY2xpL2FjcXVpcmUuanMiLAogICAgImxteF9yZWxlYXNlX2xvY2si\nOiAiYXNzZXRzL2NsaS9yZWxlYXNlLmpzIiwKICAgICJsbXhfaW5zcGVjdF9i\ncm9rZXIiOiAiYXNzZXRzL2NsaS9pbnNwZWN0LmpzIiwKICAgICJsbXhfbGF1\nbmNoX2Jyb2tlciI6ICJhc3NldHMvY2xpL3N0YXJ0LXNlcnZlci5qcyIsCiAg\nICAibG14X3N0YXJ0X3NlcnZlciI6ICJhc3NldHMvY2xpL3N0YXJ0LXNlcnZl\nci5qcyIsCiAgICAibG14X2xzIjogImFzc2V0cy9jbGkvbHMuanMiLAogICAg\nImxteCI6ICJhc3NldHMvbG14LnNoIgogIH0sCiAgInNjcmlwdHMiOiB7CiAg\nICAidGVzdCI6ICIuL3NjcmlwdHMvdGVzdC5zaCIsCiAgICAicG9zdGluc3Rh\nbGwiOiAiLi9hc3NldHMvcG9zdGluc3RhbGwuc2giCiAgfSwKICAicjJnIjog\newogICAgInRlc3QiOiAiLi90ZXN0L3NldHVwLXRlc3Quc2ggJiYgc3VtYW4g\nLS1kZWZhdWx0IgogIH0sCiAgInJlcG9zaXRvcnkiOiB7CiAgICAidHlwZSI6\nICJnaXQiLAogICAgInVybCI6ICJnaXQraHR0cHM6Ly9naXRodWIuY29tL09S\nRVNvZnR3YXJlL2xpdmUtbXV0ZXguZ2l0IgogIH0sCiAgImF1dGhvciI6ICJP\nbGVnemFuZHIgVkQiLAogICJsaWNlbnNlIjogIk1JVCIsCiAgImJ1Z3MiOiB7\nCiAgICAidXJsIjogImh0dHBzOi8vZ2l0aHViLmNvbS9PUkVTb2Z0d2FyZS9s\naXZlLW11dGV4L2lzc3VlcyIKICB9LAogICJob21lcGFnZSI6ICJodHRwczov\nL2dpdGh1Yi5jb20vT1JFU29mdHdhcmUvbGl2ZS1tdXRleCNyZWFkbWUiLAog\nICJkZXBlbmRlbmNpZXMiOiB7CiAgICAiQG9yZXNvZnR3YXJlL2pzb24tc3Ry\nZWFtLXBhcnNlciI6ICIwLjAuMTI0IiwKICAgICJAb3Jlc29mdHdhcmUvbGlu\na2VkLXF1ZXVlIjogIjAuMS4xMDYiLAogICAgImNoYWxrIjogIl4yLjQuMiIs\nCiAgICAidGNwLXBpbmciOiAiXjAuMS4xIiwKICAgICJ1dWlkIjogIl4zLjMu\nMiIKICB9LAogICJkZXZEZXBlbmRlbmNpZXMiOiB7CiAgICAiQHR5cGVzL25v\nZGUiOiAiXjEwLjEuMiIsCiAgICAiQHR5cGVzL3RjcC1waW5nIjogIl4wLjEu\nMCIsCiAgICAiQHR5cGVzL3V1aWQiOiAiXjMuNC4zIgogIH0KfQo=\n",
"encoding": "base64",
"_links": {
"self": "https://api.github.com/repos/ORESoftware/live-mutex/contents/package.json?ref=master",
"git": "https://api.github.com/repos/ORESoftware/live-mutex/git/blobs/6a2d55983bb641ff217d822d8e60dbb6c8f85ea3",
"html": "https://github.com/ORESoftware/live-mutex/blob/master/package.json"
}
}
but I just want the raw file content, not the metadata. The metadata does give me a link to the raw content:
https://raw.githubusercontent.com/ORESoftware/live-mutex/master/package.json
but for private repos, it requires an access token. So is there an easier way to do this other than this?
curl -H "Authorization: token $github_personal_access_token" "$file_url" |
jq -r '.content' | base64 -d > output.json
like I said, the biggest problem is I don't have a valid access_token in hand, and I can get an access token to download the file from the download_url, but that requires extra scripting steps. Looking for a single command. AKA, I don't want to have to install jq in a docker image if possible.

GitHub supports different media types to indicate what the client wishes to accept. In your case, you can get the raw file like this:
curl -H "Accept: application/vnd.github.v3.raw" \
-H "Authorization: token $github_personal_access_token" \
"$file_url" 2> err.log > output.json

Why do i get 400 using AutoML Rest API?

I trained a custom model using Google Cloud AutoML.
Now i am trying to access it, using the script provided by Google.
I tried to vary "content" in any kind of ways.
I also had a look at the information provided here. Surely i did provide the correct path to the key file. Also i checked on the project ID and model ID.
I do have a service account. Billing is enabled too.
export GOOGLE_APPLICATION_CREDENTIALS={key-file-path}
curl -X POST \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json" \
https://automl.googleapis.com/v1beta1/projects/{my-project-ID}/locations/us-central1/models/{my-model-id}:predict \
-d '{
"payload" : {
"textSnippet": {
"content": "happy",
"mime_type": "text/plain"
},
}
}'
I expect the result to be the prediction.
My result looks like this:
"error": {
"code": 400,
"message": "Invalid JSON payload received. Expected a value.\n“happy”,\n \n^",
"status": "INVALID_ARGUMENT"}

Thanks for your comments, John Hanley and Phillipp Möhler!
Actually I can't really tell what happened here.
I followed Johns advice to use a whole sentence. Doing that enabled me to successfully predict something!
Afterwards I was able to use both, single words and sentences.
Deleting the commas seems to have no effect at all.

Trying to find Column value based filtering in HBase REST API

HI I am trying to build REST APIs for exposing data that resides in HBase. For simplicity I am using built in HBase REST API.I am following documentation from https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hbase_rest_api.html . I have created one API for search utility which uses Rowkey. But I am stuck at that the my remaining API's require search based on column values. The documentation suggests steps but I am not able to use it.And there are no other sources available. I found there are filtering options in HBase Java Client like singlecolumnvalue filter with utilities like substringComaparator. Like these is there any way i can apply filters in HBase REST APIs?

The link you pasted show how to use scanner:
curl -vi -X PUT \
-H "Accept: text/xml" \
-H "Content-Type:text/xml" \
-d #filter.txt \
"http://example.com:20550/users/scanner/"
#stelcheck collected usage of some filters here. So if you want to use SingleColumnValueFilter with hbase rest api, your filter.txt will be like:
<Scanner batch="100">
<filter>
{
"type": "SingleColumnValueFilter",
"op": "EQUAL",
"family": "Y2Y=",
"qualifier": "cQ==",
"latestVersion": true,
"comparator": {
"type": "BinaryComparator",
"value": "dmFsdWU5"
}
}
</filter>
</Scanner>
This example is to find the cell with value value9 in column cf:q.
Update:
Add example to parse filter content in command line.
If don't want to use file content as data, just parse the content in command line.
For the example above, it will be :
curl -s -i -X PUT -H "Accept: text/xml" -H "Content-Type: text/xml" -d '<Scanner batch="100"><filter>{"type": "SingleColumnValueFilter", "op": "EQUAL", "family": "Y2Y=", "qualifier": "cQ==", "latestVersion": true, "comparator": { "type": "BinaryComparator", "value": "dmFsdWU5" } }</filter></Scanner>' "http://example.com:20550/users/scanner/"

Create GitHub.com Hook

I am attempting to create a hook using the create hook api found on
http://developer.github.com/v3/repos/hooks/#create-a-hook
but I am getting a 301 when I attempt to post, so I am sure I am doing it wrong...
A couple of questions...
1) How does github know that I can create a hook for that repo if it is private? I am sure I need to authenticate with the POST, but how?
2) Is the following curl statement a valid example of how to create a hook?
curl -v -H "Content-Type: application/json" -X POST -d "{ "name": "cia",
"active": true, "events": [ "push" ], "config": {
"url": "http://requestb.in/######", "content_type": "json" } }"
http://github.com/repos/#####/#####/hooks
I have replaced certain elements with ##### for security sake...
3) If the above is incorrect, may I please have a snippet of a valid example to create a hook for the webhook named "cia"?

curl -usigmavirus24 -v -H "Content-Type: application/json" -X POST -d '{"name": "cia", "active": true, "events": ["push"], "config": {"url": "...", "content_type": "json"}}' https://api.github.com/repos/sigmavirus24/reponame/hooks
Is the correct curl command. The URL you're posting to has to be https://api.github.com/:endpoint where :endpoint in this case is repos/username/reponame/hooks. You also need to use 's around the JSON body for the curl command because otherwise you'll get strings like "{ " concatenated with the output of commands like name, cia, active, events, etc.
Also the -u :username option is necessary for curl so it will tell curl that it MUST authenticate and ask you for the password to do so.
If you don't mind your password being in your bash history (WHICH YOU SHOULD) you can also do -u username:password. Or even better you can base64 encode your credentials in the form username:password and then send that as a header like so: Authentication: Basic <base64-encoded-credentials.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Common crawl - getting WARC file - common-crawl

Related

How to change confluent rest proxy body data schema for producing messages

How to get raw content directly from api.github.com (or raw.githubusercontent.com)

Why do i get 400 using AutoML Rest API?

Trying to find Column value based filtering in HBase REST API

Create GitHub.com Hook

Categories

Resources