Get offset and length of a subset of a WAT archive from Common Crawl index server - common-crawl

I would like to download a subset of a WAT archive segment from Amazon S3.
Background:
Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example, searching for url=www.celebuzz.com/2017-01-04/*&output=json yields JSON-formatted results, one of which is
{
"urlkey":"com,celebuzz)/2017-01-04/watch-james-corden-george-michael-tribute",
...
"filename":"crawl-data/CC-MAIN-2017-34/segments/1502886104631.25/warc/CC-MAIN-20170818082911-20170818102911-00023.warc.gz",
...
"offset":"504411150",
"length":"14169",
...
}
The filename entry indicates which archive segment contains the WARC file for this particular page. This archive file is huge; but fortunately the entry also contains offset and length fields, which can be used to request the range of bytes containing the relevant subset of the archive segment (see, e.g., lines 22-30 in this gist).
My question:
Given the location of a WARC file segment, I know how to construct the name of the corresponding WAT archive segment (see, e.g., this tutorial). I only need a subset of the WAT file, so I would like to request a range of bytes. But how do I find the corresponding offset and length for the WAT archive segment?
I have checked the API documentation for the Common Crawl index server, and it isn't clear to me that this is even possible. But in case it is, I'm posting this question.

The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same.

After many trial and error I had managed to get a range from a warc file in python and boto3 the following way:
# You have this form the index
offset, length, filename = 2161478, 12350, "crawl-data/[...].warc.gz"
import boto3
from botocore import UNSIGNED
from botocore.client import Config
# Boto3 anonymous login to common crawl
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
# Count the range
offset_end = offset + length - 1
byte_range = 'bytes={offset}-{end}'.format(offset=2161478, end=offset_end)
gzipped_text = s3.get_object(Bucket='commoncrawl', Key=filename, Range=byte_range)['Body'].read()
# The requested file in GZIP
with open("file.gz", 'w') as f:
f.write(gzipped_text)
The rest is optimisation... Hope it helps! :)

Related

What are missing attributes as defined in the hdf5 specification and metadata in group h5md?

I have a one hdf5 format file Data File containing the molecular dynamics simulation data. For quick inspection, the h5ls tool is handy. For example:
h5ls -d xaa.h5/particles/lipids/positions/time | less
now my question is based on the comment I received on the data format! What attributes are missing according the hdf5 specifications and metadata in group?
Are you trying to get the value of the Time attribute from a dataset? If so, you need to use h5dump, not h5ls. And, the attributes are attached to each dataset, so you have to include the dataset name on the path. Finally, attribute names are case sensitive; Time != time. Here is the required command for dataset_0000 (repeat for 0001 thru 0074):
h5dump -d /particles/lipids/positions/dataset_0000/Time xaa.h5
You can also get attributes with Python code. Simple example below:
import h5py
with h5py.File('xaa.h5','r') as h5f:
for ds, h5obj in h5f['/particles/lipids/positions'].items():
print(f'For dataset={ds}; Time={h5obj.attrs["Time"]}')

How could I get offset of a field in flatbuffers binary file?

I am using a library, and the library requires me to provide the offset of the desired data in the file, so it can use mmap to read the data (I can't edit the code of this library but only provide the offset).
So I want to use flatbuffers to serialize my whole data because there isn't any packing and unpacking in flatbuffers, (I think) which means that it is easy to get the offset of the desired part in the binary file.
But I don't know how to get the offset. I have tried loading the binary file and calculate the offset of the pointer of the desired field, for example, the address of the root is 1111, the address of the desired field is 1222, so the offset of the field in the binary file is 1222 - 1111 = 111 (because there is no unpacking step). But in fact, the offset of the pointer is a huge negative number.
Could someone help me with this problem? Thanks in advance!
FlatBuffers is indeed very suitable for mmap. There are no offsets to be computed, since the generated code does that all for you. You should simply mmap the whole FlatBuffers file, and then use the field accessors as normal, starting from auto root = GetRoot<MyRootType>(my_mmapped_buffer). If you want to get a direct pointer to the data in a larger field such as a string or a vector, again simply use the provided API: root->my_string_field()->c_str() for example (which will point to inside your mmapped buffer).

How can I merge multiple tfrecords file into one file?

My question is, if I want to create one tfrecords file for my data , it will take approximately 15 days to finish it, it has 500000 pairs of template , and each template is 32 frames( images). In order to save the time, I have 3 GPUs, so I thought I can create three tfrocords file each one file on one GPUs and then I can finish creating the tfrecords in 5 days. But then I searched about a way to merge these three files in one file and couldn't find proper solution.
So Is there any way to merge these three files in one file, OR is there any way that I can train my network by feeding batch of example extracted form the three tfrecords files, knowing I am using Dataset API.
As the question is asked two months ago, I thought you already find the solution. For the follows, the answer is NO, you do not need to create a single HUGE tfrecord file. Just use the new DataSet API:
dataset = tf.data.TFRecordDataset(filenames_to_read,
compression_type=None, # or 'GZIP', 'ZLIB' if compress you data.
buffer_size=10240, # any buffer size you want or 0 means no buffering
num_parallel_reads=os.cpu_count() # or 0 means sequentially reading
)
# Maybe you want to prefetch some data first.
dataset = dataset.prefetch(buffer_size=batch_size)
# Decode the example
dataset = dataset.map(single_example_parser, num_parallel_calls=os.cpu_count())
dataset = dataset.shuffle(buffer_size=number_larger_than_batch_size)
dataset = dataset.batch(batch_size).repeat(num_epochs)
...
For details, check the document.
Addressing the question title directly for anyone looking to merge multiple .tfrecord files:
The most convenient approach would be to use the tf.Data API:
(adapting an example from the docs)
# Create dataset from multiple .tfrecord files
list_of_tfrecord_files = [dir1, dir2, dir3, dir4]
dataset = tf.data.TFRecordDataset(list_of_tfrecord_files)
# Save dataset to .tfrecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(dataset)
However, as pointed out by holmescn, you'd likely be better off leaving the .tfrecord files as separate files and reading them together as a single tensorflow dataset.
You may also refer to a longer discussion regarding multiple .tfrecord files on Data Science Stackexchange
The answer by MoltenMuffins works for higher versions of tensorflow. However, if you are using lower versions, you have to iterate through the three tfrecords and save them them into a new record file as follows. This works for tf versions 1.0 and above.
def comb_tfrecord(tfrecords_path, save_path, batch_size=128):
with tf.Graph().as_default(), tf.Session() as sess:
ds = tf.data.TFRecordDataset(tfrecords_path).batch(batch_size)
batch = ds.make_one_shot_iterator().get_next()
writer = tf.python_io.TFRecordWriter(save_path)
while True:
try:
records = sess.run(batch)
for record in records:
writer.write(record)
except tf.errors.OutOfRangeError:
break
Customizing the above the script for better tfrecords listing
import os
import glob
import tensorflow as tf
save_path = 'data/tf_serving_warmup_requests'
tfrecords_path = glob.glob('data/*.tfrecords')
dataset = tf.data.TFRecordDataset(tfrecords_path)
writer = tf.data.experimental.TFRecordWriter(save_path)
writer.write(dataset)

Upload TAR contained file to REST PUT API using PHP CURL without extracting

Basically, I am facing the following issue:
I have a TAR container (no compression) of a large size (4GB). This TAR encloses several files:
file 1
file 2
file 3 - the one THAT I NEED (also very large 3 GB)
other files (it doesn't matter how many.).
I should mention that I do know where the file 3 starts (start index) and how large it is(file length) because the TAR format is relatively easy to parse.
What I need to do is upload file 3 by using PHP Curl to a REST API.THE API endpoint is HTTP PUT and the headers are correctly set (it works if I'm uploading the entire TAR file).
So, INFILE = TAR Container.
File 3 starts at the Xth Byte and has a length of Y bytes. I already know the X and Y value.
I need the curl to start sending data from X to Y.
What I did until now was:
$fileHandle = fopen($filePath, "rb"); //File path is the one of the TAR archive
fseek($fileHandle, $fileStartIndex, SEEK_CUR);
And the settings of the curl are.
curl_setopt($curlHandle, CURLOPT_PUT, 1);
curl_setopt($curlHandle, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($curlHandle, CURLOPT_INFILE, $fileHandle);
curl_setopt($curlHandle, CURLOPT_INFILESIZE, $fileSize);
I must mention that extracting file 3 to disk is not an option at this moment as the disk space is the main purpose of the task.
My first idea was to look at the CURLOPT_READFUNCTION, but the callback of this option should return a string (in my case very large one :3 GB, and it breaks the PHP variable size limit).
Has anyone succeeded in handling this kind of upload? Any other tips and trick about CURLOPT_READFUNCTION are also best appreciated.
Thank you!
According to the PHP curl doc:
CURLOPT_READFUNCTION
A callback accepting three parameters. The
first is the cURL resource, the second is a stream resource provided
to cURL through the option CURLOPT_INFILE, and the third is the
maximum amount of data to be read. The callback must return a string
with a length equal or smaller than the amount of data requested,
typically by reading it from the passed stream resource. It should
return an empty string to signal EOF.
So a combination of CURLOPT_INFILE to give curl the file handle, CURLOPT_INFILESIZE to tell curl how big the final file will be and CURLOPT_READFUNCTION to allow curl to read from the file looks like it should do what you need.
Although curl will call your CURLOPT_READFUNCTION with a $length parameter, you're free to return what you want, within the rules:
The callback must return a string with a length equal or smaller than
the amount of data requested
so if you return less than $length, curl will keep calling your CURLOPT_READFUNCTION until it returns EOF (an empty string). So you need to keep track of where you are in your file when reading in CURLOPT_READFUNCTION and start from the last read position on each call.

md5 checksum of pdf file

Please have a look on the below issue.
1 - Applying the MD5 on a .txt file containing "Hello" (without quotes, length = 5). It gives some hash value (say h1).
2 - Now file content are changed to "Hello " ( without quotes, length = 6). It gives some hash value (say h2).
3 - Now file is changed to "Hello" (exactly as step. 1). Now the hash is h1. Which makes sense.
Now the problem comes if procedure is applied to a .pdf file. Here rather than changing the file content I am chaging the colour of the text and again reverting back to the original file. In this way i am getting three different hash values.
So, is it because of the way pdf reader encode the text and meta-data, hash is different or the analogy itself is wrong?
Info:- Using a freeware in windows to calculate the hash.
So, is it because of the way pdf reader encode the text and meta-data, hash is different or the analogy itself is wrong?
Correct. If you need to test this on your own data, open any PDF in a text editor (I use Notepad++) and scroll to the bottom (where metadata is stored). You'll see something akin to:
<</Subject (Shipping Documents)
/CreationDate (D:20150630070941-06'00')
/Title (Shipping Documents)
/Author (SomeAuthor)
/Producer (iText by lowagie.com \(r0.99 - paulo118\))
/ModDate (D:20150630070941-06'00')
>>
Obviously, /CreationDate and ModDate at the very least will continue to change. Even if you re-generate a pdf from some source, with identical source data, those timestamps meaningfully change the checksum of the target pdf.
Correct, PDFs which look exactly the same can have the same checksum because of some metadata stored in the file like ModDate. I needed to detect PDFs which look the same, so I wrote a kinda-hacky Javascript function. This isn't guaranteed to work, but at least it detects duplicates some of the time (normal checksums will rarely detect duplicate pdfs).
You can read more about the PDF format here https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf and see some similar solutions in this related SO question Why does repeated bursting of a multi-page PDF into individual pages via pdftk change the md5 checksum of those pages?
/**
* The PDF format is weird, and contains various header information and other metadata.
* Most (all?) actual pdf contents appear between keywords `stream` and `endstream`.
* So, to ignore metadata, this function just extracts any contents between "stream" and "endstream".
* This is not guaranteed to find _all_ contents, but it _should_ ignore all metadata.
* Useful for generating checksums.
*/
private getRawContent(buffer: Buffer): string {
const str = buffer.toString();
// FIXME: If the binary stream itself happens to contain "endstream" or "ModDate", this wont work.
const streamParts = str.split('endstream').filter(x => !x.includes('ModDate'));
if (streamParts.length === 0) {
return str;
}
const rawContent: string[] = [];
for (const streamPart of streamParts) {
// Ignore everything before the first `stream`
const streamMatchIndex = streamPart.indexOf('stream');
if (streamMatchIndex >= 0) {
const contentStartIndex = streamMatchIndex + 'stream'.length;
const rawPartContent = streamPart.substring(contentStartIndex);
rawContent.push(rawPartContent);
}
}
return rawContent.join('\n');
}