Upload TAR contained file to REST PUT API using PHP CURL without extracting - rest

Basically, I am facing the following issue:
I have a TAR container (no compression) of a large size (4GB). This TAR encloses several files:
file 1
file 2
file 3 - the one THAT I NEED (also very large 3 GB)
other files (it doesn't matter how many.).
I should mention that I do know where the file 3 starts (start index) and how large it is(file length) because the TAR format is relatively easy to parse.
What I need to do is upload file 3 by using PHP Curl to a REST API.THE API endpoint is HTTP PUT and the headers are correctly set (it works if I'm uploading the entire TAR file).
So, INFILE = TAR Container.
File 3 starts at the Xth Byte and has a length of Y bytes. I already know the X and Y value.
I need the curl to start sending data from X to Y.
What I did until now was:
$fileHandle = fopen($filePath, "rb"); //File path is the one of the TAR archive
fseek($fileHandle, $fileStartIndex, SEEK_CUR);
And the settings of the curl are.
curl_setopt($curlHandle, CURLOPT_PUT, 1);
curl_setopt($curlHandle, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($curlHandle, CURLOPT_INFILE, $fileHandle);
curl_setopt($curlHandle, CURLOPT_INFILESIZE, $fileSize);
I must mention that extracting file 3 to disk is not an option at this moment as the disk space is the main purpose of the task.
My first idea was to look at the CURLOPT_READFUNCTION, but the callback of this option should return a string (in my case very large one :3 GB, and it breaks the PHP variable size limit).
Has anyone succeeded in handling this kind of upload? Any other tips and trick about CURLOPT_READFUNCTION are also best appreciated.
Thank you!

According to the PHP curl doc:
CURLOPT_READFUNCTION
A callback accepting three parameters. The
first is the cURL resource, the second is a stream resource provided
to cURL through the option CURLOPT_INFILE, and the third is the
maximum amount of data to be read. The callback must return a string
with a length equal or smaller than the amount of data requested,
typically by reading it from the passed stream resource. It should
return an empty string to signal EOF.
So a combination of CURLOPT_INFILE to give curl the file handle, CURLOPT_INFILESIZE to tell curl how big the final file will be and CURLOPT_READFUNCTION to allow curl to read from the file looks like it should do what you need.
Although curl will call your CURLOPT_READFUNCTION with a $length parameter, you're free to return what you want, within the rules:
The callback must return a string with a length equal or smaller than
the amount of data requested
so if you return less than $length, curl will keep calling your CURLOPT_READFUNCTION until it returns EOF (an empty string). So you need to keep track of where you are in your file when reading in CURLOPT_READFUNCTION and start from the last read position on each call.

Related

How could I get offset of a field in flatbuffers binary file?

I am using a library, and the library requires me to provide the offset of the desired data in the file, so it can use mmap to read the data (I can't edit the code of this library but only provide the offset).
So I want to use flatbuffers to serialize my whole data because there isn't any packing and unpacking in flatbuffers, (I think) which means that it is easy to get the offset of the desired part in the binary file.
But I don't know how to get the offset. I have tried loading the binary file and calculate the offset of the pointer of the desired field, for example, the address of the root is 1111, the address of the desired field is 1222, so the offset of the field in the binary file is 1222 - 1111 = 111 (because there is no unpacking step). But in fact, the offset of the pointer is a huge negative number.
Could someone help me with this problem? Thanks in advance!
FlatBuffers is indeed very suitable for mmap. There are no offsets to be computed, since the generated code does that all for you. You should simply mmap the whole FlatBuffers file, and then use the field accessors as normal, starting from auto root = GetRoot<MyRootType>(my_mmapped_buffer). If you want to get a direct pointer to the data in a larger field such as a string or a vector, again simply use the provided API: root->my_string_field()->c_str() for example (which will point to inside your mmapped buffer).

Get offset and length of a subset of a WAT archive from Common Crawl index server

I would like to download a subset of a WAT archive segment from Amazon S3.
Background:
Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example, searching for url=www.celebuzz.com/2017-01-04/*&output=json yields JSON-formatted results, one of which is
{
"urlkey":"com,celebuzz)/2017-01-04/watch-james-corden-george-michael-tribute",
...
"filename":"crawl-data/CC-MAIN-2017-34/segments/1502886104631.25/warc/CC-MAIN-20170818082911-20170818102911-00023.warc.gz",
...
"offset":"504411150",
"length":"14169",
...
}
The filename entry indicates which archive segment contains the WARC file for this particular page. This archive file is huge; but fortunately the entry also contains offset and length fields, which can be used to request the range of bytes containing the relevant subset of the archive segment (see, e.g., lines 22-30 in this gist).
My question:
Given the location of a WARC file segment, I know how to construct the name of the corresponding WAT archive segment (see, e.g., this tutorial). I only need a subset of the WAT file, so I would like to request a range of bytes. But how do I find the corresponding offset and length for the WAT archive segment?
I have checked the API documentation for the Common Crawl index server, and it isn't clear to me that this is even possible. But in case it is, I'm posting this question.
The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same.
After many trial and error I had managed to get a range from a warc file in python and boto3 the following way:
# You have this form the index
offset, length, filename = 2161478, 12350, "crawl-data/[...].warc.gz"
import boto3
from botocore import UNSIGNED
from botocore.client import Config
# Boto3 anonymous login to common crawl
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
# Count the range
offset_end = offset + length - 1
byte_range = 'bytes={offset}-{end}'.format(offset=2161478, end=offset_end)
gzipped_text = s3.get_object(Bucket='commoncrawl', Key=filename, Range=byte_range)['Body'].read()
# The requested file in GZIP
with open("file.gz", 'w') as f:
f.write(gzipped_text)
The rest is optimisation... Hope it helps! :)

md5 checksum of pdf file

Please have a look on the below issue.
1 - Applying the MD5 on a .txt file containing "Hello" (without quotes, length = 5). It gives some hash value (say h1).
2 - Now file content are changed to "Hello " ( without quotes, length = 6). It gives some hash value (say h2).
3 - Now file is changed to "Hello" (exactly as step. 1). Now the hash is h1. Which makes sense.
Now the problem comes if procedure is applied to a .pdf file. Here rather than changing the file content I am chaging the colour of the text and again reverting back to the original file. In this way i am getting three different hash values.
So, is it because of the way pdf reader encode the text and meta-data, hash is different or the analogy itself is wrong?
Info:- Using a freeware in windows to calculate the hash.
So, is it because of the way pdf reader encode the text and meta-data, hash is different or the analogy itself is wrong?
Correct. If you need to test this on your own data, open any PDF in a text editor (I use Notepad++) and scroll to the bottom (where metadata is stored). You'll see something akin to:
<</Subject (Shipping Documents)
/CreationDate (D:20150630070941-06'00')
/Title (Shipping Documents)
/Author (SomeAuthor)
/Producer (iText by lowagie.com \(r0.99 - paulo118\))
/ModDate (D:20150630070941-06'00')
>>
Obviously, /CreationDate and ModDate at the very least will continue to change. Even if you re-generate a pdf from some source, with identical source data, those timestamps meaningfully change the checksum of the target pdf.
Correct, PDFs which look exactly the same can have the same checksum because of some metadata stored in the file like ModDate. I needed to detect PDFs which look the same, so I wrote a kinda-hacky Javascript function. This isn't guaranteed to work, but at least it detects duplicates some of the time (normal checksums will rarely detect duplicate pdfs).
You can read more about the PDF format here https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf and see some similar solutions in this related SO question Why does repeated bursting of a multi-page PDF into individual pages via pdftk change the md5 checksum of those pages?
/**
* The PDF format is weird, and contains various header information and other metadata.
* Most (all?) actual pdf contents appear between keywords `stream` and `endstream`.
* So, to ignore metadata, this function just extracts any contents between "stream" and "endstream".
* This is not guaranteed to find _all_ contents, but it _should_ ignore all metadata.
* Useful for generating checksums.
*/
private getRawContent(buffer: Buffer): string {
const str = buffer.toString();
// FIXME: If the binary stream itself happens to contain "endstream" or "ModDate", this wont work.
const streamParts = str.split('endstream').filter(x => !x.includes('ModDate'));
if (streamParts.length === 0) {
return str;
}
const rawContent: string[] = [];
for (const streamPart of streamParts) {
// Ignore everything before the first `stream`
const streamMatchIndex = streamPart.indexOf('stream');
if (streamMatchIndex >= 0) {
const contentStartIndex = streamMatchIndex + 'stream'.length;
const rawPartContent = streamPart.substring(contentStartIndex);
rawContent.push(rawPartContent);
}
}
return rawContent.join('\n');
}

bash/curl: two-step web form submission

I'd like to submit two forms on the same page in sequence with curl in bash. http://en.wikipedia.org/w/index.php?title=Special:Export contains two forms: one to populate a list of pages given a Wikipedia category, and another to fetch XML data for that list.
Using curl in bash, I can submit the first form independently, returning an html file with the pages field populated (though I can't use it, as it's local instead of on the wikipedia server):
curl -d "addcat=1&catname=Works_by_Leonardo_da_Vinci&curonly=1&action=submit" http://en.wikipedia.org/w/index.php?title=Special:Export -o "somefile.html"
And I can submit the second form while specifying a page, to get the XML:
curl -d "pages=Mona_Lisa&curonly=1&action=submit" http://en.wikipedia.org/w/index.php?title=Special:Export -o "output.xml"
...but I can't figure out how to combine the two steps, or pipe the one into the other, to return XML for all the pages in a category, like I get when I perform the two steps manually. http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export seems to suggest that this is possible; any ideas? I don't have to use curl or bash.
Special:Export is not meant for fully automatic retrieval. The API is. For example, to get the current text of all pages in Category:Works by Leonardo da Vinci in XML format, you can use this URL:
http://en.wikipedia.org/w/api.php?format=xml&action=query&generator=categorymembers&gcmtitle=Category:Works_by_Leonardo_da_Vinci&prop=revisions&rvprop=content&gcmlimit=max
This won't return pages in subcategories and is limited only to first 500 pages (although that's not a problem in this case and there is a way to access the rest).
Assuming you can parse the output from the first html file and generate a list of pages (e.g.
Mona Lisa
The Last Supper
You can pipe the output to a bash loop using read. As a simple example:
$ seq 1 5 | while read x; do echo "I read $x"; done
I read 1
I read 2
I read 3
I read 4
I read 5

Compare then download

I have a plist in my document folder of the app with one string, an int value. An another plist is on my server with also a string, an int value.
How can i compare the two int value and then do something if one is bigger than the other thanks to all
How about sending along that int value as part of your download url (in a query parameter), and then only download the file if the number is different? Otherwise return a HTTP 304 (content unmodified) response. This is pretty simple to do in PHP at least...
You need to download the contents of the remote plist and compare it to the local one. Then you can 'do something' if it meets the criteria (e.g. if it's bigger).
Use NSURLConnection to download the data.
You can't compare without downloading.