bash/curl: two-step web form submission - forms

I'd like to submit two forms on the same page in sequence with curl in bash. http://en.wikipedia.org/w/index.php?title=Special:Export contains two forms: one to populate a list of pages given a Wikipedia category, and another to fetch XML data for that list.
Using curl in bash, I can submit the first form independently, returning an html file with the pages field populated (though I can't use it, as it's local instead of on the wikipedia server):
curl -d "addcat=1&catname=Works_by_Leonardo_da_Vinci&curonly=1&action=submit" http://en.wikipedia.org/w/index.php?title=Special:Export -o "somefile.html"
And I can submit the second form while specifying a page, to get the XML:
curl -d "pages=Mona_Lisa&curonly=1&action=submit" http://en.wikipedia.org/w/index.php?title=Special:Export -o "output.xml"
...but I can't figure out how to combine the two steps, or pipe the one into the other, to return XML for all the pages in a category, like I get when I perform the two steps manually. http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export seems to suggest that this is possible; any ideas? I don't have to use curl or bash.

Special:Export is not meant for fully automatic retrieval. The API is. For example, to get the current text of all pages in Category:Works by Leonardo da Vinci in XML format, you can use this URL:
http://en.wikipedia.org/w/api.php?format=xml&action=query&generator=categorymembers&gcmtitle=Category:Works_by_Leonardo_da_Vinci&prop=revisions&rvprop=content&gcmlimit=max
This won't return pages in subcategories and is limited only to first 500 pages (although that's not a problem in this case and there is a way to access the rest).

Assuming you can parse the output from the first html file and generate a list of pages (e.g.
Mona Lisa
The Last Supper
You can pipe the output to a bash loop using read. As a simple example:
$ seq 1 5 | while read x; do echo "I read $x"; done
I read 1
I read 2
I read 3
I read 4
I read 5

Related

How do I reduce the number of 301 redirect entries using wildcards and variables in Squarespace?

I recently renamed all of the URLs that make up my blog... and have written redirects for almost every page... using wildcards where I can... keeping in mind... all that I know is the * wildcard at this time...
Here is an example of what I have...
/season-1/2017/1/1/snl-s01e01-host-george-carlin -> /season-1/snl-s01e01-george-carlin 301
I want to write a catch-all that will redirect all 38 seasons of reviews with one redirect entry... but I can't figure out how to get rid of just the word "host" between s01e01- and -george-carlin... and was thinking it would work something like this...
/season-*/*/*/*/snl-s*e*-host-*-* -> /season-*/snl-s*e*[code to remove the word "host"]-*-* 301
Is that even close to being correct? Do I need that many *s
Thanks in advance for any help...
Unfortunately, you won't be able to reduce the number of individual redirect entries using the redirect features that Squarespace has to offer, namely the wildcard (*) and a single variable ([name]). Multiple variables would be needed, but only [name] is supported.
The closest you can get is:
/season-1/*/*/*/snl-s01e01-host-[name] -> /season-1/snl-s01e01-[name] 301
But, if I'm understanding things, while the above redirect appears more general, it would still need to be copy/pasted for each post individually. So although it demonstrates the best that could be achieved, it is not a technical improvement.
Therefore, there are only two alternatives:
Create a Google Sheet (or other spreadsheet) where the old URLs are copy/pasted in column one, a formula using arrayformula and regular expressions to parse the old URL and generate the new URL is added in column two, and in column three a formula is written to join the two cells with -> and 301. With that done, you could click, drag and highlight all cells in column 3, copy, and paste them in the "URL Shortcuts" text area in Squarespace.It can be quite time consuming to figure out, write and test the correct formula, but it does avoid having to manually type out every redirect. Whether it is less time/effort in total depends on the number of redirects and one's proficiency with writing spreadsheet formulas.It could be that using the redirect code above would simplify the formula that'd need to be written in the spreadsheet, which may save some time.
Another alternative would be to remove your redirects and instead handle the redirect via JavaScript added to the 404/Page-not-found page. Because it sounds like you already have all of the redirects in place but are simply trying to reduce the overall number, I wouldn't recommend changing to a JavaScript-based approach. There are other drawbacks to using JavaScript, in any case.

Handle POST data sent as array

I have an html form which sends a hidden field and a radio button with the same name.
This allows people to submit the form without picking from the list (but records a zero answer).
When the user does select a radio button, the form posts BOTH the hidden value and the selected value.
I'd like to write a perl function to convert the POST data to a hash. The following works for standard text boxes etc.
#!/usr/bin/perl
use CGI qw(:standard);
sub GetForm{
%form;
foreach my $p (param()) {
$form{$p} = param($p);
}
return %form;
}
However when faced with two form inputs with the same name it just returns the first one (ie the hidden one)
I can see that the inputs are included in the POST header as an array but I don't know how to process them.
I'm working with legacy code so I can't change the form unfortunately!
Is there a way to do this?
I have an html form which sends a hidden field and a radio button with
the same name.
This allows people to submit the form without picking from the list
(but records a zero answer).
That's an odd approach. It would be easier to leave the hidden input out and treat the absence of the data as a zero answer.
However, if you want to stick to your approach, read the documentation for the CGI module.
Specifically, the documentation for param:
When calling param() If the parameter is multivalued (e.g. from multiple selections in a scrolling list), you can ask to receive an array. Otherwise the method will return the first value.
Thus:
$form{$p} = [ param($p) ];
However, you do seem to be reinventing the wheel. There is a built-in method to get a hash of all paramaters:
$form = $CGI->new->Vars
That said, the documentation also says:
CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
So you should migrate away from this anyway.
Replace
$form{$p} = param($p); # Value of first field named $p
with
$form{$p} = ( multi_param($p) )[-1]; # Value of last field named $p
or
$form{$p} = ( grep length, multi_param($p) )[-1]; # Value of last field named $p
# that has a non-blank value

Upload TAR contained file to REST PUT API using PHP CURL without extracting

Basically, I am facing the following issue:
I have a TAR container (no compression) of a large size (4GB). This TAR encloses several files:
file 1
file 2
file 3 - the one THAT I NEED (also very large 3 GB)
other files (it doesn't matter how many.).
I should mention that I do know where the file 3 starts (start index) and how large it is(file length) because the TAR format is relatively easy to parse.
What I need to do is upload file 3 by using PHP Curl to a REST API.THE API endpoint is HTTP PUT and the headers are correctly set (it works if I'm uploading the entire TAR file).
So, INFILE = TAR Container.
File 3 starts at the Xth Byte and has a length of Y bytes. I already know the X and Y value.
I need the curl to start sending data from X to Y.
What I did until now was:
$fileHandle = fopen($filePath, "rb"); //File path is the one of the TAR archive
fseek($fileHandle, $fileStartIndex, SEEK_CUR);
And the settings of the curl are.
curl_setopt($curlHandle, CURLOPT_PUT, 1);
curl_setopt($curlHandle, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($curlHandle, CURLOPT_INFILE, $fileHandle);
curl_setopt($curlHandle, CURLOPT_INFILESIZE, $fileSize);
I must mention that extracting file 3 to disk is not an option at this moment as the disk space is the main purpose of the task.
My first idea was to look at the CURLOPT_READFUNCTION, but the callback of this option should return a string (in my case very large one :3 GB, and it breaks the PHP variable size limit).
Has anyone succeeded in handling this kind of upload? Any other tips and trick about CURLOPT_READFUNCTION are also best appreciated.
Thank you!
According to the PHP curl doc:
CURLOPT_READFUNCTION
A callback accepting three parameters. The
first is the cURL resource, the second is a stream resource provided
to cURL through the option CURLOPT_INFILE, and the third is the
maximum amount of data to be read. The callback must return a string
with a length equal or smaller than the amount of data requested,
typically by reading it from the passed stream resource. It should
return an empty string to signal EOF.
So a combination of CURLOPT_INFILE to give curl the file handle, CURLOPT_INFILESIZE to tell curl how big the final file will be and CURLOPT_READFUNCTION to allow curl to read from the file looks like it should do what you need.
Although curl will call your CURLOPT_READFUNCTION with a $length parameter, you're free to return what you want, within the rules:
The callback must return a string with a length equal or smaller than
the amount of data requested
so if you return less than $length, curl will keep calling your CURLOPT_READFUNCTION until it returns EOF (an empty string). So you need to keep track of where you are in your file when reading in CURLOPT_READFUNCTION and start from the last read position on each call.

Treeline.io sanitize inputs

I have just started investigating into treeline.io beta, so, I could not find any way in the existing machinepacks that would do the job(sanitizing user inputs). Wondering if i can do it in anyway, best if within treeline.
Treeline automatically does type-checking on all incoming request parameters. If you create a route POST /foo with parameter age and give it 123 as an example, it will automatically display an error message if you try to post to /foo with age set to abc, because it's not a number.
As far as more complex validation, you can certainly do it in Treeline--just add more machines to the beginning of your route. The if machine works well for simple tasks; for example, to ensure that age is < 150, you can use if and set the left-hand value to the age parameter, the right-hand value to 150, and the comparison to "<". For more custom validations you can create your own machine using the built-in editor and add pass and fail exits like the if machine has!
The schema-inspector machinepack allow you to sanitize and validate the inputs in Treeline: machinepack-schemainspector
Here a screenshot how I'm using it in my Treeline project:
The content of the Sanitize element:
The content of the Validate element (using the Sanitize output):
For the next parts, I'm always using the Sanitize output (email trimmed and in lowercase for this example).

How to process a simple loop in Perl's WWW::Mechanize?

Especially interesting for me as a PHP/Perl-beginner is this site in Switzerland:
see this link:http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=1308
Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it.
what we have so far: Well the harvesting task should be no problem if i take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries. Hmm - i guess that the algorithm would be basically 2 nested loops: the outer loop runs the form based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here. Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines - likely less.
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize
"Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it."
Not true. See http://perlmonks.org/?node_id=905767
"The data is copyrighted even though it is made available freely: "Downloading or copying of texts, illustrations, photos or any other data does not entail any transfer of rights on the content." (and again, in German, as you've been scraping some other German list to spam before)."