Curl command uploading document fails when run from Perl - perl

I've got a Perl script that uploads documents into Alfresco using curl.
Some of the documents have ampersand in the file name and initially this caused curl to fail. I fixed this by placing a carat symbol in front of the ampersand. But now I'm finding some documents are failing to upload when they don't have a space either side of the ampersand. Other documents with spaces in the file name and an ampersand do load successfully.
The snippet of Perl code that is running is:
# Escape & for curl in file name with a ^
my $downloadFileNameEsc = ${downloadfile};
$downloadFileNameEsc =~ s/&/^&/g;
$command = "curl -u admin:admin -F file=\#${downloadFileNameEsc} -F id=\"${docId}\" -F title=\"${docTitle}\" -F tags=\"$catTagStr\" -F abstract=\"${abstract}\" -F published=\"${publishedDate}\" -F pubId=\"${pubId}\" -F pubName=\"${pubName}\" -F modified=\"${modifiedDate}\" -F archived=\"${archived}\" -F expiry=\"${expiryDate}\" -F groupIds=\"${groupIdStr}\" -F groupNames=\"${groupNameStr}\" ${docLoadUrl}";
logmsg(4, $command);
my #cmdOutput = `$command`;
$exitStatus = $?;
my $upload = 0;
logmsg(4, "Alfresco upload status $exitStatus");
if ($exitStatus != 0) {
You can see that I am using backticks to execute the curl command so that I can read the response. The perl script is being run under windows.
What this effectively tries to run is:
curl -u admin:admin -F file=#tmp-download/Multiple%20Trusts%20Gift%20^&%20Loan.pdf -F id="e2ef104d-b4be-4896-8360-7d6f2e7c7b72" ....
This works.
curl -u admin:admin -F file=#tmp-download/Quarterly_Buys^&sells_Q1_2006.doc -F id="78d18634-ee93-4c29-b01d-270aeee3219a" ....
This fails!!
The only difference being as far as I can see is that in the one that works the file name has spaces (%26) in the file name somewhere around the ampersand, not necessarily next to the ampersand.
I can't see why one runs successfully and the other doesn't. Think it must be to do with backticks and ampersands in the file name. I haven't tried using system as I wanted to capture the response.
Any thoughts because I've exhausted all options.

You should learn to use Perl modules. Perl has some great modules to handle the Web requests. If you depend upon operating system commands, you will end up with not only dependencies upon those commands, but shell interactions and whether or not you need to quote special characters.
Perl modules remove a lot of the issues that you can run into. You are no longer dependent upon particular commands or even particular implementation of those commands. (The curl command can vary from system to system, and may not even be on the system you're on). Plus, most of these modules handle the piddling details for you (such as URI escaping strings).
LWP is the standard Perl library for implementing these requests. Take a look at the LWP Cookbook. This is a tutorial on the whole HTTP process. Basically, you need to create an agent which is really just a virtual web browser for you to use. Then, you can configure it (for example, setting the machine, browser type, etc.) you might need.
What is really nice is HTTP::Request::Common that provides a simple interface for using HTTP forms.
my $results = POST "$docLoadUrl"
[ file => '#' . "$downloadFileName",
id => $docId,
title => $docTitle,
tag => $catTagStr,
abstract => $abstract,
published => $publishedDate,
pubId => $pubId,
pubName => $pubName,
...
];
This is a lot easier to read and maintain. Plus, it will handle URI encoding for you.

Related

Delete multiple pages in AEM 6.3 following a naming pattern

In AEM content hierarchy, I have a folder in which I have 4000 pages. And out of these pages, 3700 pages are following the naming convention of xyz-1, xyz-2, xyz-3..uptill xyz-3700 like this. I have a requirement to delete these pages starting with xyz but not the other 300 pages which have different names. I have tried below command with *, but this doesn’t work. Can anybody help me here to get this resolved?
curl  -F":operation=delete" -F":applyTo=/content/foo/bar/xyz*" http://localhost:4502 -u admin:admin
Curl Needs a full path to execute the statement. You can pass individual paths like below, but it will not solve your issue as you have a lot of pages
curl -u admin:admin -F":operation=delete" -F":applyTo=/content/aem-flash/en/xyz-1" -F":applyTo=/content/aem-flash/en/xyz-2" http://localhost:4502
You have to write a script to delete all of them. There are multiple options, you can either write a standalone code and deploy as a bundle or connect from Eclipse
If you do not want to deploy a bundle, you can use groovy script to execute your code.
Below Groovy script should work for your requirement if all the pages are in the same parent node. If you would like to query the whole site, please update the query accordingly
import javax.jcr.Node
def buildQuery(page) {
def queryManager = session.workspace.queryManager;
def statement = 'select * from cq:Page where jcr:path like \''+page.path+'/xyz-%\'';
queryManager.createQuery(statement, 'sql');
}
final def page = getPage('/content/aem-flash/en/')
final def query = buildQuery(page);
final def result = query.execute()
result.nodes.each { node ->
println 'node path is: ' + node.path
node.remove();
session.save();
}
Solution with CURL
I found another way to use CURL command alone but it sends multiple requests to the server and might not be good.
The below command makes 4000 requests even the record is not present in the environment, it's just like a loop.
As regex is not supported by default in Windows, this will not work in a command prompt. It should work fine in a linux environment. If you want this to be executed in Windows, you can use Git bash console or other tools. I have tried in Git Bash and it worked fine.
curl -X DELETE "http://localhost:4502/content/aem-flash/en/xyz-[0-4000]" -u admin:admin

Is it possible to list all tags across all behat tests?

I have several hundred behat tests created by many people who used different tags. I want to clean this up, and to start with I want to list out all the tags which have been used so far.
I wanted to answer my own question as it was something I could not find an answer to elsewhere.
I tried initially to use a custom formatter but that did not work.
https://gist.github.com/paulmozo/fb23d8fb436700381a06
Eventually I crafted a Bash command to suit my purposes
bin/behat --dry-run 2>&1 | tr ' ' '\n' | grep -w #.* | sort -u
This runs the behat command with --dry-run which does not execute the tests, merely outputs the steps so I can pipe them to another tool. The 2>&1 redirects the standard error to null (this is shell dependent). The tr tool breaks every word in the stream into a separate line. The grep searches for lines starting with the # symbol. Finally sort -u sorts the list and returns the uniques.
This command takes about 15 seconds to run and did the job perfectly for me.

Checking if a file is a text file without using -T?

Title is pretty self explanatory, are there file testing functions in perl or is there a built in module that allows file testing operations?
This is a non-issue as -T like all of the file test operators are perl builtins.
They are documented here: perldoc -X
-X FILEHANDLE
-X EXPR
-X DIRHANDLE
-X
A file test, where X is one of the letters listed below. This unary operator takes one argument, either a filename, a filehandle, or a dirhandle, and tests the associated file to see if something is true about it. If the argument is omitted, tests $_ , except for -t , which tests STDIN. Unless otherwise documented, it returns 1 for true and '' for false, or the undefined value if the file doesn't exist. Despite the funny names, precedence is the same as any other named unary operator.
...
-T File is an ASCII text file (heuristic guess).
-B File is a "binary" file (opposite of -T).
The "file test" functions available in Perl are part of the programming language itself. Based on what you're saying and from the comments on this page, it may be that you have been "asked not to use external commands" because someone thinks that the -T flag is relying on something that belongs to the underlying environment and not the Perl language.
-T is part of the -X file test unary operators which are inherent to Perl:
http://perldoc.perl.org/functions/-X.html
Underlying the -T operator (specifically) is the function pp_fttext, which lives in pp_sys.c. These are part of the underlying code that comprises Perl, and you can verify this by looking in the root directory of the Perl source distribution:
http://www.perl.org/get.html
It may be the only way to do what you were originally asking (how to do this without -T) might be to do what you were asked not to do (use something external to Perl to perform the test).

wget appends query string to resulting file

I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character â - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/

Perl: can not get correct exit code from external program

I've searched everywhere, but I can't seem to find a solution for my issue. Probably, it is code related.
I'm trying to catch the exit code from a novell program called DXCMD, to check whether certain "drivers" are running. This is no problem in bash, but I need to write a more complex perl script (easier working with arrays for example).
This is the code:
#Fill the driverarray with the results from ldapsearch (in ldap syntax)
#driverarray =`ldapsearch -x -Z -D "$username" -w "$password" -b "$IDM" -s sub "ObjectClass=DirXML-Driver" dn | grep ^dn:* | sed 's/^....//' | sed 's/cn=//g;s/dc=//g;s/ou=//;s/,/./g'`;
#iterate through drivers and get the exit code:
foreach $driverdn (#driverarray)
{
my $cmd = `/opt/novell/eDirectory/bin/dxcmd -user $username -password $password -getstate "$driverdn"`;
my $driverstatus = $?>>8;
}
I've come this far; the rest of the code is written (getting the states).
But the $?>>8 code always returns 60. When I copy the command directly into the shell and echo the $?, the return code is always 2 (which means the driver is running fine). In bash, the code also works (but without the >>8, obviously).
I've looked into the error code 60, but I cannot find anything, so I think it is due to my code.
How can I rectify this error? Or how can I track the error? Anyone? :)
Wrong value passed to -getstate. You didn't remove the newline. You're missing
chomp(#driverarray);