How to perl convert xml (name with pattern) to json? - perl

The next convert test.xml to json:
perl -MJSON::Any -MXML::Simple -le'print JSON::Any->new()->objToJson(XMLin("/tmp/test.xml "))'
but I need convert any xml (example test-1.xml test-2.xml test-3.xml test-4.xml etc) with pattern name /tmp/test-*.xml, but if I use:
perl -MJSON::Any -MXML::Simple -le'print JSON::Any->new()->objToJson(XMLin("/tmp/test-*.xml "))'
I have the next messages:
File does not exist: /tmp/test-*.xml at -e line 1
How I do it?

There's problems with what you're trying to do:
XML::Simple isn't simple. It's for simple XML. It'll mangle your XML and give inconsistent results. See: Why is XML::Simple "Discouraged"?
XML is fundamentally more complicated than JSON, so there's no linear transformation. You need to figure out what'd you'd do with attributes and duplicate elements for a start.
File does not exist: /tmp/test-*.xml at -e line 1 - means the file doesn't exist. So you're not going to get very far. But XMLin doesn't accept wildcards. You'll have to process one file at a time.
The first two points are solvable, provided you accept that this cannot be a generic solution - to give a moderately general solution, we'll need an example of your source XML. But it won't be a one liner.

You seem to be asking how to find files matching a file glob.
You could use
my #qfns = glob("/tmp/test-*.xml");
If you just want the first matching file, use
my ($qfn) = glob("/tmp/test-*.xml");
Do not use the following since glob acts an iterator in scalar context.
my $qfn = glob("/tmp/test-*.xml"); # XXX

You can try this using glob and map functions.
perl -MJSON::Any -MXML::Simple -le'local $,="\n"; print map { JSON::Any->new()->objToJson(XMLin($_)) } glob "/path/to/my/test*.xml"'

Related

use a variable with whitespace Perl

I am currently working on a project but I have one big problem. I have some picture with a whitespace in the name and I want to do a montage. The problem is that I can't rename my picture and my code is like that :
$pic1 = qq(picture one.png);
$pic2 = qq(picture two.png);
my $cmd = "C:\...\montage.exe $pic1 $pic2 output.png";
system($cmd);
but because of the whitespace montage.exe doesn't work. How can I execute my code without renaming all my pictures?
Thanks a lot for your answer!
You can properly quote the filenames within the string you pass to system, as #Borodin shows in his answer. Something like: system("montage.exe '$pic1' '$pic2'")
However, A more reliable and safer solution is to pass the arguments to montage.exe as extra parameters in the system call:
system('montage.exe', $pic2, $pic2, 'output.png')
Now you don't have to worry about nesting the correct quotes, or worry about files with unexpected characters. Not only is this simpler code, but it avoids malicious injection issues, should those file names ever come from a tainted source. Someone could enter | rm *, but your system call will not remove all your files for you.
Further, in real life, you probably are not going to have a separate scalar variable for each file name. You'll have them in an array. This makes your system call even easier:
system('montage.exe', #filenames, 'output.png')
Not only is that super easy, but it avoids the pitfall of having a command line too long. If your filenames have nice long paths (maybe 50-100 characters), a Windows command line will exceed the max command length after around 100 files. Passing the arguments through system() instead of in one big string avoids that limitation.
Alternatively, you can pass the arguments to montage.exe as a list (instead of concatenating them all into a string):
use strict;
use warnings;
my $pic1 = qq(picture one.png);
my $pic2 = qq(picture two.png);
my #cmd = ("C:\...\montage.exe", $pic1, $pic2, "output.png");
system(#cmd);
You need to put quotes around the file names that have spaces. You also need to escape the backslashes
my $cmd = qq{C:\\...\\montage.exe "$pic1" "$pic2" output.png};
In unix systems, the best approach is the multi-argument form of system because 1) it avoids invoking a shell, and 2) that's the format accepted by the OS call. Neither of those are true in Windows. The OS call to spawn a program expects a command line, and system's attempt to form this command line is sometimes incorrect. The safest approach is to use Win32::ShellQuote.
use Win32::ShellQuote qw( quote_system );
system quote_system("C:\\...\\montage.exe", $pic1, $pic2, "output.png");

perl sequence extraction loop

I have an existing perl one-liner (from the Edwards lab) that works wonderfully to read a text file (named ids.file) that contains one column of IDs and searches a second, specially formatted text file (named fasta.file in this example - in "fasta" format for those who know bioinformatics) and returns sequences that match the ID from the first file. I was hoping to expand this script to do two additional things:
The current perl one-liner only seems to work if the ids.file contains one column of data. I would like it to work on a file that contains two columns (separated by spaces), and act on the second column of data (well, really any column of data, but I assume that it will be obvious enough to adapt it if someone can give an example using a second column)
I would like to append the any results returned from the output of the search to a third column, instead of just to a new file.
If someone is kind enough to offer an example but only has time or inclination to work on one of these, I would prefer that you try to solve #2 - I have come close to solving #1 with a for loop that uses awk to only use the Perl code on the second column - I haven't gotten it yet, but am close, so #2 seems like the harder one to me.
The perl one liner is as follows:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if #ARGV' ids.file fasta.file
I appreciate any help you can give!
Not quite sure but will this do?
perl -ne 'chomp; s/^>(\S+).*/$c=$i{$1}/e; print if $c;
$i{(/^\S*\s(\S*)$/)[0]}="$_ " if #ARGV'
ids.file fasta.file

perl quoting in ftp->ls with wildcard

contents of remote directory mydir :
blah.myname.1.txt
blah.myname.somethingelse.txt
blah.myname.randomcharacters.txt
blah.notmyname.1.txt
blah.notmyname.2.txt
...
in perl, I want to download all of this stuff with myname
I am failing really hard with the appropriate quoting. please help.
failed code
my #files;
#files = $ftp->ls( '*.myname.*.txt' ); # finds nothing
#files = $ftp->ls( '.*.myname.*.txt' ); # finds nothing
etc..
How do I put the wildcards so that they are interpreted by the ls, but not by perl? What is going wrong here?
I will assume that you are using the Net::FTP package. Then this part of the docs is interesting:
ls ( [ DIR ] )
Get a directory listing of DIR, or the current directory.
In an array context, returns a list of lines returned from the server. In a scalar context, returns a reference to a list.
This means that if you call this method with no arguments, you get a list of all files from the current directory, else from the directory specified.
There is no word about any patterns, which is not suprising: FTP is just a protocol to transfer files, and this module only a wrapper around that protocoll.
You can do the filtering easily with grep:
my #interesting = grep /pattern/, $ftp->ls();
To select all files that contain the character sequence myname, use grep /myname/, LIST.
To select all files that contain the character sequence .myname., use grep /\.myname\./, LIST.
To select all files that end with the character sequence .txt, use grep /\.txt$/, LIST.
The LIST is either the $ftp->ls or another grep, so you can easily chain multiple filtering steps.
Of course, Perl Regexes are more powerful than that, and we could do all the filtering in a single /\.myname\.[^.]+\.txt$/ or something, depending on your exact requirements. If you are desperate for a globbing syntax, there are tools available to convert glob patterns to regex objects, like Text::Glob, or even to do direct glob matching:
use Text::Glob qw(match_glob);
my #interesting = match_glob ".*.myname.*.txt", $ftp->ls;
However, that is inelegant, to say the least, as regexes are far more powerful and absolutely worth learning.

Error when using complex file names for tar -> write in perl

While using tar->write() I am getting errors while using complex file names.
The code is:
my $filename= $archive_type."_".$from_date_time."_".$to_date_time."tar";
$tar->write($filename);
The error i get is:
Could not create filehandle for 'postProcessProbe_2010/6/23/3_2010/6/23/7.tar':
No such file or directory at test.pl line 24
If I change the $filename to a simple string like out.tar everything works.
Well, / is the directory separator on *nix systems (and, internally Windows treats / and \ interchangeably) and I believe tar files, regardless of platform use it internally as the directory separator.
I do not think you can create file names containing / on either *nix or Windows based systems. Even if you could, that would probably create a whole bunch of headaches down the road.
It would be better in my humble opinion to switch to a saner date format such as YYYYMMDD.
Also, you are using string concatenation when sprintf would have been much clearer:
my $filename= sprintf '%s_%s_%s.tar', $archive_type, $from_date_time, , $to_date_time;

How does this Perl one liner to check if a directory is empty work?

I got this strange line of code today, it tells me 'empty' or 'not empty' depending on whether the CWD has any items (other than . and ..) in it.
I want to know how it works because it makes no sense to me.
perl -le 'print+(q=not =)[2==(()=<.* *>)].empty'
The bit I am interested in is <.* *>. I don't understand how it gets the names of all the files in the directory.
It's a golfed one-liner. The -e flag means to execute the rest of the command line as the program. The -l enables automatic line-end processing.
The <.* *> portion is a glob containing two patterns to expand: .* and *.
This portion
(q=not =)
is a list containing a single value -- the string "not". The q=...= is an alternate string delimiter, apparently used because the single-quote is being used to quote the one-liner.
The [...] portion is the subscript into that list. The value of the subscript will be either 0 (the value "not ") or 1 (nothing, which prints as the empty string) depending on the result of this comparison:
2 == (()=<.* *>)
There's a lot happening here. The comparison tests whether or not the glob returned a list of exactly two items (assumed to be . and ..) but how it does that is tricky. The inner parentheses denote an empty list. Assigning to this list puts the glob in list context so that it returns all the files in the directory. (In scalar context it would behave like an iterator and return only one at a time.) The assignment itself is evaluated in scalar context (being on the right hand side of the comparison) and therefore returns the number of elements assigned.
The leading + is to prevent Perl from parsing the list as arguments to print. The trailing .empty concatenates the string "empty" to whatever came out of the list (i.e. either "not " or the empty string).
<.* *>
is a glob consisting of two patterns: .* are all file names that start with . and * corresponds to all files (this is different than the usual DOS/Windows conventions).
(()=<.* *>)
evaluates the glob in list context, returning all the file names that match.
Then, the comparison with 2 puts it into scalar context so 2 is compared to the number of files returned. If that number is 2, then the only directory entries are . and .., period. ;-)
<.* *> means (glob(".*"), glob("*")). glob expands file patterns the same way the shell does.
I find that the B::Deparse module helps quite a bit in deciphering some stuff that throws off most programmers' eyes, such as the q=...= construct:
$ perl -MO=Deparse,-p,-q,-sC 2>/dev/null << EOF
> print+(q=not =)[2==(()=<.* *>)].empty
> EOF
use File::Glob ();
print((('not ')[(2 == (() = glob('.* *')))] . 'empty'));
Of course, this doesn't instantly produce "readable" code, but it surely converts some of the stumbling blocks.
The documentation for that feature is here. (Scroll near the end of the section)