Random element order in XML document using XML::LibXML - perl

I have a Perl script that reads a simple .csv file like below-
"header1","header2","header3","header4"
"12","12-JUL-2012","Active","Processed"
"13","11-JUL-2012","In Process","Pending"
"32","10-JUL-2012","Active","Processed"
"24","08-JUL-2012","Active","Processed"
.....
The aim is to convert this .csv to an .xml file something like below-
<ORDERS>
<LIST_G_ROWS>
<G_ROWS>
<header1>12</header1>
<header2>12-JUL-2012</header2>
<header3>Active</header3>
<header4>Processed</header4>
</G_ROWS>
<G_ROWS>
<header1>13</header1>
<header2>11-JUL-2012</header2>
<header3>In Process</header3>
<header4>Pending</header4>
</G_ROWS>
....
....
</LIST_G_ROWS>
</ORDERS>
I know that there is XML::CSV available in CPAN which will make my life easier but I want to make use of already installed XML::LibXML to create the XML, instead of installing XML::CSV. I was able to read the CSV and create the XML file as above without any issues, but I am getting a random order of the elements in the XML i.e. something like below. I need to have the order of the elements (child nodes) to be in sync with the .csv file as shown above, but I am not quite sure how do go around that. I am using a hash and sort() ing the hash didn't quite solve the problem either.
<ORDERS>
<LIST_G_ROWS>
<G_ROWS>
<header3>Active</header3>
<header1>12</header1>
<header4>Processed</header4>
<header2>12-JUL-2012</header2>
</G_ROWS>
......
and so on. Below is the snippet from my perl code
use XML::LibXML;
use strict;
my $outcsv="/path/to/data.csv";
my $$xmlFile="/path/to/data.xml";
my $headers = 0;
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8');
my $root = $doc->createElement("ORDERS");
my $list = $doc->createElement("LIST_G_ROWS");
$root->appendChild($list);
open(IN,"$outcsv") || die "can not open $outcsv: $!\n";
while(<IN>){
chomp($_);
if ($headers == 0)
{
$_ =~ s/^\"//g; #remove starting (")
$_ =~ s/\"$//g; #remove trailing (")
#keys = split(/\",\"/,$_); #split per ","
s{^\s+|\s+$}{}g foreach #keys; #remove leading and trailing spaces from each field
$headers = 1;
}
else{
$_ =~ s/^\"//g; #remove starting (")
$_ =~ s/\"$//g; #remove trailing (")
#vals = split(/\",\"/,$_); #split per ","
s{^\s+|\s+$}{}g foreach #vals; #remove leading and trailing spaces from each field
my %tags = map {$keys[$_] => $vals[$_]} (0..#keys-1);
my $row = $doc->createElement("G_ROWS");
$list->appendChild($row);
for my $name (keys %tags) {
my $tag = $doc->createElement($name);
my $value = $tags{$name};
$tag->appendTextNode($value);
$row->appendChild($tag);
}
}
}
close(IN);
$doc->setDocumentElement($root);
open(OUT,">$xmlFile") || die "can not open $xmlFile: $!\n";
print OUT $doc->toString();
close(OUT);

You could forget the %tags hash entirely. Instead, loop over the indices of #keys:
for my $i (0 .. #keys - 1) {
my $key = $keys[$i];
my $value = $values[$i];
my $tag = $doc->createElement($key);
$tag->appendTextNode($value);
$row->appendChild($tag);
}
That way, the ordering of your keys is preserved. When a hash is used, the ordering is indeterminate.

Your program is far more involved than it needs to be. For convenience and reliability you should use Text::CSV to parse your CSV file.
The program below does what you need.
use strict;
use warnings;
use Text::CSV;
use XML::LibXML;
open my $csv_fh, '<', '/path/to/data.csv' or die $!;
my $csv = Text::CSV->new;
my $headers = $csv->getline($csv_fh);
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8');
my $orders = $doc->createElement('ORDERS');
$doc->setDocumentElement($orders);
my $list = $orders->appendChild($doc->createElement('LIST_G_ROWS'));
while ( my $data = $csv->getline($csv_fh) ) {
my $rows = $list->appendChild($doc->createElement('G_ROWS'));
for my $i (0 .. $#$data) {
$rows->appendTextChild($headers->[$i], $data->[$i]);
}
}
print $doc->toFile('/path/to/data.xml', 1);
output
<?xml version="1.0" encoding="UTF-8"?>
<ORDERS>
<LIST_G_ROWS>
<G_ROWS>
<header1>12</header1>
<header2>12-JUL-2012</header2>
<header3>Active</header3>
<header4>Processed</header4>
</G_ROWS>
<G_ROWS>
<header1>13</header1>
<header2>11-JUL-2012</header2>
<header3>In Process</header3>
<header4>Pending</header4>
</G_ROWS>
<G_ROWS>
<header1>32</header1>
<header2>10-JUL-2012</header2>
<header3>Active</header3>
<header4>Processed</header4>
</G_ROWS>
<G_ROWS>
<header1>24</header1>
<header2>08-JUL-2012</header2>
<header3>Active</header3>
<header4>Processed</header4>
</G_ROWS>
</LIST_G_ROWS>
</ORDERS>
Update
Without the exotic options that Text::CSV provides, its functionality is fairly simple if its options are fixed. This alternative provides a subroutine csv_readline to replace the Text::CSV method readline. It works mostly in the same way as the module.
The output of this program is identical to that above.
use strict;
use warnings;
use XML::LibXML;
open my $csv_fh, '<', '/path/to/data.csv' or die $!;
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8');
my $orders = $doc->createElement('ORDERS');
$doc->setDocumentElement($orders);
my $list = $orders->appendChild($doc->createElement('LIST_G_ROWS'));
my $headers = csv_getline($csv_fh);
while ( my $data = csv_getline($csv_fh) ) {
my $rows = $list->appendChild($doc->createElement('G_ROWS'));
for my $i (0 .. $#$data) {
$rows->appendTextChild($headers->[$i], $data->[$i]);
}
}
print $doc->toFile('/path/to/data.xml', 1);
sub csv_getline {
my $fh = shift;
defined (my $line = <$fh>) or return;
$line =~ s/\s*\z/,/;
[ map { /"(.*)"/ ? $1 : $_ } $line =~ /( " [^"]* " | [^,]* ) , /gx ];
}

Seems like something that XML::LibXml is an overkill for, just use XML::Simple and build the proper hash which will describe that XML structure, than dump it with XMLOut to an XML file

Related

How to check whether one file's value contains in another text file? (perl script)

I would like to check one of the file's values contains on another file. if one of the value contains it will show there is existing bin for that specific, if no, it will show there is no existing bin limit. the problem is I am not sure how to check all values at once.
first DID1 text file value contain :
L84A:D:O:M:
L84C:B:E:D:
second DID text file value contain :
L84A:B:E:Q:X:F:i:M:Y:
L84C:B:E:Q:X:F:i:M:Y:
L83A:B:E:Q:X:F:i:M:Y:
if first 4words value are match, need to check all value for that line.
for example L84A in first text file & second text file value has M . it should print out there is an existing M bin
below is my code :
use strict;
use warnings;
my $filename = 'DID.txt';
my $filename1 = 'DID1.txt';
my $count = 0;
open( FILE2, "<$filename1" )
or die("Could not open log file. $!\n");
while (<FILE2>) {
my ($number) = $_;
chomp($number);
my #values1 = split( ':', $number );
open( FILE, "<$filename" )
or die("Could not open log file. $!\n");
while (<FILE>) {
my ($line) = $_;
chomp($line);
my #values = split( ':', $line );
foreach my $val (#values) {
if ( $val =~ /$values1[0]/ ) {
$count++;
if ( $values[$count] =~ /$values1[$count]/ ) {
print
"Yes ,There is an existing bin & DID\n #values1\n";
}
else {
print "No, There is an existing bin & DID\n";
}
}
}
}
}
I cannot check all value. please help to give any advice on it since this is my first time learning for perl language. Thanks a lot :)
Based on my understanding I write this code:
use strict;
use warnings;
#use ReadWrite;
use Array::Utils qw(:all);
use vars qw($my1file $myfile1cnt $my2file $myfile2cnt #output);
$my1file = "did1.txt"; $my2file = "did2.txt";
We are going to read both first and second files (DID1 and DID2).
readFileinString($my1file, \$myfile1cnt); readFileinString($my2file, \$myfile2cnt);
In first file, as per the OP's request the first four characters should be matched with second file and then if they matched we need to check rest of the characters in the first file with the second one.
while($myfile1cnt=~m/^((\w){4})\:([^\n]+)$/mig)
{
print "<LineStart>";
my $lineChk = $1; my $full_Line = $3; #print ": $full_Line\n";
my #First_values = split /\:/, $full_Line; #print join "\n", #First_values;
If the first four digit matched then,
if($myfile2cnt=~m/^$lineChk\:([^\n]+)$/m)
{
Storing the rest of the content in the same and to be split with colon and getting the characters to be matched with first file contents.
my $FullLine = $1; my #second_values = split /:/, $FullLine;
Then search each letter first and second content which matched line...
foreach my $sngletter(#First_values)
{
If the letters are matched with first and second file its going to be printed.
if( grep {$_ eq "$sngletter"} #second_values)
{
print "Matched: $sngletter\t";
}
}
}
else { print "Not Matched..."; }
This is just information that the line end.
print "<LineEnd>\n"
}
#------------------>Reading a file
sub readFileinString
#------------------>
{
my $File = shift;
my $string = shift;
use File::Basename;
my $filenames = basename($File);
open(FILE1, "<$File") or die "\nFailed Reading File: [$File]\n\tReason: $!";
read(FILE1, $$string, -s $File, 0);
close(FILE1);
}
Read search pattern and data into hash (first field is a key), then go through data and select only field included into pattern for this key.
use strict;
use warnings;
use feature 'say';
my $input1 = 'DID1.txt'; # look for key,pattern(array)
my $input2 = 'DID.txt'; # data - key,elements(array)
my $pattern;
my $data;
my %result;
$pattern = file2hash($input1); # read pattern into hash
$data = file2hash($input2); # read data into hash
while( my($k,$v) = each %{$data} ) { # walk through data
next unless defined $pattern->{$k}; # skip those which is not in pattern hash
my $find = join '|', #{ $pattern->{$k} }; # form search pattern for grep
my #found = grep {/$find/} #{ $v }; # extract only those of interest
$result{$k} = \#found; # store in result hash
}
while( my($k,$v) = each %result ) { # walk through result hash
say "$k has " . join ':', #{ $v }; # output final result
}
sub file2hash {
my $filename = shift;
my %hash;
my $fh;
open $fh, '<', $filename
or die "Couldn't open $filename";
while(<$fh>) {
chomp;
next if /^\s*$/; # skip empty lines
my($key,#data) = split ':';
$hash{$key} = \#data;
}
close $fh;
return \%hash;
}
Output
L84C has B:E
L84A has M

Parsing string in multiline data with positive lookbehind

I am trying to parse data like:
header1
-------
var1 0
var2 5
var3 9
var6 1
header2
-------
var1 -3
var3 5
var5 0
Now I want to get e.g. var3 for header2. Whats the best way to do this?
So far I was parsing my files line-by-line via
open(FILE,"< $file");
while (my $line = <FILE>){
# do stuff
}
but I guess it's not possible to handle multiline parsing properly.
Now I am thinking to parse the file at once but wasn't successful so far...
my #Input;
open(FILE,"< $file");
while (<FILE>){ #Input = <FILE>; }
if (#Input =~ /header2/){
#...
}
The easier way to handle this is "paragraph mode".
local $/ = "";
while (<>) {
my ($header, $body) =~ /^([^\n]*)\n-+\n(.*)/s
or die("Bad data");
my #data = map [ split ], split /\n/, $body;
# ... Do something with $header and #data ...
}
The same can be achieved without messing with $/ as follows:
my #buf;
while (1) {
my $line = <>;
$line =~ s/\s+\z// if !defined($line);
if (!length($line)) {
if (#buf) {
my $header = shift(#buf);
shift(#buf);
my #data = map [ split ], splice(#buf);
# ... Do something with $header and #data ...
}
last if !defined($line);
next;
}
push #buf, $line;
}
(In fact, the second snippet includes a couple of small improvements over the first.)
Quick comments on your attempt:
The while loop is useless because #Input = <FILE> places the remaining lines of the file in #Input.
#Input =~ /header2/ matches header2 against the stringification of the array, which is the stringification of the number of elements in #Input. If you want to check of an element of #Input contains header2, will you will need to loop over the elements of #Inputs and check them individually.
while (<FILE>){ #Input = <FILE>; }
This doesn't make much sense. "While you can read a record from FILE, read all of the data on FILE into #Input". I think what you actually want is just:
my #Input = <FILE>;
if (#Input =~ /header2/){
This is quite strange too. The binding operator (=~) expects scalar operands, so it evaluates both operands in scalar context. That means #Input will be evaluated as the number of elements in #Input. That's an integer and will never match "header2".
A couple of approaches. Firstly a regex approach.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file = 'file';
open my $fh, '<', $file or die $!;
my $data = join '', <$fh>;
if ($data =~ /header2.+var3 (.+?)\n/s) {
say $1;
} else {
say 'Not found';
}
The key to this is the /s on the m// operator. Without it, the two dots in the regex won't match newlines.
The other approach is more of a line by line parser.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file = 'file';
open my $fh, '<', $file or die $!;
my $section = '';
while (<$fh>) {
chomp;
# if the line all word characters,
# then we've got a section header.
if ($_ !~ /\W/) {
$section = $_;
next;
}
my ($key, $val) = split;
if ($section eq 'header2' and $key eq 'var3') {
say $val;
last;
}
}
We read the file a line at a time and make a note of the section headers. For data lines, we split on whitespace and check to see if we're in the right section and have the right key.
In both cases, I've switched to using a more standard approach (lexical filehandles, 3-arg open(), or die $!) for opening the file.

Get image src links from urls using perl and store in DB

I am trying to extract image src links using the following Perl code. Don't get where I am making mistake.
1. open a file and read URLs in it
My text file looks like this
https://zzzzzz.com/
https://yyyyyyy.com/
https://xxxxxxxxxx.com/
https://stackoverflow.com/
https://www.google.com/
https://www.yahoo.com/
foreach URL in text file extracting img src
print the retrieved data into another file
again open the file using new file handle and read it into an array
while dereferencing array it shows error ARRAY(0x2e14a48) ARRAY(0x3125528) ARRAY(0x312e170).
Perl code is
#!/usr/bin/perl
print "Content-type: text/html\n\n";
use strict;
use warnings;
use DBI;
use LWP::Simple;
use HTML::LinkExtor;
my $filename = "/path/to/file";
open FILE, '<', $filename or print "cant open file: $!";
my #data = <FILE>;
close(FILE);
my $image = "/path/to/file";
open FILES, '>', $image or print "cant write to file: $!";
foreach my $urls (#data) {
my $url = get("$urls");
my $linkextor = HTML::LinkExtor->new( \&links );
$linkextor->parse($url);
my $key;
sub links {
( my $tag, my %links ) = #_;
if ( $tag eq "img" ) {
foreach my $key ( keys %links ) {
if ( $key eq "src" ) {
foreach my $da ( #{$links{$key}} ) {
if ( $da =~ /^[a-zA-Z]/ ) {
print FILES "$da;\n";
} #if
} #foreach
} #if
} #foreach
} #if
} #sub
print FILES "\n";
} #foreach
close(FILES);
Until this, there is no problem I got all the src links like
https://zzzzzz.com/;https://yyyyyyy.com/;https://xxxxxxxxxx.com/;
https://zzzzzz.com/;https://yyyyyyy.com/;https://xxxxxxxxxx.com/;
https://zzzzzz.com/;https://yyyyyyy.com/;https://xxxxxxxxxx.com/;
https://zzzzzz.com/;https://yyyyyyy.com/;https://xxxxxxxxxx.com/;
This is the format I have output in the text file, all I need is to insert all these urls by order as $image1, $image2, $image3 in image column
my $platform = "mysql";
my $database = "xxx";
my $host = "xxxxx";
my $port = "xxxx";
my $user = "xxxxx";
my $pw = "xxxxxxxxx";
my $dbh = DBI->connect( "DBI:$platform:$database:$host:$port", $user, $pw );
open FILED, '<', $image or die "cannot open file: $!";
my #img = <FILED>;
close(FILED);
foreach my $lin (#img) {
chomp $lin;
my #in = split ';', $lin;
my $image1 = $in[0];
my $image2 = $in[1];
my $image3 = $in[2];
print "$image1 $image2 $image3 \n";
$sth->execute( $li, $val, $parsed, $htmls, $image1, $image2, $image3 );
}
exit;
I thought that I am making mistakes in foreach loop, am I right. Thanks in advance.
Your problem is likely here:
foreach my $da ( $links{$key} ) {
Because it looks like you're assuming that $links{$key} is an array, when it cannot be - it can only be an array reference. And this will have the problem you described if you print it - it'll out put ARRAY(0xDEADBEEF) type format, because that's how an array ref stringifies.
So you might find that changing it to:
foreach my $da ( #{$links{$key}} ) {
Will do the trick.
But I'd also suggest
embedding a sub within a foreach loop is bad style.
Use 3 argument open with lexical file handles - e.g. open my $input, '<', 'file.name' or die $!.
iterate that with a while loop, rather than reading it into an array that you don't then reuse.
you declare my $key twice - the first instance isn't used, and is misleading.
You write your output to $image as FILES and then you open the same file and read it back in again. You don't seem to need the intermediate file though, so why not just stash it in the #img array in the first place?
your problem lies here.
my #in = split ';', $lin;
my $image1 = [0];
my $image2 = [1];
my $image3 = [2];
You are assigning the anonymous array to your variable. Above line should be like this.
my $image1 = $in[0];
my $image2 = $in[1];
my $image3 = $in[2];

Parsing a GenBank file

I trying to parse a GenBank file so I could get the accession number, the definition, the size of the file and the DNA sequence
Is there a way to modify my code and make it shorter and just declare all the variables at once like they do in the book and parse the file in one or two blocks of code?
If you have access to Bio Perl, you might find a solution such as the following.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => "input.txt",
-format => 'GenBank');
while ( my $seq = $in->next_seq ) {
my $acc = $seq->accession;
my $length = $seq->length;
my $definition = $seq->desc;
my $type = $seq->molecule;
my $organism = $seq->species->binomial;
if ($type eq 'mRNA' &&
$organism =~ /homo sapiens/i &&
$acc =~ /[A-Za-z]{2}_[0-9]{6,}/ )
{
print "$acc | $definition | $length\n";
print $seq->seq, "\n";
print "\n";
}
}
I was able to capture the 5 variables from a sample GenBank file I have (input.txt). It should simplify your code.

Reading Data from a file in Perl

I have a file abc.txt that has data of the form
sHost = "Arun";
sUid ="Abc";
I want to get Arun for sHost and so forth using Perl. My code:
my $filename = "abc.txt";
use strict;
use warnings;
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>)
{
chomp $row;
if ($row=~m/sHost/)
{
print $row;
}
}
The output I am getting sHost = Arun;
But I want only 'Arun'. What logic should I apply here? I am very new to Perl and Linux.
After the chomp, alter to this and the variable $host will contain the value
if ($row=~m/sHost = "(.*)"/) {
$host=$1;
In simple terms the ( ) section is given to $1 if there is a match. See man perlre for the details
To generalise this to read any key and any value do something like this
while (my $row = <$fh>) {
if ($row = ~ /^(\w+) = "([^"]+)"/) {
$value{$1} = $2;
}
Then $value{'sHost'} will be "Arun" etc
For universal config file parsing you can use following piece of code:
my %config;
if ($row =~ m/^\s*(["'`])?(\S+)\1?\s*=\s*(["'`])?(\S+?)\3?;?$/) {
my $key = $2;
my $value = $4;
$config{$key} = $value;
}
This regexp allows you to process key-value lines with plain or surrounded by different quote type (" ' `, but you can add your symbols if you like) key/value with leading or/and trailing whitespaces, semicolon is not ogligatory. Also you can change (\S+) according to your requirements of key/value possible values (\S - all except whitespaces).
use m/.*=\s*([^\s]*)/g instead of m/sHost/
use print $1 instead of print $row
Replace
if ($row=~m/sHost/)
with
if ($row=~s/sHost//)