Parsing data using Perl - perl

I was not able to parse the xml data properly. I need your help.
**Code**
#!usr/bin/perl
use strict;
use warnings;
open(FILEHANDLE, "data.xml")|| die "Can't open";
my #line;
my #affi;
my #lines;
my $ct =1 ;
print "Enter the start position:-";
my $start= <STDIN>;
print "Enter the end position:-";
my $end = <STDIN>;
print "Processing your data...\n";
my $i =0;
my $t =0;
while(<FILEHANDLE>)
{
if($ct>$end)
{
close(FILEHANDLE);
exit;
}
if($ct>=$start)
{
$lines[$t] = $_;
$t++;
}
if($ct == $end)
{
my $i = 0;
my $j = 0;
my #last;
my #first;
my $l = #lines;
my $s = 0;
while($j<$l)
{
if ($lines[$j] =~m/#/)
{
$line[$i] = $lines[$j];
$u = $j-3;
$first[$i]=$lines[$s];
$s--;
$last[$i] = $lines[$u];
#$j = $j+3;
#$last[$i]= $lines[$j];
#$j++;
#$first[$i] = $lines[$j];
$i++;
}
$j++;
}
my $k = 0;
foreach(#line)
{
$line[$k] =~ s/<.*>(.* )(.*#.*)<.*>/$2/;
$affi[$k] = $1;
$line[$k] = $2;
$line[$k] =~ s/\.$//;
$k++;
}
my $u = 0;
foreach(#first)
{
$first[$u] =~s/<.*>(.*)<.*>/$1/;
$first[$u]=$1;
$u++;
}
my $m = 0;
foreach(#last)
{
$last[$m] =~s/<.*>(.*)<.*>/$1/;
$last[$m] = $1;
$m++;
}
my $q=#line;
open(FILE,">Hayathi.txt")|| die "can't open";
my $p;
for($p =0; $p<$q; $p++)
{
print FILE "$line[$p] $last[$p],$first[$p] $affi[$p]\n";
}
close(FILE);
}
$ct++;
}
This code should extract lastName firstName and affiliation from the data and should save in a text file.
I have tried the above code, but I was not able to get the firstName in the output.
I request you to please help me by correcting the code.
Thank you in advance.

You can take following code sample as basis of your code.
As no text xml sample data file provided the help is very limited based on data image.
Documentation: XML::LibXML
use strict;
use warnings;
use feature 'say';
use XML::LibXML;
my $file = 'europepmc.xml';
my $dom = XML::LibXML->load_xml(location => $file);
foreach my $node ($dom->findnodes('//result')) {
say 'NodeID: ', $node->{id};
say 'FirstName: ', $node->findvalue('./firstName');
say 'LastName: ', $node->findvalue('./lastName');
say '';
}
exit 0;

Related

How to grep string and assign it to a variable in Perl

I have a file with the below details;
file name: allappsclus
cont:i-02dd208bf1d81c254
rs:i-0098ad0b59b7fe7cf
I want to use the value for i-XXX"= in associated cont name and assign it to another variable.
If run my code and get an output it is
test-1.1.0.0
1insideif-CNS
use strict;
use warnings;
use Data::Dumper;
my $jupyter = 0;
my $controller = 0;
my $rstudio = 0;
my $zeppelin = 0;
my $fh= '/tmp/allappsclus';
open my $fh2, '<', $fh or die "Cannot open file: $!\n";
while ( <$fh2> ) {
if ( $_ =~ /jup/ ) {
$jupyter = 1;
}
elsif ( $_ =~ /con/ ) {
$controller = 1;
}
elsif ( $_ =~ /rs/ ) {
$rstudio = 1;
}
elsif ( $_ =~ /zep/ ) {
$zeppelin = 1;
}
}
print "test-$rs.$con.$jup.$zep\n";
if ( $zepeq '0' && $jup eq '0' && $con eq '1' && $rs eq '1' ) {
print "insideif-CNS";
}
else {
print "do nothing";
}
close $fh;
close $fh2;
Now I want to print the value i-02dd208bf1d81c254 instead of CNS in the output.
$myStr = "cont:i-02dd208bf1d81c254 rs:i-0098ad0b59b7fe7cf";
if ($myStr =~ /cont:([^ ]+)/) # match the full pattern with the value you required
{
$controller=1;
$cont = $1; # assign required value to a variable
}
print $cont; # print the variable
This might work for you.

Reading from a textfile and putting information into an array

I'm trying to calculate the GCD of true random numbers using random.org and grabbing those numbers from a text file. Here is a program to do the above with a PRNG that I created earlier.
#!/usr/bin/perl
use strict;
use warnings;
my $range = 100;
my $gcdcount = 0;
sub gcd_iter($$) {
my ($u, $v) = #_;
while ($v) {
($u, $v) = ($v, $u % $v);
}
return abs($u);
}
for (my $count=0; $count<=5000; $count++) {
my $random_numx = int(rand($range));
my $random_numy = int(rand($range));
my #pair = ($random_numx, $random_numy);
if (gcd_iter($random_numx, $random_numy) == 1) {
$gcdcount++;
}
}
print "The GCD Count for PRNG #1 is $gcdcount\n";
I'm pretty much doing the same exact thing, but grabbing the numbers from the textfile. How do I get those number pairs into a format where I can assign them variables in order to put them through the formula after I split the lines? Here is what I have so far:
my $filename = 'xxxxx';
open(my $fh, $filename)
or die "Could not open file '$filename' $!";
sub gcd_iter($$) {
my ($u, $v) = #_;
while ($v) {
($u, $v) = ($v, $u % $v);
}
return abs($u);
}
for (my $count=0; $count<=5000; $count++) {
if (gcd_iter($) == 1) {
$gcdcount++;
}
}
while (my $row = <$fh>) {
chomp $row;
foreach ($row) {
my #pair = split('s+', $_);
}
}
You are on the right track but may benefit by typing
perldoc -f split
Perl has excellent built-in documentation.
I would change your lines:
foreach ($row) {
my #pair = split('s+', $_);
}
to:
my #pair = split(/\s+/, $row);
and set $num1 and $num2 to $pair[0] and $pair[1] ... or simply use:
my ($num1, $num2) = split(/\s+/, $row);
to get the two scalars you are interested in directly.
In other words, you do not need the foreach $row line (you already have $row) and it is \s that is the shorthand pattern match for whitespace.
Good luck and have fun!

Perl output format

I'm reading a log file and grouping it based on the 'Program' name and in turn its ID.
LOG FILE
------------------------------------------
DEV: COM-1258
Program:Testing
Reviewer:Jackie
Description:New Entries
rev:r145201
------------------------------------------
QA: COM-9696
Program:Testing
Reviewer:Poikla
Description:Some random changes
rev:r112356
------------------------------------------
JIRA: COM-1234
Program:Development
Reviewer:John Wick
Description:Genral fix
rev:r345676
------------------------------------------
JIRA:COM-1234
Program:Development
Reviewer:None
Description:Updating Received
rev:r909276
------------------------------------------
JIRA: COM-6789
Program:Testing
Reviewer:Balise Mat
Description:Audited
rev:r876391
------------------------------------------
JIRA: COM-8585
Program:Testing
Reviewer:Gold frt
Description: yet to be reviewed
rev:r565639
The code I have,
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Terse = 1;
my $file = "log.txt";
open FH, $file or die "Couldn't open file: [$!]\n";
my $data = {};
my $hash = {};
while (<FH>)
{
my $line = $_;
chomp $line;
if ($line =~ m/(-){2,}/)
{
my $program = $hash->{Program} || '';
my $jira = $hash->{JIRA} || $hash->{QA} || $hash->{DEV} ||
+'';
if ($program && $jira)
{
push #{$data->{$program}{$jira}}, $hash;
$hash = {};
}
}
else
{
if ($line =~ m/:/)
{
my ($key, $value) = split /:\s*/, $line;
$hash->{$key} = $value;
}
elsif ($line =~ m#/# && exists $hash->{Files})
{
$hash->{Files} .= "\n$line";
}
}
}
print 'data = ' . Dumper($data);
foreach my $prg (sort keys %{$data})
{
print "===========================================================
+=\n";
print " PROGRAM : $prg
+ \n";
print "===========================================================
+=\n";
foreach my $jira (sort keys %{$data->{$prg}})
{
print "******************\n";
print "JIRA ID : $jira\n";
print "******************\n";
foreach my $hash (#{$data->{$prg}{$jira}})
{
foreach my $key (keys %{$hash})
{
# print the data except Program and JIRA
next if $key =~ m/(Program|JIRA|DEV|QA)/;
print " $key => $hash->{$key}\n";
}
print "\n";
}
}
}
I have a requirement to print the output in the below format and currently unable to do so with my logic, any ideas would be really helpful.
PROGRAM: Development
Change IDs:
1.JIRA
a.COM-1234
PROGRAM: Testing
Change IDs:
1.JIRA
a.COM-6789
b.COM-8585
2.QA
a.COM-9696
3.DEV
a.COM-1258
I would write this
use strict;
use warnings 'all';
use List::Util 'uniq';
my $file = 'log.txt';
open my $fh, $file or die "Couldn't open file: [$!]\n";
my #data;
{
my %item;
while ( <$fh> ) {
chomp;
if ( eof or /\-{2,}/ ) {
push #data, { %item } if keys %item;
%item = ();
}
else {
my ( $key, $value ) = split /\s*:\s*/;
next unless $value;
$item{$key} = $value;
$item{jira} = $key if grep { $key eq $_ } qw/ JIRA DEV QA /;
}
}
}
my %data;
{
for my $item ( #data ) {
my ($prog, $jira) = #{$item}{qw/ Program jira /};
push #{ $data{$prog}{$jira} }, $item->{$jira};
}
}
for my $prog ( sort keys %data ) {
printf "PROGRAM: %s\n", $prog;
print "Change IDs:\n";
my $n = 1;
for my $jira ( qw/ JIRA QA DEV / ) {
next unless my $codes = $data{$prog}{$jira};
printf "%d.%s\n", $n++, $jira;
my $l = 'a';
printf " %s.%s\n", $l++, $_ for sort(uniq(#$codes));
}
print "\n";
}
output
PROGRAM: Development
Change IDs:
1.JIRA
a.COM-1234
PROGRAM: Testing
Change IDs:
1.JIRA
a.COM-6789
b.COM-8585
2.QA
a.COM-9696
3.DEV
a.COM-1258
#!/usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper;
my $file = 'test';
my $hash;
my $id_hash = ();
my $line_found = 0;
my $line_count = 1;
my $ID;
my $ID_num;
open (my $FH, '<', "$file") or warn $!;
while (my $line = <$FH> ) {
chomp($line);
if ( $line =~ m/------------------------------------------/){
$line_found = 1;
$line_count++;
next;
}
if ( $line_found ) {
$line =~ m/(.*?):(.*)/;
$ID = $1;
$ID_num = $2;
$line_found = 0;
}
if ( $line =~ m/Program:(.*)/ ) {
my $pro = $1;
push #{$hash->{$pro}->{$ID}}, ($ID_num) ;
}
$line_count++;
}
close $FH;
foreach my $pro (keys %$hash){
# print Dumper($pro);
print "PROGRAM:\t$pro\nChange IDs:\n";
foreach my $ids (keys $hash->{$pro}){
print "\t1. $ids\n";
foreach my $id (values $hash->{$pro}->{$ids}){
print "\t\ta. $id\n";
}
}
}
OUTPUT
PROGRAM: Testing
Change IDs:
1. QA
a. COM-9696
1. DEV
a. COM-1258
1. JIRA
a. COM-6789
a. COM-8585
PROGRAM: Development
Change IDs:
1. JIRA
a. COM-1234
a. COM-1234
Just change the output to your need!!

Running a nested while loop inside a foreach loop in Perl

I'm trying to use a foreach loop to loop through an array and then use a nested while loop to loop through each line of a text file to see if the array element matches a line of text; if so then I push data from that line into a new array to perform calculations.
The outer foreach loop appears to be working correctly (based on printed results with each array element) but the inner while loop is not looping (same data pushed into array each time).
Any advice?
The code is below
#! /usr/bin/perl -T
use CGI qw(:cgi-lib :standard);
print "Content-type: text/html\n\n";
my $input = param('sequence');
my $meanexpfile = "final_expression_complete.txt";
open(FILE, $meanexpfile) or print "unable to open file";
my #meanmatches;
#regex = (split /\s/, $input);
foreach $regex (#regex) {
while (my $line = <FILE>) {
if ( $line =~ m/$regex\s(.+\n)/i ) {
push(#meanmatches, $1);
}
}
my $average = average(#meanmatches);
my $std_dev = std_dev($average, #meanmatches);
my $average_round = sprintf("%0.4f", $average);
my $stdev_round = sprintf("%0.4f", $std_dev);
my $coefficient_of_variation = $stdev_round / $average_round;
my $cv_round = sprintf("%0.4f", $coefficient_of_variation);
print font(
{ color => "blue" }, "<br><B>$regex average: $average_round
&nbspStandard deviation: $stdev_round&nbspCoefficient of
variation(Cv): $cv_round</B>"
);
}
sub average {
my (#values) = #_;
my $count = scalar #values;
my $total = 0;
$total += $_ for #values;
return $count ? $total / $count : 0;
}
sub std_dev {
my ($average, #values) = #_;
my $count = scalar #values;
my $std_dev_sum = 0;
$std_dev_sum += ($_ - $average)**2 for #values;
return $count ? sqrt($std_dev_sum / $count) : 0;
}
Yes, my advice would be:
Turn on strict and warnings.
perltidy your code,
use 3 argument open: open ( my $inputfile, "<", 'final_expression.txt' );
die if it doesn't open - the rest of your program is irrelevant.
chomp $line
you are iterating your filehandle, but once you've done this you're at the end of file for the next iteration of the foreach loop so your while loops becomes a null operation. Simplistically, reading the file into an array my #lines = <FILE>; would fix this.
So with that in mind:
#!/usr/bin/perl -T
use strict;
use warnings;
use CGI qw(:cgi-lib :standard);
print "Content-type: text/html\n\n";
my $input = param('sequence');
my $meanexpfile = "final_expression_complete.txt";
open( my $input_file, "<", $meanexpfile ) or die "unable to open file";
my #meanmatches;
my #regex = ( split /\s/, $input );
my #lines = <$input_file>;
chomp (#lines);
close($input_file) or warn $!;
foreach my $regex (#regex) {
foreach my $line (#lines) {
if ( $line =~ m/$regex\s(.+\n)/i ) {
push( #meanmatches, $1 );
}
}
my $average = average(#meanmatches);
my $std_dev = std_dev( $average, #meanmatches );
my $average_round = sprintf( "%0.4f", $average );
my $stdev_round = sprintf( "%0.4f", $std_dev );
my $coefficient_of_variation = $stdev_round / $average_round;
my $cv_round = sprintf( "%0.4f", $coefficient_of_variation );
print font(
{ color => "blue" }, "<br><B>$regex average: $average_round
&nbspStandard deviation: $stdev_round&nbspCoefficient of
variation(Cv): $cv_round</B>"
);
}
sub average {
my (#values) = #_;
my $count = scalar #values;
my $total = 0;
$total += $_ for #values;
return $count ? $total / $count : 0;
}
sub std_dev {
my ( $average, #values ) = #_;
my $count = scalar #values;
my $std_dev_sum = 0;
$std_dev_sum += ( $_ - $average )**2 for #values;
return $count ? sqrt( $std_dev_sum / $count ) : 0;
}
The problem here is that starting from the second iteration of foreach you are trying to read from already read file handle. You need to rewind to the beginning to read it again:
foreach $regex (#regex) {
seek FILE, 0, 0;
while ( my $line = <FILE> ) {
However that does not look very performant. Why read file several times at all, when you can read it once before the foreach starts, and then iterate through the list:
my #lines;
while (<FILE>) {
push (#lines, $_);
}
foreach $regex (#regex) {
foreach $line (#lines) {
Having the latter, you might also what to consider using grep instead of the while loop.

stockholm to fasta format - include accession id in every header

Hello I've multiple sequences in stockholm format, at the top of every alignment there is a accession ID, for ex: '#=GF AC PF00406' and '//' --> this is the end of the alignment. When I'm converting the stockholm format to fasta format I need PF00406 in the header of every sequence of the particular alignment. Some times there will be multiple stockholm alignments in one file. I tried to modify the following perl script, it gave me bizarre results, any help will be greatly appreciated.
my $columns = 60;
my $gapped = 0;
my $progname = $0;
$progname =~ s/^.*?([^\/]+)$/$1/;
my $usage = "Usage: $progname [<Stockholm file(s)>]\n";
$usage .= " [-h] print this help message\n";
$usage .= " [-g] write gapped FASTA output\n";
$usage .= " [-s] sort sequences by name\n";
$usage .= " [-c <cols>] number of columns for FASTA output (default is $columns)\n";
# parse cmd-line opts
my #argv;
while (#ARGV) {
my $arg = shift;
if ($arg eq "-h") {
die $usage;
} elsif ($arg eq "-g") {
$gapped = 1;
} elsif ($arg eq "-s"){
$sorted = 1;
} elsif ($arg eq "-c") {
defined ($columns = shift) or die $usage;
} else {
push #argv, $arg;
}
}
#ARGV = #argv;
my %seq;
while (<>) {
next unless /\S/;
next if /^\s*\#/;
if (/^\s*\/\//) { printseq() }
else {
chomp;
my ($name, $seq) = split;
#seq =~ s/[\.\-]//g unless $gapped;
$seq{$name} .= $seq;
}
}
printseq();
sub printseq {
if($sorted){
foreach $key (sort keys %seq){
print ">$key\n";
for (my $i = 0; $i < length $seq{$key}; $i += $columns){
print substr($seq{$key}, $i, $columns), "\n";
}
}
} else{
while (my ($name, $seq) = each %seq) {
print ">$name\n";
for (my $i = 0; $i < length $seq; $i += $columns) {
print substr ($seq, $i, $columns), "\n";
}
}
}
%seq = ();
}
Depending on the how much variation there is in the line with the accessionID, you might need to modify the regex, but this works for your example file
my %seq;
my $aln;
while (<>) {
if ($_ =~ /#=GF AC (\w+)/) {
$aln = $1;
}
elsif ($_ =~ /^\s*\/\/\s*$/){
$aln = '';
}
next unless /\S/;
next if /^\s*\#/;
if (/^\s*\/\//) { printseq() }
else {
chomp;
my ($name, $seq) = split;
$name = $name . ' ' . $aln;
$seq{$name} .= $seq;
}
}
printseq();