How to read a token from Input file? - perl

I have a input , it looks like this
Input File
ID Score1 Score2
ABC 1 2
DEF 30 50
I want to get the ID and two scores in each lines , and I succeed , But I want to ask does there exist any function can get a word ? that means I call this function three times then I can get the ID and two scores ....
while( eof(Input) !=1)
{
$C = getc(Input);
if($C eq "\n")
{
$Signal = 0;
print Output #Elements;
print Output "\n";
#Elements = ();
}
elsif($C ne " ")
{
if($Signal == 1)
{
push(#Elements,"-");
$Signal = 0;
}
push(#Elements,$C);
}
else
{
$Signal = 1;
}
}
I found that
$Line = readline(*Input);
#Line_elements = split(" ",$Line);
can work .....
thanks

I'm not sure what exactly should be output (better is to provide exact sample input and exact sample output but you can use something like this if you want only lines which starts witch string id and then two numeric scores:
while ( $line = <STDIN> ) {
my ($id, $score1, $score2) = ( $line =~ /^([A-Z]+)\s+([0-9]+)\s+([0-9]+)$/ );
print $id;
}

You can also do:
my ($id, $score1, $score2) = split / +/, $Line;
/ +/ is a regular expression; it takes care about one or more blank spaces

Related

returning matched element from array

I would like to return an element of an array if it matches a pattern represented by a scalar. I would like to avoid looping so I've tried the following method:
use List::Util 'first';
my $match = first { /$pattern/ } #list_of_strings;
$match variable is not returning a value, even though, I know it matches exactly one element in array. What am I doing wrong?
my #amplicon_exon = ();
open(TXT5, "$amplicon_exon");
while (<TXT5>){
my $file_line = $_;
$file_line =~ s/\s+\z//g;
push (#amplicon_exon, $file_line);
}
close(TXT5);
foreach (#amplicon_exon){
chop($_);
}
my #matrix_lines = ();
open(TXT5, "$matrixfile");
while (<TXT5>){
my $matrixlineentry = $_;
$matrixlineentry =~ s/\s+\z//g;
push (#matrix_lines, $matrixlineentry);
}
close(TXT5);
foreach (#matrix_lines){
chop($_);
}
my #failedamplicons = ();
for my $vcf_file ( #vcf_files ) {
$vcf_file =~ m|([^/]+)_annotated.vcf$| or die "Can't extract Sample ID";
my $sample_id = $1;
my #myuniquearray = ();
my #amplicon_array = ();
my $entire_matrix_header = $matrix_lines[0]; print "The matrix header is ".$entire_matrix_header."\n";
$entire_matrix_header =~ s/\s+\z//g;
my #matrix_headers = split (/\t/, $entire_matrix_header);
push #matrix_headers, "endoffile";
my $matrix_column_number = "";
for (0..#matrix_headers){
my $onco_matrix_header = $matrix_headers[$_];
$onco_matrix_header =~ s/\s+\z//g;
if ((length $onco_matrix_header > 0) && (index($sample_id, $onco_matrix_header) != -1)) {
$matrix_column_number = $_;
print "The matrixcolumnnumber is ".$matrix_column_number."\n";
}
}
#print " The matrixcolumnnumber is ".$matrixcolumnnumber."\n";
for (1..#matrix_lines-1){
my #matrix_values = split (/\t/, $matrix_lines[$_]);
if ( ($matrix_values[$matrix_column_number]<201) && ($matrix_values[$matrix_column_number]>-1) ) {
my $f_amp = $matrix_values[1];#if ( grep( /^$value$/, #array ) )
print Dumper($f_amp, \#amplicon_exon);
my $match = first { /$f_amp/ } #amplicon_exon; print "#############the match is $match\n";
my #parts = split /:/, $match;
my $exon_amp = $parts[1]; my $gene_res = $parts[2];
print "less than 200 exists";
my #total_amps = ($run_folder, $sample_id, $gene_res, $exon_amp, $matrix_values[$matrix_column_number], $f_amp);
my $failedamplicon = join "\t", #total_amps;
push (#failedamplicons, $failedamplicon);
}
}
for (0..#failedamplicons-1){
my $failedamplicons = $failedamplicons[$_];
open (MYFILE, ">>$failed_amps_output");##opens files with header and adds the rest of the lines.
print MYFILE $failedamplicons."\n";
close (MYFILE);
}
}
Sample data:
#amplicon_exon lines:
ON_MAP2K1_1:2:MAP2K1
OCP1_MAP2K1_1:3:MAP2K1
OCP1_MAP2K1_2:6:MAP2K1
$f_amp examples:
ON_MAP2K1_1
OCP1_MAP2K1_1
OCP1_MAP2K1_2
matrixfile
Gene Target barcode-1 barcode-2 barcode-3 barcode-4 barcode-5
MOTOR Focus_MTOR_1 22 786 123 456 456
JAK1 OCP1_JAK1_2 345 45 342 23 432
ALT OCP1_ALK_3 43 456 23 3 56
There is something wrong with part of the program that you haven't shown
Here is a short program that uses exactly the #amplicon_exon values that you have given as examples, and dumps the result of the first call for every value of $f_cmp. It works
Perhaps this will shake you from your insistence that you have done nothing wrong?
use strict;
use warnings 'all';
use List::Util 'first';
use Data::Dump;
# Using sample #amplicon_exon lines
my #amplicon_exon = qw/
ON_MAP2K1_1:2:MAP2K1
OCP1_MAP2K1_1:3:MAP2K1
OCP1_MAP2K1_2:6:MAP2K1
/;
# Using $f_amp examples:
for my $f_amp (qw/
ON_MAP2K1_1
OCP1_MAP2K1_1
OCP1_MAP2K1_2 / ) {
my $match = first { /$f_amp/ } #amplicon_exon;
dd $match;
}
output
"ON_MAP2K1_1:2:MAP2K1"
"OCP1_MAP2K1_1:3:MAP2K1"
"OCP1_MAP2K1_2:6:MAP2K1"
I'd risk £5 that your #matrix_values contains newlines

Clearing list in a loop

I have a list of lines (about 2K) that I need to compare to much bigger list on certain criteria and save the results. So what I'm doing is:
foreach( keys %lines1 )
{
($orig1,$orig2,$orig3) = (split( /,/, $lines1{$_}))[1,2,3]
push( #result, grep{ ($data1,$data2,$data3) = (split( /,/, $lines2{$_})[1,2,3];$orig1 == $data1 && $orig2 == $data2 && $orig3 == $data3 } keys %lines2 );
$hash_result{$count} = #result;
}
Problem is #result accumulates data.
So on the first run it's size is 1, on the second the data is pushed and the size is 2: old line matched and new line matched.
I feel I'm missing something obvious, but don't remember what.
Both lists comes from the same file as CSV. They are different by one field.
Input:
data1,data2,data3,data4,data5,data6,data7,data8,0 - $line1
data1,data2,data3,data4,data5,data6,data7,data8,1 - $line2
There is couple of lines of type $line1.
Output:
In the output it probably should be a hash of list of list, The task is: for every $line1 that have matches calculate some statistics.
Could someone please help?
Or maybe I'm doing it completely wrong?
[EDIT]
What I'm looking for here is something like this:
"$hash_result{$count} = [[1,2,3,10,4,6][1,2,3,5,3,11][1,2,3,100,60,20]]"
so that I can calculate some statistics on the $count line.
[/EDIT]
Adding calculation of average
foreach( keys %lines1 ) {
($orig1,$orig2,$orig3) = (split( /,/, $lines1{$_}))[1,2,3]
#result = grep{ ($data1,$data2,$data3) = (split( /,/, $lines2{$_})[1,2,3];$orig1 == $data1 && $orig2 == $data2 && $orig3 == $data3 } keys %lines2 );
push #{$hash_result{$count}}, [ #result ];
}
my $average= 0;
foreach my $list (#{$hash_result{$count}}) {
$average+= $list->[3]; # 10 + 5 + 100
}
$average/= scalar(#{$hash_result{$count}}); # 10+5+100 / 3

How do I compare two lines (if they are equal or not equal) of a file read inside a while loop?

I have a file like this one below, where the line starting with a number is an ID for my sample and the following lines are the data.
10001;02/07/98;;PI;M^12/12/59^F^^SP^09/12/55
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D16S539
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D7S820
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D13S317
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D5S818
10002;02/07/98;;RJ;F^20/04/86^SP^
;;;;;F1|SP1;;;12;10;12;11;;D10S212
;;;;;F1|SP1;;;8;8;10;8;;D7S820
;;;;;F1|SP1;;;12;11;14;11;;D13S317
;;;;;F1|SP1;;;13;12;13;8;;D5S818
For the lines containing the data, I would like to test whether or not the fields 6-11 are the same because I want the data only if they are not equal to each other (in the first case they are all '9').
So I thought about splitting the lines and store them as an array, and then compare the arrays with the ~~ operator. But how do I do that if I'm reading the file inside a while loop and the array is redefined each line?
Or maybe there is better ways to do that.
Thanks in advance!
This is a pseudocode to illustrate what I want to do:
open FILE, $ARGV[0] or die $!;
while (<FILE>) {
chomp;
my #field = split /;/;
if ($field[0] eq '') {
if #fields[6 .. 11] is not equal to #fields[6 .. 11] in all the next lines {
do my calculation;
}
}
}
Am I correct in saying that data really represents two records? If so, you want to accumulate the lines for for the full record.
my #super_rec;
while (<>) {
chomp;
my #fields = split /;/;
if ($fields[0] ne '') {
process_rec(\#super_rec) if #super_rec;
#super_rec = \#fields;
} else {
push #super_rec, \#fields;
}
}
process_rec(\#super_rec) if #super_rec;
Then, your question can be answered.
sub process_rec {
my ($super_rec) = #_;
my ($rec, #subrecs) = #$super_rec;
my $do_calc = 0;
for my $i (1..$#subrecs) {
if ( $subrecs[0][ 6] ne $subrecs[$i][ 6]
|| $subrecs[0][ 7] ne $subrecs[$i][ 7]
|| $subrecs[0][ 8] ne $subrecs[$i][ 8]
|| $subrecs[0][ 9] ne $subrecs[$i][ 9]
|| $subrecs[0][10] ne $subrecs[$i][10]
|| $subrecs[0][11] ne $subrecs[$i][11]
) {
$do_calc = 1;
last;
}
}
if ($do_calc) {
...
}
}
I assume you're looking to compare data across lines, not within a single line. If I've got that wrong, ignore the rest of my answer.
The way I would do it is to re-join fields 6 through 11 as a string. Keep the data from the first line as $firstdata, and compare data from each successive line as $nextdata. Each time the data don't match, you up the $differences counter. When you get an ID line, check to see if the previous $differences was greater than zero and if so do your calculation (you may need to save the ID line and other fields in some other variables). Then re-initialize the $differences and $firstdata variable.
my $firstdata = "";
my $nextdata = "";
my $differences = 0;
open FILE, $ARGV[0] or die $!;
while (<FILE>) {
chomp;
my #field = split /;/;
if ($field[0] eq '') {
$nextdata = join(';', #fields[6..11]);
if ($firstdata && ($nextdata ne $firstdata)) {
$differences++;
} else {
$firstdata = $nextdata;
}
} else {
if ($differences) {
# do your calculation for previous ID
}
$firstdata = "";
$differences = 0;
}
}
if ($differences) {
# do your calculation one last time for the last ID
}
Here's a way to do it with Regex. This might be inefficient than other methods, if the indices are fixed from 6 to 11, and are known to be those only, because it will traverse entire String: -
open FILE, $ARGV[0] or die $!;
while (<FILE>) {
chomp;
my $num = 0;
my $same = 1;
while (/;(\d+);/) {
if ($num == 0) { $num = $1; }
elsif ($1 != $num) { $same = 0; last; }
# Substitute current digit matched with x (or any char)
# to avoid infinite loop
s/$1/x/;
}
if ($same) {
print "All digits same";
}
}
Using the Text::CSV_XS module you can do something like this:
use strict;
use warnings;
use Text::CSV_XS;
use feature 'say';
my $csv = Text::CSV_XS->new({
sep_char => ";",
binary => 1,
});
my %data;
my #hdrs; # store initial order of headers
my $hdr;
while (my $row = $csv->getline(*DATA)) {
if ($row->[0] =~ /^\d+$/) {
$csv->combine(#$row) or die "Cannot combine: " .
$csv->error_diag();
$hdr = $csv->string(); # recreate the header
push #hdrs, $hdr; # save list of headers
} else {
push #{ $data{$hdr} }, [ #{$row}[6..11] ];
}
}
for (#hdrs) {
say "$_\n arrays are: " . (is_identical($data{$_}) ? "same":"diff");
}
sub is_identical {
my $last;
for (#{$_[0]}) { # argument is two-dimensional array
$last //= $_;
return 0 unless ( #$_ ~~ #$last );
}
return 1; # default = all arrays were identical
}
__DATA__
10001;02/07/98;;PI;M^12/12/59^F^^SP^09/12/55
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D16S539
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D7S820
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D13S317
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D5S818
10002;02/07/98;;RJ;F^20/04/86^SP^
;;;;;F1|SP1;;;12;10;12;11;;D10S212
;;;;;F1|SP1;;;8;8;10;8;;D7S820
;;;;;F1|SP1;;;12;11;14;11;;D13S317
;;;;;F1|SP1;;;13;12;13;8;;D5S818
Output:
10001;02/07/98;;PI;M^12/12/59^F^^SP^09/12/55
arrays are: same
10002;02/07/98;;RJ;F^20/04/86^SP^
arrays are: diff

Add a string to the end of hash values

I'm new in perl and have a question concerning the use of hashes. Althought similar questions were posted before none of them was related to my problem.
I have a fasta file with several sequences of different length and want to add a string (N in this case) to the end of each fasta entry until the length of all sequences are the same. At this point I'm being able to read the fasta file and return each sequence as a string (but can also be done as arrays) to the value of an hash. The key elements are the corresponding headers of the fasta file.
My code is the following:
###### calculate the length of each hash value and store the highest value in $max
my $length;
my $max = 0;
my $addN = "N";
foreach $name ( keys %seq ) {
$length = length($seq{$name});
if ($max < $length) {
$max = $length;
} else { next }
print $max,"\n";
while (length ($seq{$name}) < $max) {
$seq{$name} .= $addN;
}
foreach $name (keys %seq) {
print $seq{$name};
print "\n";
}
}
The problem here is that the output of this code is exactly the same as the input, e.g.,
INPUT:
>fasta1
AAAAAAAAA
>fasta2
AA
OUTPUT
>fasta1
AAAAAAAAA
>fasta2
AA
where I wanted an output like this:
>fasta1
AAAAAAAAA
>fasta2
AANNNNNNN
Can you please help me to accomplish this task?
use 5.014;
my %seq = ( fasta1 => 'AAA',
fasta2 => 'AAAAAA',
fasta3 => 'AAAAAAAAA',
);
my $length = length((sort { length($a) < length($b) } values %seq)[0]);
for my $name ( keys %seq ) {
$seq{$name} = $seq{$name} . ('N' x ($length - length($seq{$name})));
}
while (my($name, $val) = each %seq ) {
say "$name: $val";
}
fasta2: AAAAAANNN
fasta3: AAAAAAAAA
fasta1: AAANNNNNN
Your sample code is wrong. However, it sounds like you have a way to populate a map based on a "fasta" file. Assuming this is true, I think the following code solves your problem.
# Populate %seq from fasta file
%seq = (
"fasta1"=> "AAAAAAAAA",
"fasta2" => "AAAA",
"fasta3" => "AA"
);
my $FILL = "N";
my $normalized_length = 0;
# If the normalized length = longest value
while( my ($k,$v) = each %seq) {
my $len = length($v);
$normalized_length = $len if $len > $normalized_length;
}
while( my ($k,$v) = each %seq) {
print $v, $FILL x ($normalized_length - length($v)), "\n";
}
OUTPUT
AAAANNNNN
AANNNNNNN
AAAAAAAAA
If you need to normalize to a fixed length, then just set $normalized_length to that value and skip the first while loop.

How can I skip some block content while reading in Perl

I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.
Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.
First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];
On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.