Please don't comment to say I already asked this, It's a logic question, I know it's mostly similar code but there are underlying syntax problems that I cannot decipher and have spent hours debugging this with no hope and I just really need this answered. And that other account was deleted so I did post this half an hour ago but can't view it. Please only comment if you want to help.
It should work everything is in data and it should be turning up results, i've had it working before so it must just be so syntax thing I'm not noticing. I can't get this work. I'm almost certain it's the grep statement.
#!/usr/bin/perl
use warnings;
use strict;
open ("data", "<text.txt") or die "Can't open"; #
my #data = <data>; #file looking into
close "data"; #
while(<>){
chomp;
my $temp = $_;
my ($name, $number, $expression) = split("\t", $temp);
my $pattern = "\t";
my #found = grep ( /(^$name$pattern\|$pattern$number$)/, #data );
if(defined($found[0])){
print $_;
my ($what, $start, $stop, $chr, $who) = split("\t", $found[0]);
print "\t", $chr, $start, $stop;
#found = ();
}
}
print "\n";
Input is of the format
A1B 1 68
A1C 299 0
A2B 547 0
A2L 877 30
A2M 2 7944
And this is the format of the data file
CLDN8 30214006 30216073 21 68
A1C 20808776 20811809 Y
UBE2Q2P5Y 25431156 25437315 Y
OR5M9 56462469 56463401 11 390162
I want to search for the instances of items in the first or second column of the input file in the data file which should match up with the first and 5th column(which may not exist) respectively
Expected output should be for this example
A1B 1 68 21 30214006 30216073
A1C 299 0 Y 20808776 20811809
But I'm getting nothing
I think what you're looking for is this, but it's really very hard to tell because you have described your problem so poorly
I've had to make a lot of assumptions, but at least the output matches what you say you're expecting
use strict;
use warnings 'all';
my $data_file = 'text.txt';
my #data;
{
open my $fh, '<', $data_file or die qq{Unable to open "$data_file" for input: $1};
while ( <$fh> ) {
next unless /\S/;
push #data, [ split ];
}
}
while ( <> ) {
next unless /\S/;
my ($name, $number, $expression) = split;
for my $item ( #data ) {
my ($what, $start, $stop, $chr, $who) = #$item;
if ( $what eq $name or defined $who and $who eq $expression ) {
print join("\t", $name, $number, $expression, $chr, $start, $stop), "\n";
}
}
}
output
A1B 1 68 21 30214006 30216073
A1C 299 0 Y 20808776 20811809
Related
I want to join the first to 16th word and 17th to 31st, etc in an array with space to one line but do not know why the code does not work. Hope to get help here.Thanks
my #file = <FILE>;
for ( $i=0; $i<=$#file; $i+=16 ){
my $string = join ( " ", #file[$i..$i+15] );
print FILE1 "$string\n";
}
Below is part of my file.
1
2
3
...
What i wan to print is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21....
I wouldn't do it the way you've done it.
Instead I would:
open ( my $input, '<', "your_file_name" ) or die $!;
chomp ( my #file = <$input> );
print join (" ",splice (#file, 0, 15)),"\n" while #file;
Note - I've used a lexical file handle with a 3 argument open, because that's better style.
splice removes the first 16 elements from #file each iteration, and continues until #file is empty.
Your lines have newlines attached to them. Remove them with chomp. Then loop over the array, remove 16 items and print them.
my #file = <FILE>;
chomp #file;
while (#file) {
my #temp;
INNER: for ( 0 .. 15 ) {
push #temp, shift #file || last INNER; # not or
}
print join( q{ }, #temp ), "\n";
}
This is the long implementation of the splice solution Sobrique suggested in the comments. It's does the same thing, just way more verbose.
This is the old answer before the edit:
If you only want the first 16, this is way more effective.
my $string = join q{ }, map { <FILE>; chomp; $_ } 1 .. 16;
This reads 16 lines and chomp each of them, then joins.
You might also want to use lexical file handles $fh instead of the GLOB FILE.
open my $fh, '<', $path_to_file or die $!;
Suppose if you want to read it from file, don't store the whole file into an array. Instead loop through line by line. And check the line number with $. special variable.
use warnings;
use strict;
open my $fh,"<","input.txt";
my $line;
while (<$fh>)
{
chomp;
$line.= $_." ";
print "$line\n" and $line="" if($. % 16 == 0);
END{ print "$line\n";};
}
Or this also will work
use warnings;
use strict;
open my $wh,"<","input.txt";
my $line;
foreach (;;)
{
my $data = join " ",(map { my $m=<$wh> || ""; chomp($m); $m} (0..15));
last if ($data =~m/^\s+$/);
print $data,"\n";
}
Assuming that you have FILE and FILE1 descriptors open, try:
$.%16?s!\s+! !:1 and print FILE1 $_ while <FILE>;
I have two files. One file has a list of values like so
NC_SNPStest.txt
250
275
375
The other file has space delimited information. Column one is the first value of a range, Column two has the second value of a range, Column 5 has the name of the range, and Column eight has what acts on that range.
promoterstest.txt
20 100 yaaX F yaaX 5147 5.34 Sigma70 99
200 300 yaaA R yaaAp1 6482 6.54 Sigma70 35
350 400 yaaA R yaaAp2 6498 2.86 Sigma70 51
I am trying to write a script that takes the first line from file 1 and then parses file 2 line by line to see if that value falls in the range is between the first two columns.
When the first match is found, I want to print the value from file 1 and then the values in file 2 for columns 5 and 8 from the line with the match. If no match is found in File 2 then just print the value from File 1 and move on.
It seems like it should be a simple enough task but I'm having an issue cycling though both files.
This is what I have written:
#!/usr/bin/perl
use warnings;
use strict;
open my $PromoterFile, '<', 'promoterstest.txt' or die $!;
open my $SNPSFile, '<', 'NC_SNPtest.txt' or die $!;
open (FILE, ">PromoterMatchtest.txt");
while (my $SNPS = <$SNPSFile>) {
chomp ($SNPS);
while (my $Cord = <$PromoterFile>) {
chomp ($Cord);
my #CordFile =split(/\s/, $Cord);
my $Lend = $CordFile[0];
my $Rend = $CordFile[1];
my $Promoter = $CordFile[4];
my $SigmaFactor = $CordFile[7];
foreach $a ($SNPS)
{
if ($a >= $Lend && $a <= $Rend)
{
print FILE "$a\t$CordFile[4]\t$CordFile[7]\n";
}
else
{
print FILE "$a\n";
}
}
}
}
close FILE;
close $PromoterFile;
close $SNPSFile;
exit;
So far my output looks like so:
250
250 yaaAp1 Sigma70
250
Where the first line of file 1 is being called and file 2 is being cycled through. But the else command is being used on each line of file 2 and the script never cycles through the other lines of file 1.
Your problem is you're not resetting your progress through the second file. You read one line from $SNPSFile, check that against ever line in the second file.
But when you start over, you're already at the end of file, so:
while (my $Cord = <$PromoterFile>) {
Doesn't have anything to read.
A quick fix for this would be to add a seek command in there, but that'll make inefficient code. I'd suggest instead reading file 1 into a array, and referencing that instead.
Here's a first draft rewrite that may help.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
open my $PromoterFile, '<', 'promoterstest.txt' or die $!;
open my $SNPSFile, '<', 'NC_SNPtest.txt' or die $!;
open my $output, ">", "PromoterMatchtest.txt" or die $!;
my #data;
while (<$PromoterFile>) {
chomp;
my #CordFile = split;
my $Lend = $CordFile[0];
my $Rend = $CordFile[1];
my $Promoter = $CordFile[4];
my $SigmaFactor = $CordFile[7];
push(
#data,
{ lend => $CordFile[0],
rend => $CordFile[1],
promoter => $CordFile[4],
sigmafactor => $CordFile[7]
}
);
}
print Dumper \#data;
foreach my $value (<$SNPSFile>) {
chomp $value;
my $found = 0;
foreach my $element (#data) {
if ( $value >= $element->{lend}
and $value <= $element->{rend} )
{
#print "Found $value\n";
print {$output} join( "\t",
$value, $element->{promoter}, $element->{sigmafactor} ),
"\n";
$found++;
last;
}
}
if ( not $found ) {
print {$output} $value,"\n";
}
}
close $output;
close $PromoterFile;
close $SNPSFile;
First - we open file2, read in the stuff in it to an array of hashes. (If any of the elements there are unique, we could key off that instead.)
Then we read through SNPSfile one line at a time, looking for each key - printing it if it exists (at least once, on the first hit) and printing just the key if it doesn't.
This generates the output:
250 yaaAp1 Sigma70
275 yaaAp1 Sigma70
375 yaaAp2 Sigma70
Was that what you were aiming for?
Aside from that 'Dumper' statement which outputs the content of #data as thus:
$VAR1 = [
{
'sigmafactor' => 'Sigma70',
'promoter' => 'yaaX',
'lend' => '20',
'rend' => '100'
},
{
'sigmafactor' => 'Sigma70',
'promoter' => 'yaaAp1',
'rend' => '300',
'lend' => '200'
},
{
'promoter' => 'yaaAp2',
'sigmafactor' => 'Sigma70',
'rend' => '400',
'lend' => '350'
}
];
Here's my take on a programming solution. It's important to
Use lexical file handles and the three-paremeter form of open
Keep to lower-case letters, digits and underscores for local variables
I have also used the autodie pragma to remove the need to test the status of open explicitly, and the first function from the core library List::Util to make the code clearer and more concise
use strict;
use warnings;
use 5.010;
use autodie;
use List::Util 'first';
my #promoters;
{
open my $fh, '<', 'promoterstest.txt';
while ( <$fh> ) {
my #fields = split;
push #promoters, [ #fields[0,1,4,7] ];
}
}
open my $fh, '<', 'NC_SNPStest.txt';
open my $out_fh, '>', 'PromoterMatchtest.txt';
select $out_fh;
while ( <$fh> ) {
my ($num) = split;
my $match = first { $num >= $_->[0] and $num <= $_->[1] } #promoters;
if ( $match ) {
print join("\t", $num, #{$match}[2,3]), "\n";
}
else {
print $num, "\n";
}
}
output
250 yaaAp1 Sigma70
275 yaaAp1 Sigma70
375 yaaAp2 Sigma70
I have a tab-delimited file1:
20 50 80 110
520 590 700 770
410 440 20 50
300 340 410 440
read and put them into an array:
while(<INPUT>)
{
chomp;
push #inputarray, $_;
}
Now I'm looping through another file2:
20, 410, 700
80, 520
300
foreach number of each line in file2, I want to search the #inputarray for the number. If it exists, I want to grab the corresponding number that follows. For instance, for number 20, I want to grab the number 50. I assume that they are still separated by a tab in the string that exists as an array element in #inputarray.
while(my $line = <INPUT2>)
{
chomp $line;
my #linearray = split("\t", $line);
foreach my $start (#linearray)
{
if (grep ($start, #inputarray))
{
#want to grab the corresponding number
}
}
}
Once grep finds it, i don't know how to grab that array element to find the position of the number to extract the corresponding number using perhaps the substr function. How do i grab the array element that grep found?
A desired output would be:
line1:
20 50
410 440
700 770
line2:
80 110
520 590
line3:
300 340
IMHO, it would be best to store the numbers from file1 in a hash. Referring to the example clontent of file1 as you provided above you can have something like below
{
'20' => '50',
'80' => '110',
'520'=> '590',
'700'=> '770',
'410'=> '440',
'20' => '50',
'300'=> '340',
'410' => '440'
}
A sample piece of code will be like
my %inputarray;
while(<INPUT>)
{
my #numbers = split $_;
my $length = scalar $numbers;
# For $i = 0 to $i < $length;
# $inputarray{$numbers[$i]} = $numbers[$i+1];
# $i+=2;
}
An demonstration of the above loop
index: 0 1 2 3
numbers: 20 50 80 110
first iteration: $i=0
$inputarray{$numbers[0]} = $numbers[1];
$i = 2; #$i += 2;
second iteration: $i=2
$inputarray{$numbers[2]} = $numbers[3];
And then while parsing file2, you just need to treat the number as the key of %inputarray.
I believe this gets you close to what you want.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $prev = undef;
foreach my $curr ( split /\s+/ )
{
$follows{$prev} = $curr if ($prev);
$prev = $curr;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you only want to consider numbers from file1 in pairs, as opposed to seeing which numbers follow what other numbers without taking pair boundaries into account, then you need to change the logic in the first while loop slightly.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $line = $_;
while ( $line =~ s/(\S+)\s+(\S+)\s*// )
{
$follows{$1} = $2;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you want to read the input once but check for numbers a lot, you might be better off to split the input line into individual numbers. Then add each each number as key into a hash with the following number as value. That makes reading slow and takes more memory but the second part, where you want to check for following numbers will be a breeze thanks to exist and the nature of hashes.
If i understood your question correct, you could use just one big hash. That is of course assuming that every number is always followed by the same number.
I have a tab delimited data. I want to process that data using perl. I am a newbie to perl and could not figure out how to solve .
This is sample table: My original file is almost a GB
gi|306963568|gb|GL429799.1|_1316857_1453052 13 1
gi|306963568|gb|GL429799.1|_1316857_1453052 14 1
gi|306963568|gb|GL429799.1|_1316857_1453052 15 1
gi|306963568|gb|GL429799.1|_1316857_1453052 16 1
gi|306963568|gb|GL429799.1|_1316857_1453052 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 1
gi|306963568|gb|GL429799.1|_1316857_1453052 361 1
gi|306963568|gb|GL429799.1|_1316857_1453052 362 1
gi|306963568|gb|GL429799.1|_1316857_1453052 363 1
gi|306963568|gb|GL429799.1|_1316857_1453052 364 1
gi|306963568|gb|GL429799.1|_1316857_1453052 365 1
gi|306963568|gb|GL429799.1|_1316857_1453052 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38641 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38642 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38643 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38644 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38645 1
I would like to get the output as
Name, start value, end value, average
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
it will be great if someone could share their wisdom.
The general pattern is
use strict;
use warnings;
open my $fh, '<', 'myfile' or die $!;
while (<$fh>) {
chomp;
my #fields = split /\t/;
...
}
Within the loop the fields can be accessed as $fields[0] through $fields[2].
Update
I have understood your question better, and I think this solution will work for you. Note that it assumes the input data is sorted, as you have shown in your question.
It accumulates the start and end values, the total and the count in hash %data, and keeps a list of all the names encountered in #names so that the data can be displayed in the order it was read.
The program expects the input file name as a parameter on the command line.
You need to consider the formatting of the average because it is a floating point value. As it stands it will display the value to sixteen significant figures, and you may want to curtail that using sprintf.
use strict;
use warnings;
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
my #names;
my %data;
my $current_name = '';
my $last_index;
while (<$fh>) {
chomp;
my ($name, $index, $value) = split /\t/;
if ( $current_name ne $name or $index > $last_index + 1 ) {
push #names, $name unless $data{$name};
push #{ $data{$name} }, {
start => $index,
count => 0,
total => 0,
};
$current_name = $name;
}
my $entry = $data{$name}[-1];
$entry->{end} = $index;
$entry->{count} += 1;
$entry->{total} += $value;
$last_index = $index;
}
for my $name (#names) {
for my $entry (#{ $data{$name} }) {
my ($start, $end, $total, $count) = #{$entry}{qw/ start end total count /};
print join("\t", $name, $start, $end, $total / $count), "\n";
}
}
output
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
This will produce the same output for the sample in your question:
#!/usr/bin/env perl -n
#
my ($name, $i, $value) = split(/\t/);
sub print_stats {
print join("\t", $prev_name, $start, $prev_i, $sum / ($prev_i - $start + 1)), "\n";
}
if ($prev_name eq $name && $i == $prev_i + 1) {
$sum += $value;
$prev_i = $i;
}
else {
if ($prev_name) {
&print_stats();
}
$start = $i;
$prev_name = $name;
$sum = $value;
$prev_i = $i;
}
END {
&print_stats();
}
Use it as:
./parser.pl < sample.txt
UPDATE: answers to the questions in comments:
To print output to a file, run like this: ./parser.pl < sample.txt > output.txt
$prev_name and $prev_i are NOT initialized, so they are undef at first (= NULL)
You could do something like this....
open (FILE, 'data.txt');
while (<FILE>) {
chomp;
($name, $start_value, $end_value, $average) = split("\t");
print "Name: $name\n";
print "Start Value: $start_value\n";
print "End Value: $End_Value\n";
print "Average: %average
print "---------\n";
}
close (FILE);
exit;
Those look like GenBank files...so I'm unsure where you are getting the start, end values, average.
Here's an example using Text::CSV:
use Text::CSV; # This will implicitly use Text::CSV_XS if it's installed
my $parser = Text::CSV->new( { sep_char => '|' } );
open my $fh, '<', 'myfile' or die $!;
while (my $row = $parser->getline($fh)) {
# $row references an array of field values from the line just read
}
Also, as a minor side detail, your sample data is delimited by pipe characters, not tabs, although that may just be to avoid copy/paste errors for those answering your question. If the actual data is tab-delimited, set sep_char to "\t" instead of '|'.
Code :
#!/usr/bin/perl
my $file = $ARGV[0];
my $position = $ARGV[1]; # POSITION OF THE RESIDUE
open (FILE, $file);
while (<FILE>) {
my #f = split;
if (($f[0] == "ANNOT_RESID_NO") && ($f[1] == $position)){
push #line, $_;
}
}
print #line;
close(FILE);
INPUT :
ANNOT_TYPE[1] 0
ANNOT_TYPE_NAME[1] CATRES
ANNOT_NUMBER[1][1] 1
ANNOT_NAME[1][1] 3.1.3.16
ANNOT_DESC[1][1] Phosphoprotein phosphatase.
ANNOT_RESID_NO[1][1][1] 91
ANNOT_RESID_NAME[1][1][1] ASP
ANNOT_RESID_NUM[1][1][1] 95
ANNOT_RESID_NO[1][1][2] 92
ANNOT_RESID_NAME[1][1][2] ARG
ANNOT_NRESID[1][1] 6
ANNOT_NUMBER[1][2] 2
ANNOT_NAME[1][2] 3.1.3.53
ANNOT_DESC[1][2] [Myosin-light-chain] phosphatase.
ANNOT_RESID_NO[1][2][1] 91
ANNOT_RESID_NAME[1][2][1] ASP
ANNOT_RESID_NUM[1][2][1] 95
ANNOT_RESID_NO[1][2][2] 92
ANNOT_RESID_NAME[1][2][2] ARG
Question :
I am printing the line with has $position(for example 91) for the line starting with "ANNOT_RESID_NO". Along with this line, I also want to print, every time, in #line is the first line above this match containing "ANNOT_DESC". This "ANNOT_DESC" line is not necessarily always the line just above my matched line.
Try (complete code):
#!/usr/bin/perl
use strict;
use warnings;
my $file = $ARGV[0];
my $position = $ARGV[1];
open (FILE, $file) or die $!;
my $desc;
my #line;
while (<FILE>) {
my #f = split " ";
if ( $f[0] =~ /^ANNOT_DESC/ ) {
$desc = $_;
next;
}
if ( $f[0] =~ /^ANNOT_RESID_NO/ and $f[1] == $position ) {
push #line, $desc, $_;
}
}
output:
ANNOT_DESC[1][1] Phosphoprotein phosphatase.
ANNOT_RESID_NO[1][1][1] 91
ANNOT_DESC[1][2] [Myosin-light-chain] phosphatase.
ANNOT_RESID_NO[1][2][1] 91
With a data set that small you can push the lines from the file to an array(e.g. #file_data) , iterate the #file_data array and push the values you want into your #line array.