I have a tab delimited data. I want to process that data using perl. I am a newbie to perl and could not figure out how to solve .
This is sample table: My original file is almost a GB
gi|306963568|gb|GL429799.1|_1316857_1453052 13 1
gi|306963568|gb|GL429799.1|_1316857_1453052 14 1
gi|306963568|gb|GL429799.1|_1316857_1453052 15 1
gi|306963568|gb|GL429799.1|_1316857_1453052 16 1
gi|306963568|gb|GL429799.1|_1316857_1453052 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 1
gi|306963568|gb|GL429799.1|_1316857_1453052 361 1
gi|306963568|gb|GL429799.1|_1316857_1453052 362 1
gi|306963568|gb|GL429799.1|_1316857_1453052 363 1
gi|306963568|gb|GL429799.1|_1316857_1453052 364 1
gi|306963568|gb|GL429799.1|_1316857_1453052 365 1
gi|306963568|gb|GL429799.1|_1316857_1453052 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38641 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38642 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38643 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38644 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38645 1
I would like to get the output as
Name, start value, end value, average
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
it will be great if someone could share their wisdom.
The general pattern is
use strict;
use warnings;
open my $fh, '<', 'myfile' or die $!;
while (<$fh>) {
chomp;
my #fields = split /\t/;
...
}
Within the loop the fields can be accessed as $fields[0] through $fields[2].
Update
I have understood your question better, and I think this solution will work for you. Note that it assumes the input data is sorted, as you have shown in your question.
It accumulates the start and end values, the total and the count in hash %data, and keeps a list of all the names encountered in #names so that the data can be displayed in the order it was read.
The program expects the input file name as a parameter on the command line.
You need to consider the formatting of the average because it is a floating point value. As it stands it will display the value to sixteen significant figures, and you may want to curtail that using sprintf.
use strict;
use warnings;
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
my #names;
my %data;
my $current_name = '';
my $last_index;
while (<$fh>) {
chomp;
my ($name, $index, $value) = split /\t/;
if ( $current_name ne $name or $index > $last_index + 1 ) {
push #names, $name unless $data{$name};
push #{ $data{$name} }, {
start => $index,
count => 0,
total => 0,
};
$current_name = $name;
}
my $entry = $data{$name}[-1];
$entry->{end} = $index;
$entry->{count} += 1;
$entry->{total} += $value;
$last_index = $index;
}
for my $name (#names) {
for my $entry (#{ $data{$name} }) {
my ($start, $end, $total, $count) = #{$entry}{qw/ start end total count /};
print join("\t", $name, $start, $end, $total / $count), "\n";
}
}
output
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
This will produce the same output for the sample in your question:
#!/usr/bin/env perl -n
#
my ($name, $i, $value) = split(/\t/);
sub print_stats {
print join("\t", $prev_name, $start, $prev_i, $sum / ($prev_i - $start + 1)), "\n";
}
if ($prev_name eq $name && $i == $prev_i + 1) {
$sum += $value;
$prev_i = $i;
}
else {
if ($prev_name) {
&print_stats();
}
$start = $i;
$prev_name = $name;
$sum = $value;
$prev_i = $i;
}
END {
&print_stats();
}
Use it as:
./parser.pl < sample.txt
UPDATE: answers to the questions in comments:
To print output to a file, run like this: ./parser.pl < sample.txt > output.txt
$prev_name and $prev_i are NOT initialized, so they are undef at first (= NULL)
You could do something like this....
open (FILE, 'data.txt');
while (<FILE>) {
chomp;
($name, $start_value, $end_value, $average) = split("\t");
print "Name: $name\n";
print "Start Value: $start_value\n";
print "End Value: $End_Value\n";
print "Average: %average
print "---------\n";
}
close (FILE);
exit;
Those look like GenBank files...so I'm unsure where you are getting the start, end values, average.
Here's an example using Text::CSV:
use Text::CSV; # This will implicitly use Text::CSV_XS if it's installed
my $parser = Text::CSV->new( { sep_char => '|' } );
open my $fh, '<', 'myfile' or die $!;
while (my $row = $parser->getline($fh)) {
# $row references an array of field values from the line just read
}
Also, as a minor side detail, your sample data is delimited by pipe characters, not tabs, although that may just be to avoid copy/paste errors for those answering your question. If the actual data is tab-delimited, set sep_char to "\t" instead of '|'.
Related
Pardon me for asking a question without any coding effort. But it seems too much difficult to me.
I have a data file with tab separated three data columns (and some repetitive header lines) as:
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
From the second data column, for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).
For the above Input, the output would be:
NP_416485.4: 1, 3-5
YP_986467.7: 2-4
Here, "NP_416485.4" and "YP_986467.7" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-4", but I increases all of them with +1 as I don't want to start with 0).
Thanks for your consideration. I would appreciate any help. Thank you
Here is one approach. In case you would have a DOS data file on a Unix machine, I used \r?\n to match a new line, so it will work for all cases:
use feature qw(say);
use strict;
use warnings;
my $file_name = 'input.txt';
open ( my $fh, '<', $file_name ) or die "Could not open file '$file_name': $!";
my $str = do { local $/; <$fh> };
close $fh;
my #chunks = $str =~ /(Sequence(?:.(?!Sequence))*)/sg;
my %ids;
for my $cstr ( #chunks ) {
my ( $id, $data ) = $cstr
=~/Split_Seq\/(\S+)\.fasta.*?\r?\n\r?\n(.*)$/s;
my #lines = split /\n/, $data;
my #vals;
for my $line ( #lines ) {
my #fields = split " ", $line;
push ( #vals, $fields[0] + 1 ) if $fields[1] > 0.5;
}
$ids{$id} = \#vals;
}
for my $id ( keys %ids ) {
my #tmp = sort { $a <=> $b } #{ $ids{$id} };
my ( $first, $last );
my #rr;
for my $i (0..$#tmp) {
if ( $i == 0 ) {
$first = $tmp[0];
$last = undef;
}
if ( $i < $#tmp && ($tmp[$i] == ($tmp[$i+1] - 1 )) ) {
$last = $tmp[$i+1];
next;
}
if ( defined $last ) {
push #rr, "$first-$last";
$last = undef;
}
else {
push #rr, $tmp[$i];
}
$first = ( $i < $#tmp ) ? $tmp[$i+1] : undef;
}
say "$id: ", join ",", #rr;
}
Output:
NP_416485.4: 1,3-5
YP_986467.7: 2-4
You don't really give a good description of your problem, and you haven't made any effort to solve it yourself, but here's a solution to the first part of your problem (parsing the file into a data structure). You'll need to walk the %results hash and produce the output that you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %results;
my $section;
while (<DATA>) {
# Look for a new section
if (/\\Programs\\(\S+)\s/) {
$section = $1;
}
# Look for data lines
if (/^\d\b/) {
my #data = split;
if ($data[1] > 0.5) {
push #{$results{$section}}, $data[0] + 1;
}
}
}
say Dumper \%results;
__DATA__
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
Please don't comment to say I already asked this, It's a logic question, I know it's mostly similar code but there are underlying syntax problems that I cannot decipher and have spent hours debugging this with no hope and I just really need this answered. And that other account was deleted so I did post this half an hour ago but can't view it. Please only comment if you want to help.
It should work everything is in data and it should be turning up results, i've had it working before so it must just be so syntax thing I'm not noticing. I can't get this work. I'm almost certain it's the grep statement.
#!/usr/bin/perl
use warnings;
use strict;
open ("data", "<text.txt") or die "Can't open"; #
my #data = <data>; #file looking into
close "data"; #
while(<>){
chomp;
my $temp = $_;
my ($name, $number, $expression) = split("\t", $temp);
my $pattern = "\t";
my #found = grep ( /(^$name$pattern\|$pattern$number$)/, #data );
if(defined($found[0])){
print $_;
my ($what, $start, $stop, $chr, $who) = split("\t", $found[0]);
print "\t", $chr, $start, $stop;
#found = ();
}
}
print "\n";
Input is of the format
A1B 1 68
A1C 299 0
A2B 547 0
A2L 877 30
A2M 2 7944
And this is the format of the data file
CLDN8 30214006 30216073 21 68
A1C 20808776 20811809 Y
UBE2Q2P5Y 25431156 25437315 Y
OR5M9 56462469 56463401 11 390162
I want to search for the instances of items in the first or second column of the input file in the data file which should match up with the first and 5th column(which may not exist) respectively
Expected output should be for this example
A1B 1 68 21 30214006 30216073
A1C 299 0 Y 20808776 20811809
But I'm getting nothing
I think what you're looking for is this, but it's really very hard to tell because you have described your problem so poorly
I've had to make a lot of assumptions, but at least the output matches what you say you're expecting
use strict;
use warnings 'all';
my $data_file = 'text.txt';
my #data;
{
open my $fh, '<', $data_file or die qq{Unable to open "$data_file" for input: $1};
while ( <$fh> ) {
next unless /\S/;
push #data, [ split ];
}
}
while ( <> ) {
next unless /\S/;
my ($name, $number, $expression) = split;
for my $item ( #data ) {
my ($what, $start, $stop, $chr, $who) = #$item;
if ( $what eq $name or defined $who and $who eq $expression ) {
print join("\t", $name, $number, $expression, $chr, $start, $stop), "\n";
}
}
}
output
A1B 1 68 21 30214006 30216073
A1C 299 0 Y 20808776 20811809
I have been trying to find values that match between two columns (columns a and column b) of a large file and print the common values, plus the corresponding column d. I have been doing this by interating through hashes, however, because the file is so large, there is not enough memory to produce the output file. Is there any other way to do the same thing using less memory resources.
Any help is much appreciated.
The script I have written thus far is below:
#!usr/bin/perl
use warnings;
use strict;
open (FILE1, "<input.txt") || die "$!\n Couldn't open input.txt\n";
open (Output, ">output.txt")||die "Can't Open output.txt ";
my $hash1={};
my $hash2={};
while (<FILE1>) {
chomp (my $line=$_);
my ($a, $b, $c, $d) = split (/\t/, $line);
if ($a) {
$hash1->{$a}{info1} = "$d"; #original_ID-> YOB
}
if ($b) {
$hash2->{$b}{info2} = "$a"; #original_ID-> sire
}
foreach my $key (keys %$hash2) {
if (exists $hash1{$a}) {
$info1 = $hash1->{$a}->{info1};
print "$a\t$info1\n";
}
}
}
close FILE1;
close Output;
print "Done\n";
To clarify, the input file is a large pedigree file. An example is:
1 2 3 1977
2 4 5 1944
3 4 5 1950
4 5 6 1930
5 7 6 1928
An example of the output file is:
2 1944
4 1950
5 1928
Does the below work for you ?
#!/usr/local/bin/perl
use strict;
use warnings;
use DBM::Deep;
use List::MoreUtils qw(uniq);
my #seen;
my $db = DBM::Deep->new(
file => "foo.db",
autoflush => 1
);
while (<>) {
chomp;
my #fields = split /\s+/;
$$db{$fields[0]} = $fields[3];
push #seen, $fields[1];
}
for (uniq #seen) {
print $_ . " " . $$db{$_} . "\n" if exists $$db{$_};
}
I have a tab-delimited file1:
20 50 80 110
520 590 700 770
410 440 20 50
300 340 410 440
read and put them into an array:
while(<INPUT>)
{
chomp;
push #inputarray, $_;
}
Now I'm looping through another file2:
20, 410, 700
80, 520
300
foreach number of each line in file2, I want to search the #inputarray for the number. If it exists, I want to grab the corresponding number that follows. For instance, for number 20, I want to grab the number 50. I assume that they are still separated by a tab in the string that exists as an array element in #inputarray.
while(my $line = <INPUT2>)
{
chomp $line;
my #linearray = split("\t", $line);
foreach my $start (#linearray)
{
if (grep ($start, #inputarray))
{
#want to grab the corresponding number
}
}
}
Once grep finds it, i don't know how to grab that array element to find the position of the number to extract the corresponding number using perhaps the substr function. How do i grab the array element that grep found?
A desired output would be:
line1:
20 50
410 440
700 770
line2:
80 110
520 590
line3:
300 340
IMHO, it would be best to store the numbers from file1 in a hash. Referring to the example clontent of file1 as you provided above you can have something like below
{
'20' => '50',
'80' => '110',
'520'=> '590',
'700'=> '770',
'410'=> '440',
'20' => '50',
'300'=> '340',
'410' => '440'
}
A sample piece of code will be like
my %inputarray;
while(<INPUT>)
{
my #numbers = split $_;
my $length = scalar $numbers;
# For $i = 0 to $i < $length;
# $inputarray{$numbers[$i]} = $numbers[$i+1];
# $i+=2;
}
An demonstration of the above loop
index: 0 1 2 3
numbers: 20 50 80 110
first iteration: $i=0
$inputarray{$numbers[0]} = $numbers[1];
$i = 2; #$i += 2;
second iteration: $i=2
$inputarray{$numbers[2]} = $numbers[3];
And then while parsing file2, you just need to treat the number as the key of %inputarray.
I believe this gets you close to what you want.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $prev = undef;
foreach my $curr ( split /\s+/ )
{
$follows{$prev} = $curr if ($prev);
$prev = $curr;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you only want to consider numbers from file1 in pairs, as opposed to seeing which numbers follow what other numbers without taking pair boundaries into account, then you need to change the logic in the first while loop slightly.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $line = $_;
while ( $line =~ s/(\S+)\s+(\S+)\s*// )
{
$follows{$1} = $2;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you want to read the input once but check for numbers a lot, you might be better off to split the input line into individual numbers. Then add each each number as key into a hash with the following number as value. That makes reading slow and takes more memory but the second part, where you want to check for following numbers will be a breeze thanks to exist and the nature of hashes.
If i understood your question correct, you could use just one big hash. That is of course assuming that every number is always followed by the same number.
This is a description of my problem: I have two text files (here $variants and $annotation). I want to check if the value from column 2 in $variants lies between the values from column 2 and 3 in $annotation. If this is true then the value from column 1 in $annotation should be added to a new column in $variants.
This is how my sample input files look like
$annotationrepresents a tab-delimited text file
These values can be overlapping and cannot be perfectly sorted, since I'm working with a circular genome
C0 C1 C2
gene1 0 100
gene2 500 1000
gene3 980 1200
gene4 1500 5
$variants represents a tab-delimited text file
C0 C1
... 5
... 10
... 100
... 540
... 990
The output should look like this ($variants with two other columns added)
C0 C1 C2 C3
... 5 gene1 gene4
... 10 gene1
... 100 gene1
... 540 gene2
... 990 gene2 gene3
This is how my script looks like for the moment
my %hash1=();
while(<$annotation>){
my #column = split(/\t/); #split on tabs
my $keyfield = $column[1] && $column[2]; # I need to remember values from two columns here. How do I do that?
}
while(<$variants>){
my #column=split(/\t/); # split on tabs
my $keyfield = $column[1];
if ($hash1{$keyfield} >= # so the value in column[1] should be between the values from column[1] & [2] in $annotation
push # if true then add values from column[0] in $annotation to new column in $variants
}
So my biggest problems are how to remember two values in a file using hashes and how to put a value from one file to a column in another file. Could someone help me with this?
If the input files are not large and the positions are not too high, you can use arrays to represent all positions:
#!/usr/bin/perl
use warnings;
use strict;
sub skip_header {
my $FH = shift;
<$FH>;
}
open my $ANN, '<', 'annotation' or die $!;
my $max = 0;
while (<$ANN>) {
$_ > $max and $max = $_ for (split)[1, 2];
}
seek $ANN, 0, 0; # Rewind the file back.
my $circular;
my #genes;
while (<$ANN>) {
my ($gene, $from, $to) = split;
if ($from <= $to) {
$genes[$_] .= "$gene " for $from .. $to;
} else {
$circular = 1;
$genes[$_] .= "$gene " for 0 .. $to, $from .. $max + 1;
}
}
chop #genes;
open my $VAR, '<', 'variants' or die $!;
skip_header($VAR);
while (<$VAR>) {
next if /^\s*#/;
chomp;
my ($str, $pos) = split;
$pos = $#genes if $circular and $pos > $#genes;
print "$_ ", $genes[$pos] // q(), "\n";
}
No hashing needed at all. This example expects the annotations to be sorted and not overlapping, it also works only if all the values from variants should be printed.
#!/usr/bin/perl
use warnings;
use strict;
open my $VAR, '<', 'variants' or die $!;
<$VAR>; # skip header
my ($str, $pos) = split ' ', <$VAR>;
open my $ANN, '<', 'annotation' or die $!;
<$ANN>; # skip header
while (<$ANN>) {
my ($gene, $from, $to) = split;
while ($from <= $pos and $pos <= $to) {
print "$str $pos $gene\n";
($str, $pos) = split ' ', <$VAR> or last;
}
}