Split data separated by pipe from a file perl - perl

I have a file which i have data pipe seprated having some values 0 ,1 , or some text. Now what i have to do is to read the file line by line and split data seprated by pipe and then if the values of any index is 1 then i have create increment it's value by 1 into a hash so at last i have a hash which will give details how many true values are there in the files.
Below is the example of data and the name of each value , if the value is 1 then in hash key value will get incremented by 1
A B C D E F G H I
===========================
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|0|0|0|ABC|15:44:07|0
0|1|0|0|0|1|ABC|15:44:07|0
So final out put will be like below :
$finalHash->{D} = 2;
$finalHash->{F} = 1;
$finalHash->{B} = 1;
I am able to read the file line by line :
my $final_hash;
for my $line ( <$ifh> ) {
my #el = split( /\|/, $line );
#Now here have to set the value in the hashkey where we will get the 1 value
}

Don't use for to iterate file lines, use while. The former might exhaust memory if the file was very large, as it needs to create a list of all the lines; the latter reads one line at a time.
#! /usr/bin/perl
use warnings;
use strict;
my #column_names = qw( A B C D E F );
my %counts;
while (<DATA>) {
my #columns = split /\|/;
for my $i (0 .. $#column_names) {
++$counts{ $column_names[$i] } if $columns[$i];
}
}
use Data::Dumper; print Dumper \%counts;
__DATA__
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|0|0|0|ABC|15:44:07|0
0|1|0|0|0|1|ABC|15:44:07|0

Related

Perl: Matching columns from two different files

(Note: Column headers are there for readability and are not in the actual files)
File 1
COLUMN1 COLUMN2 COLUMN3
AG_446337835.1 example1 grgsdt
AG_448352465.1 example2 190197
AG_449465753.1 example3 h837h8
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4
AG_441227463.1 example6 f3fw4t
AG_449986090.1 example7 gft7r4
AG_445666926.1 example8 4vsr55
AG_441004541.1 example9 fh893b
AG_444837264.1 example0 k3883d
File 2
COLUMN1 COLUMN2
grgsdt AAHG
h837h8 JUJN
190197 POKJ
f45ge4 DFRF
gft7r4 NNHN
d34tw4
fh893b YUNIP
k3883d YUNIP
f3fw4t YUNIP
190197 YUNIP
4vsr55 GHGF
Desired Output file
COLUMN1 COLUMN2 COLUMN3 COLUMN4 (formerly column2 from file2)
AG_446337835.1 example1 grgsdt AAHG
AG_448352465.1 example2 190197 POKJ YUNIP
AG_449465753.1 example3 h837h8 JUJN
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4 DFRF
AG_441227463.1 example6 f3fw4t YUNIP
AG_449986090.1 example7 gft7r4 NNHN
AG_445666926.1 example8 4vsr55 GHGF
AG_441004541.1 example9 fh893b YUNIP
AG_444837264.1 example0 k3883d YUNIP
I am barely familiar with Perl (or programming general) and I was wondering if you would mind advising me with this problem.
Essentially, Column 3 in file1 corresponds to Column 1 in File2.
I want to take each line in file1, read column 3 of that line, search file2 for a matching entry, if a matching entry exists print the line from file1 with an extra column from file 2 to a new file (as seen in the example output).
The file sizes are
File1: 2GB
File2: 718MB
This script will be run off a machine with 250GB of ram so memory is not an issue.
This is what I have so far
#!/usr/bin/perl ;
#use warnings;
use Getopt::Long qw(GetOptions);
use experimental 'smartmatch';
#Variable to store inputted text file data
my $db ;
my $db2 ;
#Open and read File one into memory
open FPIN, "file1.txt" or die "Could not open";
my #file1 = <FPIN> ;
close FPIN;
#Open and read file two into memory
open FPIN, "file2.tab" or die "Could not open";
my #file2 = <FPIN> ;
close FPIN ;
foreach (#file2)
{
if (/(^\w+)\t(.+)/)
{
split /\t/, $2;
$db2->{$1}->{"geneName"} = $1 ;
$db2->{$1}->{"protein"} = $2 ;
}
}
foreach (#file1)
{
#if line begins with any word character tab and anything
if (/(^\w+.\d+)\t(.+)/)
{
my #fields = split /\t/, $2;
my $refSeqID = $1;
#assign the data in the array to variables
my ($geneSymbol, $geneName) = #fields[0, 1];
#Create database data structure and fill it with the info
$db->{$2}->{"refSeqID"} = $refSeqID ;
$db->{$2}->{"geneSymbol"} = $geneSymbol ;
$db->{$2}->{"geneName"} = $geneName ;
}
}
foreach my $id (sort keys %{$db2})
{
if ( exists $db->{$id} )
{
print $db2->{$id}."\t".$db->{$id}->{$geneSymbol}."\t".$db->{$id}->
{$refSeqID}."\t".$db2->{$id}->{$protein}."\n";
}
}
I seem to be able to read both files into memory correctly.
However I have been completely unable to compare the files to each other and I am dumbstruck on how to approach it.
Actually printing it will be another issue I need to tackle.
This will do as you ask
It starts by reading file2.txt and building a hash %f2 that relates the value of the first column to the value of the second
Thereafter it's just a matter of reading through file1.txt, splitting it into fields, and adding a further field obtained by accessing the hash using the value of the third field
I've used autodie to save the trouble of handling errors in the open calls. Otherwise everything is standard
Update
I've just noticed that a column 1 value may be repeated in file2.txt, so I've changed the code to make each key of the hash correspond to an array of values. All the values in the array appear, space-separated, in column 4 of the output
use strict;
use warnings 'all';
use autodie;
my %f2;
{
open my $fh, '<', 'file2.txt';
while ( <$fh> ) {
my ($key, $val) = split;
$f2{$key} //= [];
push #{ $f2{$key} }, $val if $val;
}
}
open my $fh, '<', 'file1.txt';
while ( <$fh> ) {
my #line = split;
my $c4 = $f2{$line[2]};
push #line, $c4 ? join(' ', #$c4) : '';
local $" = "\t";
print "#line\n";
}
output
AG_446337835.1 example1 grgsdt AAHG
AG_448352465.1 example2 190197 POKJ YUNIP
AG_449465753.1 example3 h837h8 JUJN
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4 DFRF
AG_441227463.1 example6 f3fw4t YUNIP
AG_449986090.1 example7 gft7r4 NNHN
AG_445666926.1 example8 4vsr55 GHGF
AG_441004541.1 example9 fh893b YUNIP
AG_444837264.1 example0 k3883d YUNIP
This one makes a left join. The key idea is to use geneName as a key in a hash.
#! /usr/bin/perl
use strict;
use warnings;
my %data = ();
open $a, "file1";
while (<$a>) {
chomp;
my #c = split;
$data{$c[2]} = [$c[0], $c[1], $c[2]];
}
open $b, "file2";
while (<$b>) {
chomp;
my #c = split;
push #{$data{$c[0]}}, exists $c[1] ? $c[1] : "";
}
print map { "#{$_}\n" } values %data;

Efficient way to read columns in a file using Perl

I have an input file like so, separated by newline characters.
AAA
BBB
BBA
What would be the most efficient way to count the columns (vertically), first with first, second with second etc etc.
Sample OUTPUT:
ABB
ABB
ABA
I have been using the following, but am unable to figure out how to remove the scalar context from it. Any hints are appreciated:
while (<#seq_prot>){
chomp;
my #sequence = map substr (#seq_prot, 1, 1), $start .. $end;
#sequence = split;
}
My idea was to use the substring to get the first letter of the input (A in this case), and it would cycle for all the other letters (The second A and B). Then I would increment the cycle number + 1 so as to get the next line, until I reached the end. Of course I can't seem to get the first part going, so any help is greatly appreciated, am stumped on this one.
Basically, you're trying to transpose an array.
This can be done easily using Array::Transpose
use warnings;
use strict;
use Array::Transpose;
die "Usage: $0 filename\n" if #ARGV != 1;
for (transpose([map {chomp; [split //]} <>])) {
print join("", map {$_ // " "} #$_), "\n"
}
For an input file:
ABCDEFGHIJKLMNOPQRS
12345678901234
abcdefghijklmnopq
ZYX
Will output:
A1aZ
B2bY
C3cX
D4d
E5e
F6f
G7g
H8h
I9i
J0j
K1k
L2l
M3m
N4n
O o
P p
Q q
R
S
You'll have to read in the file once for each column, or store the information and go through the data structure later.
I was originally thinking in terms of arrays of arrays, but I don't want to get into References.
I'm going to make the assumption that each line is the same length. Makes it simpler that way. We can use split to split your line into individual letters:
my = $line = "ABC"
my #split_line = split //, $line;
This will give us:
$split_line[0] = "A";
$split_line[1] = "B";
$split_line[2] = "C";
What if we now took each letter, and placed it into a #vertical_array.
my #vertical_array;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
Now let's do this with the next line:
$line = "123";
#split_line = split //, $line;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
This will give us:
$vertical_array[0] = "A1";
$vertical_array[1] = "B2";
$vertical_array[2] = "C3";
As you can see, I'm building the $vertical_array with each interation:
use strict;
use warnings;
use autodie;
use feature qw(say);
my #vertical_array;
while ( my $line = <DATA> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( 0..$#split_line ) {
$vertical_array[$index] .= $split_line[$index];
}
}
#
# Print out your vertical lines
#
for my $line ( #vertical_array ) {
say $line;
}
__DATA__
ABC
123
XYZ
BOY
FOO
BAR
This prints out:
A1XBFB
B2YOOA
C3ZYOR
If I had used references, I could probably have built an array of arrays and then flipped it. That's probably more efficient, but more complex. However, that may be better at handling lines of different lengths.

When trying to print an array from sub only the first element prints

I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!

How to find text in data file and calculate average using perl

I would like to replace a grep | awk | perl command with a pure perl solution to make it quicker and simpler to run.
I want to match each line in input.txt with a data.txt file and calculate the average of the values with the matched ID names and numbers.
The input.txt contains 1 column of ID numbers:
FBgn0260798
FBgn0040007
FBgn0046692
I would like to match each ID number with it's corresponding ID names and associated value. Here's an example of data.txt where column 1 is the ID number, columns 2 and 3 are ID name1 and ID name2 and column 3 contains the values I want to calculate the average.
FBgn0260798 CG17665 CG17665 21.4497
FBgn0040007 Gprk1 CG40129 22.4236
FBgn0046692 RpL38 CG18001 1182.88
So far I used grep and awk to produce an output file containing the corresponding values for matched ID numbers and values and then used that output file to calculate the counts and averages using the following commands:
# First part using grep | awk
exec < input.txt
while read line
do
grep -w $line data.txt | cut -f1,2,3,4 | awk '{print $1,$2,$3,$4} ' >> output.txt
done
# Second part with perl
open my $input, '<', "output_1.txt" or die; ## the output file is from the first part and has the same layout as the data.txt file
my $total = 0;
my $count = 0;
while (<$input>) {
my ($name, $id1, $id2, $value) = split;
$total += $value;
$count += 1;
}
print "The total is $total\n";
print "The count is $count\n";
print "The average is ", $total / $count, "\n";
Both parts work OK but I would like to make it simplify it by running just one script. I've been trying to find a quicker way of running the whole lot together in perl but after several hours of reading, I am totally stuck on how to do it. I've been playing around with hashes, arrays, if and elsif statements without zero success. If anyone has suggestions etc, that would be great.
Thanks,
Harriet
If I understand you, you have a data file that contains the name of each line and the value for that line. The other two IDs are not important.
You will use a new file called an input file that will contain matching names as found in the data file. These are the values you want to average.
The fastest way is to create a hash that is keyed by the names and the values will be the value for that name in the data file. Because this is a hash, you can quickly locate the corresponding value. This is much faster than grep`ing the same array over and over again.
This first part will read in the data.txt file and store the name and value in a hash keyed by the name.
use strict;
use warnings;
use autodie; # This way, you don't have to check if you can't open the file
use feature qw(say);
use constant {
INPUT_NAME => "input.txt",
DATA_FILE => "data.txt",
};
#
# Read in data.txt and get the values and keys
#
open my $data_fh, "<", DATA_FILE;
my %ids;
while ( my $line = <$data_fh> ) {
chomp $line;
my ($name, $id1, $id2, $value) = split /\s+/, $line;
$ids{$name} = $value;
}
close $data_fh;
Now, that you have this hash, it's easy to read through the input.txt file and locate the matching name in the data.txt file:
open $input_fh, "<", INPUT_FILE;
my $count = 0;
my $total = 0;
while ( my $name = <$input_fh> ) {
chomp $name;
if ( not defined $ids{$name} ) {
die qq(Cannot find matching id "$name" in data file\n);
}
$total += $ids{$name};
$count += 1;
}
close $input_fh;
say "Average = " $total / $count;
You read through each file once. I am assuming that you only have a single instance of each name in each file.

Iterate a simple program

I have the following file:
a b 5
c d 6
e f 7
g h 4
i j 3
k l 10
and I want to find which line present the minimum value in the third column and erase it from the initial file. After this, I want to iterate again the program and find again which line present the minimum and make the same thing for 2 more times.
The output file should be
c d 6
e f 7
k l 10
I tried to write the following code:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file1 = "try.dat";
open my $fg, "<", $file1 or die "Error during $file1 opening: $!";
my #vector;
while (<$fg>) {
push #vector, [ split ];
}
my $minimum = $vector[0][2];
my #blacklist;
for my $i (0 .. $#vector) {
if ($minimum > $vector[$i][2]){
$minimum = $vector[$i][2];
push #blacklist, #vector[$i+1];
}
}
#say "my minimum is $minimum";
#say "the blacklist is composed by #blacklist";
I don't know how to erase the elements contained by the #blacklist (that in the first case should be i j 3) and how to iterate everything.
Any help for the iteration?
This sort of thing is what Tie::File was made for. It allows you to modify the file in-place by modfying a tied array.
This program does what you want. The helper function minidx returns the first index of the element of the passed array that holds the smallest value.
The program works by copying the third field of the file records into array #field3, and finding the index of the smallest value in there. The element at that index is then deleted from both the file and #field3 using splice.
use strict;
use warnings;
use Tie::File;
tie my #file, 'Tie::File', 'file.txt' or die $!;
my #field3 = map { (split)[2] } #file;
for (1 .. 3) {
my $i = minidx(\#field3);
splice #file, $i, 1;
splice #field3, $i, 1;
}
sub minidx {
my ($arr) = #_;
my ($i, $v);
for (0 .. $#$arr) {
($i, $v) = ($_, $arr->[$_]) unless defined $v and $arr->[$_] >= $v;
}
return $i;
}
output
c d 6
e f 7
k l 10
When you say #blacklist = #vector you are adding the entire #vector array to the black list. You probably want to do a push #blacklist, $vector[$i]. That will push the array reference into blacklist.
Now, blacklist has an array ref in it, so you have to deference it to print it.
say "the blacklist is composed by #{$blacklist[0]}";
Edit: For iterating and writing:
I would skip the #blacklist array (unless you need it for something else) and remove the min values from #vector. Then you can write #vector to some file.
my $num_elts_to_remove = 3;
for (my $j = 0; $j < $num_elts_to_remove; $j++) {
my $minimum = $vector[0][2];
my $min_idx = 0;
for my $i (0 .. $#vector) {
if ($minimum > $vector[$i][2]){
$minimum = $vector[$i][2];
$min_idx = $i;
}
}
push #blacklist, $vector[$min_index];
splice #vector, $min_idx, 1; #remove array with smallest value
}
Now write the array to a file
open OUT, ">", $outfile or die "Error: $!";
foreach(#vector) {
print OUT join " ", #$_;
print OUT "\n";
}
close(OUT);
Prints:
c d 6
e f 7
k l 10
Taking Borodin's Tie::File suggestion even further. I have written a cute module called Tie::Array::CSV which allow you to treat a delimited file as an array (and because it uses Tie::File underneath, it is both read and write). Because of this I can use Perlish operations like map and sort (and Schwartzian transform!) to perform this task:
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::Array::CSV;
tie my #data, 'Tie::Array::CSV', 'data', sep_char => ' ';
# get a list of row ids sorted by last value (inc)
my $i = 0;
my #sorted =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$i++, $_->[-1]] }
#data;
#splice the rows by index of the lowest three (from bottom to top)
splice #data, $_, 1 for reverse sort #sorted[0..2];
Note that in the end you want to remove rows from the bottom so that you don't have to reindex every time.