How can I find the elements appearing in two columns of a tab-delimited file? - perl

I have a file which is tab delimited, and has two columns, A and B.
I want to count the number of times an element in B is repeated in A. I could have done it in Excel, but since the two columns contain more than 200k elements, it hangs.
I tried with this code but it counts elements in itself:
my %counts = ();
for (#A) {
$count{$_}++;
}
foreach my $k(keys %counts) {
print "$k\t$count{$k}\n";
}

Try this solution:
use strict;
use warnings;
my %countx;
my #y;
my $file = 'ab.txt';
open my $fh, '<', $file or die "Couldn't open $file";
while (my $line = <$fh>) {
chomp $line; # remove newline
# I've avoided using $a and $b because they are special variables in perl
my ( $x, $y ) = split /\t/, $line;
$countx{ $x }++;
push #y, $y;
}
close $fh;
foreach my $y (#y) {
my $count = $countx{ $y } || 0;
print "$y\t$count\n";
}

Related

I need the output in following way in perl

# File-
# a,b,c,d,e,f
# 1,2,3,4,3,2
# 9,8,7,6,5,0
# 2,3,4,6,7,8
# i need output like this:-
# a=1,d=4,c=3,a=9,d=6,c=7,a=2,d=6,c=4
# but my program gives this:-
# a=1,d=4,c=3a=9,d=6,c=7a=2,d=6,c=4 (there is no , after c and a)
my script is :-
open ($fh, 'parse.txt');
my #arr;
my $dummy=<$fh>;
while (<$fh>) {
chomp;
$a = substr $_, 0,1;
$b = substr $_, 6,1;
$c = substr $_, 4,1;
print "a=$a,d=$b,c=$c";
}
close (IN);
my $prefix = '';
while (<$fh>) {
chomp;
my #fields = split /,/;
print $prefix."a=$fields[0],d=$fields[3],c=$fields[2]";
$prefix = ',';
}
print("\n");
or
my #recs;
while (<$fh>) {
chomp;
my #fields = split /,/;
push #recs, "a=$fields[0],d=$fields[3],c=$fields[2]";
}
print(join(',', #recs), "\n");
Instead of printing out the values you could append them to a string and include a comma after the "c" value. Then at the end of the loop, erase the final comma from the string and print it out. There are some scalability problems if your input file is too large. But if it's a reasonable size there shouldn't be any substantial issue.
my $output;
my $dummy=<$fh>;
while (<$fh>) {
chomp;
$a = substr $_, 0,1;
$b = substr $_, 6,1;
$c = substr $_, 4,1;
$output .= "a=$a,d=$b,c=$c,";
}
chop $output;
print $output;
If you have fields with separators split the line and collect needed elements
use warnings;
use strict;
use feature 'say';
my $file = 'parse.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
my $dummy = <$fh>;
my #res;
while (<$fh>)
{
my ($a, $d, $c) = (split /,/)[0,3,2];
push #res, "a=$a,d=$d,c=$c";
}
say join ',', #res;
or pick the order in the assignment
my ($a, $c, $d) = (split /,/)[0,2,3];

Print a variable which is inside two loops

I couldn't figure it out how to escape this.
I would like to print the variable $rfam_column, which is inside two loops. But I cannot just write the print command right after the place where $rfam_column appears, because I would like to print other things which will be outside the loop and combine them to the printed content.
I would appreciate any advice as to what I'm doing wrong here.
use warnings;
use strict;
my $in;
GetOptions('input' => \$in) or die;
if ( $in ) {
my $input = $ARGV[0] or die;
open (my $fh, '<', $input) or die "Can't open $input $!\n";
chomp (my #db_file = <$fh>);
close $fh;
my #list = grep /RNA/, #db_file;
my $column;
my #column = ();
foreach $column ( #list ) {
my #all_columns = split (/\t/, $column);
my $rfam_column = $all_columns[0];
# insert "|" between RFs
foreach $_ ( $rfam_column ) {
s/^/|/;
}
}
}
print "$rfam_column";
Global symbol "$rfam_column" requires explicit package name at script_vbeta.pl line 90.
Execution of script_vbeta.pl aborted due to compilation errors.
EDITED to include all the code and information of the input--output as suggested:
Input file is a table with n lines vs n columns like this (I extracted a few columns otherwise it would be much long to represent in a line):
RF00001 1302 5S ribosomal RNA
RF00006 1307 Vault RNA
RF00007 1308 U12 minor spliceosomal RNA
RF00008 1309 Hammerhead ribozyme (type III)
Output should be like this:
|RF00001|RF00006|RF00007
And the code (usage: script.pl -i input_file):
use warnings;
use strict;
use Getopt::Long;
Getopt::Long::Configure("pass_through");
my $in;
GetOptions('input' => \$in) or die;
if ( $in ) {
my $input = $ARGV[0] or die;
open (my $fh, '<', $input) or die "Can't open $input $!\n";
chomp (my #db_file = <$fh>);
close $fh;
my #list = grep /RNA/, #db_file;
my $column;
my #column = ();
foreach $column ( #list ) {
my #all_columns = split (/\t/, $column);
my $rfam_column = $all_columns[0];
# insert "|" between RFs
foreach $_ ( $rfam_column ) {
s/^/|/;
}
}
}
print "$rfam_column";
I think you want
if ($in) {
...
my #rfams;
for my $row (#list) {
my #fields = split(/\t/, $row);
my $rfam = $fields[0];
push #rfams, $rfam;
}
my $rfams = join('|', #rfams);
print("$rfams\n");
}
I would like to print other things which will be outside the loop and combine them to the $rfam_column content
You can include anything that is in an outer scope in print. You can just put your print statement inside the inner loop
By the way, I don't know what you mean by
# insert "|" between RFs
foreach $_ ($rfam_column) {
s/^/|/;
}
That is the same as
$rfam_column =~ s/^/|/;
which just adds a pipe | character to the beginning of the string
What is an RF?

Perl : Need to append two columns if the ID's are repeating

If id gets repeated I am appending app1, app2 and printing it once.
Input:
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
Output:
id|Name|app1|app2
1|abc|234,265|231,321|
2|xyz|123|215|
3|asd|213|235|
Output I'm getting:
id|Name|app1|app2
1|abc|234,231|
2|xyz|123,215|
1|abc|265,321|
3|asd|213,235|
My Code:
#! usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
my $counter = 0;
my %RepeatNumber;
my $pos=0;
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
open(FH, '<', join('', $basedir, $file)) || die $!;
my $line = readline(FH);
unless ($counter) {
chomp $line;
print OUTFILE $line;
print OUTFILE "\n";
}
while ($line = readline(FH)) {
chomp $line;
my #obj = split('\|',$line);
if($RepeatNumber{$obj[0]}++) {
my $str1= join("|",$obj[0]);
my $str2=join(",",$obj[2],$obj[3]);
print OUTFILE join("|",$str1,$str2);
print OUTFILE "\n";
}
}
This should do the trick:
use strict;
use warnings;
my $file_in = "doctor.txt";
open (FF, "<$file_in");
my $temp = <FF>; # remove first line
my %out;
while (<FF>)
{
my ($id, $Name, $app1, $app2) = split /\|/, $_;
$out{$id}[0] = $Name;
push #{$out{$id}[1]}, $app1;
push #{$out{$id}[2]}, $app2;
}
foreach my $key (keys %out)
{
print $key, "|", $out{$key}[0], "|", join (",", #{$out{$key}[1]}), "|", join (",", #{$out{$key}[2]}), "\n";
}
EDIT
To see what the %out contains (in case it's not clear), you can use
use Data::Dumper;
and print it via
print Dumper(%out);
I'd tackle it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use 5.14.0;
my %stuff;
#extract the header row.
#use the regex to remove the linefeed, because
#we can't chomp it inline like this.
#works since perl 5.14
#otherwise we could just chomp (#header) later.
my ( $id, #header ) = split( /\|/, <DATA> =~ s/\n//r );
while (<DATA>) {
#turn this row into a hash of key-values.
my %row;
( $id, #row{#header} ) = split(/\|/);
#print for diag
print Dumper \%row;
#iterate each key, and insert into $row.
foreach my $key ( keys %row ) {
push( #{ $stuff{$id}{$key} }, $row{$key} );
}
}
#print for diag
print Dumper \%stuff;
print join ("|", "id", #header ),"\n";
#iterate ids in the hash
foreach my $id ( sort keys %stuff ) {
#join this record by '|'.
print join('|',
$id,
#turn inner arrays into comma separated via map.
map {
my %seen;
#use grep to remove dupes - e.g. "abc,abc" -> "abc"
join( ",", grep !$seen{$_}++, #$_ )
} #{ $stuff{$id} }{#header}
),
"\n";
}
__DATA__
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
This is perhaps a bit overkill for your application, but it should handle arbitrary column headings and arbitary numbers of duplicates. I'll coalesce them though - so the two abc entries don't end up abc,abc.
Output is:
id|Name|app1|app2
1|abc|234,265|231,321
2|xyz|123|215
3|asd|213|235
Another way of doing it which doesn't use a hash (in case you want to be more memory efficient), my contribution lies under the opens:
#!/usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
select(OUTFILE);
open(FH, '<', join('', $basedir, $file)) || die $!;
print(scalar(<FH>));
my #lastobj = (undef);
foreach my $obj (sort {$a->[0] <=> $b->[0]}
map {chomp;[split('|')]} <FH>) {
if(defined($lastobj[0]) &&
$obj[0] eq $lastobj[0])
{#lastobj = (#obj[0..1],
$lastobj[2].','.$obj[2],
$lastobj[3].','.$obj[3])}
else
{
if($lastobj[0] ne '')
{print(join('|',#lastobj),"|\n")}
#lastobj = #obj[0..3];
}
}
print(join('|',#lastobj),"|\n");
Note that split, without it's third argument ignores empty elements, which is why you have to add the last bar. If you don't do a chomp, you won't need to supply the bar or the trailing hard return, but you would have to record $obj[4].

how to write my results to external file in perl

I am trying to read some particular columns from myu data into my output file, i succeed in this reading one cloumn at a time but i want to read some more columns of my interest at a time (i have list of column i want to extract in a separate tex file) because extract individual column and joining them to make one separate file will become hectic to me, here is the code i tried to extract single coulmn,
#!/usr/bin/perl
use strict;
use warnings;
open (DATA, "<file.txt") or die ("Unable to open file");
my $search_string = "IADC512444";
my $header = <DATA>;
my #header_titles = split /\t/, $header;
my $extract_col = 0;
for my $header_line (#header_titles) {
last if $header_line =~ m/$search_string/;
$extract_col++;
}
print "Extracting column $extract_col\n";
while ( my $row = <DATA> ) {
last unless $row =~ /\S/;
chomp $row;
my #cells = split /\t/, $row;
print "$cells[$extract_col] ";
}
is there any possibility to extract all columns at a time instead of only IADC512444 i want from my textfile into outfile on to my harddisc? please help me in solving this problem,
Thanks
If you need to print the contents to a file on disk then you should open a file in write mode and write to it. Also if you want more columns you can do that by accessing corresponding element in the array cells. In this example i am printing the column you are printing plus column 1 and 2
open(OUT_FILE,">path_to_out_file") || die "cant open file...";
while ( my $row = <DATA> ) {
last unless $row =~ /\S/;
chomp $row;
my #cells = split /\t/, $row;
#print "$cells[$extract_col] ";
print OUT_FILE "$cells[$extract_col],$cells[1],$cells[2]\n";
}
close(OUT_FILE)
I have tweaked the code little bit to suit your requirement.
In the variable req_hdr_string you should say the column names which you require separated by ,
So it will be splitted and stored in a hash.
Then from the header i get the position of the column and print only those
#!/usr/bin/perl
use strict;
use warnings;
open (DATA, "<h11.txt") or die ("Unable to open file");
my $req_hdr_string = "abc,ghi,mno,";
my %req_hdrs = ();
my %extract_col = ();
foreach(split /,/, $req_hdr_string)
{
print "req hdr is:$_\n";
$req_hdrs{$_} = $_;
}
my $index = 0;
my $header = <DATA>;
chomp $header;
foreach (split /\t/, $header)
{
print "input is:|$_|\n";
if(exists $req_hdrs{$_})
{
print "\treq index is:$index\n";
$extract_col{$index} = 1;
}
$index++;
}
open(OUT_FILE,">out_file") || die "cant open file...";
while ( my $row = <DATA> )
{
last unless $row =~ /\S/;
chomp $row;
my #cells = split /\t/, $row;
foreach $index (sort keys%extract_col)
{
print OUT_FILE "$cells[$index],";
}
print OUT_FILE "\n";
}
close(OUT_FILE);
close(DATA);

How do I input file line results into an array?

My code so far only reads lines 1 to 4 and prints them. What I want to do instead of printing them is putting them into an array. So any help would be greatly appreciated. And hopefully just the code since it should be short. I learn much faster looking at full code instead of opening another 50 tabs trying to put multiple concepts together. Hopefully I'll learn this at some point and won't require help.
my $x = 1;
my $y = 4;
open FILE, "file.txt" or die "can not open file";
while (<FILE>) {
print if $. == $x .. $. == $y;
}
You should just put each line in an array with push :
my $x = 1;
my $y = 4;
my #array;
open FILE, "file.txt" or die "can not open file";
while (<FILE>) {
push (#array, $_) if ($. >= $x || $. <= $y);
}
foreach at the end is just proof it works - note it doesn't ignore blank lines - figured you may want to keep them.
#!/usr/bin/perl
use warnings;
use strict;
my $fi;
my $line;
my $i = 0;
my #array;
open($fi, "< file.txt");
while ($line = <$fi>) {
$array[$i] = $line;
if ($i == 3)
{
last;
}
$i++;
}
foreach(#array)
{
print $_;
}
you know, you don't need to keep iterating through the file once you've got all the data you need.
my $x = 1;
my $y = 4;
my #array;
my $file = 'file.txt';
# Lexical filehandle, three-argument open, meaningful error message
open my $file_h, '<', $file or die "cannot open $file: $!";
while (<$file_h>) {
push #array $_ if $_ >= $x; # This condition is unnecessary when $x is 1
last if $. == $y;
}