merge multiple files with similar column - perl

I have 30 files where column 1 is similar in each file. I would like to join the files based on column 1 so that the output file contains column 2 from each of the input files. I know how to join two files, but struggle with multiple files.
join -1 1 -2 1 File1 File2
The files are tab-separated with no header like this
File1
5S_rRNA 1324
5_8S_rRNA 32
7SK 15
ACA59 0
ACA64 0
BC040587 0
CDKN2B-AS 0
CDKN2B-AS_2 0
CDKN2B-AS_3 0
CLRN1-AS1 0
File2
5S_rRNA 571
5_8S_rRNA 11
7SK 5
ACA59 0
ACA64 0
BC040587 0
CDKN2B-AS 0
CDKN2B-AS_2 0
CDKN2B-AS_3 0
CLRN1-AS1 0
Output
5S_rRNA 1324 571
5_8S_rRNA 32 11
7SK 15 5
ACA59 0 0
ACA64 0 0
BC040587 0 0
CDKN2B-AS 0 0
CDKN2B-AS_2 0 0
CDKN2B-AS_3 0 0
CLRN1-AS1 0 0

First memory is the problem as the file size increases.Second if the ordering of the content is not important this will work good.
#!/usr/bin/perl
use strict;
use warnings;
my %hash;
my ($key,$value);
my #files=<files/*>;
foreach(#files){
open my $fh, '<', $_ or die "unable to open file: $! \n";
while(<$fh>){
chomp;
($key,$value)=split;
push(#{$hash{$key}},$value);
}
close($fh);
}
for(keys %hash){
print "$_ #{$hash{$_}} \n";
}

Below code will give your desire output but it will take more memory when number of files will increase (as you said there are 30 files). By using sort it sort the hash in alphabetical order of its keys (will give the output in same order as you mentioned in question).
#!/usr/bin/perl
use strict;
use warnings;
my #files = qw| input.log input1.log |; #you can give here path of files, or use #ARGV if you wish to pass files from command line
my %data;
foreach my $filename (#files)
{
open my $fh, '<', $filename or die "Cannot open $filename for reading: $!";
while (my $line = <$fh>)
{
chomp $line;
my ($col1, $col2) = split /\s+/, $line;
push #{ $data{$col1} }, $col2; #create an hash of array
}
}
foreach my $col1 (sort keys %data)
{
print join("\t", $col1, #{ $data{$col1} }), "\n";
}
Output:
5S_rRNA 1324 571
5_8S_rRNA 32 11
7SK 15 5
ACA59 0 0
ACA64 0 0
BC040587 0 0
CDKN2B-AS 0 0
CDKN2B-AS_2 0 0
CDKN2B-AS_3 0 0
CLRN1-AS1 0 0

Related

File parsing using perl

I am stuck in middle , i need help .
i have two files :
file1:
Total X :
Total y :
Total z :
Total t :
file 2:
4790351 4786929 3422 0
84860 84860 0 0
206626 206626 0 0
93902 93823 79 0
now i want output like this in third file
Total X : 4790351 4786929 3422 0
Total y : 84860 84860 0 0
Total z : 206626 206626 0 0
Total t : 93902 93823 79 0
This is my code below to try the parsing :Please help me getting the required output
while ( not eof $tata and not eof $outfh )
{
my #vals1 = split /":"/,<$tata>;
my #vals2 = split /\s+/, <$outfh>;
my #sum = join "\t", map { $vals1,$vals2[$_]} 0 .. $#vals2;
printf $_ for #sum,"\n";
}
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
open my $FILE1, "<", "file1.txt";
open my $FILE2, "<", "file2.txt";
open my $OUTFILE, ">", "results.txt";
my $first_line = <$FILE1>;
close $FILE1;
my #line_prefixes = split /\s*:\s*/, $first_line;
while (my $line = <$FILE2>) {
print {$OUTFILE} "$line_prefixes[$. - 1]: $line";
}
close $FILE2;
close $OUTFILE;
$. is the current line number in the file ($. equals 1 for the first line).
A sample run:
/pperl_programs$ cat file1.txt
Total X : Total y : Total z : Total t :
~/pperl_programs$ cat file2.txt
4790351 4786929 3422 0
84860 84860 0 0
206626 206626 0 0
93902 93823 79 0
~/pperl_programs$ cat results.txt
~/pperl_programs$ perl myprog.pl
~/pperl_programs$ cat results.txt
Total X: 4790351 4786929 3422 0
Total y: 84860 84860 0 0
Total z: 206626 206626 0 0
Total t: 93902 93823 79 0
~/pperl_programs$
For your altered files:
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
open my $FILE1, "<", "file1.txt";
open my $FILE2, "<", "file2.txt";
open my $OUTFILE, ">", "results.txt";
chomp(my #line_prefixes = <$FILE1>);
close $FILE1;
while (my $line = <$FILE2>) {
print {$OUTFILE} "$line_prefixes[$.-1] $line";
}
close $FILE2;
close $OUTFILE;
Sample output:
~/pperl_programs$ cat file1.txt
Total X :
Total y :
Total z :
Total t :
~/pperl_programs$ cat file2.txt
4790351 4786929 3422 0
84860 84860 0 0
206626 206626 0 0
93902 93823 79 0
~/pperl_programs$ cat results.txt
~/pperl_programs$ perl 1.pl
~/pperl_programs$ cat results.txt
Total X : 4790351 4786929 3422 0
Total y : 84860 84860 0 0
Total z : 206626 206626 0 0
Total t : 93902 93823 79 0
If your files are big, you probably don't want to read the whole first file into memory. If that's the case, you can read each file line by line:
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
open my $FILE1, "<", "file1.txt";
open my $FILE2, "<", "file2.txt";
open my $OUTFILE, ">", "results.txt";
while (!eof($FILE1) and !eof($FILE2) ) {
my $line_prefix = <$FILE1>;
chomp $line_prefix;
my $numbers_line = <$FILE2>;
chomp $numbers_line;
my #numbers = split /\s+/, $numbers_line;
my $fifth_column = $numbers[1] / $numbers[0];
say {$OUTFILE} "$line_prefix $numbers_line $fifth_column";
}
close $FILE1;
close $FILE2;
close $OUTFILE;
Your specification has a few loose ends; for instance - what if there are more lines in file2 then there are total labels in file1? Do you want the spaces in the first input file ignored? Do you want the spaces in the output file specifically as shown? ... and you really don't want any totals calculated??
I've presumed "yes" to most of these questions. My solution is driven by the second data file - which means that if there are more total labels then there are lines of data, they are going to be ignored. It also means that if there are more data lines in file2 then there are labels in file1, the program will simply make up the label - No Label?.
Finally, just in case you want to add these numbers up at some point, I've included but commented the sum function from List::Util.
use v5.12;
use File::Slurp;
# use File::Util qw( sum );
my $file1 = "file1.txt";
my $file2 = "file2.txt";
my $file3 = "file3.txt";
open(my $outfh, '>', $file3) or die "$file3: $!";
my #vals1 = split /\s*:\s*\n/ , read_file($file1);
my #vals2 = read_file($file2);
while (my $line = shift #vals2) {
chomp $line;
# my $total = sum split(" ", $line);
printf $outfh "%s : %s\n" , shift #vals1 // "No Label?" , $line ;
}
#
# $ cat file3.txt
Total X : 4790351 4786929 3422 0
Total y : 84860 84860 0 0
Total z : 206626 206626 0 0
Total t : 93902 93823 79 0

i have csv file and how to create new file for every same first five digit phone number and file name also shoud be first 5 digit

I have csv file like below.
Service Area Code Phone Numbers Preferences
17 9861511646 0 D 2
17 9861310397 0 D 2
13 9827035035 0 A 2
13 9827304969 0 D 2
13 9827355786 0 A 2
13 9827702373 0 A 2
17 9861424414 0 D 2
13 9827702806 0 A 2
23 9832380279 0 D 2
13 9827231370 0 D 2
13 9827163453 0 D 2
and i want to create new file according to first 4 digit like 9861.csv, 9827.csv etc
and data should be like this in 9861.csv:
Service Area Code Phone Numbers Preferences
17 9861511646 0 D 2
17 9861310397 0 D 2
17 9861424414 0 D 2
in 9827.csv data:
Service Area Code Phone Numbers Preferences
13 9827035035 0 A 2
13 9827304969 0 D 2
13 9827355786 0 A 2
13 9827702373 0 A 2
13 9827702806 0 A 2
13 9827231370 0 D 2
13 9827163453 0 D 2
here my code
my $file = "mycsvfile.csv";
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
my #fields = split "," , $line;
my $first_four = substr ($fields[1], -10, 4,);
open $line{$first_four}, '>', "$first_four.csv";
print { $line{$first_four} } $line;
close OUT;
}
Use Text::CSV it will take some of the hassle (like the header line) from you.
I don't understand why you use $first_five instead of $first_four...
my $file = "mycsvfile.csv";
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
my #fields = split "," , $line;
my $first_four = substr($fields[1], 0, 4);
open my $fh, '>>', "$first_four.csv" or die $!;
print {$fh} $line;
close $fh;
}
close $data;

Combining duplicated lines in txt file with perl

I am trying to combine duplicate lines using Perl with little luck. My tab-delimited text file is structured as follows (spaces added for readability):
Pentamer Probability Observed Length
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
I would like to be combine duplicated lines by adding the three numeric columns, therefor the line containing "ATGCA" would now look like this:
ATGCA 0.024 3 39
Any ideas/help/suggestions would be greatly appreciated! Thanks!
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
while(<>) {
my #v = split(/\s+/);
if (defined $hash{$v[0]}) {
my $arr = $hash{$v[0]};
$hash{$v[0]} = [$v[0], $arr->[1] + $v[1],
$arr->[2] + $v[2], $arr->[3] + $v[3]];
} else {
$hash{$v[0]} = [#v];
}
}
foreach my $key (keys %hash) {
print join(" ", #{$hash{$key}}), "\n";
}
Here's another option:
use Modern::Perl;
my %hash;
while ( my $line = <DATA> ) {
my #vals = split /\s+/, $line;
$hash{ $vals[0] }->[$_] += $vals[ $_ + 1 ] for 0 .. 2;
}
say join "\t", $_, #{ $hash{$_} } for sort keys %hash;
__DATA__
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
Output:
ATGCA 0.024 3 39
CAGTG 0.004 1 16
GGCAT 0.008 1 16
TGTAC 0.028 2 39

Perl: Matching four different files and obtaining particular Information in output file

I have four files. File 1 (named as inupt_22.txt) is an input file containing two columns (space delimited). First column is the alphabetically sorted list of ligandcode (three letter/number code for a particular ligand). Second column is a list of PDBcodes (Protein Data Bank code) respective of each ligandcode (unsorted list though).
File 1 (input_22.txt):
803 1cqp
AMH 1b2i
ASC 1f9g
ETS 1cil
MIT 1dwc
TFP 1ctr
VDX 1db1
ZMR 1a4g
File 2(named as SD_2.txt) is a SDF (Structure Data file) for fragments of each ligand. A ligand can contain one or more than one fragments. For instance, here 803 is the ligandcode and it has two fragments. So the file will look like: four dollar sign ($$$$) followed by ligandcode (i.e 803 in this example) in next line. every fragment follows the same thing. Next, in the 5th line of each fragment (third line from $$$$.\n803), there is a number that represents number of rows in next block of rows, like 7 in first fragment and 10 in next fragment of 803 ligand. Now, next block of rows contains a column (61-62) which contains specific number that refers to atoms in fragments. For example in first fragment of 803, these numbers are 15,16,17,19,20,21,22. These numbers need to be matched in file 3.
File 2 (SD_2.txt) looks like:
$$$$
803
SciTegic05101215222D
7 7 0 0 0 0 999 V2000
3.0215 -0.5775 0.0000 C 0 0 0 0 0 0 0 0 0 15 0 0
2.3070 -0.9900 0.0000 C 0 0 0 0 0 0 0 0 0 16 0 0
1.5926 -0.5775 0.0000 C 0 0 0 0 0 0 0 0 0 17 0 0
1.5926 0.2475 0.0000 C 0 0 0 0 0 0 0 0 0 19 0 0
2.3070 0.6600 0.0000 C 0 0 0 0 0 0 0 0 0 20 0 0
2.3070 1.4850 0.0000 O 0 0 0 0 0 0 0 0 0 21 0 0
3.0215 0.2475 0.0000 O 0 0 0 0 0 0 0 0 0 22 0 0
1 2 1 0
1 7 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 2 0
5 7 1 0
M END
> <Name>
803
> <Num_Rings>
1
> <Num_CSP3>
4
> <Fsp3>
0.8
> <Fstereo>
0
$$$$
803
SciTegic05101215222D
10 11 0 0 0 0 999 V2000
-1.7992 -1.7457 0.0000 C 0 0 0 0 0 0 0 0 0 1 0 0
-2.5137 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 2 0 0
-2.5137 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 3 0 0
-1.7992 -0.0957 0.0000 C 0 0 0 0 0 0 0 0 0 5 0 0
-1.0847 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 6 0 0
-0.3702 -0.0957 0.0000 C 0 0 0 0 0 0 0 0 0 7 0 0
0.3442 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 8 0 0
0.3442 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 9 0 0
-0.3702 -1.7457 0.0000 C 0 0 0 0 0 0 0 0 0 11 0 0
-1.0847 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 12 0 0
1 2 1 0
1 10 1 0
2 3 1 0
3 4 1 0
4 5 2 0
5 6 1 0
5 10 1 0
6 7 2 0
7 8 1 0
8 9 1 0
10 9 1 0
M END
> <Name>
803
> <Num_Rings>
2
> <Num_CSP3>
6
> <Fsp3>
0.6
> <Fstereo>
0.1
File 3 is CIF (Crystallographic Information file). This file can be obtained from following link: File_3
This file is a collection of individual cif files for several ligand molecules. Each part in file starts with data_ligandcode. For our example it will be data_803. After 46 lines from the start of each small file in collection, there is a block that gives structural information about the molecule. The number of rows in this block is not fixed. However, this block ends with an Hash sign (#). In this block two columns are important which are 53-56 and 62-63. 62-63 column contains numbers that can be matched from numbers obtained from file 2. And, 53-56 contains atom names like C1 (Carbon 1) etc. This column can be used to match with file 4.
File 4 is a Grow.out file that contains information about interaction of each ligand with their target protein. The file name is the PDBcode given in file 1 against each ligand. For example for ligand 803 the PDBcode is 1cqp. So, the grow.out file will be having name of 1cqp. 1cqp
In this file those rows are important those contain ligandcode (for example 803) and and the atom name obtained from 53-56 column of file three.
Task: I need a script that reads ligandcode from File 1, goes to file 2 search for $$$$ . \nLigandcode and then obtain numbers from column 61-62 for each fragment. Then in next step my script should pass these number to file 3 and match the rows containing these number in column 62-63 of file 3 and then pull out the information in column 53-56 (atom names). And last step will be opening of file 4 with the name of PDBcode and then printing the rows containing ligandcode and the atom names obtained from file 3. The printing should be done in an output file.
I am a Biomedical Research student. I don't have computer science background. However, I have to use Perl programming for some task. For the above mentioned task I wrote a script, but it is not working properly and I can not find the reason behind it. The script I wrote is :
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
use Carp qw(croak);
{
my $a;
my $b;
my $input_file = "input_22.txt";
my #lines = slurp($input_file);
for my $line (#lines){
my ($ligandcode, $pdbcode) = split(/\t/, $line);
my $i=0;
my $k=0;
my #array;
my #array1;
open (FILE, '<', "SD_2.txt");
while (<FILE>) {
my $i=0;
my $k=0;
my #array;
my #array1;
if ( $_=~/\x24\x24\x24\x24/ . /\n$ligandcode/) {
my $nextline1 = <FILE>;
my $nextline2 = <FILE>;
my $nextline3 = <FILE>;
my $nextline4= <FILE>;
my $totalatoms= substr( $nextline4, 1,2);
print $totalatoms,"\n";
while ($i<$totalatoms)
{
my $nextlines= <FILE>;
my $sub= substr($nextlines, 61, 2);
print $sub;
$array[$i] = $sub;
open (FH, '<', "components.txt");
while (my $ship=<FH>) {
my $var="data_$ligandcode";
if ($ship=~/$var/)
{
while ($k<=44)
{
$k++;
my $nextline = <FH>;
}
my $j=0;
my $nextline3;
do
{
$nextline3=<FH>;
print $nextline3;
my $part= substr($nextline3, 62, 2);
my $part2= substr($nextline3, 53, 4);
$array1[$j] = $part;
if ($array1[$j] eq $array[$i])
{
print $part2, "\n";
open (GH, '<', "$pdbcode");
open (OH, ">>out_grow.txt");
while (my $grow = <GH>)
{
if ( $grow=~/$ligandcode/){
print OH $grow if $grow=~/$part2/;
}}
close (GH);
close (OH);
}
$j++;
} while $nextline3 !~/\x23/;
}
}
$i++;
close (FH);
}
}}
close (FILE);
}
}
##Slurps a file into a list
sub slurp {
my ($file) = #_;
my (#data, #data_chomped);
open IN, "<", $file or croak "can't open $file\n";
#data = <IN>;
for my $line (#data){
chomp($line);
push (#data_chomped, $line);
}
close IN;
return (#data_chomped);
}
I want to make it a script that works fast and works for 1000 fragments altogether, if I make a list of 400 molecules in file 1. Kindly help me to make this script working. I ll be grateful.
You need to break your code into manageable steps.
Create data-structures from the files
use Slurp;
my #input = map{
[ split /\s+/, $_, 2 ]
} slurp $input_filename;
# etc
Process each element of input_22.txt, using those data structures.
I really think you should look into PerlMol. After all, half the reason to use Perl is CPAN.
Things you did well
Using 3-arg open
use strict;
use warnings;
Things you shouldn't have done
(Re)defined $a and $b
They are already defined for you.
Reimplemented slurp (poorly)
Read the same file in multiple times.
You opened SD_2.txt once for every line of input_22.txt.
Defined symbols outside of the scope where you use them.
$j, $k, #array and #array1 are defined twice, but only one of the definitions is being used.
Used open and close without some sort of error checking.
Either open ... or die; or use autodie;
You used bareword filehandles. IN, FILE etc
Instead use open my $FH, ...
Most of those aren't that big of a deal though, for a one-off program.

Truth Table Generation for the given input

I want to generate a truth table for the given input.Suppose if i give input 2 the output will be
10 01 11 00
if the input is 3 the output will be
111 000 110 101 011 100 001 010
i have a code snippet
#!/usr/bin/perl
#print "a|b|c\n";
for $a (1, 0){
for $b (1, 0){
for $c (1,0) {
for $d ( 1,0)
{
print "$a $b $c $d";
#print $x = ($a & $b & $c);
print "\n";
}
}
}
}
print "\n";
above code is for 4.
i don't know how to do this without writing multiple for loops. Here for value 2 i need to write two for loops and so on.
can any body tell me how to tweak this code for several input values.
Any help would be great appreciated
Recursion
Here is a simple solution using recursion:
#!/usr/bin/perl -w
my $variables=$ARGV[0]||0;
show_combinations($variables);
sub show_combinations { my($n,#prefix)=#_;
if($n > 0) {
show_combinations( $n-1, #prefix, 0);
show_combinations( $n-1, #prefix, 1);
} else {
print "#prefix\n";
}
}
Here are some sample cases:
> script.pl 1
0
1
> script.pl 2
0 0
0 1
1 0
1 1
> script.pl 3
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
I'm no Perl expert, so you might need to clean this up, but if forced to use Perl I'd probably do something like this:
#!/usr/bin/perl
my ($n) = #ARGV;
printf("%0*b\n", $n, $_) for 0 .. (1 << $n) - 1;
This is simple one line Perl code using module Math::Cartesian::Product.
use Math::Cartesian::Product;
cartesian {print "#_\n"} ([0..1]) x $ARGV[0];
Output
./sample.pl 2
0 0
0 1
1 0
1 1
./sample.pl 3
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
I don't know Perl so this may not work:
-- loop from 0 to (2^n)-1, where n is the number of digits in your cases;
-- convert each number to its n-digit binary representation;
Here is a generalization of my previous solution using Math::BigInt. This is an iterative solution:
#!/usr/bin/perl
use strict;
use warnings;
use Math::BigInt try => 'GMP';
my $n_bits = $ARGV[0] || 0;
my $it = make_it($n_bits);
while ( defined(my $bits = $it->()) ) {
print "$bits\n";
}
sub make_it {
my ($n_bits) = #_;
my $limit = Math::BigInt->new('2');
$limit->blsft($n_bits - 1);
my $next = Math::BigInt->new('-1');
return sub {
$next->binc;
return unless $next->bcmp($limit) < 0;
my $bits = $next->as_bin;
$bits =~ s/^0b//;
if ( (my $x = length $bits) < $n_bits ) {
$bits = '0' x ($n_bits - $x) . $bits;
}
return $bits;
}
}
You can use the %b format specifier for printf:
use strict;
use warnings;
my ($count) = #ARGV;
my $fmt = "%0${count}b";
my $n = 2**$count - 1;
for my $c (0 .. $n) {
my #bits = split //, sprintf $fmt, $c;
print "#bits\n";
}
This will only work for $count values less than 32.
Output:
C:\Temp> y 3
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
I am surprised no one has mentioned glob as a solution here:
perl -e 'print join "\n", glob("{0,1}" x shift || 1 )' -- 3
This prints:
000
001
010
011
100
101
110
111
glob is very handy for computing string permutations.
Here is the above, in a cleaner, non-one-liner form:
use strict;
use warnings;
my $symbol_count = shift || 1;
my #permutations = glob( '{0,1}' x $symbol_count );
print join "\n", #permutations;