count no.of occurrences per unique id - perl

I am new to command line. I have long text file (samp.txt) with following columns with space delimited. Awk/sed/perl help appreciated.
Id Pos Re Va Cn SF:R1 SR He Ho NC
c|371443199 22 G A R Pass:8 0 1 0 0
c|371443199 25 C A M Pass:13 0 0 1 0
c|371443199 22 G A R Pass:8 0 1 0 0
c|367079424 17 C G S Pass:19 0 0 1 0
c|371443198 17 G A R Pass:18 0 1 0 0
c|367079424 17 G A R Pass:18 0 0 1 0
I want count for each unique id (count unique id how many occurrences), count 6th column (6th column =pass), count how many He (from 8th column) and how many Ho (9 th column). I would like to get result like this
Id CountId Countpass CountHe CountHO
cm|371443199 3 3 2 1
cm|367079424 2 2 0 2

awk '{ids[$1]++; pass[$1] = "?"; he[$1] += $8; ho[$1] += $9} END {OFS = "\t"; print "Id", "CountId", "Countpass", "CountHe", "CountHO"; for (id in ids) {print id, ids[id], pass[id], he[id], ho[id]}' inputfile
Broken out onto multiple lines:
awk '{
ids[$1]++;
pass[$1] = "?"; # I'm not sure what you want here
he[$1] += $8;
ho[$1] += $9
}
END {
OFS = "\t";
print "Id", "CountId", "Countpass", "CountHe", "CountHO";
for (id in ids) {
print id, ids[id], pass[id], he[id], ho[id]
}' inputfile

You seem to have a typo in your input, where you put ...98 instead of ...99. Assuming this is the case, your other information and expected output makes sense.
Using an array to store the ids to preserve the original order of the ids.
use strict;
use warnings;
use feature 'say'; # to enable say()
my $hdr = <DATA>; # remove header
my %hash;
my #keys;
while (<DATA>) {
my ($id,$pos,$re,$va,$cn,$sf,$sr,$he,$ho,$nc) = split;
$id =~ s/^c\K/m/;
$hash{$id}{he} += $he;
$hash{$id}{ho} += $ho;
$hash{$id}{pass}{$sf}++;
$hash{$id}{count}++;
push #keys, $id if $hash{$id}{count} == 1;
}
say join "\t", qw(Id CountId Countpass CountHe CountHO);
for my $id (#keys) {
say join "\t", $id,
$hash{$id}{count}, # occurences of id
scalar keys $hash{$id}{pass}, # the number of unique passes
#{$hash{$id}}{qw(he ho)};
}
__DATA__
Id Pos Re Va Cn SF:R1 SR He Ho NC
c|371443199 22 G A R Pass:8 0 1 0 0
c|371443199 25 C A M Pass:13 0 0 1 0
c|371443199 22 G A R Pass:8 0 1 0 0
c|367079424 17 C G S Pass:19 0 0 1 0
c|371443198 17 G A R Pass:18 0 1 0 0
c|367079424 17 G A R Pass:18 0 0 1 0
Output:
Id CountId Countpass CountHe CountHO
cm|371443199 3 2 2 1
cm|367079424 2 2 0 2
cm|371443198 1 1 1 0
Note: I made the output tab-delimited for easier post-processing. If you want it pretty instead, use printf to get some fixed width fields.

Related

I want to have a output with 85 characters in each line, could you please say how I have to use print in this field?

I used following command to get a specific format that the output of it is in one line:
MASH P 0 3.64 NAMD P 0 3.79 AGHA P 0 4.50 SARG P 0 4.71 BENG P 0 5.47 BANR P 0 6.75 ABZA P 0 6.25 KALI P 0 6.91
I want to have a output with 85 characters in each line, could someone explain how I have to use print in this field?
You can use a regular expression with a quantifier:
$_ = 'MASH P 0 3.64 NAMD P 0 3.79 AGHA P 0 4.50 SARG P 0 4.71 BENG P 0 5.47 BANR P 0 6.75 ABZA P 0 6.25 KALI P 0 6.91';
print $&, "\n" while /.{1,85}/g;
or, if it's a part of a larger program and you don't want to suffer the performance penalty, use ${^MATCH} instead of $&:
use Syntax::Construct qw{ /p };
print ${^MATCH}, "\n" while /.{1,85}/gp;
You can also use the four argument substr:
print substr($_, 0, 85, q()), "\n" while $_;

How to count number of 1 and 0 in the matrix?

I have an image of which I cut out only one column. After that I made it to be logical so there are be only 0 and 1 in this column.
Suppose my values in this column are
1111000110000000000000011111111
I want to count the length of each block of ones or each block of zeros.
The result would be
1 - 4 (first 1)
0 - 3 (first 0)
1 - 2
and so on...
I know only count for the entire column but I can't do it for each distinct block. Anyone please help me.
Let vec be a row vector (1-by-n) of zeros and ones, then you can use the following code
rl = ( find( vec ~= [vec(2:end), vec(end)+1] ) );
data = vec( rl );
rl(2:end) = rl(2:end) - rl(1:end-1);
rl will give you the number of consecutive zeros and ones, while data will tell you for each block if it is zero or one.
This question is closely related to run length coding.
Demo:
vec = [1 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1];
rl = ( find( vec ~= [vec(2:end), vec(end)+1] ) );
data = vec( rl ),
rl(2:end) = rl(2:end) - rl(1:end-1),
data =
1 0 1 0 1
rl =
4 3 2 14 8

Perl: Matching four different files and obtaining particular Information in output file

I have four files. File 1 (named as inupt_22.txt) is an input file containing two columns (space delimited). First column is the alphabetically sorted list of ligandcode (three letter/number code for a particular ligand). Second column is a list of PDBcodes (Protein Data Bank code) respective of each ligandcode (unsorted list though).
File 1 (input_22.txt):
803 1cqp
AMH 1b2i
ASC 1f9g
ETS 1cil
MIT 1dwc
TFP 1ctr
VDX 1db1
ZMR 1a4g
File 2(named as SD_2.txt) is a SDF (Structure Data file) for fragments of each ligand. A ligand can contain one or more than one fragments. For instance, here 803 is the ligandcode and it has two fragments. So the file will look like: four dollar sign ($$$$) followed by ligandcode (i.e 803 in this example) in next line. every fragment follows the same thing. Next, in the 5th line of each fragment (third line from $$$$.\n803), there is a number that represents number of rows in next block of rows, like 7 in first fragment and 10 in next fragment of 803 ligand. Now, next block of rows contains a column (61-62) which contains specific number that refers to atoms in fragments. For example in first fragment of 803, these numbers are 15,16,17,19,20,21,22. These numbers need to be matched in file 3.
File 2 (SD_2.txt) looks like:
$$$$
803
SciTegic05101215222D
7 7 0 0 0 0 999 V2000
3.0215 -0.5775 0.0000 C 0 0 0 0 0 0 0 0 0 15 0 0
2.3070 -0.9900 0.0000 C 0 0 0 0 0 0 0 0 0 16 0 0
1.5926 -0.5775 0.0000 C 0 0 0 0 0 0 0 0 0 17 0 0
1.5926 0.2475 0.0000 C 0 0 0 0 0 0 0 0 0 19 0 0
2.3070 0.6600 0.0000 C 0 0 0 0 0 0 0 0 0 20 0 0
2.3070 1.4850 0.0000 O 0 0 0 0 0 0 0 0 0 21 0 0
3.0215 0.2475 0.0000 O 0 0 0 0 0 0 0 0 0 22 0 0
1 2 1 0
1 7 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 2 0
5 7 1 0
M END
> <Name>
803
> <Num_Rings>
1
> <Num_CSP3>
4
> <Fsp3>
0.8
> <Fstereo>
0
$$$$
803
SciTegic05101215222D
10 11 0 0 0 0 999 V2000
-1.7992 -1.7457 0.0000 C 0 0 0 0 0 0 0 0 0 1 0 0
-2.5137 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 2 0 0
-2.5137 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 3 0 0
-1.7992 -0.0957 0.0000 C 0 0 0 0 0 0 0 0 0 5 0 0
-1.0847 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 6 0 0
-0.3702 -0.0957 0.0000 C 0 0 0 0 0 0 0 0 0 7 0 0
0.3442 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 8 0 0
0.3442 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 9 0 0
-0.3702 -1.7457 0.0000 C 0 0 0 0 0 0 0 0 0 11 0 0
-1.0847 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 12 0 0
1 2 1 0
1 10 1 0
2 3 1 0
3 4 1 0
4 5 2 0
5 6 1 0
5 10 1 0
6 7 2 0
7 8 1 0
8 9 1 0
10 9 1 0
M END
> <Name>
803
> <Num_Rings>
2
> <Num_CSP3>
6
> <Fsp3>
0.6
> <Fstereo>
0.1
File 3 is CIF (Crystallographic Information file). This file can be obtained from following link: File_3
This file is a collection of individual cif files for several ligand molecules. Each part in file starts with data_ligandcode. For our example it will be data_803. After 46 lines from the start of each small file in collection, there is a block that gives structural information about the molecule. The number of rows in this block is not fixed. However, this block ends with an Hash sign (#). In this block two columns are important which are 53-56 and 62-63. 62-63 column contains numbers that can be matched from numbers obtained from file 2. And, 53-56 contains atom names like C1 (Carbon 1) etc. This column can be used to match with file 4.
File 4 is a Grow.out file that contains information about interaction of each ligand with their target protein. The file name is the PDBcode given in file 1 against each ligand. For example for ligand 803 the PDBcode is 1cqp. So, the grow.out file will be having name of 1cqp. 1cqp
In this file those rows are important those contain ligandcode (for example 803) and and the atom name obtained from 53-56 column of file three.
Task: I need a script that reads ligandcode from File 1, goes to file 2 search for $$$$ . \nLigandcode and then obtain numbers from column 61-62 for each fragment. Then in next step my script should pass these number to file 3 and match the rows containing these number in column 62-63 of file 3 and then pull out the information in column 53-56 (atom names). And last step will be opening of file 4 with the name of PDBcode and then printing the rows containing ligandcode and the atom names obtained from file 3. The printing should be done in an output file.
I am a Biomedical Research student. I don't have computer science background. However, I have to use Perl programming for some task. For the above mentioned task I wrote a script, but it is not working properly and I can not find the reason behind it. The script I wrote is :
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
use Carp qw(croak);
{
my $a;
my $b;
my $input_file = "input_22.txt";
my #lines = slurp($input_file);
for my $line (#lines){
my ($ligandcode, $pdbcode) = split(/\t/, $line);
my $i=0;
my $k=0;
my #array;
my #array1;
open (FILE, '<', "SD_2.txt");
while (<FILE>) {
my $i=0;
my $k=0;
my #array;
my #array1;
if ( $_=~/\x24\x24\x24\x24/ . /\n$ligandcode/) {
my $nextline1 = <FILE>;
my $nextline2 = <FILE>;
my $nextline3 = <FILE>;
my $nextline4= <FILE>;
my $totalatoms= substr( $nextline4, 1,2);
print $totalatoms,"\n";
while ($i<$totalatoms)
{
my $nextlines= <FILE>;
my $sub= substr($nextlines, 61, 2);
print $sub;
$array[$i] = $sub;
open (FH, '<', "components.txt");
while (my $ship=<FH>) {
my $var="data_$ligandcode";
if ($ship=~/$var/)
{
while ($k<=44)
{
$k++;
my $nextline = <FH>;
}
my $j=0;
my $nextline3;
do
{
$nextline3=<FH>;
print $nextline3;
my $part= substr($nextline3, 62, 2);
my $part2= substr($nextline3, 53, 4);
$array1[$j] = $part;
if ($array1[$j] eq $array[$i])
{
print $part2, "\n";
open (GH, '<', "$pdbcode");
open (OH, ">>out_grow.txt");
while (my $grow = <GH>)
{
if ( $grow=~/$ligandcode/){
print OH $grow if $grow=~/$part2/;
}}
close (GH);
close (OH);
}
$j++;
} while $nextline3 !~/\x23/;
}
}
$i++;
close (FH);
}
}}
close (FILE);
}
}
##Slurps a file into a list
sub slurp {
my ($file) = #_;
my (#data, #data_chomped);
open IN, "<", $file or croak "can't open $file\n";
#data = <IN>;
for my $line (#data){
chomp($line);
push (#data_chomped, $line);
}
close IN;
return (#data_chomped);
}
I want to make it a script that works fast and works for 1000 fragments altogether, if I make a list of 400 molecules in file 1. Kindly help me to make this script working. I ll be grateful.
You need to break your code into manageable steps.
Create data-structures from the files
use Slurp;
my #input = map{
[ split /\s+/, $_, 2 ]
} slurp $input_filename;
# etc
Process each element of input_22.txt, using those data structures.
I really think you should look into PerlMol. After all, half the reason to use Perl is CPAN.
Things you did well
Using 3-arg open
use strict;
use warnings;
Things you shouldn't have done
(Re)defined $a and $b
They are already defined for you.
Reimplemented slurp (poorly)
Read the same file in multiple times.
You opened SD_2.txt once for every line of input_22.txt.
Defined symbols outside of the scope where you use them.
$j, $k, #array and #array1 are defined twice, but only one of the definitions is being used.
Used open and close without some sort of error checking.
Either open ... or die; or use autodie;
You used bareword filehandles. IN, FILE etc
Instead use open my $FH, ...
Most of those aren't that big of a deal though, for a one-off program.

Use perl script to fill in values in data set

I have a series of numbers (in a text file) from 90,000 to 1,000,000 that correspond to files that I am missing. I would like to take this set and mark these files as "1" as a second column next to a complete series of numbers 90,000 to 1,000,000. For example for series 1to13 (which would correspond to the "missing" files):
3
7
10
12
I would like to create a data set:
1 0
2 0
3 1
4 0
5 0
6 0
7 1
8 0
9 0
10 1
11 0
12 1
13 0
I would like to be able to execute this in perl.
Assuming they are sorted:
use strict;
use warnings;
my $last = 89999;
while (my $next = <>) {
chomp($next);
print $last, " 0\n" while ++$last < $next;
print "$next 1\n";
}
print $last, " 0\n" while ++$last <= 1000000;

Truth Table Generation for the given input

I want to generate a truth table for the given input.Suppose if i give input 2 the output will be
10 01 11 00
if the input is 3 the output will be
111 000 110 101 011 100 001 010
i have a code snippet
#!/usr/bin/perl
#print "a|b|c\n";
for $a (1, 0){
for $b (1, 0){
for $c (1,0) {
for $d ( 1,0)
{
print "$a $b $c $d";
#print $x = ($a & $b & $c);
print "\n";
}
}
}
}
print "\n";
above code is for 4.
i don't know how to do this without writing multiple for loops. Here for value 2 i need to write two for loops and so on.
can any body tell me how to tweak this code for several input values.
Any help would be great appreciated
Recursion
Here is a simple solution using recursion:
#!/usr/bin/perl -w
my $variables=$ARGV[0]||0;
show_combinations($variables);
sub show_combinations { my($n,#prefix)=#_;
if($n > 0) {
show_combinations( $n-1, #prefix, 0);
show_combinations( $n-1, #prefix, 1);
} else {
print "#prefix\n";
}
}
Here are some sample cases:
> script.pl 1
0
1
> script.pl 2
0 0
0 1
1 0
1 1
> script.pl 3
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
I'm no Perl expert, so you might need to clean this up, but if forced to use Perl I'd probably do something like this:
#!/usr/bin/perl
my ($n) = #ARGV;
printf("%0*b\n", $n, $_) for 0 .. (1 << $n) - 1;
This is simple one line Perl code using module Math::Cartesian::Product.
use Math::Cartesian::Product;
cartesian {print "#_\n"} ([0..1]) x $ARGV[0];
Output
./sample.pl 2
0 0
0 1
1 0
1 1
./sample.pl 3
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
I don't know Perl so this may not work:
-- loop from 0 to (2^n)-1, where n is the number of digits in your cases;
-- convert each number to its n-digit binary representation;
Here is a generalization of my previous solution using Math::BigInt. This is an iterative solution:
#!/usr/bin/perl
use strict;
use warnings;
use Math::BigInt try => 'GMP';
my $n_bits = $ARGV[0] || 0;
my $it = make_it($n_bits);
while ( defined(my $bits = $it->()) ) {
print "$bits\n";
}
sub make_it {
my ($n_bits) = #_;
my $limit = Math::BigInt->new('2');
$limit->blsft($n_bits - 1);
my $next = Math::BigInt->new('-1');
return sub {
$next->binc;
return unless $next->bcmp($limit) < 0;
my $bits = $next->as_bin;
$bits =~ s/^0b//;
if ( (my $x = length $bits) < $n_bits ) {
$bits = '0' x ($n_bits - $x) . $bits;
}
return $bits;
}
}
You can use the %b format specifier for printf:
use strict;
use warnings;
my ($count) = #ARGV;
my $fmt = "%0${count}b";
my $n = 2**$count - 1;
for my $c (0 .. $n) {
my #bits = split //, sprintf $fmt, $c;
print "#bits\n";
}
This will only work for $count values less than 32.
Output:
C:\Temp> y 3
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
I am surprised no one has mentioned glob as a solution here:
perl -e 'print join "\n", glob("{0,1}" x shift || 1 )' -- 3
This prints:
000
001
010
011
100
101
110
111
glob is very handy for computing string permutations.
Here is the above, in a cleaner, non-one-liner form:
use strict;
use warnings;
my $symbol_count = shift || 1;
my #permutations = glob( '{0,1}' x $symbol_count );
print join "\n", #permutations;