I have large matrix files consisting of only "0" and "a" in clolumns and I want to do what this does:
perl -pe 'BEGIN { our $i = 1; } s/a/($i++)/ge;'; < FILE > NEW_FILE
but only increment once for each line instead of every instance on each line.
So if my first line in the file is:
0 0 a a a
The perl command gives me:
0 0 1 2 3
While i would want
0 0 1 1 1
and on the next line for instance 2 0 2 0 2 and so on...
This should be possible to do with awk, but using:
'{ i=1; gsub(/a/,(i+1));print}' tmp2
just gives me 0's and 2's for all lines...
Just increment before, not on every substitution:
awk '{i++; gsub(/a/,i)}1' file
This way, the variable gets updated once per line, not once per record.
The same applies to the Perl script:
perl -pe 'BEGIN { our $i = 0; } $i++; s/a/$i/ge;' file
Test
$ cat a
0 0 a a a
2 3 a a a
$ awk '{i++; gsub(/a/,i)}1' a
0 0 1 1 1
2 3 2 2 2
$ perl -pe 'BEGIN { our $i = 0; } $i++; s/a/$i/ge;' a
0 0 1 1 1
2 3 2 2 2
You can simply replace every occurrence of a with the current line number
perl -pe 's/a/$./g' FILE > NEW_FILE
perl -pe'$i++;s/a/$i/g'
or if you like to increment only for lines with any substitution
perl -pe'/a/&&$i++;s/a/$i/g'
In action:
$ cat a
0 0 a a a
1 2 0 0 0
2 3 a a a
$ perl -pe'$i++;s/a/$i/g' a
0 0 1 1 1
1 2 0 0 0
2 3 3 3 3
$ perl -pe'/a/&&$i++;s/a/$i/g' a
0 0 1 1 1
1 2 0 0 0
2 3 2 2 2
Related
I have a file which have rows like this:
004662484 4 0 0 0 0
The second column is number 4, and I want to use this number to extract this line into 4 lines like this:
004662484 0 0 0 0 0
004662484 1 0 0 0 0
004662484 2 0 0 0 0
004662484 3 0 0 0 0
How to do that using either awk or sed or both? Thanks!
{ reps = $2; for (i = 0; i < reps; i++) { $2 = i; print $0; } }
This question already has answers here:
An efficient way to transpose a file in Bash
(33 answers)
Closed 9 years ago.
I have a huge file of genetic markers for 2890 individuals. I would like to transpose this file. The format of my data is as follows: (I just showed 6 markers here)
ID rs4477212 kgp15297216 rs3131972 kgp6703048 kgp15557302 kgp12112772 .....
BV04976 0 0 1 0 0 0
BV76296 0 0 1 0 0 0
BV02803 0 0 0 0 0 0
BV09710 0 0 1 0 0 0
BV17599 0 0 0 0 0 0
BV29503 0 0 1 1 0 1
BV52203 0 0 0 0 0 0
BV61727 0 0 1 0 0 0
BV05952 0 0 0 0 0 0
In fact, I have 1,743,680 columns and 2890 rows in my text file. How to transpose it?
I would like the output should be like that:
ID BV04976 BV76296 BV02803 BV09710 BV17599 BV29503 BV52203 BV61727 BV05952
rs4477212 0 0 0 0 0 0 0 0 0
kgp15297216 0 0 0 0 0 0 0 0 0
rs3131972 1 1 0 1 0 1 0 1 0
kgp6703048 0 0 0 0 0 1 0 0 0
kgp15557302 0 0 0 0 0 0 0 0 0
kgp12112772 0 0 0 0 0 1 0 0 0
I would make multiple passes over the file, perhaps 100, each pass getting 1743680/passes columns, writing out them out (as rows) at the end of each pass.
Assemble the data into strings in an array, not an array of arrays, for lower memory usage and fewer passes.
Preallocating the space for each string at the beginning of each pass (e.g. $new_row[13] = ' ' x 6000; $new_row[13] = '';) might or might not help.
(See: An efficient way to transpose a file in Bash )
Have you tried
awk -f tr.awk input.txt > out.txt
where tr.awk is
{
for (i=1; i<=NF; i++) a[NR,i]=$i
}
END {
for (i=1; i<=NF; i++) {
for (j=1; j<=NR; j++) {
printf "%s", a[j,i]
if (j<NR) printf "%s", OFS
}
printf "%s",ORS
}
}
Probably your file is too big for the above procedure.
Then you could try splitting it up first. For example:
#! /bin/bash
numrows=2890
echo "Splitting file.."
split -d -a4 -l1 input.txt
arg=""
outfile="out.txt"
tempfile="temp.txt"
if [ -e $outfile ] ; then
rm -i $outfile
fi
for (( i=0; i<$numrows; i++ )) ; do
echo "Processing file: "$(expr $i + 1)"/"$numrows
file=$(printf "x%04d\n" $i)
tfile=${file}.tr
cat $file | tr -s ' ' '\n' > $tfile
rm $file
if [ $i -gt 0 ] ; then
paste -d' ' $outfile $tfile > $tempfile
rm $outfile
mv $tempfile $outfile
rm $tfile
else
mv $tfile $outfile
fi
done
note that split will generate 2890 temporary files (!)
I have four files. File 1 (named as inupt_22.txt) is an input file containing two columns (space delimited). First column is the alphabetically sorted list of ligandcode (three letter/number code for a particular ligand). Second column is a list of PDBcodes (Protein Data Bank code) respective of each ligandcode (unsorted list though).
File 1 (input_22.txt):
803 1cqp
AMH 1b2i
ASC 1f9g
ETS 1cil
MIT 1dwc
TFP 1ctr
VDX 1db1
ZMR 1a4g
File 2(named as SD_2.txt) is a SDF (Structure Data file) for fragments of each ligand. A ligand can contain one or more than one fragments. For instance, here 803 is the ligandcode and it has two fragments. So the file will look like: four dollar sign ($$$$) followed by ligandcode (i.e 803 in this example) in next line. every fragment follows the same thing. Next, in the 5th line of each fragment (third line from $$$$.\n803), there is a number that represents number of rows in next block of rows, like 7 in first fragment and 10 in next fragment of 803 ligand. Now, next block of rows contains a column (61-62) which contains specific number that refers to atoms in fragments. For example in first fragment of 803, these numbers are 15,16,17,19,20,21,22. These numbers need to be matched in file 3.
File 2 (SD_2.txt) looks like:
$$$$
803
SciTegic05101215222D
7 7 0 0 0 0 999 V2000
3.0215 -0.5775 0.0000 C 0 0 0 0 0 0 0 0 0 15 0 0
2.3070 -0.9900 0.0000 C 0 0 0 0 0 0 0 0 0 16 0 0
1.5926 -0.5775 0.0000 C 0 0 0 0 0 0 0 0 0 17 0 0
1.5926 0.2475 0.0000 C 0 0 0 0 0 0 0 0 0 19 0 0
2.3070 0.6600 0.0000 C 0 0 0 0 0 0 0 0 0 20 0 0
2.3070 1.4850 0.0000 O 0 0 0 0 0 0 0 0 0 21 0 0
3.0215 0.2475 0.0000 O 0 0 0 0 0 0 0 0 0 22 0 0
1 2 1 0
1 7 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 2 0
5 7 1 0
M END
> <Name>
803
> <Num_Rings>
1
> <Num_CSP3>
4
> <Fsp3>
0.8
> <Fstereo>
0
$$$$
803
SciTegic05101215222D
10 11 0 0 0 0 999 V2000
-1.7992 -1.7457 0.0000 C 0 0 0 0 0 0 0 0 0 1 0 0
-2.5137 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 2 0 0
-2.5137 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 3 0 0
-1.7992 -0.0957 0.0000 C 0 0 0 0 0 0 0 0 0 5 0 0
-1.0847 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 6 0 0
-0.3702 -0.0957 0.0000 C 0 0 0 0 0 0 0 0 0 7 0 0
0.3442 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 8 0 0
0.3442 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 9 0 0
-0.3702 -1.7457 0.0000 C 0 0 0 0 0 0 0 0 0 11 0 0
-1.0847 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 12 0 0
1 2 1 0
1 10 1 0
2 3 1 0
3 4 1 0
4 5 2 0
5 6 1 0
5 10 1 0
6 7 2 0
7 8 1 0
8 9 1 0
10 9 1 0
M END
> <Name>
803
> <Num_Rings>
2
> <Num_CSP3>
6
> <Fsp3>
0.6
> <Fstereo>
0.1
File 3 is CIF (Crystallographic Information file). This file can be obtained from following link: File_3
This file is a collection of individual cif files for several ligand molecules. Each part in file starts with data_ligandcode. For our example it will be data_803. After 46 lines from the start of each small file in collection, there is a block that gives structural information about the molecule. The number of rows in this block is not fixed. However, this block ends with an Hash sign (#). In this block two columns are important which are 53-56 and 62-63. 62-63 column contains numbers that can be matched from numbers obtained from file 2. And, 53-56 contains atom names like C1 (Carbon 1) etc. This column can be used to match with file 4.
File 4 is a Grow.out file that contains information about interaction of each ligand with their target protein. The file name is the PDBcode given in file 1 against each ligand. For example for ligand 803 the PDBcode is 1cqp. So, the grow.out file will be having name of 1cqp. 1cqp
In this file those rows are important those contain ligandcode (for example 803) and and the atom name obtained from 53-56 column of file three.
Task: I need a script that reads ligandcode from File 1, goes to file 2 search for $$$$ . \nLigandcode and then obtain numbers from column 61-62 for each fragment. Then in next step my script should pass these number to file 3 and match the rows containing these number in column 62-63 of file 3 and then pull out the information in column 53-56 (atom names). And last step will be opening of file 4 with the name of PDBcode and then printing the rows containing ligandcode and the atom names obtained from file 3. The printing should be done in an output file.
I am a Biomedical Research student. I don't have computer science background. However, I have to use Perl programming for some task. For the above mentioned task I wrote a script, but it is not working properly and I can not find the reason behind it. The script I wrote is :
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
use Carp qw(croak);
{
my $a;
my $b;
my $input_file = "input_22.txt";
my #lines = slurp($input_file);
for my $line (#lines){
my ($ligandcode, $pdbcode) = split(/\t/, $line);
my $i=0;
my $k=0;
my #array;
my #array1;
open (FILE, '<', "SD_2.txt");
while (<FILE>) {
my $i=0;
my $k=0;
my #array;
my #array1;
if ( $_=~/\x24\x24\x24\x24/ . /\n$ligandcode/) {
my $nextline1 = <FILE>;
my $nextline2 = <FILE>;
my $nextline3 = <FILE>;
my $nextline4= <FILE>;
my $totalatoms= substr( $nextline4, 1,2);
print $totalatoms,"\n";
while ($i<$totalatoms)
{
my $nextlines= <FILE>;
my $sub= substr($nextlines, 61, 2);
print $sub;
$array[$i] = $sub;
open (FH, '<', "components.txt");
while (my $ship=<FH>) {
my $var="data_$ligandcode";
if ($ship=~/$var/)
{
while ($k<=44)
{
$k++;
my $nextline = <FH>;
}
my $j=0;
my $nextline3;
do
{
$nextline3=<FH>;
print $nextline3;
my $part= substr($nextline3, 62, 2);
my $part2= substr($nextline3, 53, 4);
$array1[$j] = $part;
if ($array1[$j] eq $array[$i])
{
print $part2, "\n";
open (GH, '<', "$pdbcode");
open (OH, ">>out_grow.txt");
while (my $grow = <GH>)
{
if ( $grow=~/$ligandcode/){
print OH $grow if $grow=~/$part2/;
}}
close (GH);
close (OH);
}
$j++;
} while $nextline3 !~/\x23/;
}
}
$i++;
close (FH);
}
}}
close (FILE);
}
}
##Slurps a file into a list
sub slurp {
my ($file) = #_;
my (#data, #data_chomped);
open IN, "<", $file or croak "can't open $file\n";
#data = <IN>;
for my $line (#data){
chomp($line);
push (#data_chomped, $line);
}
close IN;
return (#data_chomped);
}
I want to make it a script that works fast and works for 1000 fragments altogether, if I make a list of 400 molecules in file 1. Kindly help me to make this script working. I ll be grateful.
You need to break your code into manageable steps.
Create data-structures from the files
use Slurp;
my #input = map{
[ split /\s+/, $_, 2 ]
} slurp $input_filename;
# etc
Process each element of input_22.txt, using those data structures.
I really think you should look into PerlMol. After all, half the reason to use Perl is CPAN.
Things you did well
Using 3-arg open
use strict;
use warnings;
Things you shouldn't have done
(Re)defined $a and $b
They are already defined for you.
Reimplemented slurp (poorly)
Read the same file in multiple times.
You opened SD_2.txt once for every line of input_22.txt.
Defined symbols outside of the scope where you use them.
$j, $k, #array and #array1 are defined twice, but only one of the definitions is being used.
Used open and close without some sort of error checking.
Either open ... or die; or use autodie;
You used bareword filehandles. IN, FILE etc
Instead use open my $FH, ...
Most of those aren't that big of a deal though, for a one-off program.
I have these kind of rows
0 1 1
I would like to multiply it by let's say 2 or 4 to get this pattern
0 0 0 0 1 1 1 1 1 1 1 1
Now, I have some piece of old code, which basically does this in the case of multiplying by 5.
But I cannot convert this script to do it for example 2 or 4 times...
Can anyone help me to figure it out?
Here is the code:
sed -e 's/\([01]\)/\1\1\1\1/7g ; s/\([01]\{2,\}\)/\1\1\1/g ; s/\b\([01]\)\b/\1\1\1\1\1/g ; s/\([01]\)\B/\1 /g'
$ echo '0 1 1' | sed -r 's/\S/& & & & &/g'
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
using sed repeat 4 times:
kent$ echo "0 1 1
1 1 0"|sed 's/[01]/& & & &/g'
0 0 0 0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0 0 0 0
with awk, you can give the times you want to repeat as parameter: e.g. say repeat 5 times:
kent$ echo "0 1 1
dquote> 1 1 0"|awk -v t=5 '{f=1;while(f<=NF){ n=1;while(n<=t){printf "%s ",$f;n++;}f++;} print "";}'
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
This might work for you:
echo -e '0 1 1\n1 1 1 0 0' | sed "s/\S/$(echo {1..4}| sed 's/\S*/\&/g')/g"
0 0 0 0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
This provides an OTT solution but it is programmable i.e. change 4 to any value you wish to multiply by.
I'm trying to add a bunch of 0s at the end of a line. The way the line is identified is that it is followed by a line which starts with "expr1"
in Vim what I do is:
s/\nexpr1/ 0 0 0 0 0 0\rexpr1/
and it works fine. I know that in ubuntu \n is what is normally used to terminate the line but whenever I do that I get a ^# symbol so \r works fine for me. I thought I'd use this with sed but it hasn't really worked. here is what I normally write:
sed "s/\nexpr1/ 0 0 0 0 0 0\rexpr1/" infile > outfile
The end-of-line marker is $. Try this:
s/$/ 0 0 0 0 0 0/
Depending on your environment, you might need to escape the $.
awk '{$0=$0" 0 0 0 0 0 "}1' file > tmp && mv tmp file
ruby -i.bak -ne '$_=$_.chomp!+" 0 0 0 0 0\n";print' file
awk '$(NF + 1) = " 0 0 0 0 0 0"' infile > outfile