How to repeat a sequence of numbers to the end of a column? - perl

I have a data file that needs a new column of identifiers from 1 to 5. The final purpose is to split the data into five separate files with no leftover file (split leaves a leftover file).
Data:
aa
bb
cc
dd
ff
nn
ww
tt
pp
with identifier column:
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
Not sure if this can be done with seq? Afterwards it will be split with:
awk '$2 == 1 {print $0}'
awk '$2 == 2 {print $0}'
awk '$2 == 3 {print $0}'
awk '$2 == 4 {print $0}'
awk '$2 == 5 {print $0}'

Perl to the rescue:
perl -pe 's/$/" " . $. % 5/e' < input > output
Uses 0 instead of 5.
$. is the line number.
% is the modulo operator.
the /e modifier tells the substitution to evaluate the replacement part as code
i.e. end of line ($) is replaced with a space concatenated (.) with the line number modulo 5.

$ awk '{print $0, ((NR-1)%5)+1}' file
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
No need for that to create 5 separate files of course. All you need is:
awk '{print > ("file_" ((NR-1)%5)+1)}' file
Looks like you're happy with a perl solution that outputs 1-4 then 0 instead of 1-5 so FYI here's the equivalent in awk:
$ awk '{print $0, NR%5}' file
aa 1
bb 2
cc 3
dd 4
ff 0
nn 1
ww 2
tt 3
pp 4

I am going to offer a Perl solution even though it wasn't tagged because Perl is well suited to solve this problem.
If I understand what you want to do, you have a single file that you want to split into 5 separate files based on the position of a line in the data file:
the first line in the data file goes to file 1
the second line in the data file goes to file 2
the third line in the data file goes to file 3
...
since you already have the lines position in the file, you don't really need the identifier column (though you could pursue that solution if you wanted).
Instead you can open 5 filehandles and simply alternate which handle you write to:
use strict;
use warnings;
my $datafilename = shift #ARGV;
# open filehandles and store them in an array
my #fhs;
foreach my $i ( 0 .. 4 ) {
open my $fh, '>', "${datafilename}_$i"
or die "$!";
$fhs[$i] = $fh;
}
# open the datafile
open my $datafile_fh, '<', $datafilename
or die "$!";
my $row_number = 0;
while ( my $datarow = <$datafile_fh> ) {
print { $fhs[$row_number++ % #fhs] } $datarow;
}
# close resources
foreach my $fh ( #fhs ) {
close $fh;
}

Related

How to calculate inverse log2 ratio of a UCSC wiggle file using perl?

I have 2 separate files namely A & B containing same header lines but 2 and 1 column respectively. I want to take inverse log2 of the 2nd column or 1st column in separate files but keep the other description intact. I am having some thing like this.. values in file A $1 and $2 are separated by delimiter tab
file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 0.781985
16 0.810993
20 0.769601
24 0.733831
file B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
0.721985
0.610993
0.760123
0.573831
I expect an output like this. file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.7194950944
16 1.754418585
20 1.7047982296
24 1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr2
for file B (in this file values are just copy paste of file A)
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.7194950944
1.754418585
1.7047982296
1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig rep1.bar.wig graphType=bar
variableStep chrom=chr2
This awk script does the calculation that you want:
awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' file
This matches lines that contain only digits, periods and any space characters, substituting the value of the last field $NF for 2 raised to the power of $NF. The format specifier %.12f can be modified to give you the required number of decimal places. The 1 at the end is shorthand for {print}.
Testing it out on your new files:
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.719495094445
16 1.754418584953
20 1.704798229573
24 1.663049372620
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.649449947457
1.527310087388
1.693635012985
1.488470882686
So here's the Perl version:
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(.*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;
If you run <file>.pl <datafile> (replace with appropriate names) it will convert one file so the lines have 2**<2nd value>). It simply echoes the lines that do not match the number pattern.
This is the modified little script of #ThomasKilian
Thanks to him for providing the framework.
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(\d*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
$power= sprintf("%.12f", $power);
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
$power= sprintf("%.12f", $power);
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;

grep variables and give informative ouput

I want to see how many times specific word was mentioned in the file/lines.
My dummy examples looks like this:
cat words
blue
red
green
yellow
cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT
I am doing this:
for i in $(cat words); do grep "$i" text | wc >> output; done
cat output
2 2 51
0 0 0
1 1 26
0 0 0
But what I actually want to get is:
1. Word that was used as a variable;
2. In how many lines (additionally to text hits) word was found.
Preferable output looks like this:
blue 3 2
red 0 0
green 1 1
yellow 0 0
$1 - variable that was grep'ed
$2 - how many times variable was found in the text
$3 - in how many lines variable was found
Hope someone could help me doing this with grep, awk, sed as they are fast enough for the large data set, but Perl one liner would help me too.
Edit
Tried this
for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*
and it kinda looks nice, but some of the words are longer than 300 letters so I can't create file named like the word.
You can use the grep option -o which print only the matched parts of a matching line, with each match on a separate output line.
while IFS= read -r line; do
wordcount=$(grep -o "$line" text | wc -l)
linecount=$(grep -c "$line" text)
echo $line $wordcount $linecount
done < words | column -t
You can put it all in one line to make it a one liner.
If column gives the "column too long" error, you can use printf provided you know the maximum number of characters. Use the below instead of echo and remove the pipe to column:
printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount
Replace the 20 with your max word length and the other numbers as well if you need to.
Here is a similar Perl solution; but rather written as a complete script.
#!/usr/bin/perl
use 5.012;
die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless #ARGV;
my $wordsfile = shift #ARGV;
my #wordlist = do {
open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
map {chomp; length() ? $_ : ()} <$words_fh>;
};
my %words;
while (<>) {
for my $word (#wordlist) {
my $cnt = 0;
$cnt++ for /\Q$word\E/g;
$words{$word}[0] += $cnt;
$words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
}
}
# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
say join "\t", $key, #{ $words{$key} };
}
Example output:
blue 3 2
green 1 1
red 0 0
yellow 0 0
Advantage over bash script: every file is only read once.
This gets pretty ugly as a Perl one-liner (partly because it needs to get data from two files and only one can be sent on stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you go:
perl -E 'undef $|; open $w, "<", "words"; #w=<$w>; chomp #w; $r{$_}=[0,{}] for #w; my $re = join "|", #w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for #w' < text
This requires perl 5.10 or later, but changing it to support 5.8 and earlier is trivial. (Change the -E to -e, change say to print, and add a \n at the end of each line of output.)
Output:
blue 3 2
red 0 0
green 1 1
yellow 0 0
an awk(gawk) oneliner could save you from grep puzzle:
awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
format the code a bit:
awk 'NR==FNR{n[$0];l[$0];next;}
{for(w in n){ s=$0;
t=gsub(w,"#",s);
n[w]+=t;l[w]+=t>0?1:0;}
}END{for(x in n)print x,n[x],l[x]}' words text
test with your example:
kent$ awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow 0 0
red 0 0
green 1 1
blue 3 2
if you want to format your output, you could just pipe the awk output to column -t
so it looks like:
yellow 0 0
red 0 0
green 1 1
blue 3 2
awk '
NR==FNR { words[$0]; next }
{
for (word in words) {
count = gsub(word,word)
if (count) {
counts[word] += count
lines[word]++
}
}
}
END { for (word in words) printf "%s %d %d\n", word, counts[word], lines[word] }
' file

Reformatting separated char to couples

Input:
rs001 A C T G C G T T
rs002 C C T T G G A A
out1:
rs001 AC TG CG TT
rs002 CC TT GG AA
out2 :
rs001 1 1 1 2
rs002 2 2 2 2
Ok so basically I want to replace any two similar nucleotides (like AA, CC, TT, or GG) to 2 and any two different (like AT, TA, CG, .. etc) to 1 taking into account that the input should be converted first to out1 then to out2. Also we have so many fields (like 200 columns) in each row, so loops are needed here.
This is what I tried:
cat input | awk '{ for (x = 2; x <= NF; x = x+2) print $x$(x+1) }'
Results are so weird, so can anyone please tell me why I can't get out1 ?! What mistakes I did in awk loops ?
Thanks in advance
For the first,
sed 's/ \([ACGT]\) / \1/g' input >out1
This will remove the space after every other nucleitude. It matches a nucleotide with a space on both sides; the next match will pick up where the previous ended.
For the second,
sed 's/\([ACGT]\)\1/2/g;s/[ACGT][ACGT]/1/g' out1 >out2
This replaces two adjacent identical letters with 2, then any remaining adjacent two letters with 1.
This assumes you have Linux; other sed dialects may require minor modifications.
awk '{
out1 = out2 = $1
for (i=2;i<=NF;i+=2) {
out1 = out1 FS $i $(i+1)
out2 = out2 FS ($i == $(i+1) ? 2 : 1)
}
print out1 > "out1"
print out2 > "out2"
}' input
Here's how you fix your awk script to get output 1:
awk '{ printf "%s ", $1; for (x = 2; x <= NF; x = x + 2) {printf "%s%s ", $x, $(x+1)} printf "\n"}' input
print adds a new line at the end by default, so you'll have to use formatted strings printf to specify where exactly you want the new lines.
(Also added printf "%s ", $1; at the start to print the header at the start of each line)
Edit: Triplee's solution looks much more elegant than mine - you should ditch awk and go with his =)
This might work for you (GNU sed):
sed -re 's/ (.) / \1/g;w out1' -e 's/([ACTG])\1/2/g;s/[ACTG]./1/g' file >out2

delete lines from multiple files using gawk / awk / sed

I have two sets of text files. First set is in AA folder. Second set is in BB folder. The content of ff.txt file from first set(AA folder) is shown below.
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
I would like to print the second column(number) from this file if marks>60. The output would be 3,4. Next, read the ff.txt file in the BB folder and delete the lines containing numbers 3,4.
files in the BB folder looks like this. second column is the number.
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
I used the following code.This code is working perfectly for one file.
gawk 'BEGIN {getline} $3>60{print $2}' AA/ff.txt | while read number; do gawk -v number=$number '$2 != number' BB/ff.txt > /tmp/ff.txt; mv /tmp/ff.txt BB/ff.txt; done
But when I run this code with multiple files, I get error.
gawk 'BEGIN {getline} $3>60{print $2}' AA/*.txt | while read number; do gawk -v number=$number '$2 != number' BB/*.txt > /tmp/*.txt; mv /tmp/*.txt BB/*.txt; done
error:-
mv: target `BB/kk.txt' is not a directory
I had asked this question two days ago.Please help me to solve this error.
This creates an index of all files in folder AA and checks against all files in folder BB:
cat AA/*.txt | awk 'FNR==NR { if ($3 > 60) array[$2]; next } !($2 in array)' - BB/*.txt
This compares two individual files, assuming they have the same name in folders AA and BB:
ls AA/*.txt | sed "s%AA/\(.*\)%awk 'FNR==NR { if (\$3 > 60) array[\$2]; next } !(\$2 in array)' & BB/\1 %" | sh
HTH
EDIT
This should help :-)
ls AA/*.txt | sed "s%AA/\(.*\)%awk 'FNR==NR { if (\$3 > 60) array[\$2]; next } !(\$2 in array)' & BB/\1 > \1_tmp \&\& mv \1_tmp BB/\1 %" | sh
> /tmp/*.txt and mv /tmp/*.txt BB/*.txt are wrong.
For single file
awk 'NR>1 && $3>60{print $2}' AA/ff.txt > idx.txt
awk 'NR==FNR{a[$0]; next}; !($2 in a)' idx.txt BB/ff.txt
For multiple files
awk 'FNR>1 && $3>60{print $2}' AA/*.txt >idx.txt
cat BB/*.txt | awk 'NR==FNR{a[$0]; next}; !($2 in a)' idx.txt -
One perl solution:
use warnings;
use strict;
use File::Spec;
## Hash to save data to delete from files of BB folder.
## key -> file name.
## value -> string with numbers of second column. They will be
## joined separated with '-...-', like: -2--3--1-. And it will be easier to
## search for them using a regexp.
my %delete;
## Check arguments:
## 1.- They are two.
## 2.- Both are directories.
## 3.- Both have same number of regular files and with identical names.
die qq[Usage: perl $0 <dir_AA> <dir_BB>\n] if
#ARGV != 2 ||
grep { ! -d } #ARGV;
{
my %h;
for ( glob join q[ ], map { qq[$_/*] } #ARGV ) {
next unless -f;
my $file = ( File::Spec->splitpath( $_ ) )[2] or next;
$h{ $file }++;
}
for ( values %h ) {
if ( $_ != 2 ) {
die qq[Different files in both directories\n];
}
}
}
## Get files from dir 'AA'. Process them, print to output lines which
## matches condition and save the information in the %delete hash.
for my $file ( glob( shift . qq[/*] ) ) {
open my $fh, q[<], $file or do { warn qq[Couldn't open file $file\n]; next };
$file = ( File::Spec->splitpath( $file ) )[2] or do {
warn qq[Couldn't get file name from path\n]; next };
while ( <$fh> ) {
next if $. == 1;
chomp;
my #f = split;
next unless #f >= 3;
if ( $f[ $#f ] > 60 ) {
$delete{ $file } .= qq/-$f[1]-/;
printf qq[%s\n], $_;
}
}
}
## Process files found in dir 'BB'. For each line, print it if not found in
## file from dir 'AA'.
{
#ARGV = glob( shift . qq[/*] );
$^I = q[.bak];
while ( <> ) {
## Sanity check. Shouldn't occur.
my $filename = ( File::Spec->splitpath( $ARGV ) )[2];
if ( ! exists $delete{ $filename } ) {
close ARGV;
next;
}
chomp;
my #f = split;
if ( $delete{ $filename } =~ m/-$f[1]-/ ) {
next;
}
printf qq[%s\n], $_;
}
}
exit 0;
A test:
Assuming next tree of files. Command:
ls -R1
Output:
.:
AA
BB
script.pl
./AA:
ff.txt
gg.txt
./BB:
ff.txt
gg.txt
And next content of files. Command:
head AA/*
Output:
==> AA/ff.txt <==
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
==> AA/gg.txt <==
Name number marks
john 1 70
maria 2 54
samuel 3 42
ben 4 33
Command:
head BB/*
Output:
==> BB/ff.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
==> BB/gg.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 2 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
Run the script like:
perl script.pl AA/ BB
With following ouput to screen:
samuel 3 62
ben 4 63
john 1 70
And files of BB directory modified like:
head BB/*
Output:
==> BB/ff.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
==> BB/gg.txt <==
marks 2 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
So, from ff.txt lines with numbers 3 and 4 have been deleted, and lines with number 1 in gg.txt, which all of them were bigger than 60 in last column. I think this is what you wanted to achieve. I hope it helps, although not awk.

How to extract a particular column of data in Perl?

I have some data from a unix commandline call
1 ab 45 1234
2 abc 5
4 yy 999 2
3 987 11
I'll use the system() function for the call.
How can I extract the second column of data into an array in Perl? Also, the array size has to be dependent on the number of rows that I have (it will not necessarily be 4).
I want the array to have ("ab", "abc", "yy", 987).
use strict;
use warnings;
my $data = "1 ab 45 1234
2 abc 5
2 abc 5
2 abc 5
4 yy 999 2
3 987 11";
my #second_col = map { (split)[1] } split /\n/, $data;
To get unique values, see perlfaq4. Here's part of the answer provided there:
my %seen;
my #unique = grep { ! $seen{ $_ }++ } #second_col;
You can chain a Perl cmd-line call (aka: one-liner) to your unix script:
perl -lane 'print $F[1]' data.dat
instead of data.dat, use a pipe from your command line tool
cat data.dat | perl -lane 'print $F[1]'
Addendum:
The extension for unique-ness of the resulting column is straightforward:
cat data.dat | perl -lane 'print $F[1] unless $seen{$F[1]}++'
or, if you are lazy (employing %_):
cat data.dat | perl -lane 'print unless $_{$_=$F[1]}++'