list the files with minimum sequence

list the files with minimum sequence - perl

I have some files in a directory as below (not necessarily sorted):
A_10
A_20
A_30
B_10
B_30
C_10
C_20
D_20
D_30
E_10
E_20
E_30
10, 20 and 30 are the sequence numbers of A,B,C,D,E respectively.
I want to select only those files with minimum sequence of all A,B,C,D,E
the output should be :
A_10
B_10
C_10
D_20
E_10
could anybody help me?

perl -le '
print join $/,
grep !$_{( split "_" )[0]}++,
sort glob "*_*"
'
or:
printf '%s\n' *_* | sort | awk -F_ '!_[$1]++'
or:
printf '%s\n' *_* | sort -t_ -uk1,1

In bash:
for x in A B C D E; do
ls -1 ${x}_* | sort | head -n1
done

Related

Changing sign of numbers +/- multiple files

I want to reverse the sign of numbers in column x (2) in multiple files. For example:
From
1 | 2.0
2 | -3.0
3 | 1.0
To
1 |-2.0
2 |3.0
3 |-1.0
I am using sed '/^-/ {s/.//;b};s/^/-/' file command, but it does not work. Any suggestion?

A more "proper" way using actual math is easy with awk. For example if you want to negate columns 2 and 3:
awk '{print $1, -$2, -$3}'

$ cat ip.txt
1 | 2.0
2 | -3.0
3 | 1.0
Modifying sed command from OP, not suited to easily modify for a different column or different delimiter
$ sed -E '/^(.*\|\s*)-[0-9]/ {s/^(.*\|\s*)-/\1/;b}; s/^(.*\|\s*)/&-/' ip.txt
1 | -2.0
2 | 3.0
3 | -1.0
With perl where it is easier to specify delimiter and modify specific column
$ perl -F'\|' -lane '$F[1] =~ m/-/ ? $F[1] =~ s/-// : $F[1] =~ s/\d/-$&/; print join "|", #F' ip.txt
1 | -2.0
2 | 3.0
3 | -1.0
To modify multiple files inplace within a folder, use the -i option
sed -i -E '/^(.*\|\s*)-[0-9]/ {s/^(.*\|\s*)-/\1/;b}; s/^(.*\|\s*)/&-/' *
and
perl -i -F'\|' -lane '$F[1] =~ m/-/ ? $F[1] =~ s/-// : $F[1] =~ s/\d/-$&/; print join "|", #F' *
If number format is not an issue,
$ perl -F'\|' -lane '$F[1] = -$F[1]; print join "|", #F' ip.txt
1 |-2
2 |3
3 |-1

how to put | between content lines of a text file?

I have a file containing:
L1
L2
L3
.
.
.
L512
I want to change its content to :
L1 | L2 | L3 | ... | L512
It seems so easy , but its now 1 hour Im sitting and trying to make it, I tried to do it by sed, but didn't get what I want. It seems that sed just inserts empty lines between the content, any suggestion please?

With sed this requires to read the whole input into a buffer and afterwards replace all newlines by |, like this:
sed ':a;N;$!ba;s/\n/ | /g' input.txt
Part 1 - buffering input
:a defines a label called 'a'
N gets the next line from input and appends it to the pattern buffer
$!ba jumps to a unless the end of input is reached
Part 2 - replacing newlines by |
s/\n/|/ execute the substitute command on the pattern buffern
As you can see, this is very inefficient since it requires to:
read the complete input into memory
operate three times on the input: 1. reading, 2. substituting, 3. printing
Therefore I would suggest to use awk which can do it in one loop:
awk 'NR==1{printf $0;next}{printf " | "$0}END{print ""}' input.txt

Here is one sed
sed ':a;N;s/\n/ | /g;ta' file
L1 | L2 | L3 | ... | L512
And one awk
awk '{printf("%s%s",sep,$0);sep=" | "} END {print ""}' file
L1 | L2 | L3 | ... | L512

perl -pe 's/\n/ |/g unless(eof)' file

if space between | is not mandatory
tr "\n" '|' YourFile

Several options, including those mentioned here:
paste -sd'|' file
sed ':a;N;s/\n/ | /g;ta' file
sed ':a;N;$!ba;s/\n/ | /g' file
perl -0pe 's/\n/ | /g;s/ \| $/\n/' file
perl -0nE 'say join " | ", split /\n/' file
perl -E 'chomp(#x=<>); say join " | ", #x' file
mapfile -t ary < file; (IFS="|"; echo "${ary[*]}")
awk '{printf("%s%s",sep,$0);sep=" | "} END {print ""}' file

Printing reverse complement of DNA in single-line Perl

I want to write a quick single-line perl script to produce the reverse complement of a sequence of DNA. The following isn't working for me, however:
$ cat sample.dna.sequence.txt | perl -ne '{while (<>) {$seq = $_; $seq =~ tr /atcgATCG/tagcTAGC/; $revComp = reverse($seq); print $revComp;}}'
Any suggestions? I'm aware that
tr -d "\n " < input.txt | tr "[ATGCatgcNn]" "[TACGtacgNn]" | rev
works in bash, but I want to do it with perl for the practice.

Your problem is that is that you're using both -n and while (<>) { }, so you end up with while (<>) { while (<>) { } }.
If you know how to do <file.txt, why did you switch to cat file.txt|?!
perl -0777ne's/\n //g; tr/ATGCatgcNn/TACGtacgNn/; print scalar reverse $_;' input.txt
or
perl -0777pe's/\n //g; tr/ATGCatgcNn/TACGtacgNn/; $_ = reverse $_;' input.txt
Or if you don't need to remove the newlines:
perl -pe'tr/ATGCatgcNn/TACGtacgNn/; $_ = reverse $_;' input.txt

If you need to use cat, the following one liner should work for you.
ewolf#~ $cat foo.txt
atNgNt
gatcGn
ewolf#~ $cat foo.txt | perl -ne '$seq = $_; $seq =~ tr/atcgATCG/tagcTAGC/;print reverse( $seq )'
taNcNa
ctagCn

Considering the DNA sequences in single-line format in a multifasta file:
cat multifasta_file.txt | while IFS= read L; do if [[ $L == >* ]]; then echo "$L"; else echo $L | rev | tr "ATGCatgc" "TACGtacg"; fi; done > output_file.txt
If your multifasta file is not in single-line format, you can transform your file to single-line before using the command above, like this:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' <multifasta_file.txt >multifasta_file_singleline.txt<="" p="">
Then,
cat multifasta_file_SingleLine.txt | while IFS= read L; do if [[ $L == >* ]]; then echo "$L"; else echo $L | rev | tr "ATGCatgc" "TACGtacg"; fi; done > output_file.txt
Hope it is useful for someone. It took me some time to build it.

The problem is that you're using -n in the perl flag, yet you've written your own loop. -n wraps your supplied code in a while loop like while(<STDIN>){...}. So the STDIN file handle has already been read from and your code does it again, getting EOF (end of file) or rather 'undefined'. You either need to remove the n from -ne or remove the while loop from your code.
Incidentally, a complete complement tr pattern, including ambiguous bases, is:
tr/ATGCBVDHRYKMatgcbvdhrykm/TACGVBHDYRMKtacgvbhdyrmk/
Ambiguous bases have complements too. For example, a V stands for an A, C, or G. Their complements are T, G, and C, which is represented by the ambiguous base B. Thus, V and B are complementary.
You don't need to include any N's or n's in your tr pattern (as was demonstrated in another answer) because the complement is the same and leaving them out will leave them untouched. It's just extra processing to put them in the pattern.

How do I find the largest 10 files in a given directory?

How do I find the largest 10 files in a given directory, with Perl or Bash?
EDIT:
I need this to be recursive.
I only want to see large files, no large directories.
I need this to work on Mac OS X 10.6 ('s version of find).

This prints the 10 largest files recursively from current directory.
find . -type f -printf "%s %p\n" | sort -nr | awk '{print $2}' | head -10

$ alias ducks
alias ducks='du -cs * |sort -rn |head -11'

This is a way to do it in perl. (Note: Non-recursive version, according to earlier version of the question)
perl -wE 'say for ((sort { -s $b <=> -s $a } </given/dir/*>)[0..9]);'
However, I'm sure there are better tools for the job.
ETA: Recursive version, using File::Find:
perl -MFile::Find -wE '
sub wanted { -f && push #files, $File::Find::name };
find(\&wanted, "/given/dir");
#files = sort { -s $b <=> -s $a } #files;
say for #files[0..9];'
To check file sizes, use e.g. printf("%-10s : %s\n", -s, $_) for #files[0..9]; instead.

How about this -
find . -type f -exec ls -l {} + | awk '{print $5,$NF}' | sort -nr | head -n 10
Test:
[jaypal:~/Temp] find . -type f -exec ls -l {} + | awk '{print $5,$NF}' | sort -nr | head -n 10
8887 ./backup/GTP/GTP_Parser.sh
8879 ./backup/Backup/GTP_Parser.sh
6791 ./backup/Delete_HIST_US.sh
6785 ./backup/Delete_NORM_US.sh
6725 ./backup/Delete_HIST_NET.sh
6711 ./backup/Delete_NORM_NET.sh
5339 ./backup/GTP/gtpparser.sh
5055 ./backup/GTP/gtpparser3.sh
4830 ./backup/GTP/gtpparser2.sh
3955 ./backup/GTP/temp1.file

Counting lines ignored by grep

Let me try to explain this as clearly as I can...
I have a script that at some point does this:
grep -vf ignore.txt input.txt
This ignore.txt has a bunch of lines with things I want my grep to ignore, hence the -v (meaning I don't want to see them in the output of grep).
Now, what I want to do is I want to be able to know how many lines of input.txt have been ignored by each line of ignore.txt.
For example, if ignore.txt had these lines:
line1
line2
line3
I would like to know how many lines of input.txt were ignored by ignoring line1, how many by ignoring line2, and so on.
Any ideas on how can I do this?
I hope that made sense... Thanks!

Note that the sum of the ignored lines plus the shown lines may NOT add up to the total number of lines... "line1 and line2 are here" will be counted twice.
#!/usr/bin/perl
use warnings;
use strict;
local #ARGV = 'ignore.txt';
chomp(my #pats = <>);
foreach my $pat (#pats) {
print "$pat: ", qx/grep -c $pat input.txt/;
}

According to unix.stackexchange
grep -o pattern file | wc -l
counts the total number of a given pattern in the file. A solution, given this and the information, that you already use a script, is to use several grep instances to filter and count the patterns, which you want to ignore.
However, I'd try to build a more comfortable solution involving a scripting language like e.g. python.

This script will count the matched lines by hash lookup and save the lines to be printed in #result, where you may process them as you will. To emulate grep, just print them.
I made the script so it can print out an example. To use with the files, uncomment the code in the script, and comment the ones marked # example line.
Code:
use strict;
use warnings;
use v5.10;
use Data::Dumper; # example line
# Example data.
my #ignore = ('line1' .. 'line9'); # example line
my #input = ('line2' .. 'line9', 'fo' .. 'fx', 'line2', 'line3'); # example line
#my $ignore = shift; # first argument is ignore.txt
#open my $fh, '<', $ignore or die $!;
#chomp(my #ignore = <$fh>);
#close $fh;
my #result;
my %lookup = map { $_ => 0 } #ignore;
my $rx = join '|', map quotemeta, #ignore;
#while (<>) { # This processes the remaining arguments, input.txt etc
for (#input) { # example line
chomp; # Required to avoid bugs due to missing newline at eof
if (/($rx)/) {
$lookup{$1}++;
} else {
push #result, $_;
}
}
#say for #result; # This will emulate grep
print Dumper \%lookup; # example line
Output:
$VAR1 = {
'line6' => 1,
'line1' => 0,
'line5' => 1,
'line2' => 2,
'line9' => 1,
'line3' => 2,
'line8' => 1,
'line4' => 1,
'line7' => 1
};

while IFS= read -r pattern ; do
printf '%s:' "$pattern"
grep -c -v "$pattern" input.txt
done < ignore.txt
grep with -c counts matching lines, but with -v added it counts non-matching lines. So, simply loop over the patterns and count once for each pattern.

This will print the number of ignored matches along with the matching pattern:
grep -of ignore.txt input.txt | sort | uniq -c
For example:
$ perl -le 'print "Coroline" . ++$s for 1 .. 21' > input.txt
$ perl -le 'print "line2\nline14"' > ignore.txt
$ grep -of ignore.txt input.txt | sort | uniq -c
1 line14
3 line2
I.e., A line matching "line14" was ignored once. A line matching "line2" was ignored 3 times.
If you just wanted to count the total ignored lines this would work:
grep -cof ignore.txt input.txt
Update: modified the example above to use strings so that the output is a little clearer.

This might work for you:
# seq 1 15 | sed '/^1/!d' | sed -n '$='
7
Explanation:
Delete all lines except those that match. Pipe these matching (ignored) lines to another sed command. Delete all these lines but show the line number only of the last line. So in this example 1 thru 15, lines 1,10 thru 15 are ignored - a total of 7 lines.
EDIT:
Sorry misread the question (still a little confused!):
sed 's,.*,sed "/&/!d;s/.*/matched &/" input.txt| uniq -c,' ignore.txt | sh
This shows the number of matches for each pattern in the the ignore.txt
sed 's,.*,sed "/&/d;s/.*/non-matched &/" input.txt | uniq -c,' ignore.txt | sh
This shows the number of non-matches for each pattern in the the ignore.txt
If using GNU sed, these should work too:
sed 's,.*,sed "/&/!d;s/.*/matched &/" input.txt | uniq -c,;e' ignore.txt
or
sed 's,.*,sed "/&/d;s/.*/non-matched &/" input.txt | uniq -c,;e' ignore.txt
N.B. Your success with patterns may vary i.e. check for meta characters beforehand.
On reflection I thought this can be improved to:
sed 's,.*,/&/i\\matched &,;$a\\d' ignore.txt | sed -f - input.txt | sort -k2n | uniq -c
or
sed 's,.*,/&/!i\\non-matched &,;$a\\d' ignore.txt | sed -f - input.txt | sort -k2n | uniq -c
But NO, on large files this is actually slower.

Are both ignore.txt and input.txt sorted?
If so, you can use the comm command!
$ comm -12 ignore.txt input.txt
How many lines are ignored?
$ comm -12 ignore.txt input.txt | wc -l
Or, if you want to do more processing, combine comm with awk.:
$ comm ignore.txt input.txt | awk '
END {print "Ignored lines = " igtotal " Lines not ignored = "commtotal " Lines unique to Ignore file = " uniqtotal}
{
if ($0 !~ /^\t/) {uniqtotal+=1}
if ($0 ~ /^\t[^\t]/) {commtotal+=1}
if ($0 ~ /^\t\t/) {igtotal+=1}
}'
Here I'm taking advantage with the tabs that are placed in the output by the comm command:
* If there are no tabs, the line is in ignore.txt only.
* If there is a single tab, it is in input.txt only
* If there are two tabs, the line is in both files.
By the way, not all the lines in ignore.txt are ignored. If the line isn't also in input.txt, the line can't really be said to be ignored.
With Dennis Williamson's Suggestion
comm ignore.txt input.txt | awk '
!/^\t/ {uniqtotal++}
/^\t[^\t]/ {commtotal++}
/^\t\t/ {igtotal++}
END {print "Ignored lines = " igtotal " Lines not ignored = "commtotal " Lines unique to Ignore file = " uniqtotal}'