Using the bash sort command within variable-length filenames - perl

I am trying to numerically sort a series of files output by the ls command which match the pattern either ABCDE1234A1789.RST.txt or ABCDE12345A1789.RST.txt by the '789' field.
In the example patterns above, ABCDE is the same for all files, 1234 or 12345 are digits that vary but are always either 4 or 5 digits in length. A1 is the same length for all files, but value can vary so unfortunately it can't be used as a delimiter. Everything after the first . is the same for all files. Something like:
ls -l *.RST.txt | sort -k +9.13 | awk '{print $9} ' > file-list.txt
will match the shorter filenames but not the longer ones because of the variable length of characters before the field I want to sort by.
Is there a way to accomplish sorting all files without first padding the shorter-length files to make them all the same length?

Perl to the rescue!
perl -e 'print "$_\n" for sort { substr($a, -11, 3) cmp substr($b, -11, 3) } glob "*.RST.txt"'
If your perl is more recent (5.10 or newer), you can shorten it to
perl -E 'say for sort { substr($a, -11, 3) cmp substr($b, -11, 3) } glob "*.RST.txt"'

Because of the parts of the filename which you've identified as unchanging, you can actually build a key which sort will use:
$ echo ABCDE{99999,8765,9876,345,654,23,21,2,3}A1789.RST.txt \
| fmt -w1 \
| sort -tE -k2,2n --debug
ABCDE2A1789.RST.txt
_
___________________
ABCDE3A1789.RST.txt
_
___________________
ABCDE21A1789.RST.txt
__
etc.
What this does is tell sort to separate the fields on character E, then use the 2nd field numerically. --debug arrived in coreutils 8.6, and can be very helpful in seeing exactly what sort is doing.

The conventional way to do this in bash is to extract your sort field. Except for the sort command, the following is implemented in pure bash alone:
sort_names_by_first_num() {
shopt -s extglob
for f; do
first_num="${f##+([^0-9])}";
first_num=${first_num%[^0-9]*};
[[ $first_num ]] && printf '%s\t%s\n' "$first_num" "$f"
done | sort -n | while IFS='' read -r name; do name=${name#*$'\t'}; printf '%s\n' "$name"; done
}
sort_names_by_first_num *.RST.txt
That said, newline-delimiting filenames (as this question seems to call for) is a bad practice: Filenames on UNIX filesystems are allowed to contain newlines within their names, so separating them by newlines within a list means your list is unable to contain a substantial subset of the range of valid names. It's much better practice to NUL-delimit your lists. Doing that would look like so:
sort_names_by_first_num() {
shopt -s extglob
for f; do
first_num="${f##+([^0-9])}";
first_num=${first_num%[^0-9]*};
[[ $first_num ]] && printf '%s\t%s\0' "$first_num" "$f"
done | sort -n -z | while IFS='' read -r -d '' name; do name=${name#*$'\t'}; printf '%s\0' "$name"; done
}
sort_names_by_first_num *.RST.txt

Related

Sed - replace words

I have a problem with replacing string.
|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
I want to find occurrence of Svc till | appears and swap place with Stm till | appears.
My attempts went to replacing characters and this is not my goal.
awk -F'|' -v OFS='|'
'{a=b=0;
for(i=1;i<=NF;i++){a=$i~/^Stm=/?i:a;b=$i~/^Svc=/?i:b}
t=$a;$a=$b;$b=t}7' file
outputs:
|Svc=101|Seq=2|Num=2|Stm=2|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
the code exchange the column of Stm.. and Svc.., no matter which one comes first.
If perl solution is okay, assumes only one column matches each for search terms
$ cat ip.txt
|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
$ perl -F'\|' -lane '
#i = grep { $F[$_] =~ /Svc|Stm/ } 0..$#F;
$t=$F[$i[0]]; $F[$i[0]]=$F[$i[1]]; $F[$i[1]]=$t;
print join "|", #F;
' ip.txt
|Svc=101|Seq=2|Num=2|Stm=2|MsgSize(514)=514|MsgType=556|SymbolIndex=16631
-F'\|' -lane split input line on |, see also Perl flags -pe, -pi, -p, -w, -d, -i, -t?
#i = grep { $F[$_] =~ /Svc|Stm/ } 0..$#F get index of columns matching Svc and Stm
$t=$F[$i[0]]; $F[$i[0]]=$F[$i[1]]; $F[$i[1]]=$t swap the two columns
Or use ($F[$i[0]], $F[$i[1]]) = ($F[$i[1]], $F[$i[0]]); courtesy How can I swap two Perl variables
print join "|", #F print the modified array
You need to use capture groups and backreferences in a string substition.
The below will swap the 2:
echo '|Stm=2|Seq=2|Num=2|Svc=101|MsgSize(514)=514|MsgType=556|SymbolIndex=16631' | sed 's/\(Stm.*|\)\(.*\)\(Svc.*|\)/\3\2\1/'
As pointed out in the comment from #Kent, this will not work if the strings were not in that order.

Conditional substitution of patterns in bash strings depending on the beginning of a string

I am new in bash, so excuse me if do not use the right terms.
I need to substitute certain patterns of six characters in a set of files. The order by patterns are substituted depends on the beginning of each string of text.
This is an example of input:
chr1:123-123 5GGGTTAGGGTTAGGGTTAGGGTTAGGGTTA3
chr1:456-456 5TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG3
chr1:789-789 5GGGCTAGGGTTAGGGTTAGGGTTA3
chr1:123-123 etc is the name of the string, they are separated from the string I need to work with by a tab. The string I need to work with is delimited by characters 5 and 3, but I can change them.
I want that all patterns containing T, A, G in anyone of these orders is substituted with X: TTAGGG, TAGGG, AGGGTT, GGGTTA, GGTTAG, GTTAGG.
Similarly, patterns containing CTAGGG, like row 3, in orders similar to the previous one will be substituted with a different character.
The game is repeated with some specific differences for all the 6 characters composing each pattern.
I started writing something like this:
#!/bin/bash
NORMAL=`echo "\033[m"`
RED=`echo "\033[31m"` #red
#read filename for the input file and create a copy and a folder for the output
read -p "Insert name for INPUT file: " INPUT
echo "Creating OUTPUT file " "${RED}"$INPUT"_sub.txt${NORMAL}"
mkdir -p ./"$INPUT"_OUTPUT
cp $INPUT.txt ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
echo
#start the first set of instructions
perfrep
#starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism
Instructions are
perfrep() {
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGGT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGGTT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGTTAG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
# starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism(){
sed -i -e 's/[GCA]TAGGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/G[GCA]TAGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GG[GCA]TAG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGG[GCA]TA/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGG[GCA]T/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGG[GCA]/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
I will need to repeat also with T[GCA]AGGG, TT[TCG]GGG, TTA[ACT]GG, TTAG[ACT]G and TTAGG[ACT].
Using this procedure, I get for these results for the inputs shown
5GGGXXXXTTA3
5XXXXX3
5GGGLXXTTA3
In my point of view, for my job, the first and second string are both made by X repeated five times, and the order of characters is just slightly different. On the other hand, the third one could be masked like this:
5LXXX3
How do I tell the script that if the string starts with 5GGGTTA instead of 5TTAGGG must start to substitute with
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
instead of
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
?
I will need to repeat with all cases; for instance, if the string starts with GTTAGG I will need to start with
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
and so on, and add a couple of variation of my pattern.
I need to repeat the substitution with TTAGGG and the variations for all the rows of my input file.
Sorry for the very long question. Thank you all.
Adding information asked by Varun.
Patterns of 6 characters would be TTAGGG , [GCA]TAGGG , T[GCA]AGGG , TT[TCG]GGG , TTA[ACT]GG , TTAG[ACT]G , TTAGG[ACT].
Each one must be checked for a different frame, for instance for TTAGGG we have 6 frames TTAGGG , GTTAGG , GGTTAG, GGGTTA , AGGGTT , TAGGGT.
The same frames must be applied to the pattern containing a variable position.
I will have a total of 42 patterns to check, divided in 7 groups: one containing TTAGGG and derivative frames, 6 with the patterns with a variable position and their derivatives.
TTAGGG and derivatives are the most important and need to be checked first.
#! /usr/bin/awk -f
# generate a "frame" by moving the first char to the end
function rotate(base){ return substr(base,2) substr(base,1,1) }
# Unfortunately awk arrays do not store regexps
# so I am generating the list of derivative strings to match
function generate_derivative(frame,arr, i,j,k,head,read,tail) {
arr[i]=frame;
for(j=1; j<=length(frame); j++) {
head=substr(frame,1,j-1);
read=substr(frame,j,1);
tail=substr(frame,j+1);
for( k=1; k<=3; k++) {
# use a global index to simplify
arr[++Z]= head substr(snp[read],k,1) tail
}
}
}
BEGIN{
fs="\t";
# alternatives to a base
snp["A"]="TCG"; snp["T"]="ACG"; snp["G"]="ATC"; snp["C"]="ATG";
# the primary target
frame="TTAGGG";
Z=1; # warning GLOBAL
X[Z] = frame;
# primary derivatives
generate_derivative(frame, X);
xn = Z;
# secondary shifted targets and their derivatives
for(i=1; i<length(frame); i++){
frame = rotate(frame);
L[++Z] = frame;
generate_derivative(frame, L);
}
}
/^chr[0-9:-]*\t5[ACTG]*3$/ {
# because we care about the order of the prinary matches
for (i=1; i<=xn; i++) {gsub(X[i],"X",$2)}
# since we don't care about the order of the secondary matches
for (hit in L) {gsub(L[hit],"L",$2)}
print
}
END{
# print the matches in the order they are generated
#for (i=1; i<=xn; i++) {print X[i]};
#print ""
#for (i=1+xn; i<=Z; i++) {print L[i]};
}
IFF you can generate a static matching order you can live with then
something like the above Awk script could work. but you say the primary patterns should take precedence and that a secondary rule would be better applied first in some cases. (no can do).
If you need a more flexible matching pattern I would suggest looking at "recursive decent parsing with backtracking" Or "parsing expression grammars".
But then you are not in a bash shell anymore.

printf zero padded string

The format of MAC addresses varies with the platform.
E.g. on HPUX I could get something like:
0:0:c:7:ac:1e
While Linux gives me
00:00:0c:07:ac:1e
I used to use awk in a kornshell script on CentOS5 to format this to 00000c07ac1e like shown below.
MAC="0:0:c:7:ac:1e"
echo $MAC | awk -F: '{printf( "%02s%02s%02s%02s%02s%02s\n", $1,$2,$3,$4,$5,$6)}'
Unfortunately our admin server now is Ubuntu 14LTS with a newer version of awk which doesn't support the zero padding in the %s format anymore and I get an undesired 0 0 c 7ac1e
So I now switched to perl and do:
echo $MAC | perl -ne '{#A=split(":"); printf( "%02s%02s%02s%02s%02s%02s", #A)}'
As this may break too in upcoming releases I am looking for a more robust but still compact way to format the string.
Your Perl snippet will not break in future releases. This is basic functionality. Changing it will break many, many programs. (Plus, Perl has a mechanism for introducing backwards incompatible changes without breaking existing program.)
Cleaned up:
echo "$MAC" | perl -ne'#F=split(/:/); printf("%02s%02s%02s%02s%02s%02s\n", #F)'
Shorter:
echo "$MAC" | perl -ne'printf "%02s%02s%02s%02s%02s%02s\n", split /:/'
Without the repetition:
echo "$MAC" | perl -ple'$_ = join ":", map sprintf("%02s", $_), split /:/'
There's -a if you want something more awkish:
echo "$MAC" | perl -F: -aple'$_ = join ":", map sprintf("%02s", $_), #F'
Bit long but should be pretty robust
awk -F: '{for(i=1;i<=NF;i++){while(length($i)<2)$i=0$i;printf "%s",$i;}print ""}'
How it works
1.Loop through fields
2.Whilst the field is less than 2 characters long add zeros to the front
3.print the field
4.print newline character at end.
If you were dealing with a number rather than hex, you could use %.Xd to indicate you want at least X digits.
$ awk -F: '{printf( "%.2d%.2d\n", $1, $2)}' <<< "0:23"
0023
^^
two digits
From The GNU Awk User’s Guide #5.5.3 Modifiers for printf Formats:
.prec
A period followed by an integer constant specifies the precision to
use when printing. The meaning of the precision varies by control
letter:
%d, %i, %o, %u, %x, %X
Minimum number of digits to print.
In this case, you need a more general approach to deal with each one of the blocks of the MAC address. You can loop through the elements and add a 0 in case their length is just 1:
awk -F: '{for (i=1;i<=NF;i++) #loop through the elements
{
if (length($i)==1) #if length is 1
printf("0") #add a 0
printf ("%s", $i) #print the rest
}
print "" #print a new line at the end
}' <<< "0:0:c:7:ac:1e"
This returns:
00000c07ac1e
^^ ^^ ^^
^^ ^^ ^^
Note awk '...' <<< "$MAC" is the same as echo "$MAC" | awk '...'.

Finding multiple strings on multiple lines in file and manipulating output with bash/perl

I am trying to get the version numbers for content management systems being hosted on my server. I can do this fairly simply if the version number is stored on one line with something like this:
grep -r "\$wp_version = '" /home/
Which returns exactly what I want to stdout:
/home/$RANDOMDOMAIN/wp-includes/version.php:$wp_version = '3.7.1';
The issue I run into is when I start looking for version numbers that are stored on two or more lines, like Joomla! or Magento which use the following formats respectively:
Joomla!:
/** #var string Release version. */
public $RELEASE = '3.2';
/** #var string Maintenance version. */
public $DEV_LEVEL = '3';
Magento:
'major' => '1',
'minor' => '8',
'revision' => '1',
'patch' => '0',
I have gotten it to 'work', in a way, using the following (With this method if, for whatever reason, one of the strings I am looking for is missing the whole command becomes useless since xargs -l3 is expecting 2 rows above the path provided by -print):
find /home/ -type f -name version.php -exec grep " \$RELEASE " '{}' \; -exec grep " \$DEV_LEVEL " '{}' \; -print | xargs -l3 | sed 's/\<var\>\s//g;s/\<public\>\s//g' | awk -F\; '{print $3":"$1""$2}' | sed 's/ $DEV_LEVEL = /./g'
Which get's me output like this:
/home/$RANDOMDOMAIN/version.php:$RELEASE = 3.2.3
/home/$RANDOMDOMAIN/anotherfolder/version.php:$RELEASE = 1.5.0
I also have a working for loop that WILL exclude any file that does not contain both strings, but depending how much it has to sift through, can take significantly longer than the find one liner above:
for path in $(grep -rl " \$RELEASE " /home/ 2> /dev/null | xargs grep -rl " \$DEV_LEVEL ")
do
joomlaver="$path"
joomlaver+=$(grep " \$RELEASE " $path)
joomlaver+=$(echo " \$DEV_LEVEL = '$(grep " \$DEV_LEVEL " $path | cut -d\' -f2)';")
echo "$joomlaver" | sed 's/\<var\>\s//g;s/\<public\>\s//g;s/;//g' | awk -F\' '{ print $1""$2"."$4 }' | sed 's/\s\+//g'
unset joomlaver
done
Which get's me output like this:
/home/$RANDOMDOMAIN/version.php$RELEASE=3.2.3
/home/$RANDOMDOMAIN/anotherfolder/version.php$RELEASE=1.5.0
But I have to believe there is a simpler, shorter, more elegant way. Bash is preferred or if it can somehow be done with a perl one liner, that would work as well. Any and all help would be much appreciated. Thanks in advance. (Sorry for all the edits, but I am trying to figure this out myself as well!)
Here is a perl one-liner that will extract the $RELEASE and $DEV_LEVEL from the php file format you showed:
perl -ne '$v=$1 if /\$RELEASE\s*=\s*\047([0-9.]+)\047/; $devlevel=$1 if /\$DEV_LEVEL\s*=\s*\047([0-9.]+)\047/; if (defined $v && defined $devlevel) { print "$ARGV: Release=$v Devlevel=$devlevel\n"; last; }'
The -n makes perl effectivly wrap the whole thing inside a while (<>) { } loop. Each line is checked against two regexes. If both of them have matched then it will print the result and exit.
The \047 is used to match single quotes, otherwise the shell would get confused.
If it does not find a match, it does not print anything. Otherwise it prints something like this:
sample.php: Release=3.2 Devlevel=3
You would use it in combination with find and xargs to traverse down a directory structure, perhaps like this:
find . -name "*.php" | xargs perl -ne '$v=$1 if /\$RELEASE\s*=\s*\047([0-9.]+)\047/; $devlevel=$1 if /\$DEV_LEVEL\s*=\s*\047([0-9.]+)\047/; if (defined $v && defined $devlevel) { print "$ARGV: Release=$v Devlevel=$devlevel\n"; last; }'
You could make a similar version for the other file format (Magento?) you mentioned.

How can I let perl interpret a string variable that represents an address

I want to feed input to a C program with a perl script like this
./cprogram $(perl -e 'print "\xab\xcd\xef";').
However, the string must be read from a file. So I get something like this:
./cprogram $(perl -e 'open FILE, "<myfile.txt"; $file_contents = do { local $/; <FILE> }; print $file_contents'. However, now perl interprets the string as the string "\xab\xcd\xef", and I want it to interpret it as the byte sequence as in the first example.
How can this be achieved? It has to be ran on a server without File::Slurp.
In the first case, you pass the three bytes AB CD EF (produced by the string literal "\xAB\xCD\xEF") to print.
In the second case, you must be passing something other than those three bytes to print. I suspect you are passing the twelve character string \xAB\xCD\xEF to print.
So your question becomes: How does one convert the twelve-character string \xAB\xCD\xEF into the three bytes AB CD EF. Well, you'd require some kind of parser such as
s/\\x([0-9a-fA-F][0-9a-fA-F])|\\([^x])|([^\\]+)/
$1 ? chr(hex($1)) : $2 ? $2 : $3
/eg
And here it is at work:
$ perl -e'print "\\xAB\\xCD\\xEF";' >file
$ echo -n "$( perl -0777pe'
s{\\x([0-9a-fA-F][0-9a-fA-F])|\\([^x])|([^\\]+)}{
$1 ? chr(hex($1)) : $2 // $3
}eg;
' file )" | od -t x1
0000000 ab cd ef
0000003
Is Perl's eval too evil? If not, end in print eval("\"$file_contents\"");
Or can you prepare the file in advance using Perl? EG print FILE "\xAB\xCD\xED"; then read the resulting file with your existing code.
using a bash trick:
perl -e "$(echo "print \"$(cat input)"\")"
which for your example becomes:
./cprogram "$(perl -e "$(echo "print \"$(cat myfile.txt)"\")")"