assign actions to one sed address simultaneously for match and non-match - sed

My sed command line script looks like
echo "a,b,c,d" | sed -ne 's/[^a-zA-Z0-9]//g; /^...$/ p; /^...$/! q1'
I want the script to succeed (return-code 0) if there are exactly 3 letters left, and to fail otherwise.
The slightly nagging part is that I have to duplicate the address /^...$/.
I was hoping for something like
echo "a,b,c,d" | sed -ne 's/[^a-zA-Z0-9]//g; /^...$/ p ! q1'
but that doesn't work, at least not with that syntax.

You can use // to represent previously used regex
$ echo "a,b,c,d" | sed -ne 's/[^a-zA-Z0-9]//g; /^...$/ p; //! q1'
$ echo $?
1
$ echo "b,c,d" | sed -ne 's/[^a-zA-Z0-9]//g; /^...$/ p; //! q1'
bcd
$ echo $?
0
Alternatively, you can use b command to start next cycle
$ echo "b,c,d" | sed -ne 's/[^a-zA-Z0-9]//g; /^...$/{p;b}; q1'
bcd
$ echo $?
0
$ echo "a,b,c,d" | sed -ne 's/[^a-zA-Z0-9]//g; /^...$/{p;b}; q1'
$ echo $?
1
This syntax will probably work with GNU sed only. See manual for details

Related

Bash or Python efficient substring matching and filtering

I have a set of filenames in a directory, some of which are likely to have identical substrings but not known in advance. This is a sorting exercise. I want to move the files with the maximum substring ordered letter match together in a subdirectory named with that number of letters and progress to the minimum match until no matches of 2 or more letters remain. Ignore extensions. Case insensitive. Ignore special characters.
Example.
AfricanElephant.jpg
elephant.jpg
grant.png
ant.png
el_gordo.tif
snowbell.png
Starting from maximum length matches to minimum length matches will result in:
./8/AfricanElephant.jpg and ./8/elephant.jpg
./3/grant.png and ./3/ant.png
./2/snowbell.png and ./2/el_gordo.tif
Completely lost on an efficient bash or python way to do what seems a complex sort.
I found some awk code which is almost there:
{
count=0
while ( match($0,/elephant/) ) {
count++
$0=substr($0,RSTART+1)
}
print count
}
where temp.txt contains a list of the files and is invoked as eg
awk -f test_match.awk temp.txt
Drawback is that a) this is hardwired to look for "elephant" as a string (I don't know how to make it take an input string (rather than file) and an input test string to count against, and
b) I really just want to call a bash function to do the sort as specified
If I had this I could wrap some bash script around this core awk to make it work.
function longest_common_substrings () {
shopt -s nocasematch
for file1 in * ; do for file in * ; do \
if [[ -f "$file1" ]]; then
if [[ -f "$file" ]]; then
base1=$(basename "$file" | cut -d. -f1)
base2=$(basename "$file1" | cut -d. -f1)
if [[ "$file" == "$file1" ]]; then
echo -n ""
else
echo -n "$file $file1 " ; $HOME/Scripts/longest_common_substring.sh "$base1" "$base2" | tr -d '\n' | wc -c | awk '{$1=$1;print}' ;
fi
fi
fi
done ;
done | sort -r -k3 | awk '{ print $1, $3 }' > /tmp/filesort_substring.txt
while IFS= read -r line; do \
file_to_move=$(echo "$line" | awk '{ print $1 }') ;
directory_to_move_to=$(echo "$line" | awk '{ print $2 }') ;
if [[ -f "$file_to_move" ]]; then
mkdir -p "$directory_to_move_to"
\gmv -b "$file_to_move" "$directory_to_move_to"
fi
done < /tmp/filesort_substring.txt
shopt -u nocasematch
where $HOME/Scripts/longest_common_substring.sh is
#!/bin/bash
shopt -s nocasematch
if ((${#1}>${#2})); then
long=$1 short=$2
else
long=$2 short=$1
fi
lshort=${#short}
score=0
for ((i=0;i<lshort-score;++i)); do
for ((l=score+1;l<=lshort-i;++l)); do
sub=${short:i:l}
[[ $long != *$sub* ]] && break
subfound=$sub score=$l
done
done
if ((score)); then
echo "$subfound"
fi
shopt -u nocasematch
Kudos to the original solution for computing the match in the script which I found elsewhere in this site

wc -c gives one more than I expected, why is that?

echo '2003'| wc -c
I thought it would give me 4, but it turned to be 5, what is that additional byte?
Because echo will get a new line.
echo "2014" | wc -c
it will get 5
printf "2014" | wc -c
it will get 4 where printf will not add a new line.
echo contains a built-in switch, -n, to remove newline. So running:
echo -n "2021" | wc -c
Will output the expected 4.
echo adds new line which is causing the issue.
As mentioned by "KyChen", you can use printf or:
a="2014 ;
echo $a |awk '{print length}'

Solaris sed label too long

I am trying to execute a shell file, in which there is a line:
sed -ne ':1;/PinnInstitutionPath/{n;p;b1}' Institution | sed -e s/\ //g | sed -e s/\=//g | sed -e s/\;//g | sed -e s/\"//g | sed -e s/\Name//g
And un error message turns out : "Label too long: :1;/PinnInstitutionPath/{n;p;b1}"
I am a noob at linux, so can anyone help me to solve this problem, thank you!
Try changing
sed -ne ':1;/PinnInstitutionPath/{n;p;b1}'
to
sed -ne ':1' -e '/PinnInstitutionPath/{n;p;b1}'
Also, you don't need to call sed so many times:
sed -ne 's/[ =;"]//g; s/Name//g' -e ':1' -e '/PinnInstitutionPath/{n;p;b1}'
Concerning 'sed: Label too long' in Solaris (SunOS) - you will need to split your command into several lines, if you use labels.
In your casesed -ne ':1
/PinnInstitutionPath/{
n
p
b 1
}' Institution | sed -e s/\ //g -e s/\=//g -e s/\;//g -e s/\"//g -e s/\Name//g

Pattern extraction using SED or AWK

How do I extract 68 from v1+r0.68?
Using awk, returns everything after the last '.'
echo "v1+r0.68" | awk -F. '{print $NF}'
Using sed to get the number after the last dot:
echo 'v1+r0.68' | sed 's/.*[.]\([0-9][0-9]*\)$/\1/'
grep is good at extracting things:
kent$ echo " v1+r0.68"|grep -oE "[0-9]+$"
68
Match the digit string before the end of the line using grep:
$ echo 'v1+r0.68' | grep -Eo '[0-9]+$'
68
Or match any digits after a .
$ echo 'v1+r0.68' | grep -Po '(?<=\.)\d+'
68
Print everything after the . with awk:
echo "v1+r0.68" | awk -F. '{print $NF}'
68
Substitute everything before the . with sed:
echo "v1+r0.68" | sed 's/.*\.//'
68
type man grep
and you will see
...
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
then type echo 'v1+r0.68' | grep -o '68'
if you want it any where special do:
echo 'v1+r0.68' | grep -o '68' > anyWhereSpecial.file_ending

extract number from string

I have a string ABCD20110420.txt and I want to extract the date out of it. Expected 2011-04-20
I can use replace to remove the text part, but how do I insert the "-" ?
# echo "ABCD20110420.txt" | replace 'ABCD' '' | replace '.txt' ''
20110420
echo "ABCD20110420.txt" | sed -e 's/ABCD//' -e 's/.txt//' -e 's/\(....\)\(..\)\(..\)/\1-\2-\3/'
Read: sed FAQ
Just use the shell (bash)
$> file=ABCD20110420.txt
$> echo "${file//[^0-9]/}"
20110420
$> file="${file//[^0-9]/}"
$> echo $file
20110420
$> echo ${file:0:4}-${file:4:2}-${file:6:2}
2011-04-20
The above is applicable to files like your sample. If you have files like A1BCD20110420.txt, then will not work.
For that case,
$> file=A1BCD20110420.txt
$> echo ${file%.*} #get rid of .txt
A1BCD20110420
$> file=${file%.*}
$> echo "2011${file#*2011}"
20110420
Or you can use regular expression (Bash 3.2+)
$> file=ABCD20110420.txt
$> [[ $file =~ ^.*(2011)([0-9][0-9])([0-9][0-9])\.*$ ]]
$> echo ${BASH_REMATCH[1]}
2011
$> echo ${BASH_REMATCH[2]}
04
$> echo ${BASH_REMATCH[3]}
20
echo "ABCD20110420.txt" | sed -r 's/.+([0-9]{4})([0-9]{2})([0-9]{2}).+/\1-\2-\3/'
$ file=ABCD20110420.txt
$ echo "$file" | sed -e 's/^[A-Za-z]*\([0-9][0-9][0-9][0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)\.txt$/\1-\2-\3/'
This only requires a single call to sed.
echo "ABCD20110420.txt" | sed -r 's/.{4}(.{4})(.{2})(.{2}).txt/\1-\2-\3/'