Uppercase to Lowercase with Sed and character classes - class

I'd like to convert a string from upper to lower case. I know there are different ways of solving this problem, but I'd like to understand why this command doesn't work:
echo "aa" | sed 's/'[:upper:]'/'[:lower:]'/g'
Is it a wrong way to use the classes of characters?

from lowercase to uppercase, you can use
echo "aW123bR" | sed -r 's/[a-z]+/\U&/g'
tr command is an interesting alternative
echo "aW123bR" | tr '[:lower:]' '[:upper:]'

In sed, the y command is used for mapping sets of characters:
sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/'
It requires a literal list of characters, not character classes.

Another possible solution with gawk :
[ ~]$ echo "HELLO"|awk '{print tolower($0)}'
hello

Related

Using sed to eliminate a specific string

I appreciate your help with this problem. I like to eliminate everything that is not a specific pattern from a string.
For example, below I like to eliminate everything that is not "5TTGTC".
But as seen here ^5TTGTC is not right. I used different combinations of ^(), ^{}, ^[], but none gave me what I am looking for. Appreciate your feedback!
echo ".,..,...+5TTGTC...+5TTGCC.+5TTGTC,,.,.,,.,+5ttgtc,.,,.,.+5TTGTC.+5TTGTC,..+5TTGTC" | sed 's/^5TTGTC//g'
Thanks in advance
You may use the following command if you want case sensitivity:
echo ".,..,...+5TTGTC...+5TTGCC.+5TTGTC,,.,.,,.,+5ttgtc,.,,.,.+5TTGTC.+5TTGTC,..+5TTGTC" | sed -r 's/(5TTGTC)|[,.A-Za-z+0-9]/\1/g'
The code above prints:
5TTGTC5TTGTC5TTGTC5TTGTC5TTGTC
The regular expression used above uses alternation to capture what you are interested in.
We match and capture what we are interested in (5TTGCC) and we match everything that is not the substring, in this case characters ,.A-Za-z+0-9.
You can check the behaviour of the regex here.
As pointed out by #EdMorton, the command can be simplified to:
echo ".,..,...+5TTGTC...+5TTGCC.+5TTGTC,,.,.,,.,+5ttgtc,.,,.,.+5TTGTC.+5TTGTC,..+5TTGTC" | sed -r 's/(5TTGTC)|./\1/g'
You can try this here.
For compatibility across sed versions the -r flag can be replaced by the -E flag.
You don't make it very clear what you are trying to achieve.
One way to get where you are trying to go could be the -o option in grep.
echo ".,..,...+5TTGTC...+5TTGCC.+5TTGTC,,.,.,,.,+5ttgtc,.,,.,.+5TTGTC.+5TTGTC,..+5TTGTC" | grep -o '5TTGTC'
Output:
5TTGTC
5TTGTC
5TTGTC
5TTGTC
5TTGTC
You can then change 5TTGTC into a pattern, e.g. grep -o '[0-9]TT[AG]GTC'
With any sed:
$ echo ".,..,...+5TTGTC...+5TTGCC.+5TTGTC,,.,.,,.,+5ttgtc,.,,.,.+5TTGTC.+5TTGTC,..+5TTGTC" |
sed 's/#//g; s/5TTGTC/#/g; s/[^#]//g; s/#/5TTGTC/g'
5TTGTC5TTGTC5TTGTC5TTGTC5TTGTC
With any awk:
$ echo ".,..,...+5TTGTC...+5TTGCC.+5TTGTC,,.,.,,.,+5ttgtc,.,,.,.+5TTGTC.+5TTGTC,..+5TTGTC" |
awk -v str='5TTGTC' '{gsub(str,"\n"); gsub(/[^\n]/,""); gsub(/\n/,str)}1'
5TTGTC5TTGTC5TTGTC5TTGTC5TTGTC

Escape line beginning and end in bracket expressions in sed

How do you escape line beginning and line end in bracket expressions in sed?
For example, let's say I want to replace both comma, line beginning, and line end in each line with pipe:
echo "a,b,c" | sed 's/,/|/g'
# a|b|c
echo "a,b,c" | sed 's/^/|/g'
# |a,b,c
echo "a,b,c" | sed 's/$/|/g'
# a,b,c|
echo "a,b,c" | sed 's/[,^$]/|/g'
# a|b|c
I would expect the last command to produce |a|b|c|. I also tried escaping the line beginning and line end via backslash, with no change.
With GNU sed with extended regular expressions, you can do:
$ echo "a,b,c" | /opt/gnu/bin/sed -E 's/^|,|$/|/g'
|a|b|c|
$
The -E option enables the extended regular expressions, as does -r, but -E is also used by other sed variants for the same purpose, unlike -r.
However, for reasons which elude me, the BSD (macOS) variant of sed produces:
$ echo "a,b,c" | sed -E 's/^|,|$/|/g'
|a|b|c
$
I can't think why.
If this variability is unacceptable, go with the three-substitution solution:
$ echo "a,b,c" | sed -e "s/^/|/" -e "s/$/|/" -e "s/,/|/g"
|a|b|c|
$
which should work with any variant of sed. However, note that echo "" | sed …3 subs… produces || whereas the -E variant produces |. I'm not sure if there's an easy fix for that.
You tried this, but it didn't do what you wanted:
$ echo "a,b,c" | sed 's/[,^$]/|/g'
a|b|c
$
This is what should be expected. Inside character classes, most special characters lose their special-ness. There is nothing special about $ (or , but it isn't a metacharacter anyway) in a character class; ^ is only special at the start of the class and it negates the character class. That means that what follows shows the correct, expected behaviour from this permutation of the contents of your character class:
$ echo "a,b\$\$b,c" | sed 's/[^,$]/|/g'
|,|$$|,|
$
It mapped all the non-comma, non-dollar characters to pipes. I should be using single quotes around the echo; then the backslashes wouldn't be necessary. I just followed the question's code quietly.
Following sed may help you in same.
echo "a,b,c" | sed 's/^/|/;s/,/|/g;s/$/|/'
Output will be as follows.
|a|b|c|

Skip/remove non-ascii character with sed

Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa
I've been trying to use sed to modify email addresses in a .csv but the line above keeps tripping me up, using commands like:
sed -i 's/[\d128-\d255]//' FILENAME
from this stackoverflow question
doesn't seem to work as I get an 'invalid collation character' error.
Ideally I don't want to change that combined AE character at all, I'd rather sed just skip right over it as I'm not trying to manipulate that text but rather the email addresses. As long as that AE is in there though it causes my sed substitution to fail after one line, delete the character and it processes the whole file fine.
Any ideas?
This might work for you (GNU sed):
echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
sed 's/\o346/a+e/g'
Chip,Dirkland,Droba+eSphere Inc,cdirkland#hotmail.com,usa
Then do what you have to do and after to revert do:
echo "Chip,Dirkland,Droba+eSphere Inc,cdirkland#hotmail.com,usa" |
sed 's/a+e/\o346/g'
Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa
If you have tricky characters in strings and want to understand how sed sees them use the l0 command (see here). Also very useful for debugging difficult regexps.
echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
sed -n 'l0'
Chip,Dirkland,Drob\346Sphere Inc,cdirkland#hotmail.com,usa$
sed -i 's/[^[:print:]]//' FILENAME
Also, this acts like dos2unix
The issue you are having is the local.
if you want to use a collation range like that you need to change the character type and the collation type.
This fails as \x80 -> \xff are invalid in a utf-8 string.
note \u0080 != \x80 for utf8.
anyway to get this to work just do
LC_ALL=C sed -i 's/[\d128-\d255]//' FILENAME
this will override LC_CTYPE and LC_COLLATE for the one command and do what you want.
I came here trying this sed command s/[\x00-\x1F]/ /g;, which gave me the same error message.
in this case it simply suffices to remove the \x00 from the collation, yielding s/[\x01-\x1F]/ /g;
Unfortunately it seems like all characters above and including \x7F and some others are disallowed, as can be seen with this short script:
for (( i=0; i<=255; i++ )); do
printf "== $i - \x$(echo "ibase=10;obase=16;$i" | bc) =="
echo '' | sed -E "s/[\d$i-\d$((i+1))]]//g"
done
Note that the problem is only the use of those characters to specify a range. You can still list them all manually or per script. E.g. to come back to your example:
sed -i 's/[\d128-\d255]//' FILENAME
would become
c=; for (( i=128; i<255; i++ )); do c="$c\d$i"; done
sed -i 's/['"$c"']//' FILENAME
which would translate to:
sed -i 's/[\d128\d129\d130\d131\d132\d133\d134\d135\d136\d137\d138\d139\d140\d141\d142\d143\d144\d145\d146\d147\d148\d149\d150\d151\d152\d153\d154\d155\d156\d157\d158\d159\d160\d161\d162\d163\d164\d165\d166\d167\d168\d169\d170\d171\d172\d173\d174\d175\d176\d177\d178\d179\d180\d181\d182\d183\d184\d185\d186\d187\d188\d189\d190\d191\d192\d193\d194\d195\d196\d197\d198\d199\d200\d201\d202\d203\d204\d205\d206\d207\d208\d209\d210\d211\d212\d213\d214\d215\d216\d217\d218\d219\d220\d221\d222\d223\d224\d225\d226\d227\d228\d229\d230\d231\d232\d233\d234\d235\d236\d237\d238\d239\d240\d241\d242\d243\d244\d245\d246\d247\d248\d249\d250\d251\d252\d253\d254\d255]//' FILENAME
In this case there is a way to just skip non-ASCII chars, not bothering with removing.
LANG=C sed /someemailpattern/
See https://bugzilla.redhat.com/show_bug.cgi?id=440419 and Will sed (and others) corrupt non-ASCII files?.
How about using awk for this. We setup the Field Separator to nothing. Then loop over each character. Use an if loop to check if it matches our character class. If it does we print it else we ignore it.
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i}'
Test:
[jaypal:~/Temp] echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i}'
Chip,Dirkland,DrobSphere Inc,cdirkland#hotmail.com,usa
Update:
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i; printf "\n"}' < datafile.csv > asciidata.csv
I have added printf "\n" after the loop to keep the lines separate.

Replacing the last word of a path using sed

I have the following: param="/var/tmp/test"
I need to replace the word test with another word such as new_test
need a smart way to replace the last word after "/" with sed
echo 'param="/var/tmp/test"' | sed 's/\/[^\/]*"/\/REPLACEMENT"/'
param="/var/tmp/REPLACEMENT"
echo '/var/tmp/test' | sed 's/\/[^\/]*$/\/REPLACEMENT/'
/var/tmp/REPLACEMENT
Extracting bits and pieces with sed is a bit messy (as Jim Lewis says, use basename and dirname if you can) but at least you don't need a plethora of backslashes to do it if you are going the sed route since you can use the fact that the delimiter character is selectable (I like to use ! when / is too awkward, but it's arbitrary):
$ echo 'param="/var/tmp/test"' | sed ' s!/[^/"]*"!/new_test"! '
param="/var/tmp/new_test"
We can also extract just the part that was substituted, though this is easier with two substitutions in the sed control script:
$ echo 'param="/var/tmp/test"' | sed ' s!.*/!! ; s/"$// '
test
You don't need sed for this...basename and dirname are a better choice for assembling or disassembling pathnames. All those escape characters give me a headache....
param="/var/tmp/test"
param_repl=`dirname $param`/newtest
It's not clear whether param is part of the string that you need processed or it's the variable that holds the string. Assuming the latter, you can do this using only Bash (you don't say which shell you're using):
shopt -s extglob
param="/var/tmp/test"
param="${param/%\/*([^\/])//new_test}"
If param= is part of the string:
shopt -s extglob
string='param="/var/tmp/test"'
string="${string/%\/*([^\/])\"//new}"
This might work for you:
echo 'param="/var/tmp/test"' | sed -r 's#(/(([^/]*/)*))[^"]*#\1newtest#'
param="/var/tmp/newtest"

sed Removing whitespace around certain character

what would be the best way to remove whitespace only around certain character. Let's say a dash - Some- String- 12345- Here would become Some-String-12345-Here. Something like sed 's/\ -/-/g;s/-\ /-/g' but I am sure there must be a better way.
Thanks!
If you mean all whitespace, not just spaces, then you could try \s:
echo 'Some- String- 12345- Here' | sed 's/\s*-\s*/-/g'
Output:
Some-String-12345-Here
Or use the [:space:] character class:
echo 'Some- String- 12345- Here' | sed 's/[[:space:]]*-[[:space:]]*/-/g'
Different versions of sed may or not support these, but GNU sed does.
Try:
's/ *- */-/g'
you can use awk as well
$ echo 'Some - String- 12345-' | awk -F" *- *" '{$1=$1}1' OFS="-"
Some-String-12345-
if its just "- " in your example
$ s="Some- String- 12345-"
$ echo ${s//- /-}
Some-String-12345-