How to handle UTF-8 emoji in sed on Cygwin? - sed

I've seen many topics about escaping and replacing a special character in SED, but none of them helped me.
I have this sed command I need to use on a file:
sed -i "s/This[^\|]\+/& (cool) /g" "file.txt"
For a reason I don't understand, it applies to this test case:
This is my funny 🎺 char and this | char is the char after which i want to stop my job.
... and transforms it to :
This is my funny 🎠(cool) ڠchar and this | char is the char after which i want to stop my job.
... instead of :
This is my funny 🎺 char and this (cool) | char is the char after which i want to stop my job.
Can anybody tell me how to handle this kind of case ?
Note : the file is UTF-8 encoded, I use Cygwin that is UTF-8 encoded and my SED command is in a ".sh" file that is UTF-8 encoded too.

It seems that the bug was due to the fact that I used SED on CYGWIN, as it worked well on GNU Linux.
Thank you for your attention.
I hope this thread could help another Cygwin user.

I can confirm this issue on Cygwin (v3.3.4) with sed (v4.8). Although I'm not quite sure what happened here, I found that by setting a UTF-16 compatible system locale makes it works:
$ python3 -c "print('This is my funny {} char and this | char is the char after which i want to stop my job.'.format(chr(0x1f3ba)))" | \
sed 's/This[^|]*/&(cool) /g'
This is my funny (cool) char and this | char is the char after which i want to stop my job.
$ python3 -c "print('This is my funny {} char and this | char is the char after which i want to stop my job.'.format(chr(0x1f3ba)))" | \
LANG=C.UTF16 sed 's/This[^|]*/&(cool) /g'
This is my funny 🎺 char and this (cool) | char is the char after which i want to stop my job.
By the way, always prefer single quotes when using sed in case $ and \ be expanded.

Related

Replacing = with in using sed

I have a string like below
abc="where session = '001122' and indicator = 'X'"
I want to convert it to
eng="where session in ('001122') and indicator in ('X')"
I have tried like below using sed in bash
eng=$(echo $abc | sed -r "s/=\s+('[^']+')/in (\1)/g")
I am still get the input itself. What am I doing wrong.
You can use unadorned sed with escaped to escape the capture group parentheses (\( and \)), as well as one-or-more quantifiers (\+):
$ eng=$(echo "$abc" | sed "s/=\s\+'\([^']\+\)'/in ('\1')/g"
$ echo "$eng"
where session in ('001122') and indicator in ('X')
It is also probably a good idea to quote your expansion of abc, since it has spaces in it, but not strictly necessary in this context.
Your original code may not have worked because -r is a GNU extension. The synonym -E used to be as well, but is now part of the POSIX standard, and should therefore be relatively portable. The following version should therefore have no problems either:
$ eng=$(echo "$abc" | sed -E "s/=\s+'([^']+)'/in ('\1')/g"

Need to parse the following sed command: sed -e 's/ /\'$'\n/g'

I stumble upon the command sed -e 's/ /\'$'\n/g'that supposedly takes an input and split all spaces into new lines. Still, I don't quite get how the '$' works in the command. I know that s stands for substitute, / / stands for the blank spac, \n stands for new line and /g is for global replacement, but not sure how \'$' fits in the picture. Anybody who can shed some light here will be much appreciated.
Basically it's meant for platform portability. With GNU sed it would be just
sed -e 's/ /\n/g'
because GNU sed is able to interpret \n as new line.
However, other versions of sed, like the BSD version (that comes with MacOS) do not interprete \n as newline.
That's why the command is build out of two parts
sed -e 's/ /\' part2: $'\n/g'
The $'\n/g' is an ANSI C string parsed by the shell before executing sed. Escape sequences like \n will get expanded in such strings. Doing so, the author of the command passed a literal new line (0xa) to the sed command rather than passing the escape sequence \n. (0x5c 0x6e).
One more thing, since the newline (0xa) is a command separator in sed, it needs to get escaped. That's why the \ at the end of the first part.
Alternatively you could just use a multiline version:
sed -e 's/ /\
/g'
Btw, I would have written the command like
sed -e 's/ /\'$'\n''/g'
meaning just putting the $'\n' into the ANSI C string. Imo that's better to understand.

How to use sed in order to search ^A and replace it

I would like to use (GNU) sed to do a simple search and replace. The issue is that I'm searching for a special character and it might be the reason it failed for me.
The input is:
^A9=139^A35=V^A34=9^A49=xxxx^A52=20140527-06:18:43.759^A5
and I want to replace the ^A with ;. I used:
sed -i '/s/^A/;/g' file.log
but I didn't get anything.
Your command should be,
sed -i 's/\^A/;/g' file
Command you tried,
sed -i '/s/^A/;/g' file.log
| |
| |______________You have to escape this special character. Because in general(regex) it means the starting point.
[No need to use `/` before s]
Example:
$ sed 's/\^A/;/g' file
;9=139;35=V;34=9;49=xxxx;52=20140527-06:18:43.759;5
^ has a special meaning with regular expressions. Use \^ (or potentially \\^, depending on how bash escapes things, I never quite remember it).

How to insert strings containing slashes with sed? [duplicate]

This question already has answers here:
Using different delimiters in sed commands and range addresses
(3 answers)
Closed 1 year ago.
I have a Visual Studio project, which is developed locally. Code files have to be deployed to a remote server. The only problem is the URLs they contain, which are hard-coded.
The project contains URLs such as ?page=one. For the link to be valid on the server, it must be /page/one .
I've decided to replace all URLs in my code files with sed before deployment, but I'm stuck on slashes.
I know this is not a pretty solution, but it's simple and would save me a lot of time. The total number of strings I have to replace is fewer than 10. A total number of files which have to be checked is ~30.
An example describing my situation is below:
The command I'm using:
sed -f replace.txt < a.txt > b.txt
replace.txt which contains all the strings:
s/?page=one&/pageone/g
s/?page=two&/pagetwo/g
s/?page=three&/pagethree/g
a.txt:
?page=one&
?page=two&
?page=three&
Content of b.txt after I run my sed command:
pageone
pagetwo
pagethree
What I want b.txt to contain:
/page/one
/page/two
/page/three
The easiest way would be to use a different delimiter in your search/replace lines, e.g.:
s:?page=one&:pageone:g
You can use any character as a delimiter that's not part of either string. Or, you could escape it with a backslash:
s/\//foo/
Which would replace / with foo. You'd want to use the escaped backslash in cases where you don't know what characters might occur in the replacement strings (if they are shell variables, for example).
The s command can use any character as a delimiter; whatever character comes after the s is used. I was brought up to use a #. Like so:
s#?page=one&#/page/one#g
A very useful but lesser-known fact about sed is that the familiar s/foo/bar/ command can use any punctuation, not only slashes. A common alternative is s#foo#bar#, from which it becomes obvious how to solve your problem.
add \ before special characters:
s/\?page=one&/page\/one\//g
etc.
In a system I am developing, the string to be replaced by sed is input text from a user which is stored in a variable and passed to sed.
As noted earlier on this post, if the string contained within the sed command block contains the actual delimiter used by sed - then sed terminates on syntax error. Consider the following example:
This works:
$ VALUE=12345
$ echo "MyVar=%DEF_VALUE%" | sed -e s/%DEF_VALUE%/${VALUE}/g
MyVar=12345
This breaks:
$ VALUE=12345/6
$ echo "MyVar=%DEF_VALUE%" | sed -e s/%DEF_VALUE%/${VALUE}/g
sed: -e expression #1, char 21: unknown option to `s'
Replacing the default delimiter is not a robust solution in my case as I did not want to limit the user from entering specific characters used by sed as the delimiter (e.g. "/").
However, escaping any occurrences of the delimiter in the input string would solve the problem.
Consider the below solution of systematically escaping the delimiter character in the input string before having it parsed by sed.
Such escaping can be implemented as a replacement using sed itself, this replacement is safe even if the input string contains the delimiter - this is since the input string is not part of the sed command block:
$ VALUE=$(echo ${VALUE} | sed -e "s#/#\\\/#g")
$ echo "MyVar=%DEF_VALUE%" | sed -e s/%DEF_VALUE%/${VALUE}/g
MyVar=12345/6
I have converted this to a function to be used by various scripts:
escapeForwardSlashes() {
# Validate parameters
if [ -z "$1" ]
then
echo -e "Error - no parameter specified!"
return 1
fi
# Perform replacement
echo ${1} | sed -e "s#/#\\\/#g"
return 0
}
this line should work for your 3 examples:
sed -r 's#\?(page)=([^&]*)&#/\1/\2#g' a.txt
I used -r to save some escaping .
the line should be generic for your one, two three case. you don't have to do the sub 3 times
test with your example (a.txt):
kent$ echo "?page=one&
?page=two&
?page=three&"|sed -r 's#\?(page)=([^&]*)&#/\1/\2#g'
/page/one
/page/two
/page/three
replace.txt should be
s/?page=/\/page\//g
s/&//g
please see this article
http://netjunky.net/sed-replace-path-with-slash-separators/
Just using | instead of /
Great answer from Anonymous. \ solved my problem when I tried to escape quotes in HTML strings.
So if you use sed to return some HTML templates (on a server), use double backslash instead of single:
var htmlTemplate = "<div style=\\"color:green;\\"></div>";
A simplier alternative is using AWK as on this answer:
awk '$0="prefix"$0' file > new_file
You may use an alternative regex delimiter as a search pattern by backs lashing it:
sed '\,{some_path},d'
For the s command:
sed 's,{some_path},{other_path},'

Skip/remove non-ascii character with sed

Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa
I've been trying to use sed to modify email addresses in a .csv but the line above keeps tripping me up, using commands like:
sed -i 's/[\d128-\d255]//' FILENAME
from this stackoverflow question
doesn't seem to work as I get an 'invalid collation character' error.
Ideally I don't want to change that combined AE character at all, I'd rather sed just skip right over it as I'm not trying to manipulate that text but rather the email addresses. As long as that AE is in there though it causes my sed substitution to fail after one line, delete the character and it processes the whole file fine.
Any ideas?
This might work for you (GNU sed):
echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
sed 's/\o346/a+e/g'
Chip,Dirkland,Droba+eSphere Inc,cdirkland#hotmail.com,usa
Then do what you have to do and after to revert do:
echo "Chip,Dirkland,Droba+eSphere Inc,cdirkland#hotmail.com,usa" |
sed 's/a+e/\o346/g'
Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa
If you have tricky characters in strings and want to understand how sed sees them use the l0 command (see here). Also very useful for debugging difficult regexps.
echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
sed -n 'l0'
Chip,Dirkland,Drob\346Sphere Inc,cdirkland#hotmail.com,usa$
sed -i 's/[^[:print:]]//' FILENAME
Also, this acts like dos2unix
The issue you are having is the local.
if you want to use a collation range like that you need to change the character type and the collation type.
This fails as \x80 -> \xff are invalid in a utf-8 string.
note \u0080 != \x80 for utf8.
anyway to get this to work just do
LC_ALL=C sed -i 's/[\d128-\d255]//' FILENAME
this will override LC_CTYPE and LC_COLLATE for the one command and do what you want.
I came here trying this sed command s/[\x00-\x1F]/ /g;, which gave me the same error message.
in this case it simply suffices to remove the \x00 from the collation, yielding s/[\x01-\x1F]/ /g;
Unfortunately it seems like all characters above and including \x7F and some others are disallowed, as can be seen with this short script:
for (( i=0; i<=255; i++ )); do
printf "== $i - \x$(echo "ibase=10;obase=16;$i" | bc) =="
echo '' | sed -E "s/[\d$i-\d$((i+1))]]//g"
done
Note that the problem is only the use of those characters to specify a range. You can still list them all manually or per script. E.g. to come back to your example:
sed -i 's/[\d128-\d255]//' FILENAME
would become
c=; for (( i=128; i<255; i++ )); do c="$c\d$i"; done
sed -i 's/['"$c"']//' FILENAME
which would translate to:
sed -i 's/[\d128\d129\d130\d131\d132\d133\d134\d135\d136\d137\d138\d139\d140\d141\d142\d143\d144\d145\d146\d147\d148\d149\d150\d151\d152\d153\d154\d155\d156\d157\d158\d159\d160\d161\d162\d163\d164\d165\d166\d167\d168\d169\d170\d171\d172\d173\d174\d175\d176\d177\d178\d179\d180\d181\d182\d183\d184\d185\d186\d187\d188\d189\d190\d191\d192\d193\d194\d195\d196\d197\d198\d199\d200\d201\d202\d203\d204\d205\d206\d207\d208\d209\d210\d211\d212\d213\d214\d215\d216\d217\d218\d219\d220\d221\d222\d223\d224\d225\d226\d227\d228\d229\d230\d231\d232\d233\d234\d235\d236\d237\d238\d239\d240\d241\d242\d243\d244\d245\d246\d247\d248\d249\d250\d251\d252\d253\d254\d255]//' FILENAME
In this case there is a way to just skip non-ASCII chars, not bothering with removing.
LANG=C sed /someemailpattern/
See https://bugzilla.redhat.com/show_bug.cgi?id=440419 and Will sed (and others) corrupt non-ASCII files?.
How about using awk for this. We setup the Field Separator to nothing. Then loop over each character. Use an if loop to check if it matches our character class. If it does we print it else we ignore it.
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i}'
Test:
[jaypal:~/Temp] echo "Chip,Dirkland,DrobæSphere Inc,cdirkland#hotmail.com,usa" |
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i}'
Chip,Dirkland,DrobSphere Inc,cdirkland#hotmail.com,usa
Update:
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.# ]/) printf $i; printf "\n"}' < datafile.csv > asciidata.csv
I have added printf "\n" after the loop to keep the lines separate.