Unable to delete whitespace from string with tr, sed - sed

I have a file that contains a whitespace character that I'm not able to successfully remove with command-line tools such as tr or sed. Here's the input:
2,  78 ,, 1
6, 74, ,1
and I want the output to look like:
2,78,,1
6,74,,1
Attempts
If I try tr -d "[[:space:]] the result is 2, 78,,16,74,,1 which leaves a space character and removes the newline.
If I try sed 's/[[:space:]]//g' the result is
2, 78,,1
6,74,,1
which still leaves the space.
I converted the string to hex, and it seems the offending character is a0, but even then the results are not what I'd expect:
sed 's/\xa0//g' yields
2, �78 ,, 1
6, 74, ,1
Question
What is that whitespace character that is not getting caught by the [[:space:]] character class? How can I delete it?

The offending character is a UTF-8-encoded non-breaking space, with hex representation \xc2\xa0. You can remove all spaces, including non-breaking spaces, with
sed -E 's/[[:space:]]|\xc2\xa0//g'
Explanation
-E turns on extended regex to allow the | to represent logical OR
's/pattern/replacement/' substitutes pattern matches with the replacement text (in this case, an empty string), with /g repeating the pattern substitution multiple times per line
[[:space:]] matches most whitespace characters, including spaces and tabs
\xc2\xa0 is the hex code for the UTF-8 non-breaking space

The characters you want to remove are the non-printable ones (i.e the ones not in the [:print:] character class) rather than the ones just the ones in the [:space:] character class:
$ printf 'foo\xc2\xa0bar\n' > file
$ cat file
foo bar
$ tr -dc '[:print:]' < file
foobar$
but I notice the equivalent doesn't work in GNU sed or GNU awk and idk why.

Related

Append specific caracter at the end of each line

I have a file and I want to append a specific text, \0A, to the end of each of its lines.
I used this command,
sed -i s/$/\0A/ file.txt
but that didn't work with backslash \0A.
In its default operations, sed cyclically appends a line from input, less it's terminating <newline>-character, into the pattern space of sed.
The OP wants to use sed to append the character \0A at the end of a line. This is the hexadecimal representation of the <newline>-character (cfr. http://www.asciitable.com/). So from this perspective, the OP attempts to double space a files. This can be easilly done using:
sed G file
The G command, appends a newline followed by the content of the hold space to the pattern space. Since the hold space is always empty, it just appends a newline character to the pattern space. The default action of sed is to print the line. So this just double-spaces a file.
Your command should be fixed by simply enclosing s/$/\0A/ in single quotes (') and escaping the backslash (with another backslash):
sed -i 's/$/\\0A/' file.txt
Notice that the surrounding 's protect that string from being processed by the shell, but the bashslash still needed escape in order to protect it from SED itself.
Obviously, it's still possible to avoid the single quotes if you escape enough:
sed -i s/$/\\\\0A/ file.txt
In this case there are no single quotes to protect the string, so we need write \\ in the shell to get SED fed with \, but we need two of those \\, i.e. \\\\, so that SED is fed with \\, which is an escaped \.
Move obviously, I'd never ever suggest the second alternative.

gnu sed remove portion of line after pattern match with special characters

The goal is to use sed to return only the url from each line of FF extension Mining Blocker which uses this format for its regex lines:
{"baseurl":"*://002.0x1f4b0.com/*", "suburl":"*://*/002.0x1f4b0.com/*"},
{"baseurl":"*://003.0x1f4b0.com/*", "suburl":"*://*/003.0x1f4b0.com/*"},
the result should be:
002.0x1f4b0.com
003.0x1f4b0.com
One way would be to keep everything after suburl":"*://*/ then remove each occurrence of /*"},
I found https://unix.stackexchange.com/questions/24140/return-only-the-portion-of-a-line-after-a-matching-pattern but the special characters are a problem.
this won't work:
sed -n -e s#^.*suburl":"*://*/##g hosts
Would someone please show me how to mark the 2 asterisks in the string so they are seen by regex as literal characters, not wildcards?
edit:
sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' hosts
doesn't work, unfortunately.
regarding character substitution, thanks for directing me to the references.
I reduced the searched-for string to //*/ and used ASCII character codes like this:
sed -n -e s#^.*\d047\d047\d042\d047##g hosts
Unfortunately, that didn't output any changes to the lines.
My assumptions are:
^.*something specifies everything up to and including the last occurrence of "something" in a line
sed -n -e s#search##g deletes (replace with nothing) "search" within a line
So, this line:
sed -n -e s#^.*\d047\d047\d042\d047##g hosts
Should output everything after //*/ in each line...except it doesn't.
What is incorrect with that line?
Regarding deleting everything including and after the first / AFTER that first operation, yes, that's wanted too.
This might work for you (GNU sed):
sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' file
Match greedily (the longest string that matches) all characters up to ://*/, followed by a group of characters (which will be referred to as \1) that do not match a /, followed by the rest of the line and replace it by the group \1.
N.B. the sed substitution delimiters are arbitrary, in this case chosen to be # so as make pattern matching / easier. Also the character * on the left hand side of the substitution command may be interpreted as a meta character that means zero or more of the previous character/group and so is quoted \* so that it does not mistakenly exert this property. Finally, using the option -n toggles off the usual printing of every thing in the pattern space after all the sed commands have been executed. The p flag on the substitution command, prints the pattern space following a successful substitution, therefore only URL's will appear in the output or nothing.

How to use Sed to change letter to uppercase in first and second column in text file to upper case

I have text file input.txt which has
april,december,month.gmail.com
lion,tiger,animal.gmail.com
Using sed change first and second columns to uppercase? Is there a way to do it?
With GNU sed:
sed 's/^[a-z]*,[a-z]*,/\U&/' file
s: substitute command
[a-z]*,: search for zero ore more lowercase letter followed by a ,. The pattern is repeated for second field
the \U sequence turns the replacement to uppercase
\U is applied to & which reference the matched string
or if there is only three comma separated fields:
sed 's/^[a-z].*,/\U&/' file
output:
APRIL,DECEMBER,month.gmail.com
LION,TIGER,animal.gmail.com
As #Sundeep suggests, the second sed can be shortened to:
s/^.*,/\U&/
which converts all characters until last , is found
For more on GNU sed substitution command, see this article

Sed - Printing a pattern in a line matched more than once

Input-
X's Score 1725 and Y's Score 6248 in the match number 576
I want sed to ouput-
1725
6248
My code-
sed 's/Score[[:space:]]\([0-9]+\)/\1/g'
The above code outputs -
1725 and Y's 6248 in the match
You could try the following sed commands
#!/bin/sed f
s/Score\s*/\
/g
s/\n\([0-9]\+\)[^\n]*/\
\1/g
s/^[^\n]*\n//
The first command replaces all "Score"s with newlines, so now all numbers are at the beginning of a line. To insert a newline character, we must write a backslash followed by an actual line break. That's why the command spawns two lines.
The second command will remove everything after the numbers that are on the beginning of a line. It will match a newline character followed by a number (this is how we now that this number was prefixed by a "Score" string). The number will be captured into variable \1. Then it will skip all characters up to the newline character. When writing the replacement, we must restore the newline character and the number that was captured into \1.
Because the first line contains text before the first "Score", we must remove it. That's what the last command does, it matches all characters up to the first newline, starting from the beginning of the contents of the pattern space (ie. our working buffer).
In a single command:
sed -e 's/Score\s*/\
/g;s/\n\([0-9]\+\)[^\n]*/\
\1/g;s/^[^\n]*\n//'
Hope this helps =)
One way using GNU sed because \b that matches a word boundary is an extension.
echo "X's Score 1725 and Y's Score 6248 in the match number 576" | sed -e '
## Surround searched numbers (preceded by "Score") with newline characters.
s/\bScore \([0-9]\+\)\b/\n\1\n/g;
## Delete all numbers not preceded by a newline character.
s/\([^\n0-9]\)[0-9]\+/\1/g;
## Remove all other characters but numbers and newlines.
s/[^0-9\n]\+//g;
## Remove extra newlines.
s/\n\([0-9]\)/\1/g;
s/\n$//
' infile
It yields:
1725
6248
You could AND two egreps:
<infile egrep -o 'Score [0-9]+' | egrep -o '[0-9]+$'

sed: change word order and replace

I'm trying to replace;
randomtext{{XX icon}}
by
randomtext{{ref-XX}}
..in a file, where XX could be any sequence of 2 or 3 lowercase letters.
I attempted rearranging the word order with awk before replacing "icon" with "ref-" with sed;
awk '{print $2, $1}'
..but since there is no space before the first word nor after the second one, it messed up the curly brackets;
icon}} {{XX
What is the simplest way to achieve this using sed?
sed 's/{{\([a-z]\{2,3\}\)\sicon/{{ref-\1/'
This one liner uses the substitute command s/PATTERN/REPLACE/. {{ matches two brackets. \([a-z]\{2,3\}\) captures the pattern that matches 2 or 3 lowercase letters. \s matches a white space. icon matches the literal string "icon". Then we replace the match, that is, {{....icon with the literal string {{ref- and the captured 2 or 3 letter word.
Here's a more generic version using hash tags (#) as regex delimiter:
sed 's#{{\([^ ]*\) [^}]*#{{ref-\1#'
{{ anchors the regex at the double open curly braces.
\([^ ]*\) captures everything up until a space.
[^}]* eats everything up until a closing curly brace.