Verbatim Match with sed - sed

I have a list of pairs of URLs - I want to find all occurrences of the first element of the pair and replace them with the second. I'm trying to use sed for this but sed escapes characters in my URL. Is there a way to make sed find these URLs (without changing my pairs)?
Here's my code:
while read -r NAME
do
ARG1=`echo "$NAME" | awk '{print $1}'`
ARG2=`echo "$NAME" | awk '{print $2}'`
echo "$ARG1"
echo "$ARG2"
sed -i "s#$ARG1#$ARG2#g" file
done < pagetable
pagetable has the pairs of URLS, and I'm doing the find and replace in 'file'. Since my URLs have special characters, sed isn't interpreting them verbatim.

Replace the metacharacters in the search pattern (\ * ^ $ . /) and in the replacement string (& /) before invoking sed. This assumes that the script is run by Bash.
ARG1="${ARG1//\\/\\\\}"
ARG1="${ARG1//\*/\\\*}"
ARG1="${ARG1//\//\\/}"
for mc in \^ \$ \.; do ARG1="${ARG1//$mc/\\$mc}"; done
ARG2="${ARG2//\\/\\\\}"
ARG2="${ARG2//\//\\/}"
ARG2="${ARG2//&/\\&}"
sed -i "s/$ARG1/$ARG2/g" file

Related

Regex: how to match up to a character or the end of a line?

I am trying to separate out parts of a path as follows. My input path takes the following possible forms:
bucket
bucket/dir1
bucket/dir1/dir2
bucket/dir1/dir2/dir3
...
I want to separate the first part of the path (bucket) from the rest of the string if present (dir1/dir2/dir3/...), and store both in separate variables.
The following gives me something close to what I want:
❯ BUCKET=$(echo "bucket/dir1/dir2" | sed 's#\(^[^\/]*\)[\/]\(.*\)#\1#')
❯ EXTENS=$(echo "bucket/dir1/dir2" | sed 's#\(^[^\/]*\)[\/]\(.*\)#\2#')
echo $BUCKET $EXTENS
❯ bucket dir1/dir2
HOWEVER, it fails if I only have bucket as input (without a slash):
❯ BUCKET=$(echo "bucket" | sed 's#\(^[^\/]*\)[\/]\(.*\)#\1#')
❯ EXTENS=$(echo "bucket" | sed 's#\(^[^\/]*\)[\/]\(.*\)#\2#')
echo $BUCKET $EXTENS
❯ bucket bucket
... because, in the absence of the first '/', no capture happens, so no substitution takes place. When the input is just 'bucket' I would like $EXTENS to be set to the empty string "".
Thanks!
For something so simple you could use bash built-in instead of launching sed:
$ path="bucket/dir1/dir2"
$ bucket="${path%%/*}"
$ extens="${path#$bucket}"
$ printf '|%s|%s|\n' "$bucket" "$extens"
|bucket|/dir1/dir2|
$ path="bucket"
$ bucket="${path%%/*}"
$ extens="${path#$bucket}"
$ printf '|%s|%s|\n' "$bucket" "$extens"
|bucket||
But if you really want to use sed and capture groups:
$ declare -a bucket_extens
$ mapfile -td '' bucket_extens < <(printf '%s' "bucket/dir1/dir2" | sed -E 's!([^/]*)(.*)!\1\x00\2!')
$ printf '|%s|%s|\n' "${bucket_extens[#]}"
|bucket|/dir1/dir2|
$ mapfile -td '' bucket_extens < <(printf '%s' "bucket" | sed -E 's!([^/]*)(.*)!\1\x00\2!')
$ printf '|%s|%s|\n' "${bucket_extens[#]}"
|bucket||
We use the extended regex (-E) to simplify a bit, and ! as separator of the substitute command. The first capture group is simply anything not containing a slash and the second is everything else, including nothing if there's nothing else.
In the replacement string we separate the two capture groups with a NUL character (\x00). We then use mapfile to assign the result to bash array bucket_extens.
The NUL trick is a way to deal with file names containing spaces, newlines... NUL is the only character that cannot be part of a file name. The -d '' option of mapfile indicates that the lines to map are separated by NUL instead of the default newline.
Don't capture anything. Instead, just match what you don't want and replace it with nothing:
BUCKET=$(echo "bucket" | sed 's#/.*##'). # bucket
BUCKET=$(echo "bucket/dir1/dir2" | sed 's#/.*##') # bucket
EXTENS=$(echo "bucket" | sed 's#[^/]*##') # blank
EXTENS=$(echo "bucket/dir1/dir2" | sed 's#[^/]*##') # /dir1/dir2
As you are putting a slash in the regex. the string with no slashes will not
match. Let's make the slash optional as /\?. (A backslash before ?
is requires due to the sed BRE.) Then would you please try:
#!/bin/bash
#path="bucket/dir1/dir2"
path="bucket"
bucket=$(echo "$path" | sed 's#\(^[^/]*\)/\?\(.*\)#\1#')
extens=$(echo "$path" | sed 's#\(^[^/]*\)/\?\(.*\)#\2#')
echo "$bucket" "$extens"
You don't need to prepend a backslash to a slash.
By convention, it is recommended to use lower cases for user variables.

Using a single sed call to split and grep

This is mostly by curiosity, I am trying to have the same behavior as:
echo -e "test1:test2:test3"| sed 's/:/\n/g' | grep 1
in a single sed command.
I already tried
echo -e "test1:test2:test3"| sed -e "s/:/\n/g" -n "/1/p"
But I get the following error:
sed: can't read /1/p: No such file or directory
Any idea on how to fix this and combine different types of commands into a single sed call?
Of course this is overly simplified compared to the real usecase, and I know I can get around by using multiple calls, again this is just out of curiosity.
EDIT: I am mostly interested in the sed tool, I already know how to do it using other tools, or even combinations of those.
EDIT2: Here is a more realistic script, closer to what I am trying to achieve:
arch=linux64
base=https://chromedriver.storage.googleapis.com
split="<Contents>"
curl $base \
| sed -e 's/<Contents>/<Contents>\n/g' \
| grep $arch \
| sed -e 's/^<Key>\(.*\)\/chromedriver.*/\1/' \
| sort -V > out
What I would like to simplify is the curl line, turning it into something like:
curl $base \
| sed 's/<Contents>/<Contents>\n/g' -n '/1/p' -e 's/^<Key>\(.*\)\/chromedriver.*/\1/' \
| sort -V > out
Here are some alternatives, awk and sed based:
sed -E "s/(.*:)?([^:]*1[^:]*).*/\2/" <<< "test1:test2:test3"
awk -v RS=":" '/1/' <<< "test1:test2:test3"
# or also
awk 'BEGIN{RS=":"} /1/' <<< "test1:test2:test3"
Or, using your logic, you would need to pipe a second sed command:
sed "s/:/\n/g" <<< "test1:test2:test3" | sed -n "/1/p"
See this online demo. The awk solution looks cleanest.
Details
In sed solution, (.*:)?([^:]*1[^:]*).* pattern matches an optional sequence of any 0+ chars and a :, then captures into Group 2 any 0 or more chars other than :, 1, again 0 or more chars other than :, and then just matches the rest of the line. The replacement just keeps Group 2 contents.
In awk solution, the record separator is set to : and then /1/ regex is used to only return the record having 1 in it.
This might work for you (GNU sed):
sed 's/:/\n/;/^[^\n]*1/P;D' file
Replace each : and if the first line in the pattern space contains 1 print it.
Repeat.
An alternative:
sed -Ez 's/:/\n/g;s/^[^1]*$//mg;s/\n+/\n/;s/^\n//' file
This slurps the whole file into memory and replaces all colons by newlines. All lines that do not contain 1 are removed and surplus newlines deleted.
An alternative to the really ugly sed is: grep -o '\w*2\w*'
$ printf "test1:test2:test3\nbob3:bob2:fred2\n" | grep -o '\w*2\w*'
test2
bob2
fred2
grep -o: only matching
Or: grep -o '[^:]*2[^:]*'
echo -e "test1:test2:test3" | sed -En 's/:/\n/g;/^[^\n]*2[^\n]*(\n|$)/P;//!D'
sed -n doesn't print unless told to
sed -E allows using parens to match (\n|$) which is newline or the end of the pattern space
P prints the pattern buffer up to the first newline.
D trims the pattern buffer up to the first newline
[^\n] is a character class that matches anything except a newline
// is sed shorthand for repeating a match
//! is then matching everything that didn't match previously
So, after you split into newlines, you want to make sure the 2 character is between the start of the pattern buffer ^ and the first newline.
And, if there is not the character you are looking for, you want to D delete up to the first newline.
At that point, it works for one line of input, with one string containing the character you're looking for.
To expand to several matches within a line, you have to ta, conditionally branch back to label :a:
$ printf "test1:test2:test3\nbob3:bob2:fred2\n" | \
sed -En ':a s/:/\n/g;/^[^\n]*2[^\n]*(\n|$)/P;D;ta'
test2
bob2
fred2
This is simply NOT a job for sed. With GNU awk for multi-char RS:
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' '/1/'
test1
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' 'NR%2'
test1
test3
test5
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' '!(NR%2)'
test2
test4
test6
$ echo "foo1:bar1:foo2:bar2:foo3:bar3" | awk -v RS='[:\n]' '/foo/ || /2/'
foo1
foo2
bar2
foo3
With any awk you'd just have to strip the \n from the final record before operating on it:
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS=':' '{sub(/\n$/,"")} /1/'
test1

How to replace \n by space using sed command?

I have to collect a select query data to a CSV file. I want to use a sed command to replace \n from the data by a space.
I'm using this:
query | sed "s/\n/ /g" > file.csv .......
But it is not working. Only \ is getting removed, while it should also remove n and add a space. Please suggest something.
You want to replace newline with space, not necessarily using sed.
Use tr:
tr '\n' ' '
\n is special to sed: it stands for the newline character. To replace a literal \n, you have to escape the backslash:
sed 's/\\n/ /g'
Notice that I've used single quotes. If you use double quotes, the backslash has a special meaning if followed by any of $, `, ", \, or newline, i.e., "\n" is still \n, but "\\n" would become \n.
Since we want sed to see \\n, we'd have to use one of these:
sed "s/\\\n/ /g" – the first \\ becomes \, and \n doesn't change, resulting in \\n
sed "s/\\\\n/ /g" – both pairs of \\ are reduced to \ and sed gets \\n as well
but single quotes are much simpler:
$ sed 's/\\n/ /g' <<< 'my\nname\nis\nrohinee'
my name is rohinee
From comments on the question, it became apparent that sed had nothing to do with removing the backslashes; the OP tried
echo my\nname\nis | sed 's/\n/ /g'
but the backslashes are removed by the shell:
$ echo my\nname\nis
mynnamenis
so even if the correct \\n were used, sed wouldn't find any matches. The correct way is
$ echo 'my\nname\nis' | sed 's/\\n/ /g'
my name is

search a string which contains "/" and replace using sed

How to search a pattern and remove the line using sed which contains special characters like "ranasnfs2:/SA_kits/prod"
I tried using a variable to hold the complete string and then recall the variable in sed command but it is not working.
echo $a
ranasnfs2:/SA_kits/prod
sed -i '/"$a"/d' test.txt
cat test.txt | grep -i SA
/SA_kits -rw,suid,soft,retry=4 ranasnfs2:/SA_kits/prod
You need to escape the slash character.
Use this for deleting lines which contain a /:
sed '/\//d' file

How to extract URL from html source with sed/awk or cut?

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)
Here is a sample that I need matched:
<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"
The code preceding the URL will always be the same so I need to extract the part between:
<img id="sample-image" class="photo" src="
and the " after the URL.
I tried something with sed like this:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
But it does not work. I would appreciate your suggestions, thanks a lot !
You can use grep like this :
grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt
or with sed :
sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt
or with awk :
awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt
If you have GNU grep then you can do something like:
grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt
If you wish to use awk then the following would work:
awk -F\" '{print $(NF-1)}' test.txt
With sed as
echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'
A few things about the sed command you are using:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.
You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).
Here's what I would do
sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
The p flag tells sed to print the line where substitution (s) was performed.
\(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///
The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)