I've been trying to extract the bold portion from the following:
[HorribleSubs] Black Clover - 128 [720p].mkv
But for whatever reason, this sed expression-
sed --regexp-extended 's#.*#\1#'
-is returning the entire file, when of course, I only want the \1 capture group to be.
The weird thing, is that this expression worked just fine when I tried debugging it with desed; with the capture group and primary match showing up just fine.
I'm using gnu sed 4.8-1
You can use
sed -n -E '/.*<a href="(\/torrent\/[^"]*)\/">[^<]*<\/a>.*/{s//\1/p;q}'
Details:
-n - suppresses default line output
-E - enables POSIX ERE regex syntax
/.*[^<]*<\/a>.*/ - finds a line containing < href=".../">... substring, capturing the part between href=" and /"
{s//\1/p;q}' - replaces the string matched above with the value of the captured substring, prints it and quits.
See the online demo:
s='blah
[HorribleSubs] Black Clover - 128 [720p].mkv
blah
[HorribleSubs] Black Clover - 128 [720p].mkv
blah'
sed -n -E '/.*<a href="(\/torrent\/[^"]*)\/">[^<]*<\/a>.*/{s//\1/p;q}' <<< "$s"
# => /torrent/4384536/HorribleSubs-Black-Clover-128-720p-mkv
Related
I'm trying to run the command below to replace every char in DECEMBER by itself followed by $n question marks. I tried both escaping {$n} like so {$n} and leaving it as is. Yet my output just keeps being D?{$n}E?{$n}... Is it just not possible to do this with a sed?
How should i got about this.
echo 'DECEMBER' > a.txt
sed -i "s%\(.\)%\1\(?\){$n}%g" a.txt
cat a.txt
This might work for you (GNU sed):
n=5
sed -E ':a;s/[^\n]/&\n/g;x;s/^/x/;/x{'"$n"'}/{z;x;y/\n/?/;b};x;ba' file
Append a newline to each non-newline character in a line $n times then replace all newlines by the intended character ?.
N.B. The newline is chosen as the initial substitute character as it is not possible for it to be within a line (sed uses newlines to separate lines) and if the final substitution character already exists within the current line, the substitutions are correct.
Range (also, interval or limiting quantifiers), like {3} / {3,} / {3,6}, are part of regex, and not replacement patterns.
You can use
sed -i "s/./&$(for i in {1..7}; do echo -n '?'; done)/g" a.txt
See the online demo:
#!/bin/bash
sed "s/./&$(for i in {1..7}; do echo -n '?'; done)/g" <<< "DECEMBER"
# => D???????E???????C???????E???????M???????B???????E???????R???????
Here, . matches any char, and & in the replacement pattern puts it back and $(for i in {1..7}; do echo -n '?'; done) adds seven question marks right after it.
This one-liner should do the trick:
sed 's/./&'$(printf '%*s' "$n" '' | tr ' ' '?')'/g' a.txt
with the assumption that $n expands to a positive integer and the command is executed in a POSIX shell.
Efficiently using any awk in any shell on every Unix box after setting n=2:
$ awk -v n="$n" '
BEGIN {
new = sprintf("%*s",n,"")
gsub(/./,"?",new)
}
{
gsub(/./,"&"new)
print
}
' a.txt
D??E??C??E??M??B??E??R??
To make the changes "inplace" use GNU awk with -i inplace just like GNU sed has -i.
Caveat - if the character you want to use in the replacement text is & then you'd need to use gsub(/./,"\\\\\\&",new) in the BEGIN section to make it is treated as literal instead of a backreference metachar. You'd have that issue and more (e.g. handling \1 or /) with any sed solution and any solution that uses double quotes around the script would have more issues with handling $s and the solutions that have a shell script expanding unquoted would have even more issues with globbing chars.
The original text is:
apr_array_pstrcat(anythingbutalwayshereincludingspaces,anythingbutalwayshereincludingspaces, ',')
I want to change it to:
apr_array_pstrcat(samethingasabove,samethingasabove, ", ")
I got the following sed command, but it is not working:
find . -type f -exec sed -i "s/apr_array_pstrcat\((.*),(.*),(.*)','\)/apr_array_pstrcat\($1,$2,$3\", \"\)/g" {} +
How can I do this? I am able to understand PCRE regex, but I am not sure about this sed one.
Issues with OP's attempts:
-E is needed to enable ERE, otherwise \( and ( need to be reversed with default BRE
$1, $2, etc should be \1, \2, etc
there should be only two capture groups as per given sample
also, g flag isn't needed if there can be only one match per line
sed -E "s/apr_array_pstrcat\((.*),(.*)','\)/apr_array_pstrcat\(\1,\2\", \"\)/g"
This can be simplified to:
sed -E "s/(apr_array_pstrcat\(.*),(.*)','\)/\1,\2\", \"\)/g"
# or this one, since using double quotes for entire expression can lead to
# conflict with shell double quote interpretation
sed -E 's/(apr_array_pstrcat\(.*),(.*)\x27,\x27\)/\1,\2", "\)/g'
This can be further simplified depending on what kind of data is present in the input:
# change ',' to ", " if a line contains apr_array_pstrcat(
sed '/apr_array_pstrcat(/ s/\x27,\x27/", "/'
sed has the -E flag for "use extended regular expressions in the script".
I'd also match the arguments with 'anything that's not a comma': "[^,]+"
So this works for me:
sed -E "s/(apr_array_pstrcat\([^,]+, [^,]+,) ','\)/\1 \", \")/"
I am trying to extract the version information a string using sed as follows
echo "A10.1.1-Vers8" | sed -n "s/^A\([0-9]+\)\.\([0-9]\)\.[0-9]+-.*/\1/p"
I want to extract '10' after 'A'. But the above expression doesn't give the expected information. Could some one please give some explanation on why this statement doesn't work ?
I tried the above command and changed options os sed but nothing works. I think this is some syntax error
echo "A10.1.1-Vers10" | sed -n "s/^X\([0-9]+\)\.\([0-9]\)\.[0-9]+-.*/\1/p"
Expected result is '10'
Actually result is None
$ echo "A10.1.1-Vers8" | sed -r 's/^A([[:digit:]]+)\.(.*)$/\1/g'
10
Search for string starting with A (^A), followed by multiple digits (I am using POSIX character class [[:digit:]]+) which is captured in a group (), followed by a literal dot \., followed by everything else (.*)$.
Finally, replace the whole thing with the Captured Group content \1.
In GNU sed, -r adds some syntactic sugar, in the man page, it is called as --regexp-extended
GNU grep is an alternative to sed:
$ echo "A10.1.1-Vers10" | grep -oP '(?<=^A)[0-9]+'
10
The -o option tells grep to print only the matched characters.
The -P option tells grep to match Perl regular expressions, which enables the (?<= lookbehind zero-length assertion.
The lookbehind assertion (?<=^A) ensures there is an A at the beginning of the line, but doesn't include it as part of the match for output.
If you need to match more of the version string, you can use a lookforward assertion:
$ echo "A10.1.1-Vers10" | grep -oP '(?<=^A)[0-9]+(?=\.[0-9]+\.[0-9]+-.*)'
10
Using sed I want to parse Heroku's log-runtime-metrics like this one:
2016-01-29T00:38:43.662697+00:00 heroku[worker.2]: source=worker.2 dyno=heroku.17664470.d3f28df1-e15f-3452-1234-5fd0e244d46f sample#memory_total=54.01MB sample#memory_rss=54.01MB sample#memory_cache=0.00MB sample#memory_swap=0.00MB sample#memory_pgpgin=17492pages sample#memory_pgpgout=3666pages
the desired output is:
worker.2: 54.01MB (54.01MB is being memory_total)
I could not manage although I tried several alternatives including:
sed -E 's/.+source=(.+) .+memory_total=(.+) .+/\1: \2/g'
What is wrong with my command? How can it be corrected?
The .+ after source= and memory_total= are both greedy, so they accept as much of the line as possible. Use [^ ] to mean "anything except a space" so that it knows where to stop.
sed -E 's/.+source=([^ ]+) .+memory_total=([^ ]+) .+/\1: \2/g'
Putting your content into https://regex101.com/ makes it really obvious what's going on.
I'd go for the old-fashioned, reliable, non-extended sed expressions and make sure that the patterns are not too greedy:
sed -e 's/.*source=\([^ ]*\) .*memory_total=\([^ ]*\) .*/\1: \2/'
The -e is not the opposite of -E, which is primarily a Mac OS X (BSD) sed option; the normal option for GNU sed is -r instead. The -e simply means that the next argument is an expression in the script.
This produces your desired output from the given line of data:
worker.2: 54.01MB
Bonus question: There are some odd lines within the stream, I can usually filter them out using a grep pipe like | grep memory_total. However if I try to use it along with the sed command, it does not work. No output is produced with this:
heroku logs -t -s heroku | grep memory_total | sed.......
Sometimes grep | sed is necessary, but it is often redundant (unless you are using a grep feature that isn't readily supported by sed, such as Perl regular expressions).
You should be able to use:
sed -n -e '/memory_total=/ s/.*source=\([^ ]*\) .*memory_total=\([^ ]*\) .*/\1: \2/p'
The -n means "don't print by default". The /memory_total=/ matches the lines you're after; the s/// content is the same as before. I removed the g suffix that was there previously; the regex would never match multiple times anyway. I added the p to print the line when the substitution occurs.
I am trying to write a sed expression that can remove urls from a file
example
http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor #kdpartak :)
But I dont get it:
sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile
FIXED!!!!!
handles almost all cases, even malformed URLs
sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more
The following removes http:// or https:// and everything up until the next space:
sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile
updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N Thx to HMB Contributor #kdpartak :)
Edit:
I should have used:
sed -e 's!http[s]\?://\S*!!g' posFile
"[s]\?" is a far more readable way of writing "an optional s" compared to "\(s\)\{0,1\}"
"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"
I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).
There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?
The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file
perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file
The GNU sed flags, expressions used are:
-i Edit in-place
-e [-e script] --expression=script : basically, add the commands in script
(expression) to the set of commands to be run while processing the input
^ Match start of line
$ Match end of line
? Match one or more of preceding regular expression
{2,} Match 2 or more of preceding regular expression
\S* Any non-space character; alternative to: [^[:space:]]*
However,
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'
leaves nonprinting character(s), presumably \n (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.
sed -i 's/^[ \t]*//; s/[ \t]*$//'
do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).
The solution is to use the following perl expression:
perl -i -pe 's/^'`echo "\012"`'${2,}//g'
which uses a shell substitution,
'`echo "\012"`'
to replace an octal value
\012
(i.e., a newline, \n), that occurs 2 or more times,
{2,}
(otherwise we would unwrap all lines), with something else; here:
//
i.e., nothing.
[The second reference below provides a wonderful table of these values!]
The perl flags used are:
-p Places a printing loop around your command,
so that it acts on each line of standard input
-i Edit in-place
-e Allows you to provide the program as an argument,
rather than in a file
References:
perl flags: Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII control codes: https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
remove URLs: sed to remove URLs from a file
branch labels: How can I replace a newline (\n) using sed?
GNU sed manual: https://www.gnu.org/software/sed/manual/sed.html
quick regex guide: https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html
Example:
$ cat url_test_input.txt
Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.
$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a
$ cat a
Some text ...
Some more text.
$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a
Some text ...
Some more text.
$