sed equivalent of perl -pe - perl

I'm looking for an equivalent of perl -pe. Ideally, it would be replace with sed if it's possible. Any help is highly appreciated.
The code is:
perl -pe 's/^\[([^\]]+)\].*$/$1/g'

$ echo '[foo] 123' | perl -pe 's/^\[([^\]]+)\].*$/$1/g'
foo
$ echo '[foo] 123' | sed -E 's/^\[([^]]+)\].*$/\1/'
foo
sed by default accepts code from command line, so -e isn't needed (though it can be used)
printing the pattern space is default, so -p isn't needed and sed -n is similar to perl -n
-E is used here to be as close as possible to Perl regex. sed supports BRE and ERE (not as feature rich as Perl) and even that differs from implementation to implementation.
with BRE, the command for this example would be: sed 's/^\[\([^]]*\)\].*$/\1/'
\ isn't special inside character class unless it is an escape sequence like \t, \x27 etc
backreferences use \N format (and limited to maximum 9)
Also note that g flag isn't needed in either case, as you are using line anchors

Related

How do I correctly use variables with sed?

I need to use a string variable in a sed command. My attempt is given in script.sh, it does not do what I want, I assume because my variable contains characters that I need sed to evaluate. I am working in linux bash.
input.txt
delicious.banana
gross.apple
script.sh
adjectives="delicious\|gross\|bearable\|yummy"
sed "s/\($adjectives\)\.//g" input.txt > output.txt
output.txt desired
banana
apple
output.txt current
deliciousbanana
grossdapple
Non-gnu sed don't work with \| in BRE (basic regex mode). I suggest using ERE (Extended regex mode) using -E and as a bonus you can eliminate all the escaping:
adjectives="delicious|gross|bearable|yummy"
sed -E "s/($adjectives)\.//g" input.txt
banana
apple

Parsing a line with sed using regular expression

Using sed I want to parse Heroku's log-runtime-metrics like this one:
2016-01-29T00:38:43.662697+00:00 heroku[worker.2]: source=worker.2 dyno=heroku.17664470.d3f28df1-e15f-3452-1234-5fd0e244d46f sample#memory_total=54.01MB sample#memory_rss=54.01MB sample#memory_cache=0.00MB sample#memory_swap=0.00MB sample#memory_pgpgin=17492pages sample#memory_pgpgout=3666pages
the desired output is:
worker.2: 54.01MB (54.01MB is being memory_total)
I could not manage although I tried several alternatives including:
sed -E 's/.+source=(.+) .+memory_total=(.+) .+/\1: \2/g'
What is wrong with my command? How can it be corrected?
The .+ after source= and memory_total= are both greedy, so they accept as much of the line as possible. Use [^ ] to mean "anything except a space" so that it knows where to stop.
sed -E 's/.+source=([^ ]+) .+memory_total=([^ ]+) .+/\1: \2/g'
Putting your content into https://regex101.com/ makes it really obvious what's going on.
I'd go for the old-fashioned, reliable, non-extended sed expressions and make sure that the patterns are not too greedy:
sed -e 's/.*source=\([^ ]*\) .*memory_total=\([^ ]*\) .*/\1: \2/'
The -e is not the opposite of -E, which is primarily a Mac OS X (BSD) sed option; the normal option for GNU sed is -r instead. The -e simply means that the next argument is an expression in the script.
This produces your desired output from the given line of data:
worker.2: 54.01MB
Bonus question: There are some odd lines within the stream, I can usually filter them out using a grep pipe like | grep memory_total. However if I try to use it along with the sed command, it does not work. No output is produced with this:
heroku logs -t -s heroku | grep memory_total | sed.......
Sometimes grep | sed is necessary, but it is often redundant (unless you are using a grep feature that isn't readily supported by sed, such as Perl regular expressions).
You should be able to use:
sed -n -e '/memory_total=/ s/.*source=\([^ ]*\) .*memory_total=\([^ ]*\) .*/\1: \2/p'
The -n means "don't print by default". The /memory_total=/ matches the lines you're after; the s/// content is the same as before. I removed the g suffix that was there previously; the regex would never match multiple times anyway. I added the p to print the line when the substitution occurs.

sed to replace non-printable character with printable character

Am running BASH and UNIX utilities on Windows 7.
Have a file that contains a vertical tab. The binary symbol is 0x0B. The octal symbol is 013. I need to replace the symbol with a blank space.
Have tried this sed approach but it fails:
sed -e 's/'$(echo "octal-value")'/replace-word/g'
Specifically:
sed -e 's/'$(echo "\013")'/ /g'
Update:
Following this advice I use GNU sed and this approach:
sed -i 's:\0x0B: :g' file
but the stubborn vertical tab is still in the file.
What is the correct way to replace a non-printable character with a printable character?
Sed should recognise special characters:
sed -e 's/\x0b/ /g'
In answer to why the -e? If you use more than one sed expression, then each one must be preceded by the -e. So, for example:
echo foo bar bas zer | sed -e 's/zer/oh my/g' -e 's/bas/baz/'
would result in:
foo bar baz oh my
thus performing 2 different sed changes ('scripts) with only a single invocation. See sed man pages for more details.
(the above example is, obviously, contrived. I, however, have seen a sed command in a script with 78 individual -e 'scripts'!)
If you only have one 'script', then the -e is optional, obviously.

Is there a way to run a tr command (or something like it) in Windows Perl?

I have a line of shell script that I need to recreate the functionality of in Windows Perl?
tr -cd '[:print:]\n\r' | tr -s ' ' | sed -e 's/ $//'
The sed part is easy but not being a shell scripting expert (or perl expert for that matter) I'm having trouble recreating the functionality of the tr command in Perl.
Break down the bits to what each does and convert that to Perl:
tr -cd '[:print:]\n\r'
Take the complement (-c) of printable characters, newlines, and carriage returns and delete them (-d). The tr manpage tells you which each of those do.
tr -s ' '
Collapse multiple same characters (-s) to one character.
sed -e 's/ $//'
Get rid of a trailing space.
Now just do that in whatever language you want to use. Putting it together, you might like something like:
perl -pe 's/\P{PosixPrint}//g; tr/ //s; s/ \z//;'
Note that Perl's tr does not do character classes, but I can use a complemented (\P{...} with the uppercase P) Unicode character class to do the same thing. Perl character classes also understand the regular POSIX character classes.

sed to remove URLs from a file

I am trying to write a sed expression that can remove urls from a file
example
http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor #kdpartak :)
But I dont get it:
sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile
FIXED!!!!!
handles almost all cases, even malformed URLs
sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more
The following removes http:// or https:// and everything up until the next space:
sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile
updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N Thx to HMB Contributor #kdpartak :)
Edit:
I should have used:
sed -e 's!http[s]\?://\S*!!g' posFile
"[s]\?" is a far more readable way of writing "an optional s" compared to "\(s\)\{0,1\}"
"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"
I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).
There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?
The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file
perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file
The GNU sed flags, expressions used are:
-i Edit in-place
-e [-e script] --expression=script : basically, add the commands in script
(expression) to the set of commands to be run while processing the input
^ Match start of line
$ Match end of line
? Match one or more of preceding regular expression
{2,} Match 2 or more of preceding regular expression
\S* Any non-space character; alternative to: [^[:space:]]*
However,
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'
leaves nonprinting character(s), presumably \n (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.
sed -i 's/^[ \t]*//; s/[ \t]*$//'
do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).
The solution is to use the following perl expression:
perl -i -pe 's/^'`echo "\012"`'${2,}//g'
which uses a shell substitution,
'`echo "\012"`'
to replace an octal value
\012
(i.e., a newline, \n), that occurs 2 or more times,
{2,}
(otherwise we would unwrap all lines), with something else; here:
//
i.e., nothing.
[The second reference below provides a wonderful table of these values!]
The perl flags used are:
-p Places a printing loop around your command,
so that it acts on each line of standard input
-i Edit in-place
-e Allows you to provide the program as an argument,
rather than in a file
References:
perl flags: Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII control codes: https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
remove URLs: sed to remove URLs from a file
branch labels: How can I replace a newline (\n) using sed?
GNU sed manual: https://www.gnu.org/software/sed/manual/sed.html
quick regex guide: https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html
Example:
$ cat url_test_input.txt
Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.
$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a
$ cat a
Some text ...
Some more text.
$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a
Some text ...
Some more text.
$