I am trying to create a filter command to reduce the lines from a log file, assume each line contains partition made of date,
/iamthepath01/20200301/file01.txt
/iamthepath02/20200302/file02.txt
....
/iamthepathxx/20210619/filexx.txt
then from thousands of lines I only want to keep the ones with two string in the path
/202106
/202105
and remove any other lines
I have tried following command
sed -i -e '\(/202105\|/202106\)!d' ~/log.txt
above command threw
sed: -e expression #1, char 24: unterminated address regex
You can use
sed -i '/\/20210[56]/!d' ~/log.txt
Or, if you need to use more specific alternatives and further enhance the pattern:
sed -i -E '/\/(202105|202106)/!d' ~/log.txt
Details:
-i - GNU sed option for inline file replacement
-E - option enabling POSIX ERE regex syntax
/\/20210[56]/ - regex that matches /20210 and then either 5 or 6
\/(202105|202106) - the POSIX ERE pattern that matches / and then either 202105 or 202106
!d - removes the lines not matching the pattern.
See the online demo:
#!/bin/bash
s='/iamthepath01/20200301/file01.txt
/iamthepath02/20200302/file02.txt
/iamthepathxx/20210619/filexx.txt'
sed '/\/20210[56]/!d' <<< "$s"
Output:
/iamthepathxx/20210619/filexx.txt
sed is the wrong tool for this. If you want a script that's as fragile as the sed one then use grep as it's the tool that exists solely to do a simple g/re/p (hence the name) like you're doing:
$ grep '/20210[56]' file
/iamthepathxx/20210619/filexx.txt
or if you want a more robust solution that focuses just on the part of the line you want to match and so will avoid false matches, then use awk:
$ awk -F '/' '$3 ~ /^20210[56]/' file
/iamthepathxx/20210619/filexx.txt
This might work for you (GNU sed):
sed -ni '\#/20210[56]#p' file
This uses seds -n grep-like option to turn off implicit printing and -i option to edit the file in place.
Normally sed uses the /.../ to match but other delimiters may be used if the first is escaped e.g. \#...#.
So the above solution will filter the existing file down to lines that contain either /202105 or /202106.
N.B. grep will almost certainly be faster in finding the above lines however the use of the -i option may be the ultimate reason for choosing sed (although the same outcome can be achieved by tacking on the > tmpFile && mv tmpFile file to a grep solution).
Related
Is there a way in (Gnu) sed to replace all characters in a matching part of a string? For example I might have a list of file paths with several (arbitrary number of) paths in each line, e.g.:
/a/b/c/d/e /f/g/XXX/h/i /j/k/l/m
/n/o/p /q/r/s/t/u /v/x/x/y
/z/XXX/a/b /c/d/e/f
I would like to replace all the slashes in paths containing XXX keping all the others untouched, e.g.:
/a/b/c/d/e #f#g#XXX#h#i /j/k/l/m
/n/o/p /q/r/s/t/u /v/x/x/y
#z#XXX#a#b /c/d/e/f
Unfortunately I cannot come up with a solution. Maybe it's even impossible with sed. But I'm curious if somebody find a way to solve the problem.
We can replace any / preceding XXX with no intervening spaces like this:
# Using extended regex syntax
s!/([^ ]*XXX)!#\1!
It's a very similar substitution for those that follow XXX.
Putting them together in a loop makes this program:
#!/bin/sed -rf
:loop
s!/([^ ]*XXX)!#\1!
s!(XXX[^ ]*)/!\1#!
tloop
Output:
/a/b/c/d/e #f#g#XXX#h#i /j/k/l/m
/n/o/p /q/r/s/t/u /v/x/x/y
#z#XXX#a#b /c/d/e/f
That said, it might be simpler to use a pipeline, to break the file paths into individual lines and then reassemble them after the substitution:
sed -e 's/ *$//;s/ */&\n/g' \
| sed -e '/XXX/y,/,#,' \
| sed -e ':a;/ $/{N;s/\n//;ba}'
Using sed I want to parse Heroku's log-runtime-metrics like this one:
2016-01-29T00:38:43.662697+00:00 heroku[worker.2]: source=worker.2 dyno=heroku.17664470.d3f28df1-e15f-3452-1234-5fd0e244d46f sample#memory_total=54.01MB sample#memory_rss=54.01MB sample#memory_cache=0.00MB sample#memory_swap=0.00MB sample#memory_pgpgin=17492pages sample#memory_pgpgout=3666pages
the desired output is:
worker.2: 54.01MB (54.01MB is being memory_total)
I could not manage although I tried several alternatives including:
sed -E 's/.+source=(.+) .+memory_total=(.+) .+/\1: \2/g'
What is wrong with my command? How can it be corrected?
The .+ after source= and memory_total= are both greedy, so they accept as much of the line as possible. Use [^ ] to mean "anything except a space" so that it knows where to stop.
sed -E 's/.+source=([^ ]+) .+memory_total=([^ ]+) .+/\1: \2/g'
Putting your content into https://regex101.com/ makes it really obvious what's going on.
I'd go for the old-fashioned, reliable, non-extended sed expressions and make sure that the patterns are not too greedy:
sed -e 's/.*source=\([^ ]*\) .*memory_total=\([^ ]*\) .*/\1: \2/'
The -e is not the opposite of -E, which is primarily a Mac OS X (BSD) sed option; the normal option for GNU sed is -r instead. The -e simply means that the next argument is an expression in the script.
This produces your desired output from the given line of data:
worker.2: 54.01MB
Bonus question: There are some odd lines within the stream, I can usually filter them out using a grep pipe like | grep memory_total. However if I try to use it along with the sed command, it does not work. No output is produced with this:
heroku logs -t -s heroku | grep memory_total | sed.......
Sometimes grep | sed is necessary, but it is often redundant (unless you are using a grep feature that isn't readily supported by sed, such as Perl regular expressions).
You should be able to use:
sed -n -e '/memory_total=/ s/.*source=\([^ ]*\) .*memory_total=\([^ ]*\) .*/\1: \2/p'
The -n means "don't print by default". The /memory_total=/ matches the lines you're after; the s/// content is the same as before. I removed the g suffix that was there previously; the regex would never match multiple times anyway. I added the p to print the line when the substitution occurs.
I am trying to write a sed expression that can remove urls from a file
example
http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor #kdpartak :)
But I dont get it:
sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile
FIXED!!!!!
handles almost all cases, even malformed URLs
sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more
The following removes http:// or https:// and everything up until the next space:
sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile
updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N Thx to HMB Contributor #kdpartak :)
Edit:
I should have used:
sed -e 's!http[s]\?://\S*!!g' posFile
"[s]\?" is a far more readable way of writing "an optional s" compared to "\(s\)\{0,1\}"
"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"
I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).
There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?
The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file
perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file
The GNU sed flags, expressions used are:
-i Edit in-place
-e [-e script] --expression=script : basically, add the commands in script
(expression) to the set of commands to be run while processing the input
^ Match start of line
$ Match end of line
? Match one or more of preceding regular expression
{2,} Match 2 or more of preceding regular expression
\S* Any non-space character; alternative to: [^[:space:]]*
However,
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'
leaves nonprinting character(s), presumably \n (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.
sed -i 's/^[ \t]*//; s/[ \t]*$//'
do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).
The solution is to use the following perl expression:
perl -i -pe 's/^'`echo "\012"`'${2,}//g'
which uses a shell substitution,
'`echo "\012"`'
to replace an octal value
\012
(i.e., a newline, \n), that occurs 2 or more times,
{2,}
(otherwise we would unwrap all lines), with something else; here:
//
i.e., nothing.
[The second reference below provides a wonderful table of these values!]
The perl flags used are:
-p Places a printing loop around your command,
so that it acts on each line of standard input
-i Edit in-place
-e Allows you to provide the program as an argument,
rather than in a file
References:
perl flags: Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII control codes: https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
remove URLs: sed to remove URLs from a file
branch labels: How can I replace a newline (\n) using sed?
GNU sed manual: https://www.gnu.org/software/sed/manual/sed.html
quick regex guide: https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html
Example:
$ cat url_test_input.txt
Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.
$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a
$ cat a
Some text ...
Some more text.
$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a
Some text ...
Some more text.
$
how to remove comment lines (as # bal bla ) and empty lines (lines without charecters) from file with one sed command?
THX
lidia
If you're worried about starting two sed processes in a pipeline for performance reasons, you probably shouldn't be, it's still very efficient. But based on your comment that you want to do in-place editing, you can still do that with distinct commands (sed commands rather than invocations of sed itself).
You can either use multiple -e arguments or separate commands with a semicolon, something like (just one of these, not both):
sed -i 's/#.*$//' -e '/^$/d' fileName
sed -i 's/#.*$//;/^$/d' fileName
The following transcript shows this in action:
pax> printf 'Line # with a comment\n\n# Line with only a comment\n' >file
pax> cat file
Line # with a comment
# Line with only a comment
pax> cp file filex ; sed -i 's/#.*$//;/^$/d' filex ; cat filex
Line
pax> cp file filex ; sed -i -e 's/#.*$//' -e '/^$/d' filex ; cat filex
Line
Note how the file is modified in-place even with two -e options. You can see that both commands are executed on each line. The line with a comment first has the comment removed then all is removed because it's empty.
In addition, the original empty line is also removed.
#paxdiablo has a good answer but it can be improved.
(1) The '/^$/d' clause only matches 100% blank lines.
If you want to also match lines that are entirely whitespace (spaces, tabs etc.) use this instead:
'/^\s*$/d'
(2) The 's/#.*$//' clause only matches lines that start with the # character in column 0.
If you want to also match lines that have only whitespace before the first # use this instead:
'/^\s*#.*$/d'
The above criteria may not be universal (e.g. within a HEREDOC block, or in a Python multi-line string the different approaches could be significant), but in many cases the conventional definition of "blank" lines include whitespace-only, and "comment" lines include whitespace-then-#.
(3) Lastly, on OSX at least, the #paxdiablo solution in which the first clause turns comment lines into blank lines, and the second clause strips blank lines (including what were originally comments) doesn't work. It seems to be more portable to make both clauses /d delete actions as I've done.
The revised command incorporating the above is:
sed -e '/^\s*#.*$/d' -e '/^\s*$/d' inputFile
This tiny jewel removes all # comments, no matter where they begin in a line (see caution below):
sed -e 's/\s*#.*$//'
Example:
text="
this is a # test
#this is a test
#this is a #test
this is # another #test
"
$echo "$text" | sed -e 's/\s*#.*$//'
this is a
this is
Next this removes any resulting blank lines:
$echo "$text" | sed -e 's/\s*#.*$//' | sed -e '/^\s*$/d'
Caution: Depending on the syntax and/or interpretation of the lines your processing, this might not be an appropriate solution, as it just stupidly removes end of lines, even if the '#' is part of your data or code. However, for use cases where you'll never use a hash except for as an end of line comment then it works fine. So just as with all coding, context must be taken into consideration.
Alternative variant, using grep:
cat file.txt | grep -Ev '(#.*$)|(^$)'
you can use awk
awk 'NF{gsub(/^[ \t]*#/,"");print}' file
First example(paxdiablo) is very good except its not change file, just output result. If you want to change it inline:
sudo sed -i 's/#.*$//;/^$/d' inputFile
On (one of) my linux boxes, sed understands extended regular expressions with the -r option, so:
sed -r '/(^\s*#)|(^\s*$)/d' squid.conf.installed
is very useful for showing all non-blank, non comment lines.
The regex matches either start of line followed by zero or more spaces or tabs followed by either a hash or end of line, and deletes those matching lines from the input.
How can I make sed filter matching lines according to some expression, but ignore non-matching lines, instead of letting them print?
As a real example, I want to run scalac (the Scala compiler) on a set of files, and read from its -verbose output the .class files created. scalac -verbose outputs a bunch of messages, but we're only interested in those of the form [wrote some-class-name.class].
What I'm currently doing is this (|& is bash 4.0's way to pipe stderr to the next program):
$ scalac -verbose some-file.scala ... |& sed 's/^\[wrote \(.*\.class\)\]$/\1/'
This will extract the file names from the messages we're interested in, but will also let all other messages pass through unchanged! Of course we could do instead this:
$ scalac -verbose some-file.scala ... |& grep '^\[wrote .*\.class\]$' |
sed 's/^\[wrote \(.*\.class\)\]$/\1/'
which works but looks very much like going around the real problem, which is how to instruct sed to ignore non-matching lines from the input. So how do we do that?
If you don't want to print lines that don't match, you can use the combination of
-n option which tells sed not to print
p flag which tells sed to print what is matched
This gives:
sed -n 's/.../.../p'
Another way with plain sed:
sed -e 's/.../.../;t;d'
s/// is a substituion, t without any label conditionally skips all following commands, d deletes line.
No need for perl or grep.
(edited after Nicholas Riley's suggestion)
Rapsey raised a relevant point about multiple substitutions expressions.
First, quoting an Unix SE answer, you can "prefix most sed commands with an address to limit the lines to which they apply".
Second, you can group commands within curly braces {} (separated with a semi-colon ; or a new line)
Third, add the print flag p on the last substitution
Syntax:
sed -n -e '/^given_regexp/ {s/regexp1/replacement1/flags1;[...];s/regexp1/replacement1/flagsnp}'
Example (see Here document for more details):
Code:
sed -n -e '/^ha/ {s/h/k/g;s/a/e/gp}' <<SAMPLE
haha
hihi
SAMPLE
Result:
keke
sed -n '/.../!p'
There is no need for a substitution.