grep summarize lines by strings - sed

I have a grep'ed string from curl, and I want to summarize lines of by its content.
e. g.
input
SomethingA v2.3
SomethingA v2.4
SomethingElse v1.1
SomethingElse v1.2
output
SomethingA 2
SomethingElse 1
The numbers in the output are not a must, but if easy to achieve, would be very nice. The "v" as the leading space is a fix prefix for the numerics which don't have to contain a dot.
I tried echo "$str" | grep -Po '(.*(?<=))v[0-9]' but it still contains the "v1" .. and I don't know how to reduce the leading strings by multiple matches.

You can use
$ awk '{print $1}' filename | sort | uniq -c | awk '{print $2,$1}'
Note : This will also give a count of the blank lines. If you want to get rid of those, use :
$ grep -v '^$' filename | awk '{print $1}' | sort | uniq -c | awk '{print $2,$1}'

Related

Using a single sed call to split and grep

This is mostly by curiosity, I am trying to have the same behavior as:
echo -e "test1:test2:test3"| sed 's/:/\n/g' | grep 1
in a single sed command.
I already tried
echo -e "test1:test2:test3"| sed -e "s/:/\n/g" -n "/1/p"
But I get the following error:
sed: can't read /1/p: No such file or directory
Any idea on how to fix this and combine different types of commands into a single sed call?
Of course this is overly simplified compared to the real usecase, and I know I can get around by using multiple calls, again this is just out of curiosity.
EDIT: I am mostly interested in the sed tool, I already know how to do it using other tools, or even combinations of those.
EDIT2: Here is a more realistic script, closer to what I am trying to achieve:
arch=linux64
base=https://chromedriver.storage.googleapis.com
split="<Contents>"
curl $base \
| sed -e 's/<Contents>/<Contents>\n/g' \
| grep $arch \
| sed -e 's/^<Key>\(.*\)\/chromedriver.*/\1/' \
| sort -V > out
What I would like to simplify is the curl line, turning it into something like:
curl $base \
| sed 's/<Contents>/<Contents>\n/g' -n '/1/p' -e 's/^<Key>\(.*\)\/chromedriver.*/\1/' \
| sort -V > out
Here are some alternatives, awk and sed based:
sed -E "s/(.*:)?([^:]*1[^:]*).*/\2/" <<< "test1:test2:test3"
awk -v RS=":" '/1/' <<< "test1:test2:test3"
# or also
awk 'BEGIN{RS=":"} /1/' <<< "test1:test2:test3"
Or, using your logic, you would need to pipe a second sed command:
sed "s/:/\n/g" <<< "test1:test2:test3" | sed -n "/1/p"
See this online demo. The awk solution looks cleanest.
Details
In sed solution, (.*:)?([^:]*1[^:]*).* pattern matches an optional sequence of any 0+ chars and a :, then captures into Group 2 any 0 or more chars other than :, 1, again 0 or more chars other than :, and then just matches the rest of the line. The replacement just keeps Group 2 contents.
In awk solution, the record separator is set to : and then /1/ regex is used to only return the record having 1 in it.
This might work for you (GNU sed):
sed 's/:/\n/;/^[^\n]*1/P;D' file
Replace each : and if the first line in the pattern space contains 1 print it.
Repeat.
An alternative:
sed -Ez 's/:/\n/g;s/^[^1]*$//mg;s/\n+/\n/;s/^\n//' file
This slurps the whole file into memory and replaces all colons by newlines. All lines that do not contain 1 are removed and surplus newlines deleted.
An alternative to the really ugly sed is: grep -o '\w*2\w*'
$ printf "test1:test2:test3\nbob3:bob2:fred2\n" | grep -o '\w*2\w*'
test2
bob2
fred2
grep -o: only matching
Or: grep -o '[^:]*2[^:]*'
echo -e "test1:test2:test3" | sed -En 's/:/\n/g;/^[^\n]*2[^\n]*(\n|$)/P;//!D'
sed -n doesn't print unless told to
sed -E allows using parens to match (\n|$) which is newline or the end of the pattern space
P prints the pattern buffer up to the first newline.
D trims the pattern buffer up to the first newline
[^\n] is a character class that matches anything except a newline
// is sed shorthand for repeating a match
//! is then matching everything that didn't match previously
So, after you split into newlines, you want to make sure the 2 character is between the start of the pattern buffer ^ and the first newline.
And, if there is not the character you are looking for, you want to D delete up to the first newline.
At that point, it works for one line of input, with one string containing the character you're looking for.
To expand to several matches within a line, you have to ta, conditionally branch back to label :a:
$ printf "test1:test2:test3\nbob3:bob2:fred2\n" | \
sed -En ':a s/:/\n/g;/^[^\n]*2[^\n]*(\n|$)/P;D;ta'
test2
bob2
fred2
This is simply NOT a job for sed. With GNU awk for multi-char RS:
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' '/1/'
test1
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' 'NR%2'
test1
test3
test5
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' '!(NR%2)'
test2
test4
test6
$ echo "foo1:bar1:foo2:bar2:foo3:bar3" | awk -v RS='[:\n]' '/foo/ || /2/'
foo1
foo2
bar2
foo3
With any awk you'd just have to strip the \n from the final record before operating on it:
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS=':' '{sub(/\n$/,"")} /1/'
test1

Why is sed returning more characters than requested

In a part of my script I am trying to generate a list of the year and month that a file was submitted. Since the file contains the timestamp, I should be able to cut the filenames to the month position, and then do a sort+uniq filtering. However sed is generating an outlier for one of the files.
I am using this command sequence
ls -1 service*json | sed -e "s|\(.*201...\).*json$|\1|g" | sort |uniq
And this works for most of time except in some cases it outputs the whole timestamp:
$ ls
service-parent-20181119092630.json service-parent-20181123134132.json service-parent-20181202124532.json service-parent-20190121091830.json service-parent-20190125124209.json
service-parent-20181119101003.json service-parent-20181126104300.json service-parent-20181211095939.json service-parent-20190121092453.json service-parent-20190128163539.json
service-parent-20181120095850.json service-parent-20181127083441.json service-parent-20190107035508.json service-parent-20190122093608.json
service-parent-20181120104838.json service-parent-20181129155835.json service-parent-20190107042234.json service-parent-20190122115053.json
$ ls -1 service*json | sed -e "s|\(.*201...\).*json$|\1|g" | sort |uniq
service-parent-201811
service-parent-201811201048
service-parent-201812
service-parent-201901
I have also tried this variation but the second output line is still returned:
ls -1 service*json | sed -e "s|\(.*201.\{3\}\).*json$|\1|g" | sort |uniq
Can somebody explain why service-parent-201811201048 is returned past the requested 3 characters?
Thanks.
service-parent-201811201048 happens to have 201048 to match 201....
Might try ls -1 service*json | sed -e "s|\(.*-201...\).*json$|\1|g" | sort |uniq to ask for a dash - before 201....
It is not recommended to parse the output of ls. Please try instead:
for i in service*json; do
sed -e "s|^\(service-.*-201[0-9]\{3\}\).*json$|\1|g" <<< "$i"
done | sort | uniq
Your problem is explained at https://stackoverflow.com/a/54565973/1745001 (i.e. .* is greedy) but try this:
$ ls | sed -E 's/(-[0-9]{6}).*/\1/' | sort -u
service-parent-201811
service-parent-201812
service-parent-201901
The above requires a sed that supports EREs via -E, e.g. GNU sed and OSX/BSD sed.

Get a column using sed and modify it

I need to modify the 5 to 9 column directly in each line from a file.
Currently i'm doing this in a while loop, getting each column by line.
For example a line looks like:
echo "m.mustermann#muster.com;surnanme;givenname;displayname;1111;2222;3333;44(#44;(5555"
line_9=$(echo $line | awk -F "[;]" '{print $9}' | sed 's/[^0-9+*,]*//g')
Is there a possibility to do that with "sed -i" instead of awk
Thanks for any help
I'm not sure it can be done generally in sed, but you could definitely do it in awk:
… | awk -F";" '{ gsub("[^0-9]*","",$9); print $9 }'
If you really want to do it with sed, the expression will look something like:
… | sed -e 's,\(^[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)\(.*\),\1\2\3\4\5\6\7\8\9,'
For a version with sed (posix) only
line_9="$(echo $line | sed 'H;x;s/^\(.\)\(\([^;]*;\)\{8\}\)\([^;]*\)/\2\1\4\1/;h;s/\(\n\).*\1/\1/;x;s/.*\(\n\)\(.*\)\1.*/\2/;s/[^0-9+*,]*//g;G;s/\(.*\)\(\n\)\(.*\)\2/\3\1/;h;s/.*//;x' )"

sed extract digits

I try to extract digits with sed:
echo hgdfjg678gfdg kjg45nn | sed 's/.*\([0-9]\+\).*/\1/g'
but result is:
5
How to extract: 678 and 45?
Thanks in advance!
The problem is that the . in .* will match digits as well as non-digits, and it keeps on matching as long as it can -- that is as long as there's one digit left unconsumed that can match the [0-9].
Instead of extracting digits, just delete non-digits:
echo hgdfjg678gfdg kjg45nn | sed 's/[^0-9]//g'
or even
echo hgdfjg678gfdg kjg45nn | tr -d -c 0-9
You may use grep with option -o for this:
$ echo hgdfjg678gfdg kjg45nn | grep -E -o "[0-9]+"
678
45
Or use tr:
$ echo hgdfjg678gfdg kjg45nn | tr -d [a-z]
678 45
.* in sed is greedy. And there are no non-greedy option AFAIK.(You must use [^0-9]* in this case for non-greedy matching. But this works only once, so you will get only 678 without 45.)
If you must use only sed, it would not be easy to get the result.
I recommend to use gnu’s grep
$ echo hgdfjg678gfdg kjg45nn | grep -oP '\d+'
678
45
If you really want to stick to sed, this would be one of many possible answers.
$ echo hgdfjg678gfdg kjg45nn | \
sed -e 's/\([0-9^]\)\([^0-9]\)/\1\n\2/g' | \
sed -n 's/[^0-9]*\([0-9]\+\).*/\1/p’
678
45

Match escape sequence for "bold" in console output with grep

Hi I have lots of logfiles with ^[[1m (as vim displays it) in them. I want to watch a logfile life via
tail -n 1000 -f logfile.log | grep <expression-for-escape-sequence>
and only get lines that have bold in them.
I am not sure which grep options I should use and have tried the following already:
tail -n 1000 -f logfile.log | grep "\033\0133\061\0155"
tail -n 1000 -f logfile.log | grep "\033\01331m"
tail -n 1000 -f logfile.log | grep "\033\[1m"
It does not work though... And yes there are bold lines in the last 1000 lines of logfile.log, testing with
echo -e "\033\01331mTest\033\01330m" | grep ...
same results... ;)
Appreciate any help!
Use single quotes with a dollar sign in front—as in $'...'—to have the shell convert the \033 escape sequence into an ESC character:
tail -n 1000 -f logfile.log | grep $'\033\[1m'
From man bash:
Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard.
This works (in a POSIX shell, not necessarily bash):
echo -e "\033\01331mTest\033\01330m" | grep "$(printf "\x1b\\[1m")"