Using sed to extract a number value from json - sed

I have a file in which some lines contain a json object on a single line, and I want to extract the value of the window_indicator property.
A normal regular expression is: "window_indicator":\s*([\-\d\.]+) in which I want the value of the fist match group.
Here it is working perfectly well: https://regex101.com/r/w9Iuch/1
I've settled on sed because it seems that grep has to print the whole line and can't limit to the match group value, and perl is overkill.
Unfortunately, sed isn't actually capable of doing this, is it?
# sed 's/("window_indicator:)/\1/' in.txt
sed: -e expression #1, char 26: invalid reference \1 on `s' command's RHS
# sed -E 's/("window_indicator":)/\1/p' in.txt
prints out every line of the file
# sed -rn 's/("window_indicator":)/\1/p' in.txt
prints the whole line
# sed -rn 's/("window_indicator":)/\1/' in.txt
nothing

With sed, you need to match the whole line, capture what you need, replace the whole match with Group 1 placeholder, and make sure you suppress the default line output and only print the new text after successful substitution:
sed -nE 's/.*"window_indicator":[[:space:]]*([-0-9.]+).*/\1/p' in.txt
If the first match is to be retrieved, add q to quit:
sed -nE 's/.*"window_indicator":[[:space:]]*([-0-9.]+).*/\1/p;q' in.txt
Note that \d is not supported in POSIX regex, it is replaced with 0-9 range in the bracket expression here.
Details
n - suppress default line output
E - enables POSIX ERE flavor
.*"window_indicator":[[:space:]]*([-0-9.]+).* - finds
.* - any text
"window_indicator": - a fixed string
[[:space:]]* - zero or more whitespaces (GNU sed supports \s, too)
([-0-9.]+) - Group 1: one or more digits, - or .
.* - any text
\1 - replaces with Group 1 value
p - prints the result upon successful replacement
q - quits processing the stream.
With GNU grep, it is even easier:
grep -oP '"window_indicator":\s*\K[-\d.]+' in.txt
To get the first match,
grep -oP '"window_indicator":\s*\K[-\d.]+' in.txt | head -1
Here,
o - outputs matched texts only
P - enables the PCRE regex engine
"window_indicator":\s*\K[-\d.]+ - matches
"window_indicator": - a fixed string
\s* - zero or more whitespaces
\K - removes the text matched so far from the match value
[-\d.]+ - matches one or more -, . or digits.

1st solution: With your shown samples please try following awk code. Though its always advised to use json parsers like: jq. Simple explanation would be, using match function of awk here, where using regex "window_indicator":[0-9]+} in it to match needed value. If regex is successfully matched then creating variable val which has sub-string of matched regex in current line. Then substituting "window_indicator": and } with NULL in val and printing val which will give needed value.
awk '
match($0,/"window_indicator":[0-9]+}/){
val=substr($0,RSTART,RLENGTH)
gsub(/"window_indicator":|}/,"",val)
print val
}
' Input_file
2nd solution: Using GNU grep where using positive look ahead and positive look behind mechanism and getting the expected output as per requirement.
grep -oP '(?<="window_indicator":)\d+(?=})' Input_file

Using sed
$ sed -E 's/.*window_indicator":([0-9]+).*/\1/' input_file
0
Using grep
$ grep -Po '.*window_indicator":\K\d+' input_file
0
Using awk
$ awk '{match($0,/.*window_indicator":([0-9]+)/,arr);print arr[1]}' input_file
0

Related

How to replace next word after a match in perl or sed expression?

I'm trying to replace doxygen comment from a file with swift comments.
e.g: /// \param foo should became /// - Parameter foo: with foo
So far I have
gsed -i 's/\\param/\- Parameter/g' my_file or perl -pe 's/\\param/\- Parameter/g'
I'd like to replace the following word (foo) after my expression with word: (foo:)
I didn't manage to find a good expression for that. Ideally, something that work on Linux and macOS
In perl you can capture and put back the word with $1 (first parentheses).
s/\\param\s+(.+)/- Parameter $1:/g
.+ will capture the rest of that line. If that is something you don't want, and just want to capture the first word, you can use \S+ or \w+ or whatever other character class that matches the characters you want to capture, e.g. [a-z_-]+.
In sed it is probably \1.
Using sed
$ sed -E 's/\\(param)( [^ ]*)/- \u\1eter\2:/' input_file
/// - Parameter foo:
You could use pattern with a single capture group, and use that group with \1 in the replacement.
sed -E 's/\\param[[:space:]]+([^[:space:]]+)/- Parameter \1:/g' my_file
The pattern matches:
\\param Match \param
[[:space:]]+ Match 1+ spaces
([^[:space:]]+) Capture group 1, match 1+ non whitespace chars
Output
- Parameter foo:
If you validated the output, then you can change sed -E to sed -Ei to do the replacement.

GREP Print Blank Lines For Non-Matches

I want to extract strings between two patterns with GREP, but when no match is found, I would like to print a blank line instead.
Input
This is very new
This is quite old
This is not so new
Desired Output
is very
is not so
I've attempted:
grep -o -P '(?<=This).*?(?=new)'
But this does not preserve the second blank line in the above example. Have searched for over an hour, tried a few things but nothing's worked out.
Will happily used a solution in SED if that's easier!
You can use
#!/bin/bash
s='This is very new
This is quite old
This is not so new'
sed -En 's/.*This(.*)new.*|.*/\1/p' <<< "$s"
See the online demo yielding
is very
is not so
Details:
E - enables POSIX ERE regex syntax
n - suppresses default line output
s/.*This(.*)new.*|.*/\1/ - finds any text, This, any text (captured into Group 1, \1, and then any text again, or the whole string (in sed, line), and replaces with Group 1 value.
p - prints the result of the substitution.
And this is what you need for your actual data:
sed -En 's/.*"user_ip":"([^"]*).*|.*/\1/p'
See this online demo. The [^"]* matches zero or more chars other than a " char.
With your shown samples, please try following awk code.
awk -F'This\\s+|\\s+new' 'NF==3{print $2;next} NF!=3{print ""}' Input_file
OR
awk -F'This\\s+|\\s+new' 'NF==3{print $2;next} {print ""}' Input_file
Explanation: Simple explanation would be, setting This\\s+ OR \\s+new as field separators for all the lines of Input_file. Then in main program checking condition if NF(number of fields) are 3 then print 2nd field (where next will take cursor to next line). In another condition checking if NF(number of fields) is NOT equal to 3 then simply print a blank line.
sed:
sed -E '
/This.*new/! s/.*//
s/.*This(.*)new.*/\1/
' file
first line: lines not matching "This.*new", remove all characters leaving a blank line
second lnie: lines matching the pattern, keep only the "middle" text
this is not the pcre non-greedy match: the line
This is new but that is not new
will produce the output
is new but that is not
To continue to use PCRE, use perl:
perl -lpe '$_ = /This(.*?)new/ ? $1 : ""' file
This might work for you:
sed -E 's/.*This(.*)new.*|.*/\1/' file
If the first match is made, the line is replace by everything between This and new.
Otherwise the second match will remove everything.
N.B. The substitution will always match one of the conditions. The solution was suggested by Wiktor Stribiżew.

replace an exact match for a double underscore with a string using sed

I am trying to replace an exact match for a double underscore with a string.
sed -i 's/\<__\>/.abc.def__/g' file
But this leaves the file unchanged. Grateful for any pointers.
follow up from Sed match exact
If you have no overlapping matches (and your provided input has none), a sed like this will do:
sed -E 's/([^_]|^)__([^_]|$)/\1.abc.def__\2/g' file > newfile
Here, ([^_]|^)__([^_]|$) matches and captures into Group 1 (\1) any char other than _ or start of string (([^_]|^)), then matches __, and then captures into Group 2 (\2) any char other than _ or end of string (([^_]|$)).
If there can be overlapping matches, sed becomes rather difficult to use here. A perfect alternative would be using
perl -pe 's/(?<!_)__(?!_)/.abc.def__/g' file > newfile
perl -i -pe 's/(?<!_)__(?!_)/.abc.def__/g' file
The (?<!_)__(?!_) regex contains two lookarounds, (?<!_) negative lookbehind that makes sure there is no _ char immediately to the left of the current location, and (?!_) negative lookahead makes sure there is no _ char immediately to the right of the current location.
See the online demo:
#!/bin/bash
s='AFM_7499_190512_110136_001_p_EQ4H_1_s60_0012__386___Day_'
sed -E 's/([^_]|^)__([^_]|$)/\1.abc.def__\2/g' <<< "$s"
# => AFM_7499_190512_110136_001_p_EQ4H_1_s60_0012.abc.def__386___Day_
perl -i -pe 's/(?<!_)__(?!_)/.abc.def__/g' <<< "$s"
# => AFM_7499_190512_110136_001_p_EQ4H_1_s60_0012.abc.def__386___Day_
This might work for you (GNU sed):
sed -E 's/^/\n/;:a;ta;s/\n__($|[^_])/.abc.def__\1\n/;ta;s/\n(_+|[^_]+)/\1\n/;ta;s/\n//' file
Prepend a newline to the current line.
Pattern match through the line using the newline as a delimiter.
If the pattern is matched, replace with the required string and step the delimiter over the replacement.
Otherwise, shift the delimiter along the line and repeat.
At the end of line, remove the introduced newline.
Alternative:
sed -E 's/(^|[^_])__($|[^_])/\1\n\2/g;s//\1\n\2/g;s/\n/.abc.def__/g' file

Substitution of characters limited to part of each input line

Have a file eg. Inventory.conf with lines like:
Int/domain—home.dir=/etc/int
I need to replace / and — before the = but not after.
Result should be:
Int_domain_home_dir=/etc/int
I have tried several sed commands but none seem to fit my need.
Sed with a t loop (BRE):
$ sed ':a;s/[-/—.]\(.*=\)/_\1/;ta;' <<< "Int/domain—home.dir=/etc/int"
Int_domain_home_dir=/etc/int
When one of the -/—. character is found, it's replaced with a _. Following text up to = is captured and output using backreference. If the previous substitution succeeds, the t command loops to label :a to check for further replacements.
Edit:
If you're under BSD/Mac OSX (thanks #mklement0):
sed -e ':a' -e 's/[-/—.]\(.*=\)/_\1/;ta'
You're asking for a sed solution, but an awk solution is simpler and performs better in this case, because you can easily split the line into 2 fields by = and then selectively apply gsub() to only the 1st field in order to replace the characters of interest:
$ awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }' <<< 'Int/domain-home.dir=/etc/int'
Int_domain_home_dir=/etc/int
-F= tells awk to split the input into fields by =, which with the input at hand results in $1 (1st field) containing the first half of the line, before the =, and $2 (2nd field) the 2nd half, after the =; using the -F option sets variable FS, the input field separator.
gsub("[./-]", "_", $1) globally replaces all characters in set [./-] with _ in $1 - i.e., all occurrences of either ., / or - in the 1st field are replaced with a _ each.
print $1 FS $2 prints the result: the modified 1st field ($1), followed by FS (which is =), followed by the (unmodified) 2nd field ($2).
Note that I've used ASCII char. - (HYPHEN-MINUS, codepoint 0x2d) in the awk script, even though your sample input contains the Unicode char. — (EM DASH, U+2014, UTF-8 encoding 0xe2 0x80 0x94).
If you really want to match that, simply substitute it in the command above, but note that the awk version on macOS won't handle that properly.
Another option is to use iconv with ASCII transliteration, which tranlates the em dash into a regular ASCII -:
iconv -f utf-8 -t ascii//translit <<< 'Int/domain—home.dir=/etc/int' |
awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }'
perl allows for an elegant solution too:
$ perl -F= -ane '$F[0] =~ tr|-/.|_|; print join("=", #F)' <<<'Int/domain-home.dir=/etc/int'
Int_domain_home_dir=/etc/int
-F=, just like with Awk, tells Perl to use = as the separator when splitting lines into fields
-ane activates field splitting (a), turns off implicit output (n), and e tells Perl that the next argument is an expression (command string) to execute.
The fields that each line is split into is stored in array #F, where $F[0] refers to the 1st field.
$F[0] =~ tr|-/.|-| translates (replaces) all occurrences of -, /, and . to _.
print join("=", #F) rebuilds the input line from the fields - with the 1st field now modified - and prints the result.
Depending on the Awk implementation used, this may actually be faster (see below).
That sed isn't the best tool for this job is also reflected in the relative performance of the solutions:
Sample timings from my macOS 10.12 machine (GNU sed 4.2.2, Mawk awk 1.3.4, perl v5.18.2, using input file file, which contains 1 million copies of the sample input line) - take them with a grain of salt, but the ratios of the numbers are of interest; fastest solutions first:
# This answer's awk answer.
# Note: Mawk is much faster here than GNU Awk and BSD Awk.
$ time awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }' file >/dev/null
real 0m0.657s
# This answer's perl solution:
# Note: On macOS, this outperforms the Awk solution when using either
# GNU Awk or BSD Awk.
$ time perl -F= -ane '$F[0] =~ tr|-/.|_|; print join("=", #F)' file >/dev/null
real 0m1.656s
# Sundeep's perl solution with tr///
$ time perl -pe 's#^[^=]+#$&=~tr|/.-|_|r#e' file >/dev/null
real 0m2.370s
# Sundeep's perl solution with s///
$ time perl -pe 's#^[^=]+#$&=~s|[/.-]|_|gr#e' file >/dev/null
real 0m3.540s
# Cyrus' solution.
$ time sed 'h;s/[^=]*//;x;s/=.*//;s/[/.-]/_/g;G;s/\n//' file >/dev/null
real 0m4.090s
# Kenavoz' solution.
# Note: The 3-byte UTF-8 em dash is NOT included in the char. set,
# for consistency of comparison with the other solutions.
# Interestingly, adding the em dash adds another 2 seconds or so.
$ time sed ':a;s/[-/.]\(.*=\)/_\1/;ta' file >/dev/null
real 0m9.036s
As you can see, the awk solution is fastest by far, with the line-internal-loop sed solution predictably performing worst, by a factor of about 12.
With GNU sed:
echo 'Int/domain—home.dir=/etc/int' | sed 'h;s/[^=]*//;x;s/=.*//;s/[/—.]/_/g;G;s/\n//'
Output:
Int_domain_home_dir=/etc/int
See: man sed. I assume you want to replace dots too.
If perl solution is okay:
$ echo 'Int/domain-home.dir=/etc/int' | perl -pe 's#^[^=]+#$&=~s|[/.-]|_|gr#e'
Int_domain_home_dir=/etc/int
^[^=]+ string matching from start of line up to but not including the first occurrence of =
$&=~s|[/.-]|_|gr perform another substitution on matched string
replace all / or . or - characters with _
the r modifier would return the modified string
the e modifier allows to use expression instead of string in replacement section
# is used as delimiter to avoid having to escape / inside the character class [/.-]
Also, as suggested by #mklement0, we can use translate instead of inner substitute
$ echo 'Int/domain-home.dir=/etc/int' | perl -pe 's#^[^=]+#$&=~tr|/.-|_|r#e'
Int_domain_home_dir=/etc/int
Note that I've changed sample input, - is used instead of — which is what OP seems to want based on comments

How to use sed-awk-gawk to display a matched string

I've got a file called 'res' that's 29374 characters of http data in a one-line string. Inside it, there are several http links, but I only want to be display those that end in '/idNNNNNNNNN' where N is a digit. In fact I'm only interested in the string 'idNNNNNNNNN'.
I've tried with:
cat res | sed -n '0,/.*\(id[0-9]*\).*/s//\1/p'
but I get the whole file.
Do you know a way to do it?
perl -n -E 'say $1 while m!/id(\d{9})!g' input-file
should work. That assumes exactly 9 digits; that's the {9} in the above. You can match 8 or 9 ({8,9}), 8 or more ({8,}), up to 9 ({0,9}), etc.
Example of this working:
$ echo -n 'junk jumk http://foo/id231313 junk lalala http://bar/id23123 asda' | perl -n -E 'say $1 while m!id(\d{0,9})!g'
231313
23123
That's with the 0 to 9 variant, of course.
If you're stuck with a pre-5.10 perl, use -e instead of -E and print "$1\n" instead of say $1.
How it works
First is the two command-line arguments to Perl. -n tells Perl to read input from standard input or files given on the command line, line by line, setting $_ to each line. $_ is perl's default target for a lot of things, including regular expression matches. -E merely tells Perl that the next argument is a Perl one-liner, using the new language features (vs. -e which does not use the 5.10 extensions).
So, looking at the one liner: say means to print out some value, followed by a newline. $1 is the first regular expression capture (captures are made by parentheses in regular expressions). while is a looping construct, which you're probably familiar with. m is the match operator, the ! after it is the regular expression delimiter (normally, you see / here, but since the pattern contains / it's easier to use something else, so you don't have to escape the / as \/). /id(\d{9}) is the regular expression to match. Keep in mind that the delimiter is !, so the / is not special, it just matches a literal /. The parentheses form a capture group, so $1 will be the number. The ! is the delimiter, followed by g which means to match as many times as possible (as opposed to once). This is what makes it pick up all the URLs in the line, not just the first. As long as there is a match, the m operator will return a true value, so the loop will continue (and run that say $1, printing out the match).
Two-sed solution
I think this is one way to do this with only sed. Much more complicated!
echo 'junk jumk http://foo/id231313 junk lalala http://bar/id23123 asda' | \
sed 's!http://!\nhttp://!g' | \
sed 's!^.*/id\([0-9]*\).*$!\1!'
cat res | perl -ne 'chomp; print "$1\n" if m/\/(id\d*)/'
The trouble is that sed and grep and awk work on lines, and you've only got one line. So, you probably need to split things up so you have more than one line -- then you can make the normal tools work.
tr ':' '\012' < res |
sed -n 's%.*/\(id[0-9][0-9]*\).*%\1%p'
This takes advantage of URLs containing colons and maps colons to newlines with tr, then uses sed to pick up anything up to a slash, followed by id and one or more digits, followed by anything, and prints out the id and digit string (only). Since these only occur in URLs, they will only appear one per line and relatively near the start of the line too.
Here's a solution using only one invocation of sed:
sed -n 's| |\n|g;/^http/{s|http://[^/]*/id\([0-9]*\)|\1|;P};D' inputfile
Explanation:
s| |\n|g; - Divide and conquer
/^http/{ - If pattern space begins with "http"
s|http://[^/]*/id\([0-9]*\)|\1|; - capture the id
P - Print the string preceding the first newline
}; - end if
D - Delete the string preceding the first newline regardless of whether it contains "http"
Edit:
This version uses the same technique but is more selective.
sed -n 's|http://|\n&|g;/^\n*http/{s|\n*http://[^/]*/id\([0-9]*\)|\1\n|;P};D' inputfile