parsing a curl in the command line - command-line

I have this:
curl -H \"api_key:{key}\" http://api.wordnik.com/api/word.xml/dog/definitions
How do I parse this (within the commandline) to make it take whatever's in between <text> and </text> in this page?

$ curl ....... | awk -vRS="</text>" '/<text>/{ gsub(/.*<text/,""); print "->"$0}'
$ curl ....... | awk 'BEGIN{RS="</text>"}/<text>/{ gsub(/.*<text/,""); print "->"$0}'
Note, use GNU awk. (gawk)

Related

Change variable value in Script shell with sed command ; error syntax

sed -i 's|from_infura_hex=?|from_infura_hex=$(curl -s -X POST --connect-timeout 5 -H "Content-Type: application/json" --data \'{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}\' https://ropsten.infura.io/X/X | jq .result | xargs)|' /home/ec2-user/LastBlockNode.sh
I tried to execute this command but I always get this error:
-bash: syntax error near unexpected token `)'
The purpose of this command is to modify the value from_infura_hex=? in the script LastBlockNode.sh by the curl command.
Can anyone help with this sed command?
If you choose a pipe character | as a delimiter for s command,
the character should not appear in pattern or replacement without escaping. As you are using | as a pipeline in your command, it is better to pick other character such as #.
You cannot nest single quotes even if you escape it with a backslash.
In order to use a command substitution within the replacement,
you need to say sed -i '/pattern/'"$(command)"'/', not
sed -i '/pattern/$(command)/'.
Then would you please try something like:
sed -i 's#from_infura_hex=?#from_infura_hex='"$(curl -s -X POST --connect-timeout 5 -H "Content-Type: application/json" --data "{\"jsonrpc\":\"2.0\",\"method\":
\"eth_blockNumber\",\"params\":[],\"id\":1}" https://ropsten.infura.io/X/X | jq .result | xargs)"'#' /home/ec2-user/LastBlockNode.sh
But it will be safer and more readable to split the command into
multiple lines:
replacement="$(curl -s -X POST --connect-timeout 5 -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' https://ropsten.infura.io/X/X | jq .result | xargs)"
sed -i 's#from_infura_hex=?#from_infura_hex='"$replacement"'#' /home/ec2-user/LastBlockNode.sh
Please note I have not tested the commands above with the actual data.
If either of them still do not work, please let me know with the error message.

search and select decimal numbers in a text file line

I have xml textfiles which contain lines of multiple numbers (3) separated by tabs/spaces, from which I would like to select the each set of numbers separately.
From:
<tagname1> 110.0912 99.1234 55.1326 </tagname1>
Result:
110.0912
and:
99.1234
and:
55.1326
I would like to use sed, awk, grep, etc. perl is fine too. Seems simple, but can't figure out a cleaner line. I've tried:
more FILENAME | grep tagname1 | grep -E -o "[0-9]+*\.[0-9]+" | head -n 1
perl -MRegexp::Common -nE 's/<.*?>//g; say for /($RE{num}{real})/g' file
You can use grep -o option.
$ cat file
<tagname1> 110.0912 99.1234 55.1326 </tagname1>
$ grep -oE '\b[0-9.]+\b' file
110.0912
99.1234
55.1326
\b defines a word boundary
[0-9.]+ is a character class suggesting match numbers and . one or more times
-o option prints matched pattern only
awk -v which=2 '/<tagname1>(([0-9]*(\.[0-9]*)?)|[ \t])*<\/tagname1>/ {print $(which+1)}' input.txt
Select which number you want to be printed using the variable which in this example it will print the second number which=2
input.txt:
<tagname1> 110.0912 99.1234 55.1326 </tagname1>
You can use awk
awk '{print $2,$3,$4}' OFS="\n" file
110.0912
99.1234
55.1326
$ cat file
<tagname1> 110.0912 99.1234 55.1326 </tagname1>
$ awk -v tag="tagname1" -v nr=1 '$0~"<"tag">"{print $(nr+1)}' file
110.0912
$ awk -v tag="tagname1" -v nr=2 '$0~"<"tag">"{print $(nr+1)}' file
99.1234
$ awk -v tag="tagname1" -v nr=3 '$0~"<"tag">"{print $(nr+1)}' file
55.1326

Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:
<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....
I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:
grep "/strong" output.html | awk '{print $1}'
Grep on "/strong" to get the lines with the targets; that works fine.
Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:
<strong>Target1NoSpaces</strong><span
<strong>Target2
Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.
Try pup, a command line tool for processing HTML. For example:
$ pup 'strong text{}' < file.html
Target1NoSpaces
Target2 With Spaces
To search via XPath, try xpup.
Alternatively, for a well-formed HTML/XML document, try html-xml-utils.
One way using mojolicious and its DOM parser:
perl -Mojo -E '
g("http://your.web")
->dom
->find("strong")
->each( sub { if ( $t = shift->text ) { say $t } } )'
Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.
grep -oP "(?<=<strong>).*?(?=</strong>)" file
Output:
Target1NoSpaces
Target2 With Spaces
Add:
This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:
ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file
Input:
<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>
Output:
----------
Target
A
B
C
----------
Target D
----------
Target E
Here's a solution using xmlstarlet
xml sel -t -v //strong input.html
Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.
awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename
You never need grep with awk and the field separator doesn't have to be whitespace:
$ awk -F'<|>' '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces
You should really use a proper parser for this however.
Since you tagged perl
perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html
I am surprised no one mensions W3C HTML-XML-utils
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' strong
output:
<strong class="fc-black-750 mb6">Stack Overflow
for Teams</strong>
<strong>Teams</strong>
To capture only content:
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' -c strong
Stack Overflow
for Teams
Teams

sed replace text in a XML file

I have huge XML file with data like this:
<amount quantity="1">12.00</amount>
How can i replace the 12.00 with something else using sed?
Not really enough information in your question but to replace all values of 12.00 with say 24.00 you could do:
$ sed 's/>12\.00</>24.00</g' file.xml
If you are happy with the results you can store them back using the -i option:
$ sed -i 's/>12\.00</>24.00</g' file.xml
A more rubust match would be:
$ sed -r 's_(<amount quantity="[0-9]+">)12.00(</amount>)_\124.00\2_g' file.xml
But you should really parse the XML properly and not force regexp to do something it wasn't designed for.
script.sh:
#!/bin/bash
xml="<amount quantity="1">12.00</amount>"
newxml=`echo $xml | sed -n "s/\(<amount[^>]*>\)\([^<]*\)\(<\/amount>\)/\113.37\3/gp"`
echo "$newxml"
result:
$ ./script.sh
<amount quantity=1>13.37</amount>

Use curl to parse XML, get an image's URL and download it

I want to write a shell script to get an image from an rss feed.
Right now I have:
curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/ height="400" \/>//' | sed 's/ //g'
This I use to grab the first occurence of an image URL in the file.
Now I want to put this URL in a variable to use cURL again to download the image.
Any help appreciated! (Also you might give tipps on how to better remove everything from the line with the URL. This is the line:
<img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />
There's probably some better regex to remove everything except the URL than my solution.)
Thanks in advance!
Using a regexp to parse HTML/XML is a Bad Idea in general. Therefore I'd recommend that you use a proper parser.
If you don't object to using Perl, let Perl do the proper XML or HTML parsing for you using appropriate parser libraries:
HTML
curl http://BOGUS.com |& perl -e '{use HTML::TokeParser;
$parser = HTML::TokeParser->new(\*STDIN);
$img = $parser->get_tag('img') ;
print "$img->[1]->{src}\n";
}'
/content02/groups/intranetcommon/documents/image/blk_logo.gif
XML
curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;
$twig=XML::Twig->new(twig_handlers =>{img => sub {
print $_[1]->att("src")."\n"; exit 0;}});
open(my $fh, "-");
$twig->parse($fh);
}'
/content02/groups/intranetcommon/documents/image/blk_logo.gif
I used wget instead of curl, but its just the same
#!/bin/bash
url='http://www.nichtlustig.de/rss/nichtrss.rss'
wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }
/<img src=/{
gsub(/.*<img src=\"/,"")
gsub(/\".[^>]*>/,"")
print
}' | xargs -i wget "{}"
Use a DOM parser and extract all img elements using getElementsByTagName. Then add them to a list/array, loop through and separately fetch them.
I would suggest using Python, but any language would have a DOM library.
#!/bin/sh
URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/ height="400" \/>//' | sed 's/ //g')
curl -C - -O $URL
This totally does the job!
Any idea on the regex?
Here's a quick Python solution:
from BeautifulSoup import BeautifulSoup
from os import sys
soup = BeautifulSoup(sys.stdin.read())
print soup.findAll('img')[0]['src']
Usage:
$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`
This works like a charm and will not leave you trying to find the magical regex that will parse random HTML (Hint: there is no such expression, especially not if you have a greedy matcher like sed.)