Xidel get json from HTML tag attribute - xidel

I am trying to extract an image URL from a div, where the link to the file is stored as a json object in data-settings attribute:
<div class="c-offerBox_galleryItem">
<div data-component="magnifier" data-component-on="#load" data-settings="{
image: '/media/cache/gallery/rc/p2vgiqwd/images/42/42542/KRHE7Z29X19.jpg',
ratio: 1.5,
outside: 0
}"></div>
</div>
Currently I can access data-settings with:
xidel "https://example.com" -e "//div[#class='c-offerBox_galleryItem']/div/#data-setting
The output is the json object. How can I access the image object?
I thought something like:
xidel "https://example.com" -e "//div[#class='c-offerBox_galleryItem']/div/#data-setting/$json/image
would work, but not.

No, you can only use the global default variable $json when "https://example.com" itself returns a JSON.
To parse a string as JSON use parse-json(). And in this case you'll need the option "liberal" as well.
xidel -s "https://example.com" -e "//div[#class='c-offerBox_galleryItem']/div/parse-json(#data-settings,{'liberal':true()})"
For Xidel 0.9.8 use json() (deprecated for newer builds).
xidel -s "https://example.com" -e "//div[#class='c-offerBox_galleryItem']/div/json(#data-settings)"

Related

How to perform UTF-8 encoding using xmlstarlet fo --encode option?

The synopsis for xmlstarlet fo says
XMLStarlet Toolkit: Format XML document
Usage: xmlstarlet fo [<options>] <xml-file>
where <options> are
-n or --noindent - do not indent
-t or --indent-tab - indent output with tabulation
-s or --indent-spaces <num> - indent output with <num> spaces
-o or --omit-decl - omit xml declaration <?xml version="1.0"?>
--net - allow network access
-R or --recover - try to recover what is parsable
-D or --dropdtd - remove the DOCTYPE of the input docs
-C or --nocdata - replace cdata section with text nodes
-N or --nsclean - remove redundant namespace declarations
-e or --encode <encoding> - output in the given encoding (utf-8, unicode...)
-H or --html - input is HTML
-h or --help - print help
When I run
cat unformatted.html | xmlstarlet fo -H -R --encode utf-8
I am returned the error message
failed to load external entity "utf-8"
In my limited experience, xmlstarlet fo especially, needs the stdin dash to work (better).
In your example, the 'unformatted.html' contents are piped to xmlstarlet.
But xmlstarlet fo doesn't 'see' the piped input, if you don't use a - (dash).
It assumes that the last argument (utf-8) is the filename ("external entity") whose contents you're trying to format. Obviously, there's no such file. Just to be on the safe side, I'd also enclose the encoding argument with double quotes, like so: "utf-8".
Altering your statement to
xmlstarlet fo -H -R --encode "utf-8" unformatted.html
should do the trick.
The cat is unnecessary, I'd think.

Extract data using grep/sed from html tag with special class/id

I need to grep info from website and it is stored like:
<div class="name">Mark</div>
<div class="surname">John</div>
<div class="phone">8434</div>
and etc.
Tried to grep it and parse it later with sed:
grep -o '<div class="name">.*</div>' | sed -e 's?<div class="name">?|?g'
but, when I try to replace with sed -e 's?<\/div><div class="phone">?|?g' - no result
and for every class do the same thing. I cannot delete all html tags (sed 's/<[^>]\+>//g'), and need to do it only for div with this classes.
The output format should be like
|Mark|John|8434|
I need to do it with grep/sed
Using awk should do the job:
awk -F"[<>]" '{printf "%s|",$3}' file
Mark|John|8434|
If you need a new line at the end:
awk -F"[<>]" '{printf "%s|",$3} END {print ""}' file
It creates filed separated by < or > then print the third field with | as separator.

Wget: Filenames without the query string

I want to download a list of webpages from a file. How can I stop Wget appending the query strings on to the saved files?
wget http://www.example.com/index.html?querystring
I need this to be downloaded as index.html, not index.html?querystring
There is the -O option:
wget -O file.html http://www.example.com/index.html?querystring
so you can alter a little bit your script to pass to the -O argument the right file name.
I've finally resigned to using the -O and just wrapped it in a bash function to make it easier. I put this in my ~/.bashrc file:
wget-rmq ()
{
[ -z "$1" ] && echo 'error: wget-rmq requires a URL to retrieve as the first arg'
local output_filename="$(echo $1 | sed 's/?.*//g' | sed 's|https.*/||g')"
wget -O "${output_filename}" "${1}"
}
Then when I want to download a file:
wget-rmq http://www.example.com/index.html?querystring
The replacement regex is fairly simple. If any ?s appear in the URL before the query string begins then it will break. In practice that hasn't happened though since URL encoding requires ? to be in URLs as %3F, but I wanted to note the possibility.

bash cgi won't return image

We have a monitoring system making RRD databases. I am looking for the most light way of creating graphs from this RRD files for our HTML pages. So I don't want to store them in files. I am trying to create simple BASH CGI script, that will output image data, so I can do something like this:
<img src="/cgi-bin/graph.cgi?param1=abc"></img>
First of all, I am trying to create simple CGI script, that will send me PNG image. This doesn't work:
#!/bin/bash
echo -e "Content-type: image/png\n\n"
cat image.png
But when I rewrite this to PERL, it does work:
#!/usr/bin/perl
print "Content-type: image/png\n\n";
open(IMG, "image.png");
print while <IMG>;
close(IMG);
exit 0;
What is the difference? I would really like to do this in BASH. Thank you.
Absence of -n switch outputs third newline, so it should be
echo -ne "Content-type: image/png\n\n"
or
echo -e "Content-type: image/png\n"
from man echo
-n do not output the trailing newline

Use curl to parse XML, get an image's URL and download it

I want to write a shell script to get an image from an rss feed.
Right now I have:
curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/ height="400" \/>//' | sed 's/ //g'
This I use to grab the first occurence of an image URL in the file.
Now I want to put this URL in a variable to use cURL again to download the image.
Any help appreciated! (Also you might give tipps on how to better remove everything from the line with the URL. This is the line:
<img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />
There's probably some better regex to remove everything except the URL than my solution.)
Thanks in advance!
Using a regexp to parse HTML/XML is a Bad Idea in general. Therefore I'd recommend that you use a proper parser.
If you don't object to using Perl, let Perl do the proper XML or HTML parsing for you using appropriate parser libraries:
HTML
curl http://BOGUS.com |& perl -e '{use HTML::TokeParser;
$parser = HTML::TokeParser->new(\*STDIN);
$img = $parser->get_tag('img') ;
print "$img->[1]->{src}\n";
}'
/content02/groups/intranetcommon/documents/image/blk_logo.gif
XML
curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;
$twig=XML::Twig->new(twig_handlers =>{img => sub {
print $_[1]->att("src")."\n"; exit 0;}});
open(my $fh, "-");
$twig->parse($fh);
}'
/content02/groups/intranetcommon/documents/image/blk_logo.gif
I used wget instead of curl, but its just the same
#!/bin/bash
url='http://www.nichtlustig.de/rss/nichtrss.rss'
wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }
/<img src=/{
gsub(/.*<img src=\"/,"")
gsub(/\".[^>]*>/,"")
print
}' | xargs -i wget "{}"
Use a DOM parser and extract all img elements using getElementsByTagName. Then add them to a list/array, loop through and separately fetch them.
I would suggest using Python, but any language would have a DOM library.
#!/bin/sh
URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/ height="400" \/>//' | sed 's/ //g')
curl -C - -O $URL
This totally does the job!
Any idea on the regex?
Here's a quick Python solution:
from BeautifulSoup import BeautifulSoup
from os import sys
soup = BeautifulSoup(sys.stdin.read())
print soup.findAll('img')[0]['src']
Usage:
$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`
This works like a charm and will not leave you trying to find the magical regex that will parse random HTML (Hint: there is no such expression, especially not if you have a greedy matcher like sed.)