I need to extract from a website all the text divided by div and class.
I'd like to keep this tool generic to use it with different websites.
The piece of code below is working fine. But I don't know how to get into the children elements.
from bs4 import BeautifulSoup
import requests
url = 'xxx'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
div = soup.find_all("div")
classes = [value
for element in soup.find_all(class_=True)
for value in element["class"]]
for class_el in classes:
try:
div = soup.find('div', {"class" : class_el})
text = div.text
print("")
print("=============================")
print(class_el)
print("")
print(text)
except:
print("error")
If I understand you correctly, this should get you the text, if any, from each <div> element in the soup, if that <div> element has one of the classes which is in the classes list.
As an aside, it's not a good idea to name your variables div, etc., so I changed that part a bit:
for class_el in classes:
target = soup.find('div', {"class" : class_el})
if target is not None and len(target.text.strip())>0:
print(target.text.strip())
print('=============')
Related
Situation description:
Python 3.7, GTK 3.0, PyGObjects 3.34.0 Webkit2 4.0
I have a dialog window, with GtkNotebook containing 2 tabs.
1. tab contains editable Webkit webview, the 2. tab contains textview. One of the arguments provided in class consrtructor is valid HTML snippet as string variable
What I would like to get as a result, is that any changes made in any window, are automatically reflected in other.
Current problem:
Using solution provided here, any previous changes that were made in webview are discarded upon switching the notepad tabs. Debugging shows that html obtained with aforementioned call, does not contain changes.
Any ideas what might be missing in the logic or handling itself?
For reference, the code for the dialog is as follows:
# -*- coding: utf-8 -*-
import gi
gi.require_version('Gtk', '3.0')
gi.require_version('WebKit2', '4.0')
from gi.repository import Gtk, WebKit2
class DescriptionDialog:
def __init__(self, *args):
# GTK Builder
self._builder = args[0]
self._builder.add_from_file("UI/GUI/description.glade")
self.dialog = self._builder.get_object("descriptionDialog")
self._textView = self._builder.get_object("textview1")
self.webViewContainer = self._builder.get_object("WebViewContainer")
self.browserHolder = WebKit2.WebView()
self.browserHolder.set_editable(True)
self.webViewContainer.add(self.browserHolder)
self.browserHolder.show()
# valid html snippet, held as string
self.__buffer_orig__ = args[2]
self.buffer = args[2]
self.browserHolder.load_html(self.buffer)
self._builder.connect_signals(
{
"onDialogClose": self.onDialogClose,
"pageChangeNotebook": self.onPageChange
})
self.dialog.set_transient_for(self._builder.get_object("MainWindow"))
self.textBuffer = self._builder.get_object("textbuffer1")
self.textBuffer.set_text(self.buffer)
self.dialog.show()
def onDialogClose(self, handler):
self.dialog.hide()
def onPageChange(self, notebook=None, scrolledWindow=None, pageNumber=0):
if pageNumber == 0:
self.buffer = self.textBuffer.get_text(self.textBuffer.get_start_iter(), self.textBuffer.get_end_iter(), True)
self.browserHolder.load_html(self.buffer)
if pageNumber == 1:
self.browserHolder.get_main_resource().get_data(None, self.getDataFromResource, None)
def getDataFromResource(self, resource, result, userData=None):
# Changed html is not returned here
self.buffer = str(resource.get_data_finish(result).decode("utf-8"))
self.textBuffer.set_text(self.buffer)
For other internet users finding this thread.
Currently, at the given versions, this is the working result that I have come up with.
Main idea with this implementation is following - use the WebView enabled Javascript engine to obtain the contents of a <body> tag. Then, parse the Javascript result to use this value later on.
def onPageChange(self, notebook=None, scrolledWindow=None, pageNumber=0):
if pageNumber == 0:
self.buffer = self.textBuffer.get_text(self.textBuffer.get_start_iter(), self.textBuffer.get_end_iter(), True)
self.browserHolder.load_html(self.buffer)
if pageNumber == 1:
# use JavaScript to get the html contained in rendered body tag
script = "document.body.innerHTML;"
# Execute JavasScript via WebKit2.WebView bindings.
# Result can be obtained asynchronously, via callback method
self.browserHolder.run_javascript(script, None, self.getJSStatus, None)
def getJSStatus(self, resource, result, userData=None):
# Sample adapted and simplified from here:
# https://lazka.github.io/pgi-docs/#WebKit2-4.0/classes/WebView.html#WebKit2.WebView.run_javascript_finish
# Get the JavaScript result
data = self.browserHolder.run_javascript_finish(result)
# Get value from result, and convert it to string
self.buffer = data.get_js_value().to_string()
self.textBuffer.set_text(self.buffer)
I'm using metadata-extractor to retrieve metadata from video files. I have it successfully retrieving the directories. Now I need to query the directories for specific info -- duration, height, etc.
The metadata-extractor docs give this example of how to query for a specific tag value:
// obtain the Exif directory
ExifSubIFDDirectory directory
= metadata.getFirstDirectoryOfType(ExifSubIFDDirectory.class);
// query the tag's value
Date date
= directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
So it appears I need to get a list of the relevant tags, such as TAG_DATETIME_ORIGINAL, for duration, height, etc.
This page in the metadata-extractor docs contains a link titled "the various tag values", but the page it goes to lists tags for still images only, not for video files.
Googling for Metadata-Extractor -- Complete List of All Tags does not seem to bring up a list of all tags.
Are the metadata-extractor docs really missing a list of tags, or am I approaching this the wrong way somehow?
I found a list of tags at:
https://developer.tizen.org/dev-guide/2.3.1/org.tizen.guides/html/native/multimedia/metadata_extractor_n.htm
However, those constants don't seem to be what's needed in actual code. Here's Java code that works:
import com.drew.imaging.ImageMetadataReader;
import com.drew.metadata.Directory;
import com.drew.metadata.Metadata;
import com.drew.metadata.Tag;
import com.drew.metadata.file.FileTypeDirectory;
import com.drew.metadata.mp4.Mp4Directory;
import com.drew.metadata.mp4.media.Mp4SoundDirectory;
import com.drew.metadata.mp4.media.Mp4VideoDirectory;
[.....]
Metadata theMetadata = null;
try {
InputStream stream = new URL(theVideoInfo.getLinkToVideo()).openStream();
theMetadata = ImageMetadataReader.readMetadata(stream);
}
} catch (java.lang.Exception exception) {
exception.printStackTrace();
}
Mp4SoundDirectory soundDirectory
= theMetadata.getFirstDirectoryOfType(Mp4SoundDirectory.class);
Mp4VideoDirectory videoDirectory
= theMetadata.getFirstDirectoryOfType(Mp4VideoDirectory.class);
Mp4Directory mp4Directory
= theMetadata.getFirstDirectoryOfType(Mp4Directory.class);
FileTypeDirectory fileTypeDirectory
= theMetadata.getFirstDirectoryOfType(FileTypeDirectory.class);
String numberOfAudioChannels
= soundDirectory.getString(Mp4SoundDirectory.TAG_NUMBER_OF_CHANNELS);
String duration = mp4Directory.getString(Mp4Directory.TAG_DURATION);
String frameRate = videoDirectory.getString(Mp4VideoDirectory.TAG_FRAME_RATE);
String height = videoDirectory.getString(Mp4VideoDirectory.TAG_HEIGHT);
String width = videoDirectory.getString(Mp4VideoDirectory.TAG_WIDTH);
String type = fileTypeDirectory.getString(FileTypeDirectory.TAG_DETECTED_FILE_MIME_TYPE);
I found the constants (TAG_HEIGHT, TAG_WIDTH, etc.) by directly examining the metadata-extractor objects in the debugger. For example, I'd type:
Mp4VideoDirectory.WIDTH
...and the debugger (IntelliJ) would auto-complete the available constants that had the text "WIDTH" in them.
I am using textimage component and added "paraformat" in rteplugins.
When I provide the values in the dialog, it gets saved as the property "text": <h1>heading1</h1><h2>Heading2</h2><h3>Heading3</h3><p>Paragraph</p>
Now, I want to get individual values eg. heading1, heading2 etc in this case.
I can apply some string functions. But, does anyone know any direct method to get these values?
Here you go with text component example hope the same helps (As it is not recommended to parse HTML with regex but for you to give some idea posting below answer)
Created a component with rtePlugins as shown below
text.jsp code
<%#page session="false"%>
<%# page import="com.day.cq.wcm.foundation.Placeholder,java.util.regex.Matcher,java.util.regex.Pattern" %>
<%#include file="/libs/foundation/global.jsp"%><%
%><cq:text property="text" escapeXml="true"
placeholder="<%= Placeholder.getDefaultPlaceholder(slingRequest, component, null)%>"/>
<!-- Below code to demonstrate the Retrieval of paraformat -->
<% String defaultMessage = properties.get("text", "");
// the pattern we want to search for
Pattern p = Pattern.compile("<p>(.*?)</p>");
Pattern h1 = Pattern.compile("<h1>(.*?)</h1>");
Pattern h2 = Pattern.compile("<h2>(.*?)</h2>");
Pattern h3 = Pattern.compile("<h3>(.*?)</h3>");
Matcher mp = p.matcher(defaultMessage);
Matcher mh1 = h1.matcher(defaultMessage);
Matcher mh2 = h2.matcher(defaultMessage);
Matcher mh3 = h3.matcher(defaultMessage);
// if we find a match, get the group
if (mp.find()) {
// get the matching group
String parastring = mp.group(1);
// print the group
out.println("tag P : "+parastring +"</br>");
}
if (mh1.find()) {
// get the matching group
String h1string = mh1.group(1);
// print the group
out.println("tag h1 : "+h1string +"</br>");
}
if (mh2.find()) {
// get the matching group
String h2string = mh2.group(1);
// print the group
out.println("tag h2 : "+h2string +"</br>");
}
if (mh3.find()) {
// get the matching group
String h3string = mh3.group(1);
// print the group
out.println("tag h3 : "+h3string +"</br>");
}
%>
drag and drop the component in page and edit the rte
Finally in page you can retrieve the text by using the java.util.regex.*; package as shown in the jsp code(which is not optimized just to give some understanding of how to satisfy the requirement you are looking for).
I've been trying to scrape different pages. First, I scrape a URL from the first page using the xpath(#href) at the parse function. And then I try to scrape the article at that URL, from the parse function request callback. But it doesn't work.
How can I solve this issue? Here is my code:
import scrapy
from string import join
from article.items import ArticleItem
class ArticleSpider(scrapy.Spider):
name = "article"
allowed_domains = ["http://joongang.joins.com"]
j_classifications = ['politics','money','society','culture']
start_urls = ["http://news.joins.com/politics",
"http://news.joins.com/society",
"http://news.joins.com/money",]
def parse(self, response):
sel = scrapy.Selector(response)
urls = sel.xpath('//div[#class="bd"]/ul/li/strong[#class="headline mg"]')
items = []
for url in urls:
item = ArticleItem()
item['url'] = url.xpath('a/#href').extract()
item['url'] = "http://news.joins.com"+join(item['url'])
items.append(item['url'])
for itm in items:
yield scrapy.Request(itm,callback=self.parse2,meta={'item':item})
def parse2(self, response):
item = response.meta['item']
sel = scrapy.Selector(response)
articles = sel.xpath('//div[#id="article_body"]')
for article in articles:
item['article'] = article.xpath('text()').extract()
items.append(item['article'])
return items
The problem here is that you restrict the domains: allowed_domains = ["http://joongang.joins.com"]
If I change this to allowed_domains = ["joins.com"] I get results in parse2 and article text is extracted -- as unicode but this is OK since the site is not written in latin characters.
And by the way: you can use response.xpath() instead of creating a selector over the response object. This requires some less code and makes it easier to code.
I've searched everywhere and what I most found was doc.xpath('//element[#class="classname"]'), but this does not work no matter what I try.
code I'm using
import lxml.html
def check():
data = urlopen('url').read();
return str(data);
doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[#class='test']")
print(el)
It simply prints an empty list.
Edit:
How odd. I used google as a test page and it works fine there, but it doesn't work on the page I was using (youtube)
Here's the exact code I'm using.
import lxml.html
from urllib.request import urlopen
import sys
def check():
data = urlopen('http://www.youtube.com/user/TopGear').read(); #TopGear as a test
return data.decode('utf-8', 'ignore');
doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[#class='channel']")
print(el)
The TopGear page that you use for testing doesn't have any <div class="channel"> elements. But this works (for example):
el = doc.xpath("//div[#class='channel-title-container']")
Or this:
el = doc.xpath("//div[#class='a yb xr']")
To find <div> elements with a class attribute that contains the string channel, you could use
el = doc.xpath("//div[contains(#class, 'channel')]")
You can use lxml.cssselect to simplify class and id request: http://lxml.de/dev/cssselect.html
HTML uses classes (a lot), which makes them convenient to hook XPath queries. However XPath has no knowledge/support of CSS classes (or even space-separated lists) which makes classes a pain in the ass to check: the canonically correct way to look for elements having a specific class is:
//*[contains(concat(' ', normalize-space(#class), ' '), '$className')]
In your case this is
el = doc.xpath(
"//div[contains(concat(' ', normalize-space(#class), ' '), 'channel')]"
)
# print(el)
# [<Element div at 0x7fa44e31ccc8>, <Element div at 0x7fa44e31c278>, <Element div at 0x7fa44e31cdb8>]
or use own XPath function hasclass(*classes)
def _hasaclass(context, *cls):
return "your implementation ..."
xpath_utils = etree.FunctionNamespace(None)
xpath_utils['hasaclass'] = _hasaclass
el = doc.xpath("//div[hasaclass('channel')]")