Finding html element with class using lxml - class

I've searched everywhere and what I most found was doc.xpath('//element[#class="classname"]'), but this does not work no matter what I try.
code I'm using
import lxml.html
def check():
data = urlopen('url').read();
return str(data);
doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[#class='test']")
print(el)
It simply prints an empty list.
Edit:
How odd. I used google as a test page and it works fine there, but it doesn't work on the page I was using (youtube)
Here's the exact code I'm using.
import lxml.html
from urllib.request import urlopen
import sys
def check():
data = urlopen('http://www.youtube.com/user/TopGear').read(); #TopGear as a test
return data.decode('utf-8', 'ignore');
doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[#class='channel']")
print(el)

The TopGear page that you use for testing doesn't have any <div class="channel"> elements. But this works (for example):
el = doc.xpath("//div[#class='channel-title-container']")
Or this:
el = doc.xpath("//div[#class='a yb xr']")
To find <div> elements with a class attribute that contains the string channel, you could use
el = doc.xpath("//div[contains(#class, 'channel')]")

You can use lxml.cssselect to simplify class and id request: http://lxml.de/dev/cssselect.html

HTML uses classes (a lot), which makes them convenient to hook XPath queries. However XPath has no knowledge/support of CSS classes (or even space-separated lists) which makes classes a pain in the ass to check: the canonically correct way to look for elements having a specific class is:
//*[contains(concat(' ', normalize-space(#class), ' '), '$className')]
In your case this is
el = doc.xpath(
"//div[contains(concat(' ', normalize-space(#class), ' '), 'channel')]"
)
# print(el)
# [<Element div at 0x7fa44e31ccc8>, <Element div at 0x7fa44e31c278>, <Element div at 0x7fa44e31cdb8>]
or use own XPath function hasclass(*classes)
def _hasaclass(context, *cls):
return "your implementation ..."
xpath_utils = etree.FunctionNamespace(None)
xpath_utils['hasaclass'] = _hasaclass
el = doc.xpath("//div[hasaclass('channel')]")

Related

BeautifulSoup extract text from all div class including children elements

I need to extract from a website all the text divided by div and class.
I'd like to keep this tool generic to use it with different websites.
The piece of code below is working fine. But I don't know how to get into the children elements.
from bs4 import BeautifulSoup
import requests
url = 'xxx'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
div = soup.find_all("div")
classes = [value
for element in soup.find_all(class_=True)
for value in element["class"]]
for class_el in classes:
try:
div = soup.find('div', {"class" : class_el})
text = div.text
print("")
print("=============================")
print(class_el)
print("")
print(text)
except:
print("error")
If I understand you correctly, this should get you the text, if any, from each <div> element in the soup, if that <div> element has one of the classes which is in the classes list.
As an aside, it's not a good idea to name your variables div, etc., so I changed that part a bit:
for class_el in classes:
target = soup.find('div', {"class" : class_el})
if target is not None and len(target.text.strip())>0:
print(target.text.strip())
print('=============')

How to save edited webview result?

Situation description:
Python 3.7, GTK 3.0, PyGObjects 3.34.0 Webkit2 4.0
I have a dialog window, with GtkNotebook containing 2 tabs.
1. tab contains editable Webkit webview, the 2. tab contains textview. One of the arguments provided in class consrtructor is valid HTML snippet as string variable
What I would like to get as a result, is that any changes made in any window, are automatically reflected in other.
Current problem:
Using solution provided here, any previous changes that were made in webview are discarded upon switching the notepad tabs. Debugging shows that html obtained with aforementioned call, does not contain changes.
Any ideas what might be missing in the logic or handling itself?
For reference, the code for the dialog is as follows:
# -*- coding: utf-8 -*-
import gi
gi.require_version('Gtk', '3.0')
gi.require_version('WebKit2', '4.0')
from gi.repository import Gtk, WebKit2
class DescriptionDialog:
def __init__(self, *args):
# GTK Builder
self._builder = args[0]
self._builder.add_from_file("UI/GUI/description.glade")
self.dialog = self._builder.get_object("descriptionDialog")
self._textView = self._builder.get_object("textview1")
self.webViewContainer = self._builder.get_object("WebViewContainer")
self.browserHolder = WebKit2.WebView()
self.browserHolder.set_editable(True)
self.webViewContainer.add(self.browserHolder)
self.browserHolder.show()
# valid html snippet, held as string
self.__buffer_orig__ = args[2]
self.buffer = args[2]
self.browserHolder.load_html(self.buffer)
self._builder.connect_signals(
{
"onDialogClose": self.onDialogClose,
"pageChangeNotebook": self.onPageChange
})
self.dialog.set_transient_for(self._builder.get_object("MainWindow"))
self.textBuffer = self._builder.get_object("textbuffer1")
self.textBuffer.set_text(self.buffer)
self.dialog.show()
def onDialogClose(self, handler):
self.dialog.hide()
def onPageChange(self, notebook=None, scrolledWindow=None, pageNumber=0):
if pageNumber == 0:
self.buffer = self.textBuffer.get_text(self.textBuffer.get_start_iter(), self.textBuffer.get_end_iter(), True)
self.browserHolder.load_html(self.buffer)
if pageNumber == 1:
self.browserHolder.get_main_resource().get_data(None, self.getDataFromResource, None)
def getDataFromResource(self, resource, result, userData=None):
# Changed html is not returned here
self.buffer = str(resource.get_data_finish(result).decode("utf-8"))
self.textBuffer.set_text(self.buffer)
For other internet users finding this thread.
Currently, at the given versions, this is the working result that I have come up with.
Main idea with this implementation is following - use the WebView enabled Javascript engine to obtain the contents of a <body> tag. Then, parse the Javascript result to use this value later on.
def onPageChange(self, notebook=None, scrolledWindow=None, pageNumber=0):
if pageNumber == 0:
self.buffer = self.textBuffer.get_text(self.textBuffer.get_start_iter(), self.textBuffer.get_end_iter(), True)
self.browserHolder.load_html(self.buffer)
if pageNumber == 1:
# use JavaScript to get the html contained in rendered body tag
script = "document.body.innerHTML;"
# Execute JavasScript via WebKit2.WebView bindings.
# Result can be obtained asynchronously, via callback method
self.browserHolder.run_javascript(script, None, self.getJSStatus, None)
def getJSStatus(self, resource, result, userData=None):
# Sample adapted and simplified from here:
# https://lazka.github.io/pgi-docs/#WebKit2-4.0/classes/WebView.html#WebKit2.WebView.run_javascript_finish
# Get the JavaScript result
data = self.browserHolder.run_javascript_finish(result)
# Get value from result, and convert it to string
self.buffer = data.get_js_value().to_string()
self.textBuffer.set_text(self.buffer)

Can I get html string from IPython.display.Markdown?

I want to get html from markdown on Jupyter Notebook.
like this.
from IPython import display
display.Code("import this")._repr_html_()
But I get:
IPython.core.display.Markdown object has no attribute '_repr_html_'.
Any idea?
Here is an example of using the markdown package to convert markdown to HTML and combining it with other HTML
pip install markdown
import ipywidgets as widgets
import markdown
#Convert markdown to html
html = markdown.markdown("""# Pandas and Plotly guide
Here we have [Pandas](https://pandas.pydata.org/) and [Plotly Express library](https://plotly.com/python/plotly-express/) used in combination with:
* Ipyvuetify (ipywidgets Vuetify UI framework)
* Ipymonaco ( a text editor widget)
Available plot types:
""")
# copy some html from the plotly express website
html += """<ul>
<li><strong>Basics</strong>: <code>scatter</code>, <code>line</code>, <code>area</code>, <code>bar</code>, <code>funnel</code>, <code>timeline</code></li>
</ul>"""
help_links = widgets.HTML(value = html)
help_links
I don't think it is (directly) possible; the Markdown -> HTML conversion for, say, mdout = md("## string ..."); display(mdout) seems to happen in JavaScript, in the append_markdown function, defined here:
https://github.com/jupyter/notebook/blob/238828e/notebook/static/notebook/js/outputarea.js#L730
Of course, if someone can come up with a way, to do a JavaScript call from Jupyter Python cell to perform this conversion, and then get the results back in Python before doing the display(...), then it would be possible :)
For more discussion, see:
https://discourse.jupyter.org/t/how-to-obtain-html-string-from-markdown-as-jupyter-does-it/10589
EDIT: However, I just found a method to cheat through this (see also IPython: Adding Javascript scripts to IPython notebook) - you don't get the HTML string directly back in Python, but you can send the markdown string from Python, and control the display() of the converted string; the trick is to write in a separate <div>, and have JavaScript store the result of the conversion there.
So you can put this in a code (Python) cell in Jupyter:
def js_convert_md_html(instring_md):
js_convert = """
<div id="_my_special_div"></div>
<script>
//import * as markdown from "base/js/markdown"; //import declarations may only appear at top level of a module
//define(['base/js/markdown'], function ttttest(markdown) {{ // Mismatched anonymous define() module:
// console.log(markdown);
//}});
//const markdown = require('base/js/markdown'); // redeclaration of const markdown
//console.log(markdown); // is there!
function do_convert_md_html(instr) {{
//return instr.toUpperCase();
markdown.render(instr, {{
with_math: true,
clean_tables: true,
sanitize: true,
}}, function (err, html) {{
//console.log(html); //ok
$("#_my_special_div").html(html);
}});
}}
var myinputstring = '{0}';
do_convert_md_html(myinputstring);
</script>
""".format(instring_md)
return HTML(js_convert)
jsobj = js_convert_md_html("*hello* **world** $$x_2 = e^{x}$$")
display(jsobj)
This results with:

Is there a way to include math formulae in Scaladoc?

I would like to enter math formulae in Scaladoc documentation of mathematical Scala code. In Java, I found a library called LatexTaglet that can do exactly this for Javadoc, by writing formulae in Latex:
http://latextaglet.sourceforge.net/
And it seems to integrate well with Maven (reporting/plugins section of a POM). Is there an equivalent library for Scaladoc? If not, how could I integrate this library with SBT?
I also considered using MathML (http://www.w3.org/Math/), but looks too verbose. Is there an editor you would recommend? Does MathML integrate well with Scaladoc?
Thank you for your help!
To follow on #mergeconflict answer, here is how I did it
As there is no proper solution, what I did is to implement a crawler that parse all generated html files, and replace any found "import tag" (see code below), by the import of the MathJax script:
lazy val mathFormulaInDoc = taskKey[Unit]("add MathJax script import in doc html to display nice latex formula")
mathFormulaInDoc := {
val apiDir = (doc in Compile).value
val docDir = apiDir // /"some"/"subfolder" // in my case, only api/some/solder is parsed
// will replace this "importTag" by "scriptLine
val importTag = "##import MathJax"
val scriptLine = "<script type=\"text/javascript\" src=\"https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML\"> </script>"
// find all html file and apply patch
if(docDir.isDirectory)
listHtmlFile(docDir).foreach { f =>
val content = Source.fromFile(f).getLines().mkString("\n")
if(content.contains(importTag)) {
val writer = new PrintWriter(f)
writer.write(content.replace(importTag, scriptLine))
writer.close()
}
}
}
// attach this task to doc task
mathFormulaInDoc <<= mathFormulaInDoc triggeredBy (doc in Compile)
// function that find html files recursively
def listHtmlFile(dir: java.io.File): List[java.io.File] = {
dir.listFiles.toList.flatMap { f =>
if(f.getName.endsWith(".html")) List(f)
else if(f.isDirectory) listHtmlFile(f)
else List[File]()
}
}
As you could see, this crawler task is attached to the doc task, to it is done automatically by sbt doc.
Here is an example of doc that will be rendered with formula
/**
* Compute the energy using formula:
*
* ##import MathJax
*
* $$e = m\times c^2$$
*/
def energy(m: Double, c: Double) = m*c*c
Now, it would be possible to improve this code. For example:
add the script import in the html head section
avoid reading the whole files (maybe add a rule that the import tag should be in the first few lines
add the script to the sbt package, and add it to the target/api folder using some suitable task
The short answer is: no. LaTeXTaglet is made possible by the JavaDoc Taglet API. There is no equivalent in Scaladoc, therefore no clean solution.
However, I can think of a hack that might be easy enough to do:
There's a library called MathJax, which looks for LaTeX-style math formulae in an HTML page and dynamically renders it in place. I've used it before, it's pretty nice; all you have to do is include the script. So you could do two things:
Edit and rebuild the Scaladoc source to include MathJax, or...
Write a little post-processor crawl all of Scaladoc's HTML output after it runs, and inject MathJax into each file.
That way, you could just write LaTeX formulae directly in your Scala comments and they should be rendered in the browser. Of course if you wanted a non-hacky solution, I'd suggest you create a taglet-like API for Scaladoc ;)
The forthcoming scala3 aka Dotty has in-built support for markdown which allows rendering simple math formulas using a subset of Latex.
I solved this by using the same approach as Spark did.
Put this JavaScript in a file somewhere in your project:
// From Spark, licensed APL2
// https://github.com/apache/spark/commit/36827ddafeaa7a683362eb8da31065aaff9676d5
function injectMathJax() {
var script = document.createElement('script');
script.type = 'text/javascript';
script.async = true;
script.onload = function(){
MathJax.Hub.Config({
displayAlign: "left",
tex2jax: {
inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
processEscapes: true,
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'a']
}
});
};
script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
'cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML';
document.getElementsByTagName('head')[0].appendChild(script);
}
document.addEventListener('DOMContentLoaded', injectMathJax)
and this little bit into your build.sbt:
lazy val injectMathJax = taskKey[Unit]("Injects MathJax Javascript into Scaladoc template.js")
injectMathJax := {
val docPath = (Compile / doc).value
val templateJsOutput = docPath / "lib" / "template.js"
streams.value.log.info(s"Adding MathJax initialization to $templateJsOutput")
// change this path, obviously
IO.append(templateJsOutput, IO.readBytes(file("doc/static/js/mathjax_init.js")))
},
injectMathJax := (injectMathJax triggeredBy (Compile / doc)).value
I'll eventually get around to building and publicly releasing a plugin for this, as I'm likely going to be using Scala 2.x for a very long time.
Caveats to this approach:
Formulae must be in $ or $$ in Scaladoc comments.
It's best to further enclose them in the comment with another element. I've been using <blockquote>.
For at least the Scaladoc included with Scala 2.11.x, a formula will only show on class, object, and trait top-level symbols. Something in the toggle to show the full comment breaks when MathJax-inject elements are present. I've not figured it out yet, but if I do, I'll submit a patch to Scaladoc directly.
Example:
/**
* A Mean Absolute Scaled Error implementation
*
* Non-seasonal MASE formula:
* <blockquote>
* $$
* \mathrm{MASE} = \mathrm{mean}\left( \frac{\left| e_j \right|}{\frac{1}{T-1}\sum_{t=2}^T \left| Y_t-Y_{t-1}\right|} \right) = \frac{\frac{1}{J}\sum_{j}\left| e_j \right|}{\frac{1}{T-1}\sum_{t=2}^T \left| Y_t-Y_{t-1}\right|}
* $$
* </blockquote>
**/
object MeanAbsoluteScaledError {

I get a 404 error in web.py when using /(.+) (Newbie Q)

here is my code and my issue.
import web
render = web.template.render('templates/')
urls = (
'/(.+)', 'index'
)
class index:
def GET(self, lang):
return render.index(lang)
if __name__=="__main__":
app = web.application(urls, globals())
app.run()
and my index.html is this one:
$def with (lang)
$if lang == 'en':
I just wanted to say <em>hello</em>
$elif lang =='es' or lang == '':
<em>Hola</em>, mundo!
$else:
página no disponible en este idioma
the problem is that when I run this code I get an 404 error. I think the issue might be the urls part, specifically the /(.+). I think I'm not using it right, and I wanna make it work so I can use more than one parameter. When I use /(.*) it work, but not for more than one parameters, and the doc says that for more than 1 parameter I gotta use + instead of *
Thanks beforehand.
You should study regexp, webpy only matches it against path and passes matched groups to controller method. You can mark group as optional with ? so if it is empty then it is not captured and lang will be set to None by default.
Also . in regexp means any symbol, to capture language you'd better use \w that matches any word character.
urls = (
'/(\w+)?', 'index'
)
class index:
def GET(self, lang=None):
return render.index(lang)