Scrape different pages using Scrapy - callback

I've been trying to scrape different pages. First, I scrape a URL from the first page using the xpath(#href) at the parse function. And then I try to scrape the article at that URL, from the parse function request callback. But it doesn't work.
How can I solve this issue? Here is my code:
import scrapy
from string import join
from article.items import ArticleItem
class ArticleSpider(scrapy.Spider):
name = "article"
allowed_domains = ["http://joongang.joins.com"]
j_classifications = ['politics','money','society','culture']
start_urls = ["http://news.joins.com/politics",
"http://news.joins.com/society",
"http://news.joins.com/money",]
def parse(self, response):
sel = scrapy.Selector(response)
urls = sel.xpath('//div[#class="bd"]/ul/li/strong[#class="headline mg"]')
items = []
for url in urls:
item = ArticleItem()
item['url'] = url.xpath('a/#href').extract()
item['url'] = "http://news.joins.com"+join(item['url'])
items.append(item['url'])
for itm in items:
yield scrapy.Request(itm,callback=self.parse2,meta={'item':item})
def parse2(self, response):
item = response.meta['item']
sel = scrapy.Selector(response)
articles = sel.xpath('//div[#id="article_body"]')
for article in articles:
item['article'] = article.xpath('text()').extract()
items.append(item['article'])
return items

The problem here is that you restrict the domains: allowed_domains = ["http://joongang.joins.com"]
If I change this to allowed_domains = ["joins.com"] I get results in parse2 and article text is extracted -- as unicode but this is OK since the site is not written in latin characters.
And by the way: you can use response.xpath() instead of creating a selector over the response object. This requires some less code and makes it easier to code.

Related

Email Pdf attachments to Google sheet

I am looking for something that allows me from a mail PDF attachment to get a data in a google sheet.
We all often get PDF attachments in our email and it will be great if we get the entire data in a google sheet.
DO let me know if there is anything like this
Explanation:
Your question is very broad and it is impossible to give a specific answer because that answer would depend on the pdf but also on the data you want to fetch from that, besides all the other details you skipped to mention.
Here I will provide a general code snippet which you can use to get the pdf from the gmail attachments and then convert it to text (string). For this text you can use some regular expressions (which have to be very specific on your use case) to get the desired information and then put it in your sheet.
Code snippet:
The main code will this one. You should only modify this code:
function myFunction() {
const queryString = "label:unread from example#gmail.com" // use your email query
const threads = GmailApp.search(queryString);
const threadsMessages = GmailApp.getMessagesForThreads(threads);
const ss = SpreadsheetApp.getActive();
for (thr = 0, thrs = threads.length; thr < thrs; ++thr) {
let messages = threads[thr].getMessages();
for (msg = 0, msgs = messages.length; msg < msgs; ++msg) {
let attachments = messages[msg].getAttachments();
for (att = 0, atts = attachments.length; att < atts; ++att) {
let attachment_Name = attachments[att].getName();
let filetext = pdfToText( attachments[att], {keepTextfile: false} );
Logger.log(filetext)
// do something with filetext
// build some regular expression that fetches the desired data from filetext
// put this data to the sheet
}}}
}
and pdfToText is a function implemented by Mogsdad which you can find here. Just copy paste the code snippet provided in that link together with myFunction I posted in this answer. Also you have some options which you can use that are very well explained in the link I provided. Important thing to note, to use this library you need to enable the Drive API from the resources.
This will get you started and if you face any issues down the road which you can't find the solution for, you should create a new post here with the specific details of the problem.

How to save edited webview result?

Situation description:
Python 3.7, GTK 3.0, PyGObjects 3.34.0 Webkit2 4.0
I have a dialog window, with GtkNotebook containing 2 tabs.
1. tab contains editable Webkit webview, the 2. tab contains textview. One of the arguments provided in class consrtructor is valid HTML snippet as string variable
What I would like to get as a result, is that any changes made in any window, are automatically reflected in other.
Current problem:
Using solution provided here, any previous changes that were made in webview are discarded upon switching the notepad tabs. Debugging shows that html obtained with aforementioned call, does not contain changes.
Any ideas what might be missing in the logic or handling itself?
For reference, the code for the dialog is as follows:
# -*- coding: utf-8 -*-
import gi
gi.require_version('Gtk', '3.0')
gi.require_version('WebKit2', '4.0')
from gi.repository import Gtk, WebKit2
class DescriptionDialog:
def __init__(self, *args):
# GTK Builder
self._builder = args[0]
self._builder.add_from_file("UI/GUI/description.glade")
self.dialog = self._builder.get_object("descriptionDialog")
self._textView = self._builder.get_object("textview1")
self.webViewContainer = self._builder.get_object("WebViewContainer")
self.browserHolder = WebKit2.WebView()
self.browserHolder.set_editable(True)
self.webViewContainer.add(self.browserHolder)
self.browserHolder.show()
# valid html snippet, held as string
self.__buffer_orig__ = args[2]
self.buffer = args[2]
self.browserHolder.load_html(self.buffer)
self._builder.connect_signals(
{
"onDialogClose": self.onDialogClose,
"pageChangeNotebook": self.onPageChange
})
self.dialog.set_transient_for(self._builder.get_object("MainWindow"))
self.textBuffer = self._builder.get_object("textbuffer1")
self.textBuffer.set_text(self.buffer)
self.dialog.show()
def onDialogClose(self, handler):
self.dialog.hide()
def onPageChange(self, notebook=None, scrolledWindow=None, pageNumber=0):
if pageNumber == 0:
self.buffer = self.textBuffer.get_text(self.textBuffer.get_start_iter(), self.textBuffer.get_end_iter(), True)
self.browserHolder.load_html(self.buffer)
if pageNumber == 1:
self.browserHolder.get_main_resource().get_data(None, self.getDataFromResource, None)
def getDataFromResource(self, resource, result, userData=None):
# Changed html is not returned here
self.buffer = str(resource.get_data_finish(result).decode("utf-8"))
self.textBuffer.set_text(self.buffer)
For other internet users finding this thread.
Currently, at the given versions, this is the working result that I have come up with.
Main idea with this implementation is following - use the WebView enabled Javascript engine to obtain the contents of a <body> tag. Then, parse the Javascript result to use this value later on.
def onPageChange(self, notebook=None, scrolledWindow=None, pageNumber=0):
if pageNumber == 0:
self.buffer = self.textBuffer.get_text(self.textBuffer.get_start_iter(), self.textBuffer.get_end_iter(), True)
self.browserHolder.load_html(self.buffer)
if pageNumber == 1:
# use JavaScript to get the html contained in rendered body tag
script = "document.body.innerHTML;"
# Execute JavasScript via WebKit2.WebView bindings.
# Result can be obtained asynchronously, via callback method
self.browserHolder.run_javascript(script, None, self.getJSStatus, None)
def getJSStatus(self, resource, result, userData=None):
# Sample adapted and simplified from here:
# https://lazka.github.io/pgi-docs/#WebKit2-4.0/classes/WebView.html#WebKit2.WebView.run_javascript_finish
# Get the JavaScript result
data = self.browserHolder.run_javascript_finish(result)
# Get value from result, and convert it to string
self.buffer = data.get_js_value().to_string()
self.textBuffer.set_text(self.buffer)

Metadata-Extractor -- Missing List of Tags?

I'm using metadata-extractor to retrieve metadata from video files. I have it successfully retrieving the directories. Now I need to query the directories for specific info -- duration, height, etc.
The metadata-extractor docs give this example of how to query for a specific tag value:
// obtain the Exif directory
ExifSubIFDDirectory directory
= metadata.getFirstDirectoryOfType(ExifSubIFDDirectory.class);
// query the tag's value
Date date
= directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
So it appears I need to get a list of the relevant tags, such as TAG_DATETIME_ORIGINAL, for duration, height, etc.
This page in the metadata-extractor docs contains a link titled "the various tag values", but the page it goes to lists tags for still images only, not for video files.
Googling for Metadata-Extractor -- Complete List of All Tags does not seem to bring up a list of all tags.
Are the metadata-extractor docs really missing a list of tags, or am I approaching this the wrong way somehow?
I found a list of tags at:
https://developer.tizen.org/dev-guide/2.3.1/org.tizen.guides/html/native/multimedia/metadata_extractor_n.htm
However, those constants don't seem to be what's needed in actual code. Here's Java code that works:
import com.drew.imaging.ImageMetadataReader;
import com.drew.metadata.Directory;
import com.drew.metadata.Metadata;
import com.drew.metadata.Tag;
import com.drew.metadata.file.FileTypeDirectory;
import com.drew.metadata.mp4.Mp4Directory;
import com.drew.metadata.mp4.media.Mp4SoundDirectory;
import com.drew.metadata.mp4.media.Mp4VideoDirectory;
[.....]
Metadata theMetadata = null;
try {
InputStream stream = new URL(theVideoInfo.getLinkToVideo()).openStream();
theMetadata = ImageMetadataReader.readMetadata(stream);
}
} catch (java.lang.Exception exception) {
exception.printStackTrace();
}
Mp4SoundDirectory soundDirectory
= theMetadata.getFirstDirectoryOfType(Mp4SoundDirectory.class);
Mp4VideoDirectory videoDirectory
= theMetadata.getFirstDirectoryOfType(Mp4VideoDirectory.class);
Mp4Directory mp4Directory
= theMetadata.getFirstDirectoryOfType(Mp4Directory.class);
FileTypeDirectory fileTypeDirectory
= theMetadata.getFirstDirectoryOfType(FileTypeDirectory.class);
String numberOfAudioChannels
= soundDirectory.getString(Mp4SoundDirectory.TAG_NUMBER_OF_CHANNELS);
String duration = mp4Directory.getString(Mp4Directory.TAG_DURATION);
String frameRate = videoDirectory.getString(Mp4VideoDirectory.TAG_FRAME_RATE);
String height = videoDirectory.getString(Mp4VideoDirectory.TAG_HEIGHT);
String width = videoDirectory.getString(Mp4VideoDirectory.TAG_WIDTH);
String type = fileTypeDirectory.getString(FileTypeDirectory.TAG_DETECTED_FILE_MIME_TYPE);
I found the constants (TAG_HEIGHT, TAG_WIDTH, etc.) by directly examining the metadata-extractor objects in the debugger. For example, I'd type:
Mp4VideoDirectory.WIDTH
...and the debugger (IntelliJ) would auto-complete the available constants that had the text "WIDTH" in them.

Django send welcome email after User created using signals

I have a create_user_profile signal and I'd like to use same signal to send a welcome email to the user.
This is what I wrote so far in my signals.py:
#receiver(post_save, sender=User)
def update_user_profile(sender, instance, created, **kwargs):
if created:
UserProfile.objects.create(user=instance)
instance.profile.save()
subject = 'Welcome to MyApp!'
from_email = 'no-reply#myapp.com'
to = instance.email
plaintext = get_template('email/welcome.txt')
html = get_template('email/welcome.html')
d = Context({'username': instance.username})
text_content = plaintext.render(d)
html_content = html.render(d)
try:
msg = EmailMultiAlternatives(subject, text_content, from_email, [to])
msg.attach_alternative(html_content, "text/html")
msg.send()
except BadHeaderError:
return HttpResponse('Invalid header found.')
This is failing with this error:
TypeError at /signup/
context must be a dict rather than Context.
pointing to the forms.save in my views.py file.
Can you help me to understand what's wrong here?
Just pass a dict to the render instead of a Context object
d = {'username': instance.username}
text_content = plaintext.render(d)
On django 1.11 the template context must be a dict:
https://docs.djangoproject.com/en/1.11/topics/templates/#django.template.backends.base.Template.render
Try to just remove the Context object creationg.
d = {'username': instance.username}

Query String Parameter 'URL' Containing URL with a Query String

In context to the answer of 'Alo Sarv' at this question:-
Why is Facebook share button pulling parameters from the meta tags instead of my specified ones?
I'm facing a problem.
How can I achieve this:-
http://siteA.com?siteA_URL=http://siteB.com?siteB_title=abc&siteB_description=def&siteA_title=ghi&siteA_description=jkl
Such that siteA_URL is
siteA_URL = http://siteB.com?siteB_title=abc&siteB_description=def
and not:-
siteA_URL = http://siteB.com?siteB_title=abc
You need to correctly encode the URL parameter:
var siteA_URL = encodeURIComponent("http://siteB.com?siteB_title=abc&siteB_description=def");
var url = "http://siteA.com?siteA_URL=" + siteA_URL;