Scrapy and Google web scraping

Scrapy and Google web scraping - mongodb

I am trying to utilize scrapy to gather google search results and put them into MongoDB. However, I don't get any response... what am I missing?
It seems very simple.
# -*- coding: utf-8 -*-
import scrapy
class GoogleSpider(scrapy.Spider):
name = "google"
allowed_domains = ["google.com"]
start_urls = (
'https://www.google.com/#q=site:www.linkedin.com%2Fpub+intext:(security+or+jsp)+and+(power+or+utility)',
)
def parse(self, response):
for sel in response.xpath('//*[#id="rso"]/div/div[1]/div/h3'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/#href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
pass

You are missing that the response does not have the elements you are requesting with XPath.
That's because you are seeing another site when using Scrapy and when using your browser. That's because when you call your start_url it loads Google and then an XHR request is sent to query the search.
Scrapy does not send this XHR call because such things are initiated by JavaScript which is not executed by Scrapy.
To see what scrapy gets when calling this URL and see if you find your expectations use Scrapy Shell:
scrapy shell "https://www.google.com/#q=site:www.linkedin.com%2Fpub+intext:(security+or+jsp)+and+(power+or+utility)"
Then when the command prompt appears you can see why you do not get results:
>>> response.xpath('//*[#id="rso"]/div/div[1]/div/h3')
[]
>>>
So Scrapy finds nothing for your XPath because of the missing content.

Related

getting url parameters in tornado

i've been facing some problems in getting url parameters using Tornado web framework.
these are my codes:
def main_app():
return tornado.web.Application([
(r"/reg", register),
(r"/account", account),
])
class account(tornado.web.RequestHandler):
def get(self):
name = self.get_argument('name')
depo = self.get_argument('depo')
respone = {'name':name ,'depo':depo}
self.write(respone)
i've tried to use some restful API to test this web service.
i curl a url like curl localhost:8000/account?name = "parsa" & depo = "10"
but always i get this error that says it doesn't recognize depo. actually i tested something and it showed that each time the second parameter doesn't work well or even the third one doesn't work.
i tried several ways but didn't work.

This is not a problem with your Tornado code. You are making the curl request incorrectly. You can verify that by visiting that url from your browser.
With, curl, you'll have to wrap the whole url in quotes:
curl "localhost:8000/account?name=para&depo=10"

Locust tasks that follow redirects

I'm load testing a local API that will redirect a user based on a few conditions. Locust is not redirecting the simulated users hitting the end points and I know this because the app logs all redirects. If I manually hit the end points using curl, I can see the status is 302 and the Location header is set.
According to the embedded clients.HttpSession.request object, the allow_redirects option is set to True by default.
Any ideas?

We use redirection in our locust test, especially during the login phase. The redirects are handled for us without a hitch. Print the status_code of the response that you get back. Is it 200, 3xx or something worse?
Another suggestion: Don't throw your entire testing workflow into the locust file. That makes it too difficult to debug problems. Instead, create a standalone python script that uses the python requests library directly to simulate your workflow. Iron out any kinks, like redirection problems, in that simple, non-locust test script. Once you have that working, extract what you did into a file or class and have the locust task use the class.
Here is an example of what I mean. FooApplication does the real work. He is consumed by the locust file and a simple test script.
foo_app.py
class FooApplication():
def __init__(self, client):
self.client = client
self.is_logged_in = False
def login(self):
self.client.cookies.clear()
self.is_logged_in = False
name = '/login'
response = self.client.post('/login', {
'user': 'testuser',
'password': '12345'
}, allow_redirects=True, name=name)
if not response.ok:
self.log_failure('Login failed', name, response)
def load_foo(self):
name = '/foo'
response = self.client.get('/foo', name=name)
if not response.ok:
self.log_failure('Foo request failed ', name, response)
def log_failure(self, message, name, response):
pass # add some logging
foo_test_client.py
# Use this test file to iron out kinks in your request workflow
import requests
from locust.clients import HttpSession
from foo_app import FooApplication
client = HttpSession('http://dev.foo.com')
app = FooApplication(client)
app.login()
app.load_foo()
locustfile.py
from foo_app import FooApplication
class FooTaskSet(TaskSet):
def on_start(self):
self.foo = FooApplication(self.client)
#task(1)
def login(self):
if not self.foo.is_logged_in:
self.foo.login()
#task(5) # 5x more likely to load a foo vs logging in again
def load_foo(self):
if self.foo.is_logged_in:
self.load_foo()
else:
self.login()

Since Locust uses the Requests HTTP library for Python, you might find your answer there.
The Response object can be used to evaluate if a redirect has happened and what the history of redirects contains.
is_redirect:
True if this Response is a well-formed HTTP redirect that could have been
processed automatically (by Session.resolve_redirects).
There might be an indication that the redirect is not well-formed.

How to crawl data from encrypted url?

I'm trying to use scrapy to collect the university's professors' contact information from its directory. Since I can't post more than 2 links, I put all links in the following picture.
I set last name equals from the drop-down menu as shown in the picture. Then I search all professors by last name.
Usually, the url will have some pattern from other universities' website. However, for this one, the original url is (1). It becomes (2)when I search 'An' as last name. It seems like 'An' is replaced by something like 529385FD5FF90A198625819E002B8B41? I'm not sure. Is there any way I can get the url that I need to send as a request? I mean, this time I search 'An'. If I search another last name like Lee. It will be another request. They are irregular. I can't find a pattern.

The scraper is not as complex as you think it is. It just makes a POST call from the form and that returns a GET request. Below would work
import scrapy
from scrapy.utils.response import open_in_browser
class univSpider(scrapy.Spider):
name = "univ"
start_urls = ["http://appl103.lsu.edu/dir003.nsf/(NoteID)/5903C096337C2AA28625819E0038E3E4?OpenDocument"]
def parse(self, response):
yield FormRequest.from_response(response, formname="_DIRVNAM", formdata={"LastName": "Lalwani"},callback = self.search_result)
def search_result(self, response):
open_in_browser(response)
print(response.body)

Automatic mediawiki import with powershell_script

I found a nice script to import xml using powershell
http://slash4.de/tutorials/Automatic_mediawiki_page_import_powershell_script
Currently I don't get them run. I'm sure, this is a problem with the permissons.
First I set the wiki to allow anybody to upload an import
$wgGroupPermissions['*']['import'] = true;
$wgGroupPermissions['*']['importupload'] = true;
Then I get this error: Import failed: Loss of session data.
I try to figure out to pass the user and password to this line in powershell
$req.Credentials = [System.Net.CredentialCache]::DefaultCredentials
and changed it to
$req.Credentials = [System.Net.CredentialCache]::("user", "pass")
Import failed: Loss of session data. Again?
How can I pass the user/password to the website?

The Loss of session data error is generated when the edit token sent with the request does not have the expected value.
In the script you linked to, the $wikiURL string contains editToken=12345. That does not look like a valid MediaWiki edit token, so it's not surprising that it will fail.
In current versions of MediaWiki, the edit token for non-logged-in users is always +\. You could try replacing 12345 in the script with that (or, rather, with its URL-encoded version %2B%5C) and see if it helps.

Scrapy issue with iTunes' AppStore

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8
In the following code I have used the simplest regex which targets all apps in the US store.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class AppStoreSpider(CrawlSpider):
domain_name = 'itunes.apple.com'
start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8']
rules = (
Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'),
'parse_app', follow=True,
),
)
def parse_app(self, response):
....
SPIDER = AppStoreSpider()
When I run it I receive the following:
[itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8> (referer: None)
[itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8>
As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops..
it also returns this message:
[ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug!
Traceback (most recent call last):
File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies
parse_ns_headers(ns_hdrs), request)
File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set
cookie = self._cookie_from_cookie_tuple(tup, request)
File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple
if version is not None: version = int(version)
ValueError: invalid literal for int() with base 10: '"1"'
I have used the same script for other website and I didn't have this problem.
Any suggestion?

When I hit that link in a browser, it automatically tries to open iTunes locally. That could be the "offsite request" mentioned in the error.
I would try:
1) Remove "?mt=8" from the end of the URL. It looks like it's not needed anyway and it could have something to do with the request.
2) Try the same request in the Scrapy Shell. It's a much easier way to debug your code and try new things. More details here: http://doc.scrapy.org/topics/shell.html?highlight=interactive

I see this post is pretty old, if you haven't figured out the cause yet, here it is.
I run into a similar issue working with itunesconnect using mechanize. After much frustration i found that there's a bug in cookielib that doesn't handle some cookies correctly. It's discussed here: http://bugs.python.org/issue3924
The fix at the bottom of that post worked for me. I'll repost here for convenience.
Basically you create a custom subclass of cookielib.CookieJar, override _cookie_from_cookie_tuple and use this CustomCookieJar in place of the cookielib jar
class CustomCookieJar(cookielib.CookieJar):
def _cookie_from_cookie_tuple(self, tup, request):
name, value, standard, rest = tup
version = standard.get("version", None)
if version is not None:
# Some servers add " around the version number, this module expects a pure int.
standard["version"] = version.strip('"')
return cookielib.CookieJar._cookie_from_cookie_tuple(self, tup,request)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scrapy and Google web scraping - mongodb

Related

getting url parameters in tornado

Locust tasks that follow redirects

How to crawl data from encrypted url?

Automatic mediawiki import with powershell_script

Scrapy issue with iTunes' AppStore

Categories

Resources