Scrapy issue with iTunes' AppStore - app-store

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8
In the following code I have used the simplest regex which targets all apps in the US store.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class AppStoreSpider(CrawlSpider):
domain_name = 'itunes.apple.com'
start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8']
rules = (
Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'),
'parse_app', follow=True,
),
)
def parse_app(self, response):
....
SPIDER = AppStoreSpider()
When I run it I receive the following:
[itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8> (referer: None)
[itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8>
As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops..
it also returns this message:
[ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug!
Traceback (most recent call last):
File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies
parse_ns_headers(ns_hdrs), request)
File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set
cookie = self._cookie_from_cookie_tuple(tup, request)
File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple
if version is not None: version = int(version)
ValueError: invalid literal for int() with base 10: '"1"'
I have used the same script for other website and I didn't have this problem.
Any suggestion? 

When I hit that link in a browser, it automatically tries to open iTunes locally. That could be the "offsite request" mentioned in the error.
I would try:
1) Remove "?mt=8" from the end of the URL. It looks like it's not needed anyway and it could have something to do with the request.
2) Try the same request in the Scrapy Shell. It's a much easier way to debug your code and try new things. More details here: http://doc.scrapy.org/topics/shell.html?highlight=interactive

I see this post is pretty old, if you haven't figured out the cause yet, here it is.
I run into a similar issue working with itunesconnect using mechanize. After much frustration i found that there's a bug in cookielib that doesn't handle some cookies correctly. It's discussed here: http://bugs.python.org/issue3924
The fix at the bottom of that post worked for me. I'll repost here for convenience.
Basically you create a custom subclass of cookielib.CookieJar, override _cookie_from_cookie_tuple and use this CustomCookieJar in place of the cookielib jar
class CustomCookieJar(cookielib.CookieJar):
def _cookie_from_cookie_tuple(self, tup, request):
name, value, standard, rest = tup
version = standard.get("version", None)
if version is not None:
# Some servers add " around the version number, this module expects a pure int.
standard["version"] = version.strip('"')
return cookielib.CookieJar._cookie_from_cookie_tuple(self, tup,request)

Related

Scrapy and Google web scraping

I am trying to utilize scrapy to gather google search results and put them into MongoDB. However, I don't get any response... what am I missing?
It seems very simple.
# -*- coding: utf-8 -*-
import scrapy
class GoogleSpider(scrapy.Spider):
name = "google"
allowed_domains = ["google.com"]
start_urls = (
'https://www.google.com/#q=site:www.linkedin.com%2Fpub+intext:(security+or+jsp)+and+(power+or+utility)',
)
def parse(self, response):
for sel in response.xpath('//*[#id="rso"]/div/div[1]/div/h3'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/#href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
pass
You are missing that the response does not have the elements you are requesting with XPath.
That's because you are seeing another site when using Scrapy and when using your browser. That's because when you call your start_url it loads Google and then an XHR request is sent to query the search.
Scrapy does not send this XHR call because such things are initiated by JavaScript which is not executed by Scrapy.
To see what scrapy gets when calling this URL and see if you find your expectations use Scrapy Shell:
scrapy shell "https://www.google.com/#q=site:www.linkedin.com%2Fpub+intext:(security+or+jsp)+and+(power+or+utility)"
Then when the command prompt appears you can see why you do not get results:
>>> response.xpath('//*[#id="rso"]/div/div[1]/div/h3')
[]
>>>
So Scrapy finds nothing for your XPath because of the missing content.

Error when uploading file with OpenMeetings' importFile REST method

I installed 3.0.3 OpenMeetings to test access via REST interface, everything worked ok for the methods in UserService and RoomService. But, when I try to upload a pdf file by ImportFile method (FileService), OpenMeetings returns an object FileImportError stating that the file is damaged, and that this may have occurred during the file transfer via http.
When I try to import the same file using the flex application of OpenMeetings everything works right. I'm using Ruby to call the method of ImportFile OpenMeeting, and to test whether my application is wrong, I called the method using Firefox and got the same error.
I am using the following method call (sample only, not the real ruby code):
ImportFile (SID externalUserId, externalFileId, externalType, room_id, isOwner, path,
parentFolderId, fileSystemName)
SID = one string with the ID of the session
externalUserId = 'extuser' (string)
externalType = 'exttype' (string)
room_id = 2 (existing room in OpenMeetings)
isOwner = false
path = 'http://10.1.1.25/default.pdf' (The path to the file on an Apache server)
parentFolderId = 0
fileSystemName = 'default.pdf'
Also used Eclipse in remote debug to see what was happening and realized that the problem occurs in the conversion of the received file.
I would appreciate some help to solve the problem.
Thanks,
Fernando

Box API 2.0: Unable to Download

I'm testing out the new API, but having no luck downloading a test image file. The file exists, is accessible through the web UI, and is retrievable using the v1.0 API.
I'm able to access the metadata ("https://api.box.com/2.0/files/{fileid}") using both commandline curl and pycurl. However, calls to "https://api.box.com/2.0/files/{fileid}/data" bring back nothing. An earlier post (5/1) received the answer that the download feature had a bug and that "https://www.box.com" should be used as the base URL in the interim. That, however, just provokes a 404.
Please advise.
You should be able to download via http://api.box.com/2.0/files/<fildID>/content ... Looks like we have a bug somewhere in the backend. Hope to have it fixed soon.
Update 11/13/2012 -- This got fixed at least a month ago. Just updated the URL to our newer format
For me it works when its /content instead of /data... python code below
import requests
fileid = str(get_file_id(filenumber))
url = https://api.box.com/2.0/files/1790744170/content
r = requests.get(url=url, headers=<HEADERS>, proxies=<PROXIES>)
infoprint("Downloading...")
filerecieved = r.content
filename = uni_get_id(fileid, "name", "file")
f = open(filename, 'w+')
infoprint("Writing...")
f.write(filerecieved)
f.close()

How to post a file in grails

I am trying to use HTTP to POST a file to an outside API from within a grails service. I've installed the rest plugin and I'm using code like the following:
def theFile = new File("/tmp/blah.txt")
def postBody = [myFile: theFile, foo:'bar']
withHttp(uri: "http://picard:8080/breeze/project/acceptFile") {
def html = post(body: postBody, requestContentType: URLENC)
}
The post works, however, the 'myFile' param appears to be a string rather than an actual file. I have not had any success trying to google for things like "how to post a file in grails" since most of the results end up dealing with handling an uploaded file from a form.
I think I'm using the right requestContentType, but I might have missed something in the documentation.
POSTing a file is not as simple as what you have included in your question (sadly). Also, it depends on what the API you are calling is expecting, e.g. some API expect files as base64 encoded text, while others accept them as mime-multipart.
Since you are using the rest plugin, as far as I can recall it uses the Apache HttpClient, I think this link should provide enough info to get you started (assuming you are dealing with mime-multipart). It shouldn't be too hard to change it around to work with your API and perhaps make it a bit 'groovy-ier'

Nothing except "None" returned for my Python web.py Facebook app when I turn on "OAuth 2.0 for Canvas"

I am a beginning Facebook app developer, but I'm an experienced developer. I'm using web.py as my web framework, and to make matters a bit worse, I'm new to Python.
I'm running into an issue, where when I try to switch over to using the newer "OAuth 2.0 for Canvas", I simply can't get anything to work. The only thing being returned in my Facebook app is "None".
My motivation for turning on OAuth 2.0 is because it sounds like Facebook is going to force it by July, and I might as well learn it now and now have to rewrite it in a few weeks.
I turned on "OAuth 2.0 for Canvas" in the Advanced Settings, and rewrote my code to look for "signed_request" that is POSTed to my server whenever my test user tries to access my app.
My code is the following (I've removed debugging statements and error checking for brevity):
#!/usr/bin/env python
import base64
import web
import minifb
import urllib
import json
FbApiKey = "AAAAAA"
FbActualSecret = "BBBBBB"
CanvasURL = "http://1.2.3.4/fb/"
RedirectURL="http://apps.facebook.com/CCCCCCCC/"
RegURL = 'https://graph.facebook.com/oauth/authorize?client_id=%s&redirect_uri=%s&type=user_agent&display=page' % (FbApiKey, RedirectURL)
urls = (
'/fb/', 'index',
)
app = web.application(urls, locals())
def authorize():
args = web.input()
signed_request = args['signed_request']
#split the signed_request via the .
strings = signed_request.split('.')
hmac = strings[0]
encoded = strings[1]
#since uslsafe_b64decode requires padding, add the proper padding
numPads = len(encoded) % 4
encoded = encoded + "=" * numPads
unencoded = base64.urlsafe_b64decode(str(encoded))
#convert signedRequest into a dictionary
signedRequest = json.loads(unencoded)
try:
#try to find the oauth_token, if it's not there, then
#redirect to the login page
access_token = signedRequest['oauth_token']
print(access_token)
except:
print("Access token not found, redirect user to login")
redirect = "<script type=\"text/javascript\">\ntop.location.href=\"" +_RegURL + "\";\n</script>"
print(redirect)
return redirect
# Do something on the canvas page
returnString = "<html><body>Hello</body></html>"
print(returnString)
class index:
def GET(self):
authorize()
def POST(self):
authorize()
if __name__ == "__main__":
app.run()
For the time being, I want to concentrate on the case where the user is already logged in, so assume that oauth_token is found.
My question is: Why is my "Hello" not being outputted, and instead all I see is "None"?
It appears that I'm missing something very fundamental, because I swear to you, I've scoured the Internet for solutions, and I've read the Facebook pages on this many times. Similarly, I've found many good blogs and stackoverflow questions that document precisely how to use OAuth 2.0 and signed_request. But the fact that I am getting a proper oauth_token, but my only output is "None" makes me think there is something fundamental that I'm doing incorrectly. I realize that "None" is a special word in python, so maybe that's the cause, but I can't pin down exactly what I'm doing wrong.
When I turn off OAuth 2.0, and revert my code to look for the older POST data, I'm able to easily print stuff to the screen.
Any help on this would be greatly appreciated!
How embarrassing!
In my authorize function, I return a string. But since class index is calling authorize, it needs to be returned from the class, not from authorize. If I return the return from authorize, it works.