Scraping data out of facebook using scrapy - facebook

The new graph search on facebook lets you search for current employees of a company using query token - Current Google employees (for example).
I want to scrape the results page (http://www.facebook.com/search/104958162837/employees/present) via scrapy.
Initial problem was facebook allows only a facebook user to access the information, so directing me to login.php. So, before scraping this url, I logged in via scrapy and then this result page. But even though the http response is 200 for this page, it does not scraps any data. The code is as follows:
import sys
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.http import Request
class DmozSpider(BaseSpider):
name = "test"
start_urls = ['https://www.facebook.com/login.php'];
task_urls = [query]
def parse(self, response):
return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)]
def after_login(self,response):
if "authentication failed" in response.body:
self.log("Login failed",level=log.ERROR)
return
return Request(query, callback=self.page_parse)
def page_parse(self,response):
hxs = HtmlXPathSelector(response)
print hxs
items = hxs.select('//div[#class="_4_yl"]')
count = 0
print items
What could I have missed or done incorrectly?

The problem is that search results (specifically div initial_browse_result) are loaded dynamically via javascript. Scrapy receives the page before those actions, so there is no results yet there.
Basically, you have two options here:
try to simulate these js (XHR) requests in scrapy, see:
Scraping ajax pages using python
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
use the combination of scrapy and selenium, or scrapy and mechanize to load the whole page with the content, see:
Executing Javascript Submit form functions using scrapy in python
this answer
If you go with first option, you should analyze all requests going during the page load and figure out which one is responsible for getting the data you want to scrape.
The second is pretty straightforward, but will definitely work - you just use other tool to get the page with loaded via js data, then parse it to scrapy items.
Hope that helps.

Related

Use Login Robox using username only like on reward sites

Hi i know this is probably not the place to ask this but i m stumped at the moment as i cant seem to find any reference or docs relating to working with Roblox. I mean sure they have an auth route etc but nothing detailed. I want to login user using username and give them roblox based on different actions they take on the site like completing surveys etc. Can anyone please give me links to some resources that would come in handy for the particular purpose. Thank you.
Roblox does not support any OAuth systems, but you still can use HttpService:GetAsync() function to get strings/data from web site(if the page in website display that text), the way to keep data that you recieved from url(web page) safe is to store script with HttpService:GetAsync() function in server side(example: RobloxScriptService). You need to allow http requests in your GameSettings -> Security in roblox studio. Script example:
local HttpService = game:GetService("HttpService")
local stringg = HttpService:GetAsync("https://pastebin.com/raw/k7S6Ln9R")
print(string)
--Should outpud data written ot the web page, you can use any web page to store data even your own
The only two things that left is to make your web server rewrite the page, or just use some databases at your web site by placing their url into loadstring() function.
Now you just need to parse the string given by url to use it's data.
The pastebin url that i wrote into loadstring() just an example, you can write whatever you wan, but again you need to parse the data that you got from url, or just convert the string into type of text like on the page, and then just check is they written at url/webpage. Example:
local writtenpass = game.Players["anyplayer"].PlayerGui.TestGui.Frame.PasswordTextBox.text
local writtenlogin = game.Players["anyplayer"].PlayerGui.TestGui.Frame.LoginTextBox.text
local HttpService = game:GetService("HttpService")
local response = HttpService:GetAsync("https://pastebin.com/raw/k7S6Ln9R")
local istrue = string.find(response, "{ login = ".. writtenlogin .." pass = ".. writtenpass .." }")
print(istrue)
if istrue == 1 then
print("exist!")
--whatewer actions if login and pass exist
end
You can wiew the page here https://pastebin.com/raw/k7S6Ln9R
Well that a lot of damage!
If it helps mark me

Import API not working in sisense

I was trying to use the dashboard import API from v1.0 which can be found in the REST API reference. I logged in to http://localhost:8083/dev/api/docs/#/ , gave the correct authorization token, and a dash file in the body, and a 24 character importFolder and hit the Run button to fire the API. It returns 201 as HTTP response, which means the request was successful. However, when I go back to the homepage, I don't see any new dashboard imported in to the said folder. I have tried both cases, where the importFolder exists (already created manually be me), and does not already exist, where I expect the API to create it for me. Neither of these, however, create/import the dashboard
A few comments that should help you resolve this:
When running the command from the interactive API reference (swagger) you don't need the authentication token, because you're already logged in with an active session.
Make sure the json of your dashboard is valid, by saving it as a .dash file and importing via the UI
The folder field is optional - if you leave the field blank, the dashboard is imported to the root of your navigation/folders panel.
If you'd like to import to a specific folder, you'll need to provide the folder ID, not its name, which can be found several ways such as using the /api/v1/folders endpoint, where you can provide a name filtering field and use the oid property of the returned object as the value for the folder field in the import endpoint.
If you can't get this to work still, use chrome's developer tools to look at the outgoing request when you import from the UI and compare the request (headers, body and path) to what you're doing via swagger in order to find the issue.

parsing facebook html with python 2.7

I am using python 2.7 and I am trying to extract the personal ids of the people who liked my facebook page photos. My code is:
import urllib2
from bs4 import BeautifulSoup
import mechanize
br = mechanize.Browser()
htmltext = br.open("url").read()
soup = BeautifulSoup(htmltext)
search = soup.findAll('div',attrs={'class':'_5j0e fsl fwb fcb'})
print search
But when run this code I get empty brackets [ ]. Also, when I run the same code but "print soup" instead of the "print search", I get the HTML but the ids are not there, I even Ctrl+F to look for them but they are not there, so it seems that my code didn't extract these parts at all.
Thank you!
Scraping is not allowed on Facebook, you MUST use the Graph API to get data. It is a pretty simple API call: /post-id/likes
More information: https://developers.facebook.com/docs/graph-api/reference/v2.5/object/likes

Facebook Javascript API "This authorization code has been used" on quick screen refresh (F5 or COMMAND+R)

I'm using the Facebook Javascript API for login in conjunction with the official Facebook PHP SDK on my server to execute the two following lines of code:
$helper = $fb->getJavaScriptHelper();
$accessToken = $helper->getAccessToken();
With the token, I'm further able to execute this code which gets the necessary details I need on the server:
$fb->setDefaultAccessToken($accessToken);
$response = $fb->get('/me?locale=en_US&fields=name,first_name,last_name,email,gender');
If I refresh the webpage I'm working with and let it fully load everything works correctly and I'm able to print to screen all of the details I get back in $response.
The problem I'm having, however, is that if I quickly refresh the screen (either by hitting F5 on Windows machines or COMMAND+R on Macs) before the Facebook javascript code executes I get the following thrown error from the Facebook API:
"This authorization code has been used"
How do I avoid this? Do I wrap the Facebook code on the client side in a jQuery document ready function? I hesitate to do that because I've been told that the Facebook Javascript code is good to go as a stand-alone script that is intelligent enough to know when the document is loaded.
I'm about ready to throw in the towel and just code a manual login process that totally bypasses the Facebook Javascript API. Thanks for your help.
Put the access code into a $_session['access_token'] and then redirect to another page to get the data.
Login Page
Callback Page (save into the session variable)
Working Page (work with the session variable)
See more: https://benmarshall.me/facebook-php-sdk/#example-login

How to get user information of the user while running load tests in locust

I provide number of users=12 and hatch rate=2.
How can I get user id(s) of all users hitting my web page, as I would like to do some customizations based on the object names which are getting created (say an article title).
How to pass user information (say user id) while creating new articles. So that if I run a test with 12 users, I would know that articles were created by a certain user.
from locust import HttpLocust, TaskSet, task
def create_new_article(self):
with self.client.request('post',"/articles",data={"article[title]":"computer","article[content]":"pc"},catch_response=True) as response:
print response
How can I get user id(s) of all users hitting my web page?
This depends on how your web server is set up. What exactly is a user ID in your application's context?
I'll proceed with the assumption that you have some mechanism by which you can generate a user ID.
You could have your client side get the user ID(s) (using Javascript for example) and then pass each ID along to the server in a HTTP request where you could custom define a header to contain your user ID for that request.
For example, if you're using Flask/Python to handle all the business logic of your web application, then you might have some code like:
from flask import Flask, request
app = Flask(__name__)
#app.route("/articles")
def do_something_with_user_id():
do_something(request.headers.get("user-id"))
if __name__ == "__main__":
app.run()
How to pass user information (say user id) while creating new
articles?
You could change your POST request line in your Locust script to something like:
with self.client.request('post',"/articles",headers=header_with_user_id,data={"article[title]":"computer","article[content]":"pc"},catch_response=True) as response:
where header_with_user_id could be defined as follows:
header_with_user_id = { "user-id": <some user ID>}
where <some user ID> would be a stringified version of whatever your mechanism to obtain the user ID gets you.