I am using python 2.7 and I am trying to extract the personal ids of the people who liked my facebook page photos. My code is:
import urllib2
from bs4 import BeautifulSoup
import mechanize
br = mechanize.Browser()
htmltext = br.open("url").read()
soup = BeautifulSoup(htmltext)
search = soup.findAll('div',attrs={'class':'_5j0e fsl fwb fcb'})
print search
But when run this code I get empty brackets [ ]. Also, when I run the same code but "print soup" instead of the "print search", I get the HTML but the ids are not there, I even Ctrl+F to look for them but they are not there, so it seems that my code didn't extract these parts at all.
Thank you!
Scraping is not allowed on Facebook, you MUST use the Graph API to get data. It is a pretty simple API call: /post-id/likes
More information: https://developers.facebook.com/docs/graph-api/reference/v2.5/object/likes
Related
I have since figured out how to used cURL on Terminal. When I enter a cURL query with parameters in Terminal, I receive a huge block of un-indented text indicating the ad id, ad snapshot URL, and ad delivery start time. Unfortunately, none of the ad snapshot URLs work.
Facebook says they're supposed to look something like this:
https://www.facebook.com/ads/archive/render_ad/?id=123&access_token=<ACCESS_TOKEN>
Mine look like this (I replaced my access token with Xs):
www.facebook.com/ads/archive/render_ad/?id=215976033116557&access_token=XXXX
Even when I reformat the URL to match that of Facebook's example, the page reads: "This page isn't available
The link you followed may be broken, or the page may have been removed."
Anyone know how to solve this by chance?
Update: exported JSON result to R using rjson, links now work
install.packages("rjson")
library("rjson")
result <- fromJSON(file = "myresults.json")
print(result)
json_data_frame <- as.data.frame(result)
print(json_data_frame)
Right now I have been trying to access yahoo with python and am I am not sure why I can't seem to login.
My intended flow is
go to yahoo -> go to login -> enter username -> press submit button -> enter password -> press submit button.
Please let me know where I have made a mistake and why not code doesn't seem to work. Any alternatives to login into yahoo that are not selenium-based would be appreciated and still use python.
"""Example app to login to Yahoo using the StatefulBrowser class."""
from __future__ import print_function
import argparse
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser(
soup_config={'features': 'lxml'},
raise_on_404=True,
user_agent='MyBot/0.1: mysite.example.com/bot_info',
)
# Uncomment for a more verbose output:
browser.set_verbose(2)
browser.session.cookies.keys()
browser.open("https://login.yahoo.com/config/login?.src=fpctx&.intl=ca&.lang=en-CA&.done=https%3A%2F%2Fca.yahoo.com")
form1 = browser.select_form(nr=0)
browser['username'] = 'beta#gmail.com'
response = browser.submit_selected()
print(response.content)
browser.select_form(nr=0)
browser['passwd'] = 'badPass'
response = browser.submit_selected()
print(response)
page = browser.get_current_page()
A quick look at the login page source shows that it uses JavaScript quite extensively. It seems very likely that the form submission is handled by JavaScript, though I can't point to an exact line of code that proves this incontrovertibly.
Since MechanicalSoup does not support JavaScript, you may need to find an alternate tool that does, such as Selenium. See this FAQ for more information.
I am trying to get Facebook name of user according to his Facebook id.
for example if my Facebook is: Mor Amit
and my Facebook id is: 875810135770071
i want to get the name of the Facebook that is: mor.amit.3
Do you know how can i do it?
Thank you
*I am using java
Since you didn't mentioned which language you are using I'll use PHP since its easier and quicker to learn (my opinion).
If you have the ID then this link will take you the user profile:
www.facebook.com/profile.php?id=<user id digits only>
If you have the username then you can use:
www.facebook.com/<user name>
Once you loaded the page lets say using PHP (curl, file_get_contents) you can find inside the content (DOM) using regex - dirty parsing:
FB user id regex pattern: /data-profileid="([0-9]{0,})"/i
FB user real name regex pattern: /fb-timeline-cover-name">(.{0,})<\/span>/i
FB user name regex pattern: /URL=\/(.{0,})\?{1}/i
Its better to parse the content (you can use - Simple HTML DOM) and then traverse the DOM but that's more complex.
Have fun exploring and learning.
I'm not very familiar with APIs, ssl certs, etc. but I would like to access the underlying html code for facebook pages. I'm having trouble using getURL() with Facebook, and "ssl.verifypeer = F" doesn't work. Here's an example:
library(RCurl)
txt<-getURL("https://www.facebook.com/nytimes/", ssl.verifypeer = FALSE)
This only returns and empty string:
txt = ""
Does this mean I need to use the Graph API? Can you access underlying HTML code using the Graph API? Using Firebug extension for Firefox, I can see the html code, but I can't access it through R. I'm not interested in specific data like likes, or posts, just the html code. Any suggestions on how I can access the html code for a facebook page? Thanks in advance.
Use:
txt <- getURLContent("https://www.facebook.com/nytimes/", ssl.verifypeer = FALSE, followlocation = TRUE)
getURL may also work with followlocation = TRUE. It worked on my linux box but not on a windows machine.
The new graph search on facebook lets you search for current employees of a company using query token - Current Google employees (for example).
I want to scrape the results page (http://www.facebook.com/search/104958162837/employees/present) via scrapy.
Initial problem was facebook allows only a facebook user to access the information, so directing me to login.php. So, before scraping this url, I logged in via scrapy and then this result page. But even though the http response is 200 for this page, it does not scraps any data. The code is as follows:
import sys
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.http import Request
class DmozSpider(BaseSpider):
name = "test"
start_urls = ['https://www.facebook.com/login.php'];
task_urls = [query]
def parse(self, response):
return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)]
def after_login(self,response):
if "authentication failed" in response.body:
self.log("Login failed",level=log.ERROR)
return
return Request(query, callback=self.page_parse)
def page_parse(self,response):
hxs = HtmlXPathSelector(response)
print hxs
items = hxs.select('//div[#class="_4_yl"]')
count = 0
print items
What could I have missed or done incorrectly?
The problem is that search results (specifically div initial_browse_result) are loaded dynamically via javascript. Scrapy receives the page before those actions, so there is no results yet there.
Basically, you have two options here:
try to simulate these js (XHR) requests in scrapy, see:
Scraping ajax pages using python
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
use the combination of scrapy and selenium, or scrapy and mechanize to load the whole page with the content, see:
Executing Javascript Submit form functions using scrapy in python
this answer
If you go with first option, you should analyze all requests going during the page load and figure out which one is responsible for getting the data you want to scrape.
The second is pretty straightforward, but will definitely work - you just use other tool to get the page with loaded via js data, then parse it to scrapy items.
Hope that helps.