ok i will keep it simple
I put the following query in YQL console
select * from html where url="https://twitter.com/laurenlemon/status/470403949980549121"
The twitter site in the query is a list of tweets which i want to pull using YQL.
The response in the console contained only the html tags and few contents of some of the html tags, but not a single tweets of any user was visible in any html element in response of the YQL console window.
I dont know what i am doing wrong.
OK guys i figured it out, YQL only has ability to scrape html loaded contents, now the contents that are loaded by AJAX like requests, so i had to go with selenium cum phantomJS that has ability to emulate a real browser and navigate to sites and scrape functions.
Anyone looking to scrape ajax loaded content can refer the selenium docs here, its very easy to use, step by step guide to scraping AJAX content.
Related
I'm working on a URL preview widget, so I'd like to extract the meta tags from the HTML of a given URL.
However, the problem is that websites like Twitter don't return the entire HTML when they detect there's no JavaScript engine enabled (i.e. doing a GET request from the http package).
So, I'd like to know if there's any workaround for these cases, for example, using some kind of headless browser to get the entire HTML.
Thanks!
I've got an old ASPX+XML website created by an external agency here. I only have access to sections of the XML as the web.config is locked.
I want to crawl this site to scrape the pages and capture the relational data. I can do a blank search which returns back all the data - from here a web crawler would be fine. However, I cannot find a web crawler that will hit search - I've tried a JavaScript that submits the form on page load but this still does not work (I guess it's not fast enough).
The URL does not contain the query string (so I cant just do a blank search and copy the results URL for example).
Any ideas?
Hi I have one url which will open the Page in XMl format. Now there is no any Style information for the page so when I am trying to use
assertTextPresent();
for the verification of a text on the page it showing error
Couldn't access document.body. Is this HTML page fully loaded?
Is there any other way to verify The Page is fully Loaded.
I am using Perl Language with Selenium RC.
I am using Open function for launching the page.
This is not basically a Perl question. Checking whether a page is loaded completely or not is a client side concern and you can use Javascript to tackle this. With AJAX, on XMLHTTPRequest object, check the status code of 4 and response code of 200, you are pretty much done.
Is it possible to have multiple og:title(etc) in a page?
I have 4 different products on same page (url: site/#theblock)
I want my users to share the product with the correct link on their FB.
I tried jquery but its not working.
Im using a wordpress cms.
Any idea how to do it?
Thanks
You need to use some server side technology to dynamically change the og tags based on query string parameters. Hash tags will not work since they are only used by the browser and are not sent to the server.
The reason to set the og tags server side is because Facebook's scraper does not run JavaScript and therefore your server needs to make the correct tags upon serving the page.
I've been reading guides and examples for a long time (hours) but I can't manage. I tried to use all html meta tag like title, description, and og:property. Also tried to use the link sharer and also to create a new blank page with just the info I want to share to facebook in order to test. Also I tried to generate an random url in php so to have always a different url variable (the url to share and also the url of the main page containing the script). I also grabbed (url linter) a lot of time the url to clean the cache of facebook. It always give me the title of the site domain as title or the url itself as the shared title and description. I don't know what to do.
The main web site is from joomla. In the code of index of joomla I put a php include if the url has the variable "articolo" id. This incuded php page has regulat head body etc. So maybe I facebook check the main meta of joomla first? So now I tried to open a popup with just the page for sharing. Look here: link
It's possible that the title is locked in, meaning that after X number of likes Facebook doesn't allow you to change it anymore. Can you give us an example URL you're having issues with?
EDIT
Ok, now the link you provided shows some very interesting output. http://modernolatina.it/wjs/index.php?option=com_content&view=article&id=96&Itemid=258&autore=6&articolo=6
First, you webserver, instead of sending back a 200 code, is sending back a 500 code.
Secondly the HTML your webserver is sending back has two HTML tags (Do a view source on the content returned)
Fix up those two issues and I think the linter will be happier with your page.
Test your page here:
http://developers.facebook.com/tools/debug