Best way to store data for Greasemonkey based crawler?

Best way to store data for Greasemonkey based crawler? - persistence

I want to crawl a site with Greasemonkey and wonder if there is a better way to temporarily store values than with GM_setValue.
What I want to do is crawl my contacts in a social network and extract the Twitter URLs from their profile pages.
My current plan is to open each profile in it's own tab, so that it looks more like a normal browsing person (ie css, scrits and images will be loaded by the browser). Then store the Twitter URL with GM_setValue. Once all profile pages have been crawled, create a page using the stored values.
I am not so happy with the storage option, though. Maybe there is a better way?
I have considered inserting the user profiles into the current page so that I could all process them with the same script instance, but I am not sure if XMLHttpRequest looks indistignuishable from normal user initiated requests.

I've had a similar project where I needed to get a whole lot of (invoice line data) from a website, and export it into an accounting database.
You could create a .aspx (or PHP etc) back end, which processes POST data and stores it in a database.
Any data you want from a single page can be stored in a form (hidden using style properties if you want), using field names or id's to identify the data. Then all you need to do is make the form action an .aspx page and submit the form using javascript.
(Alternatively you could add a submit button to the page, allowing you to check the form values before submitting to the database).

I think you should first ask yourself why you want to use Greasemonkey for your particular problem. Greasemonkey was developed as a way to modify one's browsing experience -- not as a web spider. While you might be able to get Greasemonkey to do this using GM_setValue, I think you will find your solution to be kludgy and hard to develop. That, and it will require many manual steps (like opening all of those tabs, clearing the Greasemonkey variables between runs of your script, etc).
Does anything you are doing require the JavaScript on the page to be executed? If so, you may want to consider using Perl and WWW::Mechanize::Plugin::JavaScript. Otherwise, I would recommend that you do all of this in a simple Python script. You will want to take a look at the urllib2 module. For example, take a look at the following code (note that it uses cookielib to support cookies, which you will most likely need if your script requires you to be logged into a site):
import urllib2
import cookielib
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
response = opener.open("http://twitter.com/someguy")
responseText = response.read()
Then you can do all of the processing you want using regular expressions.

Have you considered Google Gears? That would give you access to a local SQLite database which you can store large amounts of information in.

The reason for wanting Greasemonkey
is that the page to be crawled does
not really approve of robots.
Greasemonkey seemed like the easiest
way to make the crawler look
legitimate.
Actually tainting your crawler through the browser does not make it that more legitimate. You are still breaking the terms of use of the site! WWW::Mechanize for example is equally well suited to 'spoof' your User Agent String, but that and crawling is, if the site does not allow spiders/crawlers, illegal!

The reason for wanting Greasemonkey is that the page to be crawled does not really approve of robots. Greasemonkey seemed like the easiest way to make the crawler look legitimate.
I think this is the the hardest way imaginable to make a crawler look legitimate. Spoofing a web browser is trivially easy with some basic understanding of HTTP headers.
Also, some sites have heuristics that look for clients that behave like spiders, so simply making requests look like browser doesn't mean the won't know what you are doing.

Related

How to secure querystring/POST details to a third party

I'm basically looking at a security problem between a parent page and an iframe with links to a third party.
I want to send a POST or a GET (doesn't matter which as I can control the other side) to the third party, but not expose any details within it (say a SID or a user token) and have it's HTML content (JS/HTML/Images) loaded into the iframe.
I've looked at server-side redirects, creating a proxy using webclinet/webresponse and am curious to whether there is a good way to do it.
Has anyone ever done this before, or think that the secrity is not possible? Hell, even if I'm barking up the wrong tree on how to solve this.
If anybody has any examples on this it would be greatly appreciated.
Cheers,
Jamie
[Edit] Was thinking I might need to add some more details.
Say I have a parent page: https://mycompany.com/ShowThirdParty.
This has an iframe in it at the moment which will have the content of another component (also owned by me, or another team more specifically)
Basically I'd like to send some credentials to content in the iframe in such a way that the external pages can't read it, the iframe is put into a modal (I've done that) and the iframe has the restricted content with the auhtentication almost seamless and invisible.
I currently have it working as a GET url generated dynamically via JS and then passed into the iframe src parameter, obviously that isn't secure.
I kind of want some kind of server side redirect across a full url, but I don't even think that's possible.

You could try using AJAX and load a PHP script (with any parameters to the script encoded/encrypted) to query the 3rd party page and load the response into the iframe. Not really sure how your code is setup but there should be a way.

It can also be done by POST Method (submit the data to iFrame using POST) as it is HTTPS so the data you send to iFrame is encryped.

SEO and Javascript Data Load

These days modern sites are becoming more and more service oriented like facebook/gmail.
A main page is loaded and then with ajax requests it calls all sorts of data and adds them on the site. This is also something that is promoted on ASP.NET MVC4 with the Web API.
So now lets say we want to create a product category page for a eshop. It has come to my understanding that the way to go with this implementation is to create a nice layout and create a Web API that will retrieve all data on request.
So we'll have a url like
/api/Products
that will retun a json with all of our products and then we can build up with this api by adding filters/paging maybe (/api/Products?sort-by=name) or anything else that will return the filtered json and we can pass with ajax requests back and forth offering the user an excellent experience.
My question with this now is what happens with SEO.
So a few years ago without onepage ajax/service oriented sites we would have
http://website.com/apples/
http://website.com/apples/2/
that would load the list of the apples with pagination.
Now the site would be
http://website.com/apples/
however it wouldn't load the apples instantly but load a blank page and call the service
/api/apples
that would return a json and then load the data on the site.
I read this article at Google https://developers.google.com/webmasters/ajax-crawling/docs/html-snapshot which didn't convince me. I really don't want to load the service behind and then string replace.
It is possible to have the
http://website.com/apples/
that would call the service
/api/apples
and load the data and be at the same time Google friendly?

You have a couple of options. Either you can use HTML5 pushState to update the URL, but then you also will need to create a version of your site that works without JavaScript turned on.
Another option is to use Googles AJAX Crawling specification. I don't know which search providers that currently supports it, but should be a good way to at least get into Googles search results.

Send the user to a page along with a error message

I want to set up a login page in which from anywhere on the site I can send a user to it and it will display a custom message along with it. I could use a redirect and a msg query param but is this the best way to do it?
I'm working with node.js but I'm interested in a universal solution.

If you are going for easy, you can just have GET data in the URL. But, that doesn't look that nice, if you want a rather long message, plus, GET has size restrictions, where POST (virtually) hasn't.
For using post data you could use the solution of this: JavaScript post request like a form submit question, but that gives a rather messy source code (if you want a somewhat longer text).
You could keep them in a database, and only send the ID of the message to a PHP page, and get it from the database (that's what I would do, but that doesn't mean it's a good idea, just amateur here!)

You can use jQuery or simply plain javascript to extract your message from the url; the relevant question that listed links to detailed code: jquery get querystring from URL.
Then depending on how you want it displayed, apply the extracted string to your situation.

Codeigniter form action with slashes instead of normal GETs?

Hey, so this is one of those questions that seems obvious, and I'm probably going to feel stupid, but here goes:
I'm doing a CodeIgniter site with a search. Think of a Google type input, where you'd search for "white huskies." I have a search results page that takes a URI (MySite.com/dogs/white huskies), and takes the third part, and performs the search on that term. I'd like this to be done in the URI, and no by POST so my users can bookmark results.
The problem I'm having is how to get that search button directed to Mysite.com/dogs/WHATEVER IS IN THE INPUT. How do I get the what is in the input part into the anchor href? I know I could do this with javascript, but I've heard it's bad practice to force people to have javascript for things this small.
Thanks for the help!

Read: Form redirect to URL containing query term? - pure HTML or Django
(asked for Django, but answer fits here too)

You could have an intermediate POST page that collects the form inputs and concatenates them into a valid URL which you can then redirect to. I'm not sure if this is good or bad SEO practice however, but I can't see another way of doing this without some Javascript intervention.
Perhaps you could look at doing the intermediate POST page which takes the values are redirects you to /search/dog/white/huskies, but also have a Javascript equivalent that does this on the fly on the form submit and does a window.location refresh to the same /search/dog/white/huskies?
Just my 2 pennies worth ;)

It is possible to have CodeIgniter work with $_GET variables and URI segments securely.
A work around I have used in the past is to have the search term collected using POST, parse the required URL for use with URI segments and then redirect your user to this page.
$url = 'mysite.com/search/' . urlencode($_POST['query']);
redirect($url);
This shouldn't effect SEO but something like the URL of a search result is unlikely to have any effect on SEO anyway. Clean URLs are only really meant to be used for permanent content. If you're going to be displaying the search term on the page, remember to use xss_clean(), seen a few people make this fatal mistake before.

How does facebook's Share a link feature work?

I'm trying to implement a feature like that where a user inputs a url and when displaying that url I want to have a custom display (an embed object if it's a video from youtube, a thumbnail if it's an image link, title and excerpt of body if it's a normal link).
How can such a feature be realized?

There is a new idea called oEmbed that a few sites support (Flickr, Vimeo and a few others) that addresses this problem. oEmbed site
Otherwise, just check the site against a list of ones you pick and then pull out the relevant bits to construct an embed link.

I liked the idea of oEmbed a lot but unfortunately it doesn't has that much adoption yet.
oohEmbed tries to solve this issue by building oEmbed for many websites.
For the feature to work, it needs the server's interaction where I believe the following scenario is how it works
Assume that we have the site humanzz.com and that it provides such feature
A user enters a url on the humanzz.com's webpage and presses a button like facebooks' preview button
An AJAX call is made to a dedicated page on humanzz.com
humanzz.com does calls the remote website and gets its data
The AJAX call now returns the page's data (oEmbed JSON object)
This involves so much server's overhead.
I really wanted to do it using JavaScript as the server's role was only to bypass "Same Origin Policy"'s restrictions.
oohEmbed allows bypassing the server's step by specifying a callback parameter to oohEmbed so that the JSON object returned is passed to a callback function on your page.
An example illustrating this is as follows
Add a script tag dynamically to your page
< script type="text/javascript" src="http://oohembed.com/oohembed/?url=http%3A//www.amazon.com/Myths-Innovation-Scott-Berkun/dp/0596527055/&callback=myCallBack">< /script>
This would result in executing myCallback(oEmbedJSONObject) which is great.
The problem with that solution is you still have to have a fallback for websites that don't have oEmbed representations.

For the embedded things, I have been using auto_html ( https://github.com/dejan/auto_html) with great success (vimeo, youtube, images) and even added soundcloud myself. But I am still looking for a "thumbnail" generation with an image and text facebook-like.

I guess you have to construct it by yourself by manually parsing the kind of URL you get.
If it is an image url, well then you just have to rescale it and in case the user clicks on it, then handle that by opening the original one somehow.
If it is a link to some youtube video, then you have to take a look at how the embedding of Youtube videos works. You can just copy the code that is provided by Youtube itself, and then exchange the parts with the URL to the video with the URL you got from your user.
I did never implement something like that, but I assume it should work somehow like this.