I would like to fetch some data from Google Scholar automatically via a matlab script. I am mostly interested in data like Google Scholar's Bibtex entries and the forward citation feature. However, it seems that there is no API for Google Scholar, is there a way to automatically fetch bibliographic data from Google Scholar using Matlab? Are there some tools or code already available for this?
A word of caution I found while working further on this project.
There is a reason why Google Scholar does not have an API. Using bots to collect from Google Scholar is against the EULA. The basic idea is that any program that tries to interface with Google Scholar cannot do so in a qualitatively different way than an end user. In other words, you can automatically fetch large amounts of data. Although the script in #JustinPeel's answer do not necessarily violate the terms, putting it in a massive loop, would.
Some specific points from this EULA:
You shall not, and shall not allow any third party to: ...
(i) directly or indirectly generate queries, or impressions of or clicks on Results, through any automated, deceptive, fraudulent or other invalid means (including, but not limited to, click spam, robots, macro programs, and Internet agents);
...
(l) "crawl", "spider", index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof);
If you look at the Google Scholar robots.txt then you can also see that no bots of any kind are allowed.
I have heard from some colleagues that you will get in trouble if you try to circumvent this policy, which can result in your lab losing access to Google Scholar.
If you really want to use Matlab for this (which I don't really advise), then you can look at some various web scraping examples and there is this code that actually already gets some info from Google Scholar. Basically, just good 'matlab web scraping' and off you go.
I personally would recommend using Python for this because Python is better for general programming IMHO. For instance, this guy has already done a similar thing to what you want with Python. However, if you know Matlab and don't have any interest/time for Python then follow the links in the first paragraph.
Related
I have hundreds of thousands of PDFs that are presently stored in the filesystem. I have a custom application that, as an afterthought to its actual purpose, provides access to these PDFs. I would like to take the "storage & retrieval" part out of the custom application and use an OpenSource document storage backend.
Access to the PDF Store should be via a REST API, so that users would not need a custom client for basic document browsing and viewing. Programs that store PDFs should also be able to work via the REST API. They would provide the actual binary or ASCII data plus structured meta data, which could later be used in retrieval.
A typical query for retrieval would be "give me all documents that were created between days X and Y with document types A or B".
My research, whether such a storage backend exists, has come up empty. Do any of you know a system that provides these features? OpenSource preferred, reasonably priced systems considered.
I am not looking for advice on how to "roll my own" using available technologies. Rather, I'm trying to find out whether that can be avoided. Many thanks in advance.
What you describe sounds like a document management or asset management system of which there are many; and many work with PDF files. I have some fleeting experience with commercial offerings such as Xinet (http://www.northplains.com/xinet - now acquired apparently) or Elvis (http://www.elvisdam.com). Both might fit your requirements but they're probably too big and likely too expensive.
Have you looked at Alfresco? This is an open source alternative I came into contact with years ago while being on the board of a selection committee. As far as I remember it definitely goes in the direction of what you are looking for and it is open source so might fit that angle as well: http://www.alfresco.com.
While brainstorming about six years ago, I had what I thought was a great idea: in the future there could be webservice standards and DTDs that effectively turn the web into a decentralized knowledgebase. I listed several areas where I thought this could be applied, one of which was:
For making data avail. directly from a business's website: open hours, locations, and contact phone numbers. Suggest a web service standard by which businesses have a standard URL extended off the main (base) URL for there website, at which is located a webservice. That webservice as well has a standardized set of services for downloading a list of their locations, contact telephone numbers, and business hours.
It's interesting looking back at these notes now since this is not how things have evolved. Instead of businesses putting this information on only their website then letting any search engine or other data aggregator to crawl it, they are updating it separately on their website, their Facebook page, and Google Maps. Facebook and Google Maps, due to their popularity, have become the solution to the problem I though my idea would solve.
Is the way things are better than the way I thought they could be? If so then why doesn't my idea fit the reality? If not then what's holding my idea back from being realized?
A lot of this information is available via APIs, that doesn't mean that it doesn't get put other places as well, through a variety of means. For example, a company may expose information via an API, and their Facebook app might use that API to populate a Facebook page.
Also, various microformats are in use that encapsulate some of this information.
The biggest obstacle is agreeing on what meta-information should be exposed, how it should be exposed, and how it should be accessed.
We need to display meta information (e.g, address, name) on our site for various venues like bars, restaurants, and theaters.
Ideally, users would type in the name of a venue, along with zip code, and we present the closest matches.
Which APIs have people used for similar geolocation purposes? What are the pros and cons of each?
Our basic research yielded a few options (listed in title and below). We're curious to hear how others have deployed these APIs and which ones are ultimately in use.
Fwix API: http://developers.fwix.com/
Zumigo
Does Facebook plan on offering a Places API eventually that could accomplish this?
Thanks!
Facebook Places is based on Factual. You can use Factual's API which is pretty good (and still free, I think?)
http://www.factual.com/topic/local
You can also use unauthenticated Foursquare as a straight places database. The data is of uneven quality since it's crowdsourced, but I find it generally good. It's free to a certain API limit, but I think the paid tier is negotiated.
https://developer.foursquare.com/
I briefly looked at Google Places but didn't like it because of all the restrictions on how you have to display results (Google wants their ad revenue).
It's been a long time since this question was asked but a quick update on answers for other people.
This post, right now at least, will not go into great detail about each service but merely lists them:
http://wiki.developer.factual.com/w/page/12298852/start
http://developer.yp.com
http://www.yelp.com/developers/documentation
https://developer.foursquare.com/
http://code.google.com/apis/maps/documentation/places/
http://developers.facebook.com/docs/reference/api/
https://simplegeo.com/docs/api-endpoints/simplegeo-context
http://www.citygridmedia.com/developer/
http://fwix.com/developer_tools
http://localeze.com/
They each have their pros and cons (i.e. Google Places only allows 20 results per query, Foursquare and Facebook Places have semi-unreliable results) which can be explained a bit more in detail, although not entirely, in the following link. http://www.quora.com/What-are-the-pros-and-cons-of-each-Places-API
For my own project I ended up deciding to go with Factual's API since there are no restrictions on what you do with the data (one of the only ToS' that I've read in its entirety). Factual has a pretty reliable API, which as a user of the API you may update, modify, or flag rows of the data. Facebook Places bases their data on Factual's, just another fact to shed some perspective.
Hope I can be of help to any future searchers.
This is not a complete answer, because I havn't compared the given geolocation API, but there is also the Google Places API, which solves a similiar problem like the other APIs.
One thing about SimpleGeo: The Location API of SimpleGeo supports mainly US (and Canada?) based locations. The last time I checked, my home country Germany doesn't has many known locations.
Comparison between places data APIs is tough to keep up to date, with the fast past of the space, and with acquisitions like SimpleGeo and HyperPublic changing the landscape quickly.
So I'll just throw in CityGrids perspective as of February 2012. CityGrid provides 18M US places, allowing up to 10M requests per month for developers (publishers) at no charge.
You can search using a wide range of "what" and "where" (Cities, Neighborhoods, Zip Codes, Metro Areas, Addresses, Intersections) searches including latlong. We have rich data for each place including images, videos, reviews, offers, etc.
CityGrid also has a developer revenue sharing program where we'll pay you to display some places as well as large mobile and web advertising network.
You can also query Places via the CityGrid API using Factual, Foursquare and other places providers places and venue IDs. We aggregate data from several places data providers through our system.
Website: http://developer.citygridmedia.com/
I've searched the web for this bit to no avail - I Hope some one can point me in the right direction. I'm happy to look things up, but its knowing where to start.
I am creating an iPhone app which takes content updates from a webserver and will also push feedback there. Whilst the content is obviously available via the app, I don't want the source address to be discovered and published my some unhelpful person so that it all becomes freely available.
I'm therefore looking at placing it in a mySQL database and possibly writing some PHP routines to provide access to my http(s) requests. That's all pretty new to me but I can probably do it. However, I'm not sure where to start with the security question. Something simple and straightforward would be great. Also, any guidance on whether to stick with the XML parser I currently have or to switch to JSON would be much appreciated.
The content consists of straightforward data but also html and images.
Doing exactly what you want (prevent users from 'unauthorized' apps to get access to this data') is rather difficult because at the end of the day, any access codes and/or URLs will be stored in your app for someone to dig up and exploit.
If you can, consider authenticating against the USER not the App. So that even if there is a 3rd party app created that can access this data from where ever you store it, you can still disable it on a per-user basis.
Like everything in the field of Information Security, you have to consider the cost-benefit. You need to weigh-up the value of your data vs. the cost of your security both in terms of actual development cost and the cost of protecting it as well as the cost of inconveniencing users to the point that you can't sell your data at all.
Good luck!
I'm looking for a presentation, PDF, blog post, or whitepaper discussing the technical details of how to filter down and display massive amounts of information for individual users in an intelligent (possibly machine learning) kind of way. I've had coworkers hear presentations on the Facebook news feed but I can't find anything published anywhere that goes into the dirty details. Searches seem to just turn up the controversy of the system. Maybe I'm not searching for the right keywords...
#AlexCuse I'm trying to build something similar to Facebook's system. I have large amounts of data and I need to filter it down to something manageable to present to the user. I cannot use another website due to the scale of what I've got to work at. Also I just want a technical discussion of how to implement it, not examples of people who have an implementation.
Are you looking for something along the lines of distributed pub/sub with content based filtering? If so, you may want to look into Siena and some of the associated papers such as Design and Evaluation of a Wide-Area Event Notification Service