How can I fetch more than 1000 Google results with the Perl Google API? - perl

HUsing the regular search engine, as a human, can get you not more than 1000 results, which is far more than a regular person needs.
But what If I do want to get 2000? is it possible? I read that it is possible using the App Engine or something like that (over here...), but, is it possible, somehow, to do it through Perl?

I don't know a way around this limit, other than to use a series of refined searches versus one general search.
For example instead of just "Tim Medora", I might search for myself by:
Search #1: "Tim Medora Phoenix"
Search #2: "Tim Medora Boston"
Search #3: "Tim Medora Canada"
However, if you are trying to use Google to search a particular site, you may be able to read that site's Google sitemaps.
For example, www.linkedin.com exposes all 80 million+ users/businesses via a series of nested sitemap XML files: http://www.linkedin.com/sitemap.xml.
Using this method, you can crawl a specific site quite easily with your own search algorithm if they have good Google sitemaps.
Of course, I am in no way suggesting that you exploit a sitemap for illegal/unfriendly purposes.

Related

How to use overpass api?

I am a total newbie so this may be a silly question, but I can't find any tutorials on how to query overpass api to display things on own website. Do I install it on my server or is there a code to query it in the script?
What I want to achieve is to have a searchbar on one page to search for tags, and that would display one random point with that tag on the other page with a leaflet map.
But I am struggling to even display any points on it. Would it be actually better to have a local geojson file with set list of points in one town if I want to limit them to just this town anyway?
I will be grateful for any help, it's a first time I am doing something like this and it horribly stresses me out
You can visually run and try overpass queries using http://overpass-turbo.eu/.
In order to unload the overpass server, it would be a good idea to fetch the data once (and update regularly) and host them on your own server (also pay attention to the terms of use of the specific APIs, they might limit the number of requests per hour or prohibit using the for autocomplete).
To query the server from an application, GET from https://overpass-api.de/api/interpreter?data=, followed by your request (the same you would type into overpass turbo, just without line breaks).
It is also possible to host an overpass instance on your own.
If you need to learn the Overpass Query Syntax first, you can read the docs.

Google manual search results do not match Custom Search API results

I am trying to obtain the total result count from manual Google web search via the CustomSearch API. I am searching the entire web based on a keyword and an associated site, so the search query is " site:". Judging by my research, it is a known issue that manual Google search results tend to differ from CustomSearch API results obtained from searching the entire web, as cited here and elsewhere.
Is there really no way to exactly reproduce manual search results with the API? If that is the case, then the API is rather limited and should be explained up-front to developers in the documentation or fixed.
My custom search engine is already set up to search the whole web I believe.
It looks like maybe PDF files returned in a web search are not returned by the API. I have already tried specifying fileType parameter to include PDFs to no avail.
3 Results returned via manual web search.
0 results returned by API.
If anyone has lessons to share I will be thankful!

Nutch Parser Plugin Collect Contact Information

I am working on a project that need to identify contact points on company's website and used for the purpose of enhancing security.
Right now, I managed to use Apache Nutch to crawl several rounds of sites. The next step will be to parse the HTML pages and locate where the contact information is. In this case, I am only interested in email addresses and phone numbers....
Here is what I am planning to do, we can write a map reduce jobs to parse HTML file and use things like regular expression in combo with Jsoup/Beautifulsoup HTML parsers to find the regular expression.
However, I am wondering is there any parser plugin that has already been implemented and maybe tested used for this purpose?
You should not need to write a custom map reduce job. Just implement a bespoke HTMLParseFilter which will give you a DOM to run XPath expressions on + the text of the document if you want regular expressions.
Having worked on something similar for a customer a few years ago, I found that there were many pages implementing schema.org. You could write a custom HTMLParse filter with Xpath to extract normalised info from the microdata. You can look at the microdata parser which is for StormCrawler as a an example of how to leverage Apache Any23 to extract microdata.
If you want a more NLP-intensive method, you could use Behemoth to process Nutch segments with tools such as Apache UIMA or GATE.
HTH

async autocomplete service

Call me crazy, but I'm looking for a service that will deliver autocomplete functionality similar to Google, Twitter, etc. After searching around for 20 min I thought to ask the geniuses here. Ideas?
I don't mind paying, but it would great if free.. Also is there a top notch NLP service that I can submit strings to and get back states, cities, currencies, company names, establishments, etc. Basically I need to take unstructured data (generic search string) and pull out key information with relevant meta-data.
Big challenge, I know.
Sharing solutions I found after further research.
https://github.com/haochi/jquery.googleSuggest
http://shreyaschand.com/blog/2013/01/03/google-autocomplete-api/
If you dont want to implement it yourself, you can use this service called 'Autocomplete as a Service' which is specifically written for these purposes. You can access it here - www.aaas.io.
you can add metadata with each record and it returns metadata along with the matching results. Do check out demo put up on the home page. It has got a very simple API specifically written for autocomplete search
It does support large datasets and you can apply filters as well while searching.
Its usage is simple - Add your data and use the API URL as autocomplete data source.
Disclaimer: I am founder of it. I will be happy to provide this service to you.

StockTwits API Streaming and Search Used Like Twitter Streaming

The StockTwits API documentation describes steams in a way that sounds like static search results, for example streams/symbol:
This allows an API application to search for a symbol or user. 30 Results will be a
combined list of symbols and users.
This seems similar to search/symbols:
This allows an API application to search for a symbol directly. 30 Results will return
only ticker symbols.
Other than the fact that search excludes users, I don't see the difference.
In contrast, the Twitter API provides methods to request a continuous stream of tweets, which I have gotten to provide tens of thousands of tweets in a few days.
Is it possible to have StockTwit pump tweets continuously, similar to Twitter?
If so, what is required? Since StockTwit streaming looks like searching to me, the only option I have seen is to submit repeated search requests, but that would exhaust the rate limit.
I prefer C#, but I am glad to study answers in other languages, such as PHP.
This is a static search for symbols or both symbols and users as a combined search. This isn't a streaming search endpoint for filtering content. This is strictly for use for finding a symbol or a user to go directly to the stream.
We are looking into offering streaming endpoints and search would be part of this offering.
You may be interested in using streamdata.io which allows to stream any APIs. We have already implemented a StockTwits demo, which can be found here and explanations can be found in this blog post.
I think it's quite easy to transpose what has been done with Android to the C# world. All you need is an EventSource library and a JSON-Patch library.