Which whois servers does this library use and are they rate limited? - whois-ruby

I can't seem to locate which whois servers this library uses, where in the code does it have the URL of the server(s) it connects too?
I'm not sure I understand how whois actually works, but I am assuming it connects to popular registrars API's correct?
Does this mean that if I try and run a whois for thousands of entries I will be rate limited and the response will begin to fail?

The WHOIS server definition list is available at
https://github.com/weppos/whois/blob/master/data/tld.json
It doesn't connect to popular registrars API, instead the WHOIS itself is a protocol and the library connects to the WHOIS interfaces provides by registries, whenever applicable.
Each registry has custom rate limiting rules. However, if you perform a large amount of requests, there is an high change you will start being rate limited.

Related

In network programming, there is a limit to number of sockets/connections, how webserver exceeds this limit?

I have started exploring on the network programming in Linux using Socket. I am wondering how come webservers like Yahoo, google, and etc are able to establish million/billions of connections. I believe the core is only socket programming to access the remote server. If that is the case then how come billion and millions of people are able to connect to the server. It means billions/millions of socket connection. This is not possible right? The spec says maximum 5 socket connections only. What is the mystery behind it?
Can you also speak in terms of this - API?
listen(sock,5);
To get an idea of tuning an individual server you may want to start with Apache Performance Tuning and maybe Linux Tuning Parameters, though it is somewhat outdated. Also see Upper limit of file descriptor in Linux
When you got a number of finely tuned servers, a network balancer is used and it typically distributes IP traffic across a cluster of such hosts. Sometimes a DNS load balancing is used in addition to further split between IP balancers.
Here, if you are interested you can follow Google's Compute Engine Load Balancing, which provides a single IP address, and does away with the need to have DNS balancing in addition, and reproduce their results:
The following instructions walk you step-by-step in setting up a
Google Compute Load Balancer benchmark that achieves 1,000,000
Requests Per Second. It is the code and step were used when writing a
blog post for the Google Cloud Platform blog. You can find the Google
Cloud Platform blog # http://googlecloudplatform.blogspot.com/ This
GIST is a combination of instructions and scripts from Eric Hankland
and Anthony F. Voellm. You are free to reuse the code snippets.
https://gist.github.com/voellm/1370e09f7f394e3be724
It doesn't 'say maximum 5 connections only'. The argument to listen() that you refer to is the backlog, not the total number of connections. It refers to the number of incoming connections that TCP will accept and hold on the 'backlog' queue() prior to the application getting hold of them via accept().

How can I recognize different applications in NetFlow dumps?

I try to discover what kind of applications work in my network (e.g. Facebook, Youtube, Twitter etc.) . Unfortunatelly I can't do Deep Packet Inspection, everything I have are NetFlow traces. I was thinking about resolving ip addresses using DNS server and check domain names of flows. But what if application use domain that doesn't contain app name? Is that any possibility to find all ip addresses that use specific app/website?
Outside deep packet inspection (in which I include tech like Cisco NBAR) your main tools are probably going to be whois and port/protocol pair. Some commercial NetFlow collectors will do some of the legwork for you, for example by doing autonomous system lookup on incoming IP addresses, or providing the IANA protocol list.
The term "application" is a bit overloaded in this domain, by the way: often it's used to mean HTTP, SSH, POP3 and similar protocols in the OSI Application Layer, which are generally guessed from the port/protocol pair. For Facebook, Hotmail, etc, the whois protocol is probably your best bet. It's a bit better than reverse DNS, but the return formats aren't standardized among the Regional Internet Registries, so your parser is going to need to have some smarts. Get the IP addresses for a few of the major sites and use the command line whois utility with them to get a feel for the output before scripting anything.
Fortunately, most of the big ones are handled by ARIN. Look for "NetName" and "OrgName" in the results (and watch for the RIR names (RIPE, APNIC, etc) to indicate where that IP address isn't handled by ARIN). For example, I see www.stackoverflow.com as 198.252.206.16. whois 198.252.206.16 returns (among other things,
NetName: SE-NET01
OrgName: Stack Exchange, Inc.
You didn't specify whether you were shell scripting or programming; if the latter, the WHOIS protocol is standard and has a number of implementations in most languages.

Technically what is an "application" when referred to in the Google Static Maps Rate Limits docs

From https://developers.google.com/maps/documentation/staticmaps/#Limits I read:
The Google Static Maps API has the following usage limits:
25 000 free static map requests per application per day.
If I'm not providing an API key in the URL, how does it determine the limit? IP of the referring page? domain of the referring URL? IP of the client?
We used the static maps on our website, we've discovered it uses the IP address of the client. So someone who looked at our website a lot would find the big your quota has been exceeded image would appear for them but not for me.
I start by declaring that I do not know this, but the logical choice is the domain.
With ip restriction multiple clients on the same web server would consume each others quotas which they should have thought of.
*Client ip would be useless in every metric.
*Server ip would mean multiple clients on one host would consume each others quota.
Whats left is the domain. However with that said, Google is know to use their brains and I would not be surprised if they have a combination running to find abuse. Like so.
If domainA.com uses up 25 000 in one day and them immediately domainB.com comes online and starts asking for images from the same ip that might ring some bells.
Of course the same would be true even for different ip if they all request the same location.
So in summary, I think if you randomize which domain asks for the map at any given cient request and only locally mark a domain as spent (for the day) when you get error back I think you can request infinite amount (if you have infinite domains). With the possible caveat of detection if all request the same location.
Of course spreading the different domains over different servers/ip would make it impossible to detect, however unlikely it's needed.
There is no clarity in the pricing model and the usage limits that Google has posted for their web service API's but I guess the accepted answer is wrong and misleading - refer Two conflicting statements for google static map usage
The 25k usage limit will be for the application and not for its client.

Search engine to check if a particular ip is web server

I have to automatically find web servers in certain ip range
It should not look like attack so I cannot use ping, curl, lynx. I cannot also use reverse dns.
The other approach is using search engine like google or bing. I can search by putting ip in search box later I can check if address contains ip then I know that is web server.
But google not returns useful data. For example for ip 212.77.100.101 (which is web server) does not return useful results, any of them on results web page does not contain 212.77.100.101 on address (https://www.google.pl/search?q=212.77.100.101).
Is there any other solution to that problem or is there any search engine to use?
This would really depend on a lot of factors. Your going to need some scripting heft to search through straight up google results for the information you want. Plus what do you mean by server? Just like a regular ole website server? You could probably utilize arin whois in some way to query ip addresses and any belonging to google, yahoo, etc etc you could identify as a LIKELY server IP address. If your looking to see if it's a server based on more technical information like OS, ports, etc etc there isn't much you'll likely find on google.
For instance an ARIN WHOIS of a google ip comes to this, http://whois.arin.net/rest/net/NET-74-125-0-0-1/pft . Using your preference of language you could probably make the query to the web page and have it return the DOM or w/e to a variable and then look for the element that would have identifying information such as a google designation under name or something to that effect.
The best way to really tell, AFAIK, is to check ports and other techniques, which you cannot do by the sound of it. I'm not aware of a database you can access either that would have that information by IP address either...
What do you mean by server? That would help narrow down what your looking to accomplish. Just any IP serving up some sort of data? Or anything that comes back to a linux box or something?
More detail! :D

server push for millions of concurrent connections

I am building a distributed system that consists of potentially millions of clients which all need to keep an open (preferrably HTTP) connection to wait for a command from the server (which is running somewhere else). The load of messages / commmands will not be very high, maybe one message / sec / 1000 clients which means it would be 1000 msg/sec # 1 million clients. => it's basically about the concurrent connections.
The requirements are simple too. One way messaging (server->client), only 1 client per "channel".
I am pretty open in terms of technology (xmpp / websockets / comet / ...). I am using Google App Engine as server, but their "channels" won't work for me unfortunately (too low quotas and no Java client). XMPP was an option but is quite expensive. So far I was using URL Fetch & pubnub, but they just started charging for connections (big time).
So:
Does anyone know of a service out there that can do that for me in an affordable way? Most I have found restrict or heavily charge for connections.
Any experience with implementing such a server yourself? I have actually done that already and it works pretty well (based on Tomcat & NIO) but I haven't had the time yet to actually set up a large load test environment (partially because this is still a fallback solution, I'd prefer a battle hardened msg server). Any experience to how many users you get per GB? Any hard limits?
My architecture also allows to fragment the msg servers, but I'd like to maximize the concurrent connections because the msg processing CPU overhead is minimal.
I have meanwhile implemented my own message server using netty.io. Netty makes use of Java NIO and scales extremely well. For idle connections I get a memory footprint of 500 bytes per connection. I am doing only very simple message forwarding (no caching, storage or other fancy stuff) but with that am easily getting 1000 - 1500 msg / sec (each half a KB) on the small amazon instance (1ECU / 1.6GB).
Otherwise if you are looking for a (paid) service then I can recommend spire.io (they do not charge for connections but have a higher price per message) or pubnub (they do charge for connections but are cheaper per message).
You have to look more in architecture of making such environment.
First of all, if you will write sockets management by yourself, then don't use Thread per Client Socket. Use Asynchronous methods for receiving and sending data.
WebSockets might be too heavy if your messages are small. Because it implements framing, which has to be applied to each message for each socket individually (caching can be used for different versions of WebSockets protocols), that makes them slower to process both directions: for receive and for send, especially because of data masking.
It is possible to create millions of sockets, but only most advanced technologies are capable to do so. Erlang is able to handle millions connections, and is pretty scalable.
If you would like to have millions of connections using other higher level technologies, then you need to think about clustering of what you are trying to accomplish.
For example using gateway server that will keep track of all processing servers. And have data of them (IP, ports, load (if it will be one internal network, firewalling and port forwarding might be handy here).
Client software connects to that gateway server, gateway server checks the least loaded server and sends ip and port to client. Client creates connection directly to working server using provided address.
That way you will have gateway which as well can handle authorization, and wont hold connections for long, so one of them might be enough. And many workers that are doing publishing of data and keeping connections.
This is very related to your needs, and might not be suitable for your solutions.