How do I gather geolocation data on GitHub users in a week's time, given their API rate limiting constraints? - github

To make a long story short, I need to gather geolocation data and signup date on ~23 million GitHub users for a historical data visualization project. I need to do this within a week's time.
I'm rate-limited to 5000 API calls/hour while authenticated. This is a very generous rate limit, but unfortunately, I've run into a major issue.
The GitHub API's "get all users" feature is great (it gives me around 30-50 users per API call, paginated), but it doesn't return the complete user information I need.
If you look at the following call (https://api.github.com/users?since=0), and compare it to a call to retrieve one user's information (https://api.github.com/users/mojombo), you'll notice that only the user-specific call retrieves the information I need such as "created_at": "2007-10-20T05:24:19Z" and "location": "San Francisco".
This means that I would have to make 1 API call per user to get the data I need. Processing 23,000,000 users then requires 4,600 hours or a little over half a year. I don't have that much time!
Are there any workarounds to this, or any other ways to get geolocation and sign up timestamp data of all GitHub users? If the paginated get all users API route returned full user info, it'd be smooth sailing.

I don't see anything obvious in the API. It's probably specifically designed to make mass data collection very slow. There's a few things you could do to try and speed this up.
Increase the page size
In your https://api.github.com/users?since=0 call add per_page=100. That will cut the number of calls getting the whole user list by 1/3.
Randomly sample the user list.
With a set of 23 million people you don't need to poll very single one to get good data about Github signup and location patterns. Since you're sampling the entire population randomly there's no poll bias to account for.
If user 1200 signed up on 2015-10-10 and user 1235 also signed up on 2015-10-10 then you know users 1201 to 1234 also signed up on 2015-10-10. I'm going to assume you don't need any more granularity than that.
Location can also be randomly sampled. If you randomly sample 1 in 10 users, or even 1 in 100 (one per page). 230,000 out of 23 million is a great polling sample. Professional national polls in the US have sample sizes of a few thousand people for an even bigger population.
How Much Sampling Can You Do In A Week?
A week gives you 168 hours or 840,000 requests.
You can get 1 user or 1 page of 100 users per request. Getting all the users, at 100 users per request, is 230,000 requests. That leaves you with 610,000 requests for individual users or about 1 in 37 users.
I'd go with 1 in 50 to account for download and debugging time. So you can poll two random users per page or about 460,000 out of 23 million. This is an excellent sample size.

Related

XRPL: How to get the history of the balance of an account?

I would like to query the history of the balance of an XRPL account with the new WebSocket API.
For example, how do I check the balance of an account on a particular day?
I know with the v2 api, there was a possibility to query balance_changes. But this doesn't seem to be part of the new version.
For example:
https://data.ripple.com/v2/accounts/rf1BiGeXwwQoi8Z2ueFYTEXSwuJYfV2Jpn/balance_changes?start=2018-01-01T00:00:00Z
How is this done with the new Websocket API's?
There's no convenient API call that the WebSocket API can do to get this. I assume you want the XRP balance, not token/issued currency balances, which are in a different place.
One way to go about it is to make an account_tx call and then iterate through the metadata. Many, but not all, transactions will have a ModifiedNode entry of type AccountRoot—if that transaction changed the account's XRP balance, you can see the difference in the PreviousFields vs. FinalFields for that entry. The Look Up Transaction Results tutorial has some details on how to parse out metadata this way. There are some kind of tricky edge cases here: for example, if you send a transaction that buys 10 drops of XRP in the exchange but burns 10 drops of XRP as a transaction cost, then the metadata won't show a balance change because the net change was zero (+10, -10).
Another approach could be to estimate what ledger_index was most recently closed at a given time, then use account_info to look up the account's balance as of that time. The hard part there is figuring out what the latest ledger index was at a given time. This is one of the places where the Data API was just more convenient than the WebSocket API—there's no way to look up by date in WebSocket so you have to try a ledger index, see what the close time of the ledger was, try another ledger index, see what the date is, etc.

Does it make sense to implement an API call for stats on a dashboard page?

I'm adding a page on my site that is kind of an overview of all the other data. It will include stuff like total X (rows), number of registered users, number of unregistered users, total number of users with x property, etc. I was thinking of doing an api route that returns an object with key/value pairs that are exactly the stats that I need (doing this instead of, say, fetching all the data and aggregating it on the front-end, since, I would assume, doing a COUNT in SQL is going to be a lot faster). Does it make sense to implement such a route? Or is this poor design? This route would likely be called on an interval, like every 30 seconds or something.

Extract significant near future local events exact location and time using Facebook Graph Search api

I am looking for a way to extract significant (number of attendees > threshold) near future (within the next week) local events exact location and time using Facebook Graph Search api.
If local cannot be done, i could just specify a city (Athens, GR for example) instead.
It would be absolutely great if the info could be extracted with one query, but i think this is too much to hope for.
What i have tried so far is:
search?fields=location,events,name&limit=300&q=athens&type=place
This produces a set of events with name relative to "athens" as well as exact location, but not the time or number of attendees or event name.
{event_ID}?fields=attending.limit(1).summary(true)
This produces the number of attendees for a specific event_ID.
The total number of significant (let's assume more than 300 attendees) event for a week's span in Athens, GR should not be very high, therefore i could manually query the API as a last resort solution.
Does anyone have any idea if/how what i am asking can be achieved?
Thank you very much in advance.
You can't do this just in one query although you can probably batch some requests (https://developers.facebook.com/docs/graph-api/making-multiple-requests).
What I would do is:
Geo query to place: https://developers.facebook.com/docs/graph-api/using-graph-api/v2.3#search
GET graph.facebook.com
/search?
q=coffee&
type=place&
center=37.76,-122.427&
distance=1000
Get the page_id and query for public events (batch): https://developers.facebook.com/docs/graph-api/reference/page/events
Get attendants for those events (batch): https://developers.facebook.com/docs/graph-api/reference/v2.3/event/attending
I hope it helps.

Facebook Request(s): what counts as 1 request?

I am currently creating an application that polls facebook for data. First request a page in this fashion...
pageID/posts?fields=id,message,created_time,type&limit=250
This returns the top 250 posts from a page. I then check if there page next is set and if it is make another request for the next 250 posts. I continue this recursively until there are no more posts.
With each post that is returned I go out and fetch the post details from the graph api as well.
My question is if I had 500 posts on a page. Would that equate to 502 requests? (500 requests for each post + 2 for parsing through page data to get posts) or am I incorrect in my understanding of a "request". I know when batching calls each query included in the batch actually counts as 1 request. The goal is to avoid the 600 calls / 600 second rate limiting. Thanks!
Every API call is...well, 1 request. So every time you use the /posts endpoint with whatever limit, it will be 1 request. For example, if you do that call you posted, it will be one request that returns 250 elements.
Batch requests are just faster, but each call in the batch counts as a request. So if you combine 10 calls in a batch, it will be 10 requests. The benefit of batch calls is really just that they are a lot faster: as fast as the slowest call in the batch.
If you want to get 500 posts with that example of yours, you would only need 2 calls. First one with 250 returned elements, second one by using the API call defined in the "next" value to get another 250. Just keep in mind that the default is usually 25 elements, and you can´t use any limit you want. There is a max limit for calls and it gets changed from time to time afaik so don´t count on getting the same result every time.
Btw, don't be to fixated on that 600calls/600seconds limit, it's just a general limit. The real limit is dynamic and depends on many factors. It's not public, of course. But if you really hit the limit, you are doing something wrong anyway.

Maximum number of network updates retrieved per API call

Is there any restriction on the number of entries that are retrieved using a single call to the Network Updates API? I found this forum comment "The per-user limit is per call, so 300 requests with however many updates they have." on the thread
http://developer.linkedin.com/forum/increase-search-api-throttle-limit
I want to confirm that indeed there is no limit. I have received as many as 106 entries in a single call.
Thanks in advance.
The maximum number of updates returned from the Network Updates API appears to be 250. Performing the following query as an example:
http://api.linkedin.com/v1/people/~/network/updates?count=500
Even if I try to specify the start parameter at, say, 250, I can't get the next 250 updates from the API:
http://api.linkedin.com/v1/people/~/network/updates?count=250&start=250
So it looks like 250 is the max, with no ability to page beyond that.
UPDATE:
Have verified that 250 is the maximum number returned, either in a single call or via the paging parameters. Looks like the documentation has been updated to reflect this.