I would like to stream all the twitter tweets (Yes, I am crazy), in order to make some stats.
I have no special permission, I am just a simple twitter user.
To begin, I am just testing whether it is possible : I go on my server, it has a 100 Mbs bandwidth (I checked it is true),
and I run this command :
curl -d 'track=http' http://stream.twitter.com/1/statuses/filter.json -umyuser:mypasswd | grep 'xxxxxx'
I put the 'grep' just to not have the tweets displayed and not have limitations due to displaying text on screen.
Then I used 'dstat' to check the bandwidth used : it is all the time limited to 128KB/s (that is only 1Mbs). As a tweet weight is about 2K, it seems I can stream only about 64 tweets per seconds... much less than the reality (more than 1000 tweets/s I believe...).
Event if I add some frequent term into the track list, The bandwidth is stuck to 128KB/s.
Do you have any idea to get the full streaming ?
Unless you have been granted the privilege, you can't access the fully unmetered firehose stream. With just basic privileges, you will be relegated to what you are doing now, using filter, or using sample (which will deliver approximately 1% of Twitter's tweet volume.)
You can try to gain elevated access by asking Twitter, however they don't seem to be handing elevated privileges out unless you can justify your use case pretty well.
Related
I'm trying to list all the users in Jira using the REST API, I'm currently using the search user feature using GET : https://docs.atlassian.com/jira/REST/server/#api/2/user-findUsers
The thing is it says that the result will by default display the 50 first result and that we can expand that result up to 1000. Compared to other features available in the REST API, the pagination here is not specified.
An example is the group member feature : https://docs.atlassian.com/jira/REST/server/#api/2/group-getUsersFromGroup
Thus I did a test and with my test Jira filled with 2 members, tried to get only one result and see if there was some sort of indication referring to the rest of my result.
The response provided will only give the results and no ways to get to know if there was more thatn 1000 (or 1 in my example), it's maybe logical but in the case of an organization with more than 1000 members, listing all the users doing this : http://jira/rest/api/2/user/search?username=.&maxResults=1000&includeInactive=true will only give at most 1000 results.
I'm getting all the users no matter what their name are using . as the matching character.
Thanks for your help!
What you can do, is to calculate manually the number of users.
Let's say you have 98 users in your system.
First search will give you 50 users. Now you have an array and you can get the length of that array which is 50.
Since you do not know if there are 50 or 51 users, you execute another search with the parameter &startAt=50.
This time the array length is 48 instead of 50 and you know that you've reached all the users in the system.
From speaking to Atlassian support, it seems like the user/search endpoint has a bug where it will only ever return the first 1,000 results at most.
One possible other way to get all of the users in your JIRA instance is to use the Crowd API's /rest/usermanagement/1/search endpoint:
curl -X GET \
'https://jira.url/rest/usermanagement/1/search?entity-type=user&start-index=0&max-results=1000&expand=user' \
-H 'Accept: application/json' -u username:password
You'll need to create a new JIRA User Server entry to create Crowd credentials (the username:password parameter above) for your application to use in its REST API calls:
Go to User Management.
Select JIRA User Server.
Add an application.
Enter the application name and password that the application will use when accessing your JIRA server application.
Enter the IP address, addresses, or IP CIDR block of the application, and click Save.
Imagine a request that starts a long running process whose output is a large set of records.
We could start the process with a POST request:
POST /api/v1/long-computation
The output consists of a large sequence of numbered records, that must be sent to the client. Since the output is large, the server does not store everything, and so maintains a window of records with a upper limit on the size of the window. Let's say that it stores upto 1000 records (and pauses computation whenever this many records are available). When the client fetches records, the server may subsequently delete those records and so continue with generating more records (as more slots in the 1000-length window are free).
Let's say we fetch records with:
GET /api/v1/long-computation?ack=213
We can take this to mean that the server should return records starting from index 214. When the server receives this request, it can assume that the (well-behaved) client is acknowledging that records up to number 213 are received by the client and so it deletes them, and then returns records starting from number 214 to whatever is available at that time.
Next if the client requests:
GET /api/v1/long-computation?ack=214
the server would delete record 214 and return records starting from 215.
This seems like a reasonable design until it is noticed that GET requests need to be safe and idempotent (see section 9.1 in the HTTP RFC).
Questions:
Is there a better way to design this API?
Is it OK to keep it as GET even though it appears to violate the standard?
Would it be reasonable to make it a POST request such as:
POST /api/v1/long-computation/truncate-and-fetch?ack=213
One question I always feel like that needs to be asked is, are you sure that REST is the right approach for this problem? I'm a big fan and proponent REST, but try to only apply to to situations where it's applicable.
That being said, I don't think there's anything necessarily wrong with expiring resources after they have been used, but I think it's bad design to re-use the same url over and over again.
Instead, when I call the first set of results (maybe with):
GET /api/v1/long-computation
I'd expect that resource to give me a next link with the next set of results.
Although that particular url design does sort of tell me there's only 1 long-computation on the entire system going on at the same time. If this is not the case, I would also expect a bit more uniqueness in the url design.
The best solution here is to buy a bigger hard drive. I'm assuming you've pushed back and that's not in the cards.
I would consider your operation to be "unsafe" as defined by RFC 7231, so I would suggest not using GET. I would also strongly advise you to not delete records from the server without the client explicitly requesting it. One of the principles REST is built around is that the web is unreliable. Under your design, what happens if a response doesn't make it to the client for whatever reason? If they make another request, any records from the lost response will be destroyed.
I'm going to second #Evert's suggestion that you absolutely must keep this design, you instead pick a technology that's build around reliable delivery of information, such as a messaging queue. If you're going to stick with REST, you need to allow clients to tell you when it's safe to delete records.
For instance, is it possible to page records? You could do something like:
POST /long-running-operations?recordsPerPage=10
202 Accepted
Location: "/long-running-operations/12"
{
"status": "building next page",
"retry-after-seconds": 120
}
GET /long-running-operations/12
200 OK
{
"status": "next page available",
"current-page": "/pages/123"
}
-- or --
GET /long-running-operations/12
200 OK
{
"status": "building next page",
"retry-after-seconds": 120
}
-- or --
GET /long-running-operations/12
200 OK
{
"status": "complete"
}
GET /pages/123
{
// a page of records
}
DELETE /pages/123
// remove this page so new records can be made
You'll need to cap out page size at the number of records you support. If the client request is smaller than that limit, you can background more records while they process the first page.
That's just spitballing, but maybe you can start there. No promises on quality - this is totally off the top of my head. This approach is a little chatty, but it saves you from returning a 404 if the new page isn't ready yet.
To make a long story short, I need to gather geolocation data and signup date on ~23 million GitHub users for a historical data visualization project. I need to do this within a week's time.
I'm rate-limited to 5000 API calls/hour while authenticated. This is a very generous rate limit, but unfortunately, I've run into a major issue.
The GitHub API's "get all users" feature is great (it gives me around 30-50 users per API call, paginated), but it doesn't return the complete user information I need.
If you look at the following call (https://api.github.com/users?since=0), and compare it to a call to retrieve one user's information (https://api.github.com/users/mojombo), you'll notice that only the user-specific call retrieves the information I need such as "created_at": "2007-10-20T05:24:19Z" and "location": "San Francisco".
This means that I would have to make 1 API call per user to get the data I need. Processing 23,000,000 users then requires 4,600 hours or a little over half a year. I don't have that much time!
Are there any workarounds to this, or any other ways to get geolocation and sign up timestamp data of all GitHub users? If the paginated get all users API route returned full user info, it'd be smooth sailing.
I don't see anything obvious in the API. It's probably specifically designed to make mass data collection very slow. There's a few things you could do to try and speed this up.
Increase the page size
In your https://api.github.com/users?since=0 call add per_page=100. That will cut the number of calls getting the whole user list by 1/3.
Randomly sample the user list.
With a set of 23 million people you don't need to poll very single one to get good data about Github signup and location patterns. Since you're sampling the entire population randomly there's no poll bias to account for.
If user 1200 signed up on 2015-10-10 and user 1235 also signed up on 2015-10-10 then you know users 1201 to 1234 also signed up on 2015-10-10. I'm going to assume you don't need any more granularity than that.
Location can also be randomly sampled. If you randomly sample 1 in 10 users, or even 1 in 100 (one per page). 230,000 out of 23 million is a great polling sample. Professional national polls in the US have sample sizes of a few thousand people for an even bigger population.
How Much Sampling Can You Do In A Week?
A week gives you 168 hours or 840,000 requests.
You can get 1 user or 1 page of 100 users per request. Getting all the users, at 100 users per request, is 230,000 requests. That leaves you with 610,000 requests for individual users or about 1 in 37 users.
I'd go with 1 in 50 to account for download and debugging time. So you can poll two random users per page or about 460,000 out of 23 million. This is an excellent sample size.
I'm working on a filtered live search module with Meteor.js.
Usecase & problem:
A user wants to do a search through all the users to find friends. But I cannot afford for each user to ask the complete users collection. The user filter the search using checkboxes. I'd like to subscribe to the matched users. What is the best way to do it ?
I guess it would be better to create the query client-side, then send it the the method to get back the desired set of users. But, I wonder : when the filtering criteria changes, does the new subscription erase all of the old one ? Because, if I do a first search which return me [usr1, usr3, usr5], and after that a search that return me [usr2, usr4], the best would be to keep the first set and simply add the new one to it on the client-side suscribed collection.
And, in addition, if then I do a third research wich should return me [usr1, usr3, usr2, usr4], the autorunned subscription would not send me anything as I already have the whole result set in my collection.
The goal is to spare processing and data transfer from the server.
I have some ideas, but I haven't coded enough of it yet to share it in a easily comprehensive way.
How would you advice me to do to be the more relevant possible in term of time and performance saving ?
Thanks you all.
David
It depends on your application, but you'll probably send a non-empty string to a publisher which uses that string to search the users collection for matching names. For example:
Meteor.publish('usersByName', function(search) {
check(search, String);
// make sure the user is logged in and that search is sufficiently long
if (!(this.userId && search.length > 2))
return [];
// search by case insensitive regular expression
var selector = {username: new RegExp(search, 'i')};
// only publish the necessary fields
var options = {fields: {username: 1}};
return Meteor.users.find(selector, options);
});
Also see common mistakes for why we limit the fields.
performance
Meteor is clever enough to keep track of the current document set that each client has for each publisher. When the publisher reruns, it knows to only send the difference between the sets. So the situation you described above is already taken care of for you.
If you were subscribed for users: 1,2,3
Then you restarted the subscription for users 2,3,4
The server would send a removed message for 1 and an added message for 4.
Note this will not happen if you stopped the subscription prior to rerunning it.
To my knowledge, there isn't a way to avoid removed messages when modifying the parameters for a single subscription. I can think of two possible (but tricky) alternatives:
Accumulate the intersection of all prior search queries and use that when subscribing. For example, if a user searched for {height: 5} and then searched for {eyes: 'blue'} you could subscribe with {height: 5, eyes: 'blue'}. This may be hard to implement on the client, but it should accomplish what you want with the minimum network traffic.
Accumulate active subscriptions. Rather than modifying the existing subscription each time the user modifies the search, start a new subscription for the new set of documents, and push the subscription handle to an array. When the template is destroyed, you'll need to iterate through all of the handles and call stop() on them. This should work, but it will consume more resources (both network and server memory + CPU).
Before attempting either of these solutions, I'd recommend benchmarking the worst case scenario without using them. My main concern is that without fairly tight controls, you could end up publishing the entire users collection after successive searches.
If you want to go easy on your server, you'll want to send as little data to the client as possible. That means every document you send to the client that is NOT a friend is waste. So let's eliminate all that waste.
Collect your filters (eg filters = {sex: 'Male', state: 'Oregon'}). Then call a method to search based on your filter (eg Users.find(filters). Additionally, you can run your own proprietary ranking algorithm to determine the % chance that a person is a friend. Maybe base it off of distance from ip address (or from phone GPS history), mutual friends, etc. This will pay dividends in efficiency in a bit. Index things like GPS coords or other highly unique attributes, maybe try out composite indexes. But remember more indexes means slower writes.
Now you've got a cursor with all possible friends, ranked from most likely to least likely.
Next, change your subscription to match those friends, but put a limit:20 on there. Also, only send over the fields you need. That way, if a user wants to skip this step, you only wasted sending 20 partial docs over the wire. Then, have an infinite scroll or 'load more' button the user can click. When they load more, it's an additive subscription, so it's not resending duplicate info. Discover Meteor describes this pattern in great detail, so I won't.
After a few clicks/scrolls, the user won't find any more friends (because you were smart & sorted them) so they will stop trying & move on to the next step. If you returned 200 possible friends & they stop trying after 60, you just saved 140 docs from going through the pipeline. There's your efficiency.
For example, I want to parse all tweets of Microsoft .
Twitter has API - GET statuses/user_timeline. But how I can see, This method can only return up to 3,200 of a user’s most recent Tweets.
So, can I parse all tweets of some screen_name?
You might be a bit better with GET search/tweets, adding the query q=#Microsoft.
However, you will have problems as well:
You'll get all tweets mentioning #Microsoft, not only the ones at the user_timeline. You will have to filter afterwards
Although there is not limit in theory, as the one of 3200 of GET statuses/user_timeline, you probably wont be able to get all tweets from a user. By definition, Twitter API does not provide that kind of service. If you want all tweets you'll need to use a service like topsy (not free)
You'll have to use pagination since every query to GET search/tweets gives you a maximum of 100 tweets, and if you need to get more than 450*100 tweets (remember you'll get a mixture of tweets as pointed out in 1. above), you'll have to handle Twitter rate limits and launch your queries in windows of 15 minutes
Sorry there is not a simpler answer to your question... Hope it helps anyway.
(EDIT: 450*100 is assuming you use application-only authentication. If not, it is 180*100)