What is the best way to process 25k API calls in 1 hour? - queue

I'm working on a tracking site that that tracks a player's levels for a game from day to day.
It's going to have to process around 25,000 API calls once a day. I'd like to be able to get this done in 1 hour but I would be okay with processing them all in 2 hours for now.
This is the API I would need to call for each player in my database to get their information: http://hiscore.runescape.com/index_lite.ws?player=Zezima
My site and database are hosted on a VPS.
My thought on how to achieve this is to spin up a handful of Digital Ocean VPS instances when the time comes to make the API calls and have my main VPS distribute the API calls across the DO instances which will make the API calls and insert the results back into my database.

Parallelization is your friend here. Pool your queue listeners and have them run on a machine with adequate CPU and memory.
How fast is your process? Completing 25,000 transactions in one hour means 7 per second. Do you have timing data to help guide the number of instances you'll need?
I'm assuming that your database will allow simultaneous INSERTs. You don't want those getting in each other's way.

Related

How does Fabric Answers send data to the server, should events be submitted periodically or immediately?

I've used Fabric for quite a few applications, however I was curious about the performance when a single application submits potentially hundreds of events per minute.
For this example I'm going to be using a Pedometer application, in which I would want to keep track of the amount of steps users are taking in my application. Considering the average user walks 100 steps per minute, I wouldn't want the application to be sending several dozen updates to the server.
How would Fabric handle this, would it just tell the server "Hey, there were 273 step events in the last 5 minutes with this meta deta" or would it sent 273 step events.
Pedometer applications typically run in the background so how would we get data to Fabric without the user opening the application
Great question! Todd from Fabric. These get batched and sent at time intervals and also certain events (like installs) trigger an upload of the queued events data. You can watch our traffic in Xcode debugger if you are curious about the specifics for your app.

How to ensure that parallel queries to ext. system are executed only once and then cached

Server frameworks: Scala, Play 2.2, ReactiveMongo, Heroku
I think I have quite interesting brain teaser for you:
In my trip-planning application I want to display weather forecast on a map(similar to this). I'm using a paid REST service to query weather data. To speed up user experience and reduce costs I plan to cache weather data for each location for one hour.
There are a few not-so obvious things to consider:
It might require to query up to 100 location for weather to display one weather map
Weather must be queried in parallel because it would take too long to query it in serial fashion considering network latency
However launching 100 threads for each user request is not an option as well (imagine just 5 users looking at a map at one time)
The solution is to have let's say 50 workers that query weather for user requests
Multiple users might be viewing the same portion of map
There is a possible racing condition where one location is queried multiple times.
However it should be queried only once and then cached.
The application is running in clustered environment meaning there will be several play instances.
Coming from a Java EE background I can come up with a pretty good solution using the Java EE stack.
However I wonder how to do this using something more natural to Scala/Play stack: Akka. There is an example (google "heroku scala akka") for similar problem but it doesn't solve one issue: Racing condition when multiple users query the same data at once.
How would you implement this?
EDIT: I have decided that the requirement to ensure that weather data is updated only once is not necessary. The situation would happen far too infrequently to be a real problem and all proposed solutions would bring too much overhead and complexity to the system to be viable.
Thanks everyone for your time and effort. I hope answers to this question will help someone in the future with similar problem.
In Akka you can choose from multiple routing strategies. ConsistentHashingRoutingLogic could serve you well in this situation. Since actors are single-threaded you can easily maintain a cache in each actor. This routing logic will assure that two equal messages will always hit the same actor.
Each actor can work in the following way:
1. check local cache (for example apache commons LRUMap)
- if found, return
2. check global cache (distributed memcache or any other key-value store)
- if found, store the result in the local cache and return
3. query the REST service
4. store the result in the global and local caches
You can have a look at this question, which I based my answer on.
I decided that I'll post my JMS solution as well.
Controller that processes the request for weather does following:
Query the DB for weather data. If there are NO locations with out-of-date data reply immediately. Otherwise continue:
Start listening on a topic (explained later).
For each location: Check whether the weather for the location isn't being updated.
If not send a weather update request message to queue.
Certain amount of workers (50?) listen to that queue.
Worker first marks the location weather as being updated
Worker retrieves updated weather and updates the DB.
Worker sends a message to a topic with weather data for that location.
When controller receives (via topic) weather updates for all out-of-date locations, combine it with up-to-date locations and reply.

Need inputs on optimising web service calls from Perl

Current implementation-
Divide the original file into files equal to the number of servers.
Ensure each server picks one file for processing.
Each server splits the file into 90 buckets.
Use ForkManager to fork 90 processes, each operating on a bucket.
The child processes will make the API calls.
Merge the output of child processes.
Merge the output of each server.
Stats-
The size of the content downloaded using the API call is 40KB.
On 2 servers, the above process for a 225k user file runs in 15 minutes. My aim is to finish a 10 million file in 30 minutes. (Hope this doesn't sound absurd!)
I contemplated using BerkeleyDB but, couldn't find how do I convert the BerkeleyDB file into normal ASCII file.
This sounds like a one-time operation to me. Although I don't understand the 30 minute limit, I have a few suggestions I know from experience.
First of all, as I said in my comment, your bottleneck will not be reading the data from your files. It will also not be writing the results back to a harddrive. The bottleneck will be in the transfer between your machines and the remote machines. Your setup sounds sophisticated, but that might not help you in this situation.
If you are hitting a webservice, someone is running that service. There are servers that can only handle a certain ammount of load. I have brought down the dev environment servers of a big logistics company with a very small load test I ran at night. Often, these things are equipped for long-term load, but not short, heavy load.
Since IT is all about talking to each other through various protocols, like web services or other APIs, you should also consider just talking to the people who run this service. If you have a business-relationship, that is easy. If not, try to find a way to reach them and to ask if their service is able to handle so many requests at all. You could end up with them excluding you permanently because to their admins it looks like you tried to DDOS them.
I'd ask them if you could send them the files (or an excerpt of the data, cut down to what is relevant for processing) so they can do the operations in batch on their side. That way, you remove the load for processing everything as web requests, and the time it takes to do these requests.

Facebook Graph Latency

The following code fragment
for($i=0;$i<60;$i++){
$u[$i]=$_REQUEST["u".$i];
$pic[$i] =imagecreatefromjpeg("http://graph.facebook.com/".$u[$i]."/picture");
}
is taking more than 90 seconds to execute on my new server. It was taking less than 15 seconds on my shared hosting server. However, on dedicated server it is taking more than 90 seconds.
The data center of my new server is Asia Pacific.
Please advice on how I can reduce this time of fetching images on the graph.
thanks and regards
Why not just request all the pictures' URLs in a single call?
https://graph.facebook.com/?fields=picture&ids=[CSV LIST OF IDS]&access_token=ACCESS_TOKEN
You'll then have a list of all the images and can fetch them all however you so wish
is taking more than 90 seconds to execute on my new server.
Well, for 60 HTTP requests that’s not too bad, I’d say.
It was taking less than 15 seconds on my shared hosting server. However, on dedicated server it is taking more than 90 seconds.
Maybe the connection of your old server was just faster …?
The data center of my new server is Asia Pacific.
Do you know by any chance, which one it was before?
Please advice on how I can reduce this time of fetching images on the graph.
Do you have to request all these images in one go?
Maybe your app’s workflow (which we don’t know anything about yet) would allow for other approaches, like getting user images at a previous time (f.e. when a user starts using your app) and cache them locally, so that you don’t have to do 60+ HTTP requests in one go.

MS CRM recursive workflow and performance

I’m about to write a workflow in CRM that calls itself every day. This is a recursive workflow.
It will run on half a million entities each day and deactive the record if it was not been upodated in the past 3 days.
I’m worried about performance has anyone else done this.
I haven't personally implemented anything like this, but that's 500,000 records that are floating around in the DB that the async service has to keep track of, which is going to tax your hardware. In addition, CRM keeps track of recursive workflow instances. I don't have the exact specs in front of me, but if a workflow calls itself a set number of times within a certain timeframe, CRM will kill the workflow.
Could you just write a console app that asks the Crm Service for records that haven't been updated in three days, and then deactivate them? Run it as a scheduled task once a day, and then your CRM system doesn't have the burden of keeping track of all those running workflow instances.
EDIT: Ah, I see now you might have been thinking of one workflow that runs on all the records as opposed to workflows running on each record. benjynito's advice makes sense if you go this route, although I still think a scheduled task would be more appropriate than using workflow.
You'll want to make sure your workflow is running in non-peak hours. Assuming you have an on-premise installation you should be able to get away with that. If you're using a hosted instance, you might be worried about one organization running the workflow while another organization is using the system. Use the timeout and maybe a custom workflow activity, if necessary, to force the start time to a certain period.
I'm assuming you'll be as efficient as possible in figuring out which records to deactivate. (i.e. Query Expression would only bring back the records you'll be deactivating).
The built-in infinite loop-protection offered by CRM shouldn't kill your workflow instances. It stops after a call depth of 8, but it resets to 1 if no calls are made for an hour. So the fact that you're doing this once a day should make you OK on the recursive workflow front.