I am creating an API Limiter, and I am having issues deciding on what system to use for data storage.
It is really clear that I am going to need a volatile storage, plus a persistent storage.
On the volatile I want to store a key-value like this:
read:14522145 100
read:99885669 16
read:78951585 100
This is a key composed of: {action}:{client} and an integer value (available credits).
On the persistent, I want to keep a record of all resource outages.
The algorithm (pseudo-code) is pretty simple:
MAX_AMOUNT = 100
call(action, client, cost) {
key = action + ":" + client
if (volatileStorage.hasKey(key)) {
value = volatileStorage.getValue(key)
if (value >= cost) {
volatileStorage.setValue(key, value - cost)
return true
} else {
persistentStorage.logOutage(method, client, cost)
return false
}
} else {
volatileStorage.setValue(key, MAX_AMOUNT)
return call(action, client, cost)
}
}
There is a parallel process that runs every N seconds for each method, that increases all keys {action}:* by M, up to O.
Additionally, I want to remove from the volatile store all items older (not modified since) than P seconds.
So basically every action is action<N, M, O, P>. For instance, reading users is increased every 1 second, by 5 points, up to 100, and removed after 60 seconds of inactivity: read_users<1, 5, 100, 60>.
So I need a volatile storage that:
Reads really quick, without consuming too many resources (what's the point of rejecting a call, if the process is more expensive than the very own call).
Allows TTL on items.
Can, with good performance, increase all keys matching a pattern (read_users:*) without getting out of a defined limit.
and a persistent storage that:
Is also quick.
Can handle loads of registers.
Any advice is welcome.
This isn't an answer but an opinion: there are existing rate limiters that you would be better off using instead of making your own. Getting it right is tricky, so adopting a production-proven implementation is not only easier but also safer.
For example, the Generic cell rate algorithm is nothing short of plain magic and has several Redis implementations, including:
As a Ruby gem (that uses server-side Lua): https://github.com/rwz/redis-gcra
As a (v4) module: https://github.com/brandur/redis-cell/
Of course, there are many more Redis-based rate limiters - I use Google to find them ;)
Related
I am evaluating the performance of Service Fabric with a Reliable Dictionary of ~1 million keys. I'm getting fairly disappointing results, so I wanted to check if either my code or my expectations are wrong.
I have a dictionary initialized with
dict = await _stateManager.GetOrAddAsync<IReliableDictionary2<string, string>>("test_"+id);
id is unique for each test run.
I populate it with a list of strings, like
"1-1-1-1-1-1-1-1-1",
"1-1-1-1-1-1-1-1-2",
"1-1-1-1-1-1-1-1-3".... up to 576,000 items. The value in the dictionary is not used, I'm currently just using "1".
It takes about 3 minutes to add all the items to the dictionary. I have to split the transaction to 100,000 at a time, otherwise it seems to hang forever (is there a limit to the number of operations in a transaction before you need to CommitAsync()?)
//take100_000 is the next 100_000 in the original list of 576,000
using (var tx = _stateManager.CreateTransaction())
{
foreach (var tick in take100_000) {
await dict.AddAsync(tx, tick, "1");
}
await tx.CommitAsync();
}
After that, I need to iterate through the dictionary to visit each item:
using (var tx = _stateManager.CreateTransaction())
{
var enumerator = (await dict.CreateEnumerableAsync(tx)).GetAsyncEnumerator();
try
{
while (await enumerator.MoveNextAsync(ct))
{
var tick = enumerator.Current.Key;
//do something with tick
}
}
catch (Exception ex)
{
throw ex;
}
}
This takes 16 seconds.
I'm not so concerned about the write time, I know it has to be replicated and persisted. But why does it take so long to read? 576,000 17-character string keys should be no more than 11.5mb in memory, and the values are only a single character and are ignored. Aren't Reliable Collections cached in ram? To iterate through a regular Dictionary of the same values takes 13ms.
I then called ContainsKeyAsync 576,000 times on an empty dictionary (in 1 transaction). This took 112 seconds. Trying this on probably any other data structure would take ~0 ms.
This is on a local 1 node cluster. I got similar results when deployed to Azure.
Are these results plausible? Any configuration I should check? Am I doing something wrong, or are my expectations wildly inaccurate? If so, is there something better suited to these requirements? (~1 million tiny keys, no values, persistent transactional updates)
Ok, for what it's worth:
Not everything is stored in memory. To support large Reliable Collections, some values are cached and some of them reside on disk, which potentially could lead to extra I/O while retrieving the data you request. I've heard a rumor that at some point we may get a chance to adjust the caching policy, but I don't think it has been implemented already.
You iterate through the data reading records one by one. IMHO, if you try to issue half a million separate sequential queries against any data source, the outcome won't be much optimistic. I'm not saying that every single MoveNext() results in a separate I/O operation, but I'd say that overall it doesn't look like a single fetch.
It depends on the resources you have. For instance, trying to reproduce your case on my local machine with a single partition and three replicas, I get the records in 5 seconds average.
Thinking about a workaround, here is what comes in mind:
Chunking I've tried to do the same stuff splitting records into string arrays capped with 10 elements(IReliableDictionary< string, string[] >). So essentially it was the same amount of data, but the time range was reduced from 5sec down to 7ms. I guess if you keep your items below 80KB thus reducing the amount of round-trips and keeping LOH small, you should see your performance improved.
Filtering CreateEnumerableAsync has an overload that allows you to specify a delegate to avoid retrieving values from the disk for keys that do not match the filter.
State Serializer In case you go beyond simple strings, you could develop your own Serializer and try to reduce the incurred I/O against your type.
Hopefully it makes sense.
I am trying to figure out efficient algorithm for processing Documents in distributed (FaaS to be more precise) environment.
Bruteforce approach would be O(D * F * R) where:
D is amount of Documents to process
F is amount of filters
R is highest amount of Rules in single Filter
I can assume, that:
single Filter has no more than 10 Rules
some Filters may share Rules (so it's N-to-N relation)
Rules are boolean functions (predicates) so I can try to take advantage of early cutting, meaning that if I have f() && g() && h() with f() evaluating to false then I do not have to process g() and h() at all and can return false immediately.
in single Document amount of Fields is always same (and about 5-10)
Filters, Rules and Documents are already in database
every Filter has at least one Rule
Using sharing (second assumption) I had an idea to first process Document against every Rule and then (after finishing) for every Filter using already computed Rules compute result. This way if Rule is shared then I am computing it only once. However, it doesn't take advantage of early cutting (third assumption).
Second idea is to use early cutting as slightly optimized bruteforce, but it won't use Rules sharing then.
Rules sharing looks like subproblem sharing, so probably memoization and dynamic programming will be helpful.
I have noticed, that Filter-Rule relation is bipartite graph. Not quite sure if it can help me though. I also have noticed, that I could use reverse sets and in every Rule store corresponding Set. This would however create circular dependency and may cause desynchronization problems in database.
Default idea is that Documents are streamed, and every single of them is event that will create FaaS instance to process it. However, this would probably force every FaaS instance to query for all Filters, which leaves me at O(F * D) queries because of Shared-Nothing architecture.
Sample Filter:
{
'normalForm': 'CONJUNCTIVE',
'rules':
[
{
'isNegated': true,
'field': 'X',
'relation': 'STARTS_WITH',
'value': 'G',
},
{
'isNegated': false,
'field': 'Y',
'relation': 'CONTAINS',
'value': 'KEY',
},
}
or in more condense form:
document -> !document.x.startsWith("G") && document.y.contains("KEY")
for Document:
{
'x': 'CAR',
'y': 'KEYBOARD',
'z': 'PAPER',
}
evaluates to true.
I can slightly change data model, stream something else instead of Document (ex. Filters) and use any nosql database and tools to help it. Apache Flink (event processing) and MongoDB (single query to retrieve Filter with it's Rules) or maybe Neo4j (as model looks like bipartite graph) looks like could help me, but not sure about it.
Can it be processed efficiently (with regard to - probably - database queries)? What tools would be appropriate?
I have been also wondering, if maybe I am trying to solve special case of some more general (math) problem that may have useful theorems and algorithms.
EDIT: My newest idea: Gather all Documents in cache like Redis. Then single event starts up and publishes N functions (as in Function as a Service), and every function selects F/N (amount of Filters divided by number of instances - so just evenly distributing Filters across instances) this way every Filter is fetched from database only once.
Now, every instance streams all Documents from cache (one document should be less than 1MB and at the same time I should have 1-10k of them so should fit in cache). This way every Document is selected from database only once (to cache).
I have reduced database read operations (still some Rules are selected multiple times), but still I am not taking advantage of Rule sharing across Filters. I could intentionally ignore it by using document database. This way by selecting Filter I will also get it's Rules. Still - I have to recalculate it's value.
I guess that's what I get for using Shared Nothing scalable architecture?
I realized that although my graph is indeed (in theory) bipartite but (in practice) it's going to be set of disjoint bipartite graphs (as not all Rules are going to be shared). This means, that I can process those disjoint parts independently on different FaaS instances without recalculating same Rules.
This reduces my problem to processing single bipartite connected graph. Now, I can use benefits of dynamic programming and share result of Rule computation only if memory i shared, so I cannot divide (and distribute) this problem further without sacrificing this benefit. So I thought this way: if I have already decided, that I will have to recompute some Rules, then let it be low compared to disjoint parts that I will get.
This is actually minimum cut problem, that has (fortunately) polynomial complexity known algorithm.
However, this may be not ideal in my case, because I don't want to cut any part of graph - I would like to cut graph ideally in half (divide and conquer strategy, that could be reapplied recursively till graph would be so small that could be processed in seconds in FaaS instance, that has time bound).
This means, that I am looking for cut, that would create two disjoint bipartite graphs, with possibly same amount of vertexes each (or at least similar).
This is sparsest cut problem, that is NP-hard, but has O(sqrt(logN)) approximated algorithm, that also favors less cut edges.
Currently, this does look like solution for my problem, however I would love to hear any suggestions, improvements and other answers.
Maybe it can be done better with other data model or algorithm? Maybe I can reduce it further with some theorem? Maybe I could transform it to other (simpler) problem, or at least that is easier to divide and distribute across nodes?
This idea and analysis strongly suggests using graph database.
Is there a way to increment a counter and then get the value back in a single call?
Or is the only way to make 2 calls?
This is not a phantom limitation but a Cassandra API one, there is nothing in CQL that allows you to retrieve and update the value in the same API call, and there's a very very good reason for that.
When you update the value of a counter the CQL looks like this:
UPDATE keyspace.table SET counter = counter + 1 WHERE a = b;
However, this masks away the true distributed locking complexity Cassandra must undergo to perform a seemingly simple increment. This is because it's really hard to make sure that every count is updating the latest value and that multiple increments of the same counter will wind up with the same value.
So you need a guarantee of the write being acknowledged by enough replicas after which you can perform a safe read, and this is making it sound really easy. In reality there's an incremental merge/compare-and-set process that goes on in a single counter increment, better detailed here.
The read operation is simply:
SELECT * FROM keyspace.table WHERE a = b;
If you think you are saving much network wise or complexity wise by doing this, you probably aren't unless the volume of read/writes is immense. In short, it's a happy thought but I wouldn't bother.
for {
inc <- database.table.update.where(_.a = b).modify(_.counter increment 1).future()
value <- database.table.select(_.counter).where(_.a = b).one()
} yield value
While Scala actors are described as light-weight, Akka actors even more so, there is obviously some overhead to using them.
So my question is, what is the smallest unit of work that is worth parallelising with Actors (assuming it can be parallelized)? Is it only worth it if there is some potentially latency or there are a lot of heavy calculations?
I'm looking for a general rule of thumb that I can easily apply in my everyday work.
EDIT: The answers so far have made me realise that what I'm interested in is perhaps actually the inverse of the question that I originally asked. So:
Assuming that structuring my program with actors is a very good fit, and therefore incurs no extra development overhead (or even incurs less development overhead than a non-actor implementation would), but the units of work it performs are quite small - is there a point at which using actors would be damaging in terms of performance and should be avoided?
Whether to use actors is not primarily a question of the unit of work, its main benefit is to make concurrent programs easier to get right. In exchange for this, you need to model your solution according to a different paradigm.
So, you need to decide first whether to use concurrency at all (which may be due to performance or correctness) and then whether to use actors. The latter is very much a matter of taste, although with Akka 2.0 I would need good reasons not to, since you get distributability (up & out) essentially for free with very little overhead.
If you still want to decide the other way around, a rule of thumb from our performance tests might be that the target message processing rate should not be higher than a few million per second.
My rule of thumb--for everyday work--is that if it takes milliseconds then it's potentially worth parallelizing. Although the transaction rates are higher than that (usually no more than a few 10s of microseconds of overhead), I like to stay well away from overhead-dominated cases. Of course, it may need to take much longer than a few milliseconds to actually be worth parallelizing. You always have to balance time time taken by writing more code against the time saved running it.
If no side effects are expected in work units then it is better to make decision for work splitting in run-time:
protected T compute() {
if (r – l <= T1 || getSurplusQueuedTaskCount() >= T2)
return problem.solve(l, r);
// decompose
}
Where:
T1 = N / (L * Runtime.getRuntime.availableProcessors())
N - Size of work in units
L = 8..16 - Load factor, configured manually
T2 = 1..3 - Max length of work queue after all stealings
Here is presentation with much more details and figures:
http://shipilev.net/pub/talks/jeeconf-May2012-forkjoin.pdf
Right now we are storing some query results on Memcache. After investigating a bit more, I've seen that many people save each individual item in Memcache. The benefit of doing this is that they can get these items from Memcache on any other request.
Store an array
$key = 'page.items.20';
if( !( $results = $memcache->get($key) ) )
{
$results = $con->execute('SELECT * FROM table LEFT JOIN .... LIMIT 0,20')->fetchAll();
$memcache->save($results, $key, 3600);
}
...
PROS:
Easier
CONS:
If I change an individual item, I have to delete all caches (it can be a pain)
I can have duplicated results (the same item on different queries)
vs
Store each item
$key = 'page.items.20';
if( !( $results_ids = $memcache->get($key) ) )
{
$results = $con->execute('SELECT * FROM table LEFT JOIN .... LIMIT 0,20')->fetchAll();
$results_ids = array();
foreach ( $results as $result )
{
$results_ids[] = $result['id'];
// if doesn't exist, save individual item
$memcache->add($result, 'item'.$result['id'], 3600);
}
// save results_ids
$memcache->save($results_ids, $key, 3600);
}
else
{
$results = $memcache->multi_get($results_ids);
// get elements which are not cached
...
}
...
PROS:
I don't have the same item stored twice on Memcache
Easier to invalidate results on several queries (just the item we change)
CONS:
More complicated business logic.
What do you think? Any other PROS or CONS on each way?
Some links
Post explaining the second method in Memcached list
Thread in Memcached Group
Grab stats and try to calculate a hit ratio or possible improvement if you cache the full query vs doing individual item grabs in MC. Profiling this kind of code is also very helpful to actually see how your theory applies.
It depends on what the query does. If you have a set of users and then want to grab the "top 10 music affinity" with some of those friends, it is worth to have both cachings:
- each friend (in fact, each user of the site)
- the top 10 query for each user (space is cheaper than CPU time)
But in general it is worth to store in MC all individual entities that are going to be used frequently (either in the same code execution, or in subsequent requests or by other users). Then things like CPU or resource heavy queries and data processings either MC-them or delegate them to async. jobs instead of making them realtime (e.g. Top 10 site users doesn't needs to be realtime, can be updated hourly or daily).
And of course taking into account that if you store and MC individual entities, you have to remove all referential integrity from the DB to be able to reuse them either individually or in groups.
The question is subjective and argumentative...
This depends on your usage pattern. If you're constantly pulling individual nodes by ID, store each one separately.
Also, note that in either case, storing the list isn't all that useful except for the top 20. If you insert/update/delete a node in such a way that the top-20 is no longer valid, you may end up needing to flush the next 20, and so on.
Lastly, keep in mind that it's a cache. If you're using a cache, you're making the underlying statement that it's no big deal if the data you're outputting is slightly stale.
The memcached stores data in chunks of specific sizes as explained better in the link below.
http://code.google.com/p/memcached/wiki/NewUserInternals
If your data distributions in the memcached is large, then the number of the larger size chunks will be less and therefore the least recently used algorithm will push the data out even if their is space available in the other chunk sizes. The least recently used algorithm works on respective chunks.
You can decide which implementation to choose based on the data size distribution in memcached.