Increment and get value of counter - scala

Is there a way to increment a counter and then get the value back in a single call?
Or is the only way to make 2 calls?

This is not a phantom limitation but a Cassandra API one, there is nothing in CQL that allows you to retrieve and update the value in the same API call, and there's a very very good reason for that.
When you update the value of a counter the CQL looks like this:
UPDATE keyspace.table SET counter = counter + 1 WHERE a = b;
However, this masks away the true distributed locking complexity Cassandra must undergo to perform a seemingly simple increment. This is because it's really hard to make sure that every count is updating the latest value and that multiple increments of the same counter will wind up with the same value.
So you need a guarantee of the write being acknowledged by enough replicas after which you can perform a safe read, and this is making it sound really easy. In reality there's an incremental merge/compare-and-set process that goes on in a single counter increment, better detailed here.
The read operation is simply:
SELECT * FROM keyspace.table WHERE a = b;
If you think you are saving much network wise or complexity wise by doing this, you probably aren't unless the volume of read/writes is immense. In short, it's a happy thought but I wouldn't bother.
for {
inc <- database.table.update.where(_.a = b).modify(_.counter increment 1).future()
value <- database.table.select(_.counter).where(_.a = b).one()
} yield value

Related

Why is no global time such a big issue in distibuted systems? When would a global time be useful?

It's no hard for me to understand why a global time cannot really exist or at least be measured in a distributed system. However, I don't really understand why this is such a big issue. I mean most code is executed sequencially anyway, or in a causal relation (so something like A then we can use it in B then execute C). I've never seen code that was like "it's critical, that these two threads execute something at the exact same time". In what scenario would it be useful to have a global time?
I mean most code is executed sequencially anyway
I disagree. That's true almost by definition for a single process on a single thread. But if Taylor Swift drops a new album on twitter and 1M of her 88.7M followers like it, they don't have to wait in a queue for the other 1-999,999 users to finish their "like" update. (Or at least, the data structure is much faster than taking out a heavyweight lock to guarantee sequence.) There's a lot of nonsequential code there.
or in a causal relation (so something like A then we can use it in B then execute C).
Right, and a naive clock-based implementation does not implement causality. Say the true order of events is this:
Initially: x = 1
Process 1: set x = 2
Process 2: read x
Process 2: if (x == 2) set x = 4
But we rely on their clocks it might look like:
Process 1: set x = 2 (t = 4:00:00)
Process 2: read x (t = 3:59:59.000)
Process 2: if (x == 2) set x = 4 (t = 3:50:59.100)
The replicas might rely on timestamps to replay these operations and use timestamps to sort them. Relying on skewed clocks would violate the causality.

Firestore Increment - Cloud Function Invoked Twice

With Firestore Increment, what happens if you're using it in a Cloud Function and the Cloud Function is accidentally invoked twice?
To make sure that your function behaves correctly on retried execution attempts, you should make it idempotent by implementing it so that an event results in the desired results (and side effects) even if it is delivered multiple times.
E.g. the function is trying to increment a document field by 1
document("post/Post_ID_1").
updateData(["likes" : FieldValue.increment(1)])
So while Increment may be atomic it's not idempotent? If we want to make our counters idempotent we still need to use a transaction and keep track of who was the last person to like the post?
It will increment once for each invocation of the function. If that's not acceptable, you will need to write some code to figure out if any subsequent invocations are valid for your case.
There are many strategies to implement this, and it's up to you to choose one that suits your needs. The usual strategy is to use the event ID in the context object passed to your function to determine if that event has been successfully processed in the past. Maybe this involves storing that record in another document, in Redis, or somewhere that persists long enough for duplicates to be prevented (an hour should be OK).

Database technology choice based on an use case

I am creating an API Limiter, and I am having issues deciding on what system to use for data storage.
It is really clear that I am going to need a volatile storage, plus a persistent storage.
On the volatile I want to store a key-value like this:
read:14522145 100
read:99885669 16
read:78951585 100
This is a key composed of: {action}:{client} and an integer value (available credits).
On the persistent, I want to keep a record of all resource outages.
The algorithm (pseudo-code) is pretty simple:
MAX_AMOUNT = 100
call(action, client, cost) {
key = action + ":" + client
if (volatileStorage.hasKey(key)) {
value = volatileStorage.getValue(key)
if (value >= cost) {
volatileStorage.setValue(key, value - cost)
return true
} else {
persistentStorage.logOutage(method, client, cost)
return false
}
} else {
volatileStorage.setValue(key, MAX_AMOUNT)
return call(action, client, cost)
}
}
There is a parallel process that runs every N seconds for each method, that increases all keys {action}:* by M, up to O.
Additionally, I want to remove from the volatile store all items older (not modified since) than P seconds.
So basically every action is action<N, M, O, P>. For instance, reading users is increased every 1 second, by 5 points, up to 100, and removed after 60 seconds of inactivity: read_users<1, 5, 100, 60>.
So I need a volatile storage that:
Reads really quick, without consuming too many resources (what's the point of rejecting a call, if the process is more expensive than the very own call).
Allows TTL on items.
Can, with good performance, increase all keys matching a pattern (read_users:*) without getting out of a defined limit.
and a persistent storage that:
Is also quick.
Can handle loads of registers.
Any advice is welcome.
This isn't an answer but an opinion: there are existing rate limiters that you would be better off using instead of making your own. Getting it right is tricky, so adopting a production-proven implementation is not only easier but also safer.
For example, the Generic cell rate algorithm is nothing short of plain magic and has several Redis implementations, including:
As a Ruby gem (that uses server-side Lua): https://github.com/rwz/redis-gcra
As a (v4) module: https://github.com/brandur/redis-cell/
Of course, there are many more Redis-based rate limiters - I use Google to find them ;)

RavenDB - querying issue - Stale results/indexes

While querying RavenDB I am noticing that it does not get the expected results immediately. May be it has to do with indexing, I dont know.
For example :
int ACount = session.Query<Patron>()
.Count();
int BCount = session.Query<Theaters>()
.Count();
int CCount = session.Query<Movies>()
.Where(x => x.Status == "Released")
.Count();
int DCount = session.Query<Promotions>()
.Count();
When I execute this then ACount and BCount get their values immediately on the first run). However CCount and DCount do not get their values until after three or four runs. They show 0 (zero) value in the first few runs.
Why does this happen for bottom two and not top two queries? If its because of stale results (or indexes) then how can I modify my queries to get the accurate results every time, when I run it first time. Thank you for help.
If you haven't defined an index for the Movies query, Raven will create a Dynamic Index. If you use the query repeatedly the index will be automatically persisted. Otherwise Raven will discard it and that may explain why you're getting 0 results during the first few runs.
You can also instruct Raven to wait for the indexing process to ensure that you'll always get the most accurate results (even though this might not be a good idea as it will slow your queries) by using the WaitForNonStaleResults instruction:
session.Query<Movies>()
.Customize(x => x.WaitForNonStaleResults())
.Where(x => x.Status == "Released")
.Count();
Needing to put WaitForNonStaleResults in each query feels like a massive "code smell" (as much as I normally hate the term, it seems completely appropriate here).
The only real solution I've found so far is:
var store = new DocumentStore(); // do whatever
store.DatabaseCommands.DisableAllCaching();
Performance suffers accordingly, but I think slower performance is far less of a sin than unreliable if not outright inaccurate results.
You have the following options according to the official documentation (the most preferable first):
Setting cut-off point.
WaitForNonStaleResultsAsOfLastWrite(TimeSpan.FromSeconds(10))
or
WaitForNonStaleResultsAsOfNow()
This will make sure that you get the latest results up to that point in time (or till the last write). And you can put a cap on it (e.g. 10s), if you want to sacrifice freshness of the results to receiving the response faster.
Explicitly waiting for non-stale results
WaitForNonStaleResultsAsOfNow(TimeSpan.FromSeconds(10))
Again, specifying a time-out would be a good practice.
Setting querying conventions to apply the same rule to all requests
store.Conventions.DefaultQueryingConsistency = ConsistencyOptions.AlwaysWaitForNonStaleResultsAsOfLastWrite.

Streams vs. tail recursion for iterative processes

This is a follow-up to my previous question.
I understand that we can use streams to generate an approximation of 'pi' (and other numbers), n-th fibonacci, etc. However I doubt if streams is the right approach to do that.
The main drawback (as I see it) is memory consumption: e.g. stream will retains all fibonacci numbers for i < n while I need only fibonacci n-th. Of course, I can use drop but it makes the solution a bit more complicated. The tail recursion looks like a more suitable approach to the tasks like that.
What do you think?
If need to go fast, travel light. That means; avoid allocation of any unneccessary memory. If you need memory, use the fastast collections available. If you know how much memory you need; preallocate. Allocation is the absolute performance killer... for calculation. Your code may not look nice anymore, but it will go fast.
However, if you're working with IO (disk, network) or any user interaction then allocation pales. It's then better to shift priority from code performance to maintainability.
Use Iterator. It does not retain intermediate values.
If you want n-th fibonacci number and use a stream just as a temporary data structure (if you do not hold references to previously computed elements of stream) then your algorithm would run in constant space.
Previously computed elements of a Stream (which are not used anymore) are going to be garbage collected. And as they were allocated in the youngest generation and immediately collected, allmost all allocations might be in cache.
Update:
It seems that the current implementation of Stream is not as space-efficient as it may be, mainly because it inherits an implementation of apply method from LinearSeqOptimized trait, where it is defined as
def apply(n: Int): A = {
val rest = drop(n)
if (n < 0 || rest.isEmpty) throw new IndexOutOfBoundsException("" + n)
rest.head
}
Reference to a head of a stream is hold here by this and prevents the stream from being gc'ed. So combination of drop and head methods (as in f.drop(100).head) may be better for situations where dropping intermediate results is feasible. (thanks to Sebastien Bocq for explaining this stuff on scala-user).