How to handle concurrent adds on the same key in Last Write Wins map? - distributed-computing

I am implementing an LWW map and in my design, all added key-value pairs have timestamps as is expected from LWW. That works for me until the same key is added in two replicas with different values at the same time. I can't understand how to make the merge operation commutative in this scenario.
Example:
Replica1 => add("key1", "value1", "time1")
Replica2 => add("key1", "value2", "time1")
Merge(Replica1, Replica2) # What should be the value of key1 in the resulting map?

Let's see what last write wins means in terms on causality. Let's say two clients C1 and C2 changed the same data D (same key) to values D1 and D2 respectively. If they don't ever talk to each other directly, you can pick D1 or D2 as the last value, and they will be OK.
But, if they talk to each other like, C1 changed the value to D1, informed C2, which as the result changed it to D2. Now, D1 and D2 have causal dependency, D1 happens before D2, so if your system picks D1 as the last value in merge, you have broken the last write wins guaranty.
Now coming to your question, when two clients make two requests in parallel, those two requests no way can have causal dependency, as both were inflight together, so any value you pick is fine.

Related

Postgres extended statistics with partitioning

I am using Postgres 13 and have created a table with columns A, B and C. The table is partitioned by A with 2 possible values. Partition 1 contains 100 possible values each for B and C, whereas partition 2 has 100 completely different values for B, and 1 different value for C. I have set the statistics for both columns to maximum so that this definitely doesn't cause any issue
If I group by B and C on either partition, Postgres estimates the number of groups correctly. However if I run the query against the base table where I really want it, it estimates what I assume is no functional dependency between A, B and C, i.e. (p1B + p1C) * (p2B + p2C) for 200 * 101 as opposed to the reality of p1B * p1C + p2B * p2C for 10000 + 100.
I guess I was half expecting it to sum the underlying partitions rather than use the full count of 200 B's and 101 C's that the base table can see. Moreover, if I also add A into the group by then the estimate erroneously doubles further still, as it then thinks that this set will also be duplicated for each value of A.
This all made me think that I need an extended statistic to tell it that A influences either B or C or both. However if I set one on the base partition and analyze, the value in pg_statistic_ext_data->stxdndistinct is null. Whereas if I set it on the partitions themselves, this does appear to work, though isn't particularly useful because the estimation is already correct at this level. How do I go about having Postgres estimate against the base table correctly without having to run the query against all of the partitions and unioning them together?
You can define extended statistics on a partitioned table, but PostgreSQL doesn't collect any data in that case. You'll have to create extended statistics on all partitions individually.
You can confirm that by querying the collected data after an ANALYZE:
SELECT s.stxrelid::regclass AS table_name,
s.stxname AS statistics_name,
d.stxdndistinct AS ndistinct,
d.stxddependencies AS dependencies
FROM pg_statistic_ext AS s
JOIN pg_statistic_ext_data AS d
ON d.stxoid = s.oid;
There is certainly room for improvement here; perhaps don't allow defining extended statistics on a partitioned table in the first place.
I found that I just had to turn enable_partitionwise_aggregate on to get this to estimate correctly

Database Multiple Optional Filters with Null values in AnyLogic

I am trying to find the most optimal way to create conditional filtering of a Database in AnyLogic. The database, looks as follows:
What I am trying to do is adding in AnyLogic Main window 12 check boxes: P1, P2, P3, ..., T5, T6
If the user checks P1, with reference to the above table, A and B need to be displayed.
If the user checks P2, C needs to be displayed
If the user checks P1 and T2, A needs to be discplayed
In summary, it is like filtering in Excel any set of columns by "x" noting that the other cells have a null value.
To start, I used the following code to filter for P1 only (adding the entries to a collection of type String):
collectionDescription.addAll(
selectFrom( data_base ).
where( data_base.P1.eq("x")).
list( data_base.Col1));
Now to filter for P2 as well, the following can be done:
collectionDescription.addAll(
selectFrom( data_base ).
where( data_base.P1.eq("x")).
where( data_base.P2.eq("x")).
list( data_base.Col1));
Following the above logic, as many "where" conditions as needed can be added. However, there are so many possible combinations (e.g. P1/P2, P1/P2/P3, P1/P3, etc. you can imagine the amount of possible combinations).
So, I thought of a potential solution where I would add as many "where" conditions as there are columns, but instead of adding ".eq("x")", I would add ".eq(varP1)". If a checkbox (e.g. P1) is ticked, varP1 would be equal to "x" (varP1 being a variable of type String). But if the box is unticked, ideally varP1 should take a value that means "where anything" so that the condition would not make any impact. I did not manage to find a way to mimic this behavior.
Sorry for the long post, but I wanted to make sure the situation is as clear as possible. So am I on the right track? Any suggestions on how to fix this? Your suggestions would be highly appreciated.
let's say you have a checkbox associated to a variable varP1 to define what you want... then you can do
selectFrom(db_table)
.where(varP1 ? db_table.db_column.eq("x") : db_table.db_column.isNotNull().or(db_table.db_column.isNull()))
.list();
where anything is what i here did as any value which is either null or not null... so if the box is checked then you will find x, otherwise you will find anything

How to implement application level pagination over ScalarDB

This question is part-Cassandra and part ScalarDB. I am using ScalarDB which provide ACID support on top of Cassandra. The library seem to be working well! Unfortunately, ScalarDB doesn't support pagination though so I have to implement it in the application.
Consider this scenario in which P is primary key, C is clustering key and E is other data within the partition
Partition => { P,C1,E1
P,C2,E1
P,C2,E2
P,C2,E3
P,C2,E4
P,C3,E1
...
P,Cm,En
}
In ScalarDB, I can specify start and end values of keys so I suppose ScalarDB will get data only from the specified rows. I can also limit the no. of entries fetched.
https://scalar-labs.github.io/scalardb/javadoc/com/scalar/db/api/Scan.html
Say I want to get entries E3 and E4 from P,C2. For smaller values, I can specify start and end clustering keys as C2 and set fetch limit to say 4 and ignore E1 and E2. But if there are several hundred records then this method will not scale.
For example say P,C1 has 10 records, P,C2 has 100 records and I want to implement pagination of 20 records per query. Then to implement this, I'll have to
Query 1 – Scan – primary key will be P, clustering start will be C1, clustering end will be Cn as I don’t know how many records are there.
get P,C1. This will give 10 records
get P,C2. This will give me 20 records. I'll ignore last 10 and combine P,C1's 10 with P,C2's first 10 and return the result.
I'll also have to maintain that the last cluster key queried was C2 and also that 10 records were fetched from it.
Query 2 (for next pagination request) - Scan – primary key will be P, clustering start will be C2, clustering end will be Cn as I don’t know how many records are there.
Now I'll fetch P,C2 and get 20, ignore 1st 10 (as they were sent last time), take the remaining 10, do another fetch using same Scan and take first 10 from that.
Is this how it should be done or is there a better way? My concern with above implementation is that every time I'll have to fetch loads of records and dump them. For example, say I want to get records 70-90 from P,C2 then I'll still query up to record 60 and dump the result!
Primary keys and Clustering keys compose a primary key so your above example looks not right.
Let' assume the following data structure.
P, C1, ...
P, C2, ...
P, C3, ...
...
Anyways, I think one of the ways could be as follows. Assuming the page size is 2.
Scan with start (P, C1) inclusive, ascending and with limit 2. Results stored in R1
Get the last record of R1 -> (P, C2).
Scan with start the previous last record (P, C2) not inclusive, ascending with limit 2.
...

Using merge in sas

I am a little bit confused about merging in SAS. For example, when people are using the merge statement, sometimes (in=a) or (in=b) is followed. What does that do exactly?
To elaborate more on vknowles answer, in=a and in=b are useful in different types of merges. Let's say we have the following data step:
data inner left right outer;
merge have1(in=a) have2(in=b);
by ...;
if a and b then output inner;
else if a and not b then output left;
else if not a and b then output right;
else if not (a or b) then output miss;
run;
The data step will create 4 different datasets which are the basis of an inner join, left join, and right join.
The statement if a and b then output inner; will output only records in which the key is found in the datasets have1 and have2 which is the equivalent of a SQL inner join.
else if a and not b then output left; will output only the records that occur in the have1 dataset and not the have2. This is the equivalent of a left outer join in SQL. If you wanted a full left join you could either append the left dataset to the inner dataset or just change the statement to if (a and b) or (a and not b) then output left.
The third else if is just the opposite of the previous. Here you can perform a right join on the data.
The last else if will output to the outer dataset which is the equivalent of an outer join. This is useful for debugging purposes as the key is unique to each dataset. Thanks to Robert for this addition.
When you see a dataset referenced in SAS with () following it the items inside are called dataset options. The IN= dataset option is valid when reading datasets using statements like SET, MERGE, UPDATE. The word following IN= names a variable that SAS will create that will be true (1) when that dataset contributes to the current observation and false (0) otherwise.
A good example would be if want to use one data set to subset another. For example if you wanted to merge on data from a master lookup table to add an extra variable like an address or account type , but did not what to add in every id from the lookup table.
data want;
merge my_data(in=in1) master_lookup (in=in2);
by id;
if in1 ;
run;
Or if you are stacking or interleaving data from more than one table and wanted to take action depending on which table this record is from.
data want;
set one(in=in1) two(in=in2);
by id;
if in1 then source='ONE';
if in2 then source='TWO';
run;
Let's say you have MERGE A (in=INA) B (in=INB);
When merging two datasets with a BY statement, you can have the following situations:
Dataset A has an observation with a given by-value and dataset B does not. In this case, INA will be true and INB will be false.
Dataset A and B both have an observation with a given by-value. In this case, INA and INB will be true.
[Edited to correct error] Dataset A and B have different numbers of observations with a given by-value. Let's say A has more than B. Even after B runs out of observations, both INA and INB will be true. If you want to know whether B still has observations, you need something like the following code. As #Tom pointed out, you often want to know which dataset is contributing observations.
data want;
ina=0;
inb=0;
merge a (in=ina) b (in=inb);
by mybyvariable;
...
run;
The above code takes advantage of the fact that SAS retains the variables from the last observation contributed to the by-group by the dataset with the smaller number of observations. If you reinitialize any of the variables before the MERGE statement, and there is no new observation, they will keep their reinitialized values; but if there is a new observation, SAS will reset them. I hope that makes sense.
Dataset A does not have an observation with a given by-value and B does. In this case, INA will be false and INB will be true.
Note that this does not cover the situation of merging without a BY statement.

Merge neo4j relationships into one while returning the result if certain condition satisfies

My use case is:
I have to return whole graph in result but the condition is
If there are more than 1 relationship in between two particular nodes in the same direction then I have to just merge it into 1 relationship. For ex: Lets say there are two nodes 'm' and 'n' and there are 3 relations in between these nodes say r1, r2, r3 (in the same direction) then when I get the result after firing cypher query I should get only 1 relation in between 'n' and 'm'.
I need to perform some operations on top of it like the resultant relation that we got from merging all the relations should contain the properties and their values that I want to retain. Actually I will retain all the properties of any one of the relations that are merging depending upon the timestamp field that is one of the properties in relation.
Note : I have same properties throughout all my relations (The number of properties and name of properties are same across all relations. Values may differ for sure)
Any help would be appreciated. Thanks in advance.
You mean something like this?
Delete all except the first
MATCH (a)-[r]->(b)
WITH a,b,type(r) as type, collect(r) as rels
FOREACH (r in rels[1..] | DELETE r)
Ordering by timestamp first
MATCH (a)-[r]->(b)
WITH a,r,b
ORDER BY r.timestamp DESC
WITH a,b,type(r) as type, collect(r) as rels
FOREACH (r in rels[1..] | DELETE r)
If you want to do all those operations virtually just on query results you'd do them in your programming language of choice.