Why is this output wrong ? - Sequential Consistency - distributed-computing

The way I understand sequential consistency model the output marked as wrong should be valid, what am I missing ?

If we take a look on the definition of sequential consistency in the wiki we'll see:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program
In your example the order of the P2 process is violated since the print (x,z) operation precedes y=1.

Related

YCSB differences between Workload F and workload A

I don't understand what the difference between workload A (50% read, 50% update) and workload F(50% read, 50% read-modify-write)
update isn't read-modify-write ?
what the difference between read operation and scan operation ?
i don't understand exactly what the signification of thread ? (number of request or number of client ?)
please help.
thanks
You have multiple questions. I am answering the first one, which is the one described in the title.
Update modifies a record without reading it first (though it may read it first by chance, as part of the read operations in the workload). Read-modify-update, as the name indicates, reads the value of a record, modifies it and writes the new value. They are both updates indeed, but the access pattern is different.
Ref: https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads

Efficient way to check if there are NA's in pyspark

I have a pyspark dataframe, named df. I want to know if his columns contains NA's, I don't care if it is just one row or all of them. The problem is, my current way to know if there are NA's, is this one:
from pyspark.sql import functions as F
if (df.where(F.isnull('column_name')).count() >= 1):
print("There are nulls")
else:
print("Yey! No nulls")
The issue I see here, is that I need to compute the number of nulls in the whole column, and that is a huge amount of time wasted, because I want the process to stop when it finds the first null.
I thought about this solution but I am not sure it works (because I work in a cluster with a lot of other people so the execution time depends on the multiple jobs other people run in the cluster, so I can't compare the two approaches in even conditions):
(df.where(F.isnull('column_name')).limit(1).count() == 1)
Does adding the limit help ? Are there more efficient ways to achieve this ?
There is no non-exhaustive search for something that isn't there.
We can probably squeeze a lot more performance out of your query for the case where a record with a null value exists (see below), but what about when it doesn't? If you're planning on running this query multiple times, with the answer changing each time, you should be aware (I don't mean to imply that you aren't) that if the answer is "there are no null values in the entire dataframe", then you will have to scan the entire dataframe to know this, and there isn't a fast way to do that. If you need this kind of information frequently and the answer can frequently be "no", you'll almost certainly want to persist this kind of information somewhere, and update it whenever you insert a record that might have null values by checking just that record.
Don't use count().
count() is probably making things worse.
In the count case Spark used wide transformation and actually applies LocalLimit on each partition and shuffles partial results to perform GlobalLimit.
In the take case Spark used narrow transformation and evaluated LocalLimit only on the first partition.
In other words, .limit(1).count() is likely to select one example from each partition of your dataset, before selecting one example from that list of examples. Your intent is to abort as soon as a single example is found, but unfortunately, count() doesn't seem smart enough to achieve that on its own.
As alluded to by the same example, though, you can use take(), first(), or head() to achieve the use case you want. This will more effectively limit the number of partitions that are examined:
If no shuffle is required (no aggregations, joins, or sorts), these operations will be optimized to inspect enough partitions to satisfy the operation - likely a much smaller subset of the overall partitions of the dataset.
Please note, count() can be more performant in other cases. As the other SO question rightly pointed out,
neither guarantees better performance in general.
There may be more you can do.
Depending on your storage method and schema, you might be able to squeeze more performance out of your query.
Since you aren't even interested in the value of the row that was chosen in this case, you can throw a select(F.lit(True)) between your isnull and your take. This should in theory reduce the amount of information the workers in the cluster need to transfer. This is unlikely to matter if you have only a few columns of simple types, but if you have complex data structures, this can help and is very unlikely to hurt.
If you know how your data is partitioned and you know which partition(s) you're interested in or have a very good guess about which partition(s) (if any) are likely to contain null values, you should definitely filter your dataframe by that partition to speed up your query.

PostgreSQL Serialized Inserts Interleaving Sequence Numbers

I have multiple processes inserting into a Postgres (10.3) table using the SERIALIZED isolation level.
Another part of our system needs to read these records and be guaranteed that it receives all of them in sequence. For example, in the picture below, the consumer would need to
select * from table where sequanceNum > 2309 limit 5
and then receive sequence numbers 2310, 2311, 2312, 2313 and 2314.
The reading query is using READCOMMITTED isolation level.
What I'm seeing though is that the reading query is only receiving the rows I've highlighted in yellow. Looking at the xmin, I'm guessing that transaction 334250 had begun but not finished, then transactions 334251, 334252 et al started and finished prior to my reading query starting.
My question is, how did they get sequence numbers interleaved in those of 334250? Why weren't those transactions blocked by merrit of all of the writing transactions being serialized?
Any suggestions on how to achieve what I'm after? Which is, a guarantee that different transactions don't generate interleaving sequence numbers? (It's ok if there are gaps.... but they can't interleave).
Thanks very much for your help. I'm losing hair over this one!
PS - I just noticed that 334250 has a non zero xmax. Is that a clue that I'm missing perhaps?
The SQL standard in its usual brevity defines SERIALIZABLE as:
The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable.
A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions
that produces the same effect as some serial execution of those same SQL-transactions. A serial execution
is one in which each SQL-transaction executes to completion before the next SQL-transaction begins.
In the light of this definition, I understand that your wish is that the sequence numbers be in the same order as the “serial execution” that “produces the same effect”.
Unfortunately the equivalent serial ordering is not clear at the time the transactions begin, because statements later in the transaction can determine the “logical” order of the transactions.
Sequence numbers on the other hand are ordered according to the wall time when the number was requested.
In a way, you would need sequence numbers that are determined by something that is not certain until the transactions commit, and that is a contradiction in terms.
So I think that it is not possible to get what you want, unless you actually serialize the execution, e.g. by locking the table in SHARE ROW EXCLUSIVE mode before you insert the data.
My question is why you have that unusual demand. I cannot think of a good reason.

PostgreSQL Results Same Explanation on Different Queries

I have some complex queries that will produce same result. The only difference is execution order. For example, a query performs selection first before join while the other query performs join first, then selection. However, when I read the explanation (on the explain tab, using PgAdmin III), both queries have the same diagram.
Why?
I'm not a pro with explaining this with all the correct terminologies, however essentially the preprocessing attempts to find the most efficient way to execute the statement. It does this by breaking them down into simpler sub statements- just because you write it one way it doesn't mean it is the same order the pre processing will execute the plan. Kind of like precedence with arithmetic (brackets, multiply, divide, etc).
Certain operations will influence the statement order of execution enabling you to "tune" your queries to make them more efficient. http://www.postgresql.org/docs/current/interactive/performance-tips.html

how to avoid race conditions with scala parallel collections

Are parallel collections intended to do operations with side effects? If so, how can you avoid race conditions?
For example:
var sum=0
(1 to 10000).foreach(n=>sum+=n); println(sum)
50005000
no problem with this.
But if try to parallelize, race conditions happen:
var sum=0
(1 to 10000).par.foreach(n=>sum+=n);println(sum)
49980037
Quick answer: don't do that. Parallel code should be parallel, not concurrent.
Better answer:
val sum = (1 to 10000).par.reduce(_+_) // depends on commutativity and associativity
See also aggregate.
Parallel case doesn't work because you don't use volatile variables hence not ensuring visibility of your writes and because you have multiple threads that do the following:
read sum into a register
add to the register with the sum value
write the updated value back to memory
If 2 threads do step 1 first one after the other and then proceed to do the rest of the steps above in any order, they will end up overwriting one of the updates.
Use #volatile annotation to ensure visibility of sum when doing something like this. See here.
Even with #volatile, due to the non-atomicity of the increment you will be losing some increments. You should use AtomicIntegers and their incrementAndGet.
Although using atomic counters will ensure correctness, having shared variables here hinders performance greatly - your shared variable is now a performance bottleneck because every thread will try to atomically write to the same cache line. If you wrote to this variable infrequently, it wouldn't be a problem, but since you do it in every iteration, there will be no speedup here - in fact, due to cache-line ownership transfer between processors, it will probably be slower.
So, as Daniel suggested - use reduce for this.