How is PutIf different from PutIfExists and PutIfNotExists - scalardb

I am using ScalarDB which adds ACID support in Cassandra. Referring to the documentation, how is PutIf different from PutIfExists and PutIfNotExists?
I suppose
PutIfExists is like an update
PutIfNotExists is like a new addition
What is PutIf? When to use it?

PutIf can add conditions to update a record. You can use PutIfExists when you want to just update without any other condition. On the other hand, You have to use PutIfNotExists when you want to insert a new record without unexpected overwriting.
For example, a table has a record as below.
We want to update the pass of a record if the score is greater than or equal to 60.
(We assume that records have been inserted beforehand, and we have calculated the threshold (mean, median, etc.), then we decide the ID can pass the threshold or not.)
|pID|cID|score| pass|
| 0| 0| 80|false|
| 1| 0| 45|false|
The following put will update the first record to set true in pass.
PartitionKey pk = new Key(new IntValue("pID", 0));
ClusteringKey ck = new Key(new IntValue("cID", 0));
PutIf condition = new PutIf(new ConditionalExpression("score", new IntValue(60), Operator.GTE));
Put put = new Put(pk, ck).withValue(new BooleanValue("pass", true)).withCondition(condition);
storage.put(put);
If we try to update the second record with the same condition, it won't be updated because the score is less than 60.

Related

Calculate hash over a whole column from pyspark dataframe

I have a big data frame (approximatively 40 million rows) which looks like this:
|col A | col B |
|------|-------|
|valA1 | valB1 |
|valA2 | valB2 |
I want to compare two column in different data frames, in different workspaces. I am not allowed to bring both of them to the same environment. What I want is creating a hash value for every column, so I can compare with other columns from the other data frame.
The easy approach would be concatenating all values from a column, and then hash the resulting string. But because of the size of the data frame, I cannot do this.
So far I tried this version, but it takes too long:
hashlib.sha256(''.join(map(str,df.agg(collect_list(col("colName"))).first()[0])).encode('utf-8')).hexdigest()
and also this but same long time:
def compute_hash(df):
hasher = hashlib.sha256()
dataCollect=df.rdd.toLocalIterator()
for row in dataCollect:
hasher.update(row['colName'].encode('utf-8'))
return hasher.hexdigest()
Is this achievable in spark in a reasonable time?
You dont need to hash the whole string at once.
Example using sha256 from hashlibs library
import hashlib
column = ['valA1', 'valA2', 'valA3']
hasher = hashlib.sha256()
for row in column:
hasher.update(row.encode('utf-8'))
print(hasher.hexdigest())
# >>> 68f900960718b4881107929da0918e0e9f50599b12ebed3ec70066e55c3ec5f4
Using the update method will process the data as you use it.
The solution was grouping by a column which contained data evenly distributed. This way, spark would trigger the parallel execution for every value from the "columnToGroupBy" and generate a dataframe containing on the first column all the values of "columnToGroupBy" and on the second column, a hash over the concatenated values of "colToHash" corresponding to that value of "columnToGroupBy". For example if we had this table:
columnToGroupBy
colToHash
ValueA1
ValB1
ValueA2
ValB2
ValueA1
ValB3
ValueA2
ValB4
After aplying this function:
df.groupby("columnToGroupBy").agg(md5(concat_ws(",", array_sort(collect_set(col("colToHash"))))))
We would get the following dataframe:
columnToGroupBy
hash
ValueA1
md5(ValB1
ValueA2
md5(ValB1
If you have this new dataframe with a small number of rows, equal to the number of values you have in "columnToGroupBy", you can easily generate a hash for the whole column by collecting all the values from the "hash" column, concatenating them and hashing over.

PostgreSQL: How to get ADD UNIQUE to work

I've truncated my table then added unique:
ALTER TABLE mytable
ADD UNIQUE (id,loan_type,term,oldestyear)
After inserting data, it still duplicates, did I do something wrong?
id |loan_type |apr |term|oldestyear|valid_period
---------|---------------|------|----|----------|---------------------------------
8333|auto new |0.0249| 36| |["2019-02-26 22:48:07.305304-08",
8333|auto new |0.0249| 36| |["2019-02-26 22:47:38.421624-08",
I want id,loan_type,term, & oldest year to be a unique set. oldestyear will be null sometimes but it's still a unique set despite that.
If you can find at least one invalid (not null) value, then you can create a unique index on an expression that converts the NULL value to a non-null value so that the comparison treats them as identical:
create unique index on mytable (id,loan_type,term,coalesce(oldestyear, -42));

TopologyTestDriver sending incorrect message on KTable aggregations

I have a topology that aggregates on a KTable.
This is a generic method I created to build this topology on different topics I have.
public static <A, B, C> KTable<C, Set<B>> groupTable(KTable<A, B> table, Function<B, C> getKeyFunction,
Serde<C> keySerde, Serde<B> valueSerde, Serde<Set<B>> aggregatedSerde) {
return table
.groupBy((key, value) -> KeyValue.pair(getKeyFunction.apply(value), value),
Serialized.with(keySerde, valueSerde))
.aggregate(() -> new HashSet<>(), (key, newValue, agg) -> {
agg.remove(newValue);
agg.add(newValue);
return agg;
}, (key, oldValue, agg) -> {
agg.remove(oldValue);
return agg;
}, Materialized.with(keySerde, aggregatedSerde));
}
This works pretty well when using Kafka, but not when testing via `TopologyTestDriver`.
In both scenarios, when I get an update, the subtractor is called first, and then the adder is called. The problem is that when using the TopologyTestDriver, two messages are sent out for updates: one after the subtractor call, and another one after the adder call. Not to mention that the message that is sent after the subrtractor and before the adder is in an incorrect stage.
Any one else could confirm this is a bug? I've tested this for both Kafka versions 2.0.1 and 2.1.0.
EDIT:
I created a testcase in github to illustrate the issue: https://github.com/mulho/topology-testcase
It is expected behavior that there are two output records (one "minus" record, and one "plus" record). It's a little tricky to understand how it works, so let me try to explain.
Assume you have the following input table:
key | value
-----+---------
A | <10,2>
B | <10,3>
C | <11,4>
On KTable#groupBy() you extract the first part of the value as new key (ie, 10 or 11) and later sum the second part (ie, 2, 3, 4) in the aggregation. Because A and B record both have 10 as new key, you would sum 2+3 and you would also sum 4 for new key 11. The result table would be:
key | value
-----+---------
10 | 5
11 | 4
Now assume that an update record <B,<11,5>> change the original input KTable to:
key | value
-----+---------
A | <10,2>
B | <11,5>
C | <11,4>
Thus, the new result table should sum up 5+4 for 11 and 2 for 10:
key | value
-----+---------
10 | 2
11 | 9
If you compare the first result table with the second, you might notice that both rows got update. The old B|<10,3> record is subtracted from 10|5 resulting in 10|2 and the new B|<11,5> record is added to 11|4 resulting in 11|9.
This is exactly the two output records you see. The first output record (after subtract is executed), updates the first row (it subtracts the old value that is not part of the aggregation result any longer), while the second record adds the new value to the aggregation result. In our example, the subtract record would be <10,<null,<10,3>>> and the add record would be <11,<<11,5>,null>> (the format of those record is <key, <plus,minus>> (note that the subtract record only set the minus part while the add record only set the plus part).
Final remark: it is not possible to put plus and minus records together, because the key of the plus and minus record can be different (in our example 11 and 10), and thus might go into different partitions. This implies that the plus and minus operation might be executed by different machines and thus it's not possible to only emit one record that contains both plus and minus part.

Iterate through a dataframe and dynamically assign ID to records based on substring [Spark][Scala]

Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)
This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.
First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).

How can I count multiple foreign keys and group them in a row?

I have one table with the following structure:
ID|FK1|FK2|FK3|FK4
ID|FK1|FK2|FK3|FK4
ID|FK1|FK2|FK3|FK4
And another table that holds:
FK|DATA
FK|DATA
FK|DATA
FK|DATA
The FKn columns in the first table references the FK field in the second one. There can be more than one record linked between the first table and the second one.
What I want to achieve is to create another table with the total number of records of every FKn linked. For example:
ID|FK1|FK2|FK3|FK4
A|0 |23 |9 |3
B|4 |0 |2 |0
I know how to transform the row flow and iterate over every FKn field. I also know how to count. What Im not able to do is to group every FKn count from the same ID into one row, because after I use a tLoop component, every count operation is transformed into a new row like:
FK|count
FK|count
FK|count
FK|count
...
Any idea about how to join rows by packing N of them into one single row each time? Or is there another way to do it?
NOTE: I'm using text data as input
if i understood your issue then i would suggest a different way as below
(provided you have fixed number of FK1, FK2, FK3, FK4)
tFileInput-->Tmap (left join to lookup tAnotherTable(FK DATA) on FK1) -->output-1 i will have four columns - ID, FK1=(0 or 1 - 0 if no matching row is found, 1 if matching row is found), FK2=0, FK3=0, FK4=0. This could be a case where for same ID we can get many FK1 values as you mentioned there can be more than one rows)..
Similarly i will have
tFileInput-->Tmap (left join to lookup tAnotherTable(FK DATA) on FK2) -->output-2 i will have four columns - ID, FK1=0, FK2=(0 or 1 - 0 if no matching row is found, 1 if matching row is found), FK3=0, FK4=0.
tFileInput-->Tmap (left join to lookup tAnotherTable(FK DATA) on FK3) -->output-3 i will have four columns - ID, FK1=0, FK2=0, FK3=(0 or 1 - 0 if no matching row is found, 1 if matching row is found), FK4=0.
.....
...
Next i will Union all these Output-1, output-2, output-3, output-4 into a final result set say union all ResultUnionall and will move them to tagg and group by ID column and take SUM(FK1), SUM(FK2), SUM(FK3)...
to summarize you job will look something as below
tFileInput--->tmap(withlookup)-->tHash1/tFileOutput1
tFileInput--->tmap(withlookup)-->tHash1/tFileOutput2
tFileInput--->tmap(withlookup)-->tHash1/tFileOutput3
tFileInput--->tmap(withlookup)-->tHash1/tFileOutput4
tHashInput1/tFIleInput1---
tHashInput1/tFIleInput1---
tHashInput1/tFIleInput1----
tHashInput1/tFIleInput1--- tUnite--->tAgg-->finaloutput.