I have a topology that aggregates on a KTable.
This is a generic method I created to build this topology on different topics I have.
public static <A, B, C> KTable<C, Set<B>> groupTable(KTable<A, B> table, Function<B, C> getKeyFunction,
Serde<C> keySerde, Serde<B> valueSerde, Serde<Set<B>> aggregatedSerde) {
return table
.groupBy((key, value) -> KeyValue.pair(getKeyFunction.apply(value), value),
Serialized.with(keySerde, valueSerde))
.aggregate(() -> new HashSet<>(), (key, newValue, agg) -> {
agg.remove(newValue);
agg.add(newValue);
return agg;
}, (key, oldValue, agg) -> {
agg.remove(oldValue);
return agg;
}, Materialized.with(keySerde, aggregatedSerde));
}
This works pretty well when using Kafka, but not when testing via `TopologyTestDriver`.
In both scenarios, when I get an update, the subtractor is called first, and then the adder is called. The problem is that when using the TopologyTestDriver, two messages are sent out for updates: one after the subtractor call, and another one after the adder call. Not to mention that the message that is sent after the subrtractor and before the adder is in an incorrect stage.
Any one else could confirm this is a bug? I've tested this for both Kafka versions 2.0.1 and 2.1.0.
EDIT:
I created a testcase in github to illustrate the issue: https://github.com/mulho/topology-testcase
It is expected behavior that there are two output records (one "minus" record, and one "plus" record). It's a little tricky to understand how it works, so let me try to explain.
Assume you have the following input table:
key | value
-----+---------
A | <10,2>
B | <10,3>
C | <11,4>
On KTable#groupBy() you extract the first part of the value as new key (ie, 10 or 11) and later sum the second part (ie, 2, 3, 4) in the aggregation. Because A and B record both have 10 as new key, you would sum 2+3 and you would also sum 4 for new key 11. The result table would be:
key | value
-----+---------
10 | 5
11 | 4
Now assume that an update record <B,<11,5>> change the original input KTable to:
key | value
-----+---------
A | <10,2>
B | <11,5>
C | <11,4>
Thus, the new result table should sum up 5+4 for 11 and 2 for 10:
key | value
-----+---------
10 | 2
11 | 9
If you compare the first result table with the second, you might notice that both rows got update. The old B|<10,3> record is subtracted from 10|5 resulting in 10|2 and the new B|<11,5> record is added to 11|4 resulting in 11|9.
This is exactly the two output records you see. The first output record (after subtract is executed), updates the first row (it subtracts the old value that is not part of the aggregation result any longer), while the second record adds the new value to the aggregation result. In our example, the subtract record would be <10,<null,<10,3>>> and the add record would be <11,<<11,5>,null>> (the format of those record is <key, <plus,minus>> (note that the subtract record only set the minus part while the add record only set the plus part).
Final remark: it is not possible to put plus and minus records together, because the key of the plus and minus record can be different (in our example 11 and 10), and thus might go into different partitions. This implies that the plus and minus operation might be executed by different machines and thus it's not possible to only emit one record that contains both plus and minus part.
Related
I’m having some issues with my Kafka Streams implementation in production. I’ve implemented a function that takes a KTable and a KStream, and yields another KTable with aggregated results based on the join of these two inputs. The idea is to iterate a list in the KStream input, and for each iteration, join it with the KTable, and aggregate into a list of events present in the KTable and sink to a topic, containing the original KTable event and the list of joined KStream (1 to N join).
Context
This is how my component interacts with its context. MovementEvent contains a list of transaction_ids that should match the transaction_id of TransactionEvent, where the joiner should match & generate a new event (Sinked Event) with the original MovementEvent and a list of the matched TransactionEvent.
For reference, the Movement topic has 12 million, while the Transaction topic has 21 million records.
Implementation
public class SinkEventProcessor implements BiFunction<
KTable<TransactionKey, Transaction>,
KStream<String, Movement>,
KTable<SinkedEventKey, SinkedEvent>> {
#Override
public KTable<SinkedEventKey, SinkedEvent> apply(final KTable<TransactionKey, Transaction> transactionTable,
final KStream<String, Movement> movementStream) {
return movementStream
// [A]
.flatMap((movementKey, movement) -> movement
.getTransactionIds()
.stream()
.distinct()
.map(transactionId -> new KeyValue<>(
TransactionKey.newBuilder()
.setTransactionId(transactionId)
.build(),
movement))
.toList())
// [B]
.join(transactionTable, (movement, transaction) -> Pair.newBuilder()
.setMovement(movement)
.setTransaction(transaction)
.build())
// [C]
.groupBy((transactionKey, pair) -> SinkedEventKey.newBuilder()
.setMovementId(pair.getMovement().getMovementId())
.build())
// [D]
.aggregate(SinkedEvent::new, (key, pair, collectable) ->
collectable.setMovement(pair.getMovement())
.addTransaction(pair.getTransaction()));
}
}
[A] I have started the implementation by iterating the Movement KStream, extracting the transactionId and creating a TransactionKey to use as the new key for the following operation, to facilitate the join with each transactionId present in the Movement entity. This operation returns a KStream<TransactionKey, Movement>
[B] Joins the formerly transformed KStream and adds each value to an intermediate pair. Returns a `KStream<TransactionKey, Pair>.
[C] Groups the pairs by movementId and constructs the new key (SinkedEventKey) for the sink operation.
[D] Aggregates into the result object (SinkedEvent) by adding the transaction to the list. This operation will also sink to the topic as a KTable<SinkedEventKey, SinkedEvent>
Problem
The problem starts when we start processing the stream, the sink operation of the processor starts generating more records than it should. For instance, for a Movement with 4 transaction_id, the output topic will become something like this:
partition
offset
count of [TransactionEvent]
expected count
0
1
1
4
0
2
2
4
0
3
4
4
0
4
4
4
And the same happens for other records (e.g. a Movement with 13 transaction_id will yield 13 menssages). So for some reason that I can't compreehend, the aggregate operation is sinking on each operation, instead of waiting and collecting into the list and sinking only once.
I've tried to reproduce it in a development cluster, with exactly the same settings, with no avail. Everything seems to be working properly when I try to reproduce it (a Movement with 8 transactions produces only 1 record) but whenever I bring it to production it doesn't work as intended. I'm not sure what I'm missing, any help?
I'm trying to insert some data into a MariaDB database. I got two tables and I have to insert the rows (using a batch insert) into the first table and use the IDs of the newly-inserted rows to perform a second batch insert into the second table.
I'm doing so in Scala using Alpakka Slick. For the purpose of this question, let's call tests the main table and dependent the second one.
At the moment, my algorithm is as follows:
Insert the rows into tests
Fetch the ID of the first row in the batch using SELECT LAST_INSERT_ID();
Knowing the ID of the first row and the number of rows in the batch, compute by hand the other IDs and use them for the insertion in the second table
This works pretty well with only one connection at a time. However, I'm trying to simulate a scenario with multiple attempts to write simultaneously. To do that, I'm using Scala parallel collections and Akka Stream Source as follows:
// three sources of 10 random Strings each
val sources = Seq.fill(3)(Source(Seq.fill(10)(Random.alphanumeric.take(3).mkString))).zipWithIndex
val parallelSources: ParSeq[(Source[String, NotUsed], Int)] = sources.par
parallelSources.map { case (source, i) =>
source
.grouped(ChunkSize) // performs batch inserts of a given size
.via(insert(i))
.zipWithIndex
.runWith(Sink.foreach { case (_, chunkIndex) => println(s"Chunk $chunkIndex of source $i done") })
}
I'm adding an index to each Source just to use it a prefix in the data I write in the DB.
Here's the code of the insert Flow I've written so far:
def insert(srcIndex: Int): Flow[Seq[String], Unit, NotUsed] = {
implicit val insertSession: SlickSession = slickSession
system.registerOnTermination(() => insertSession.close())
Flow[Seq[String]]
.via(Slick.flowWithPassThrough { chunk =>
(for {
// insert data into `tests`
_ <- InsTests ++= chunk.map(v => TestProj(s"source$srcIndex-$v"))
// fetch last insert ID and connection ID
queryResult <- sql"SELECT CONNECTION_ID(), LAST_INSERT_ID();".as[(Long, Long)].headOption
_ <- queryResult match {
case Some((connId, firstIdInChunk)) =>
println(s"Source $srcIndex, last insert ID $firstIdInChunk, connection $connId")
// compute IDs by hand and write to `dependent`
val depValues = Seq.fill(ChunkSize)(s"source$srcIndex-${Random.alphanumeric.take(6).mkString}")
val depRows =
(firstIdInChunk to (firstIdInChunk + ChunkSize))
.zip(depValues)
.map { case (index, value) => DependentProj(index, value) }
InsDependent ++= depRows
case None => DBIO.failed(new Exception("..."))
}
} yield ()).transactionally
})
}
Where InsTests and InsDependent are Slick's TableQuery objects. slickSession creates a new session for each different insert and is defined as follows:
private def slickSession = {
val db = Database.forURL(
url = "jdbc:mariadb://localhost:3306/test",
user = "root",
password = "password",
executor = AsyncExecutor(
name = "executor",
minThreads = 20,
maxThreads = 20,
queueSize = 1000,
maxConnections = 20
)
)
val profile = slick.jdbc.MySQLProfile
SlickSession.forDbAndProfile(db, profile)
}
The problem is that the last insert IDs returned by the second step of the algorithm overlap. Every run of this app would print something like:
Source 2, last insert ID 6, connection 66
Source 1, last insert ID 5, connection 68
Source 0, last insert ID 7, connection 67
Chunk 0 of source 0 done
Chunk 0 of source 2 done
Chunk 0 of source 1 done
Source 2, last insert ID 40, connection 70
Source 0, last insert ID 26, connection 69
Source 1, last insert ID 27, connection 71
Chunk 1 of source 2 done
Chunk 1 of source 1 done
Chunk 1 of source 0 done
Where it looks like the connection is a different one for each Source, but the IDs overlap (Source 0 sees 7, source 1 sees 5, source 2 sees 2). It is correct that IDs start from 5, as I'm adding 4 dummy rows right after creating the tables (not shown in this question's code). Obviously, I see multiple rows in dependent with the same tests.id, which shouldn't happen.
It's my understanding that last insert IDs refer to a single connection. How is it possible that three different connections see overlapping IDs, considering that the entire flow is wrapped in a transaction (via Slick's transactionally)?
This happens with innodb_autoinc_lock_mode=1. As far as I've seen so far, it doesn't with innodb_autoinc_lock_mode=0, which makes sense, since InnoDB would lock tests until the whole batch insert terminates.
UPDATE after Georg's answer: For some other constraints in the project, I'd like the solution to be compatible with MariaDB 10.4, which, as far as I understand, doesn't feature INSERT...RETURNING. Additionally, Slick's ++= operator's support for returning is quite bad, as also reported here. I tested it on both MariaDB 10.4 and 10.5, and, according to the query logs, Slick does execute single INSERT INTO statements instead of a batch one. In my case, this is not quite acceptable, as I'm planning on writing several chunks of rows in a streaming fashion.
While I also understand that making assumptions about the auto-increment value being 1 is not ideal, we do have control over the Production setup and do not have multi-master replication.
You cannot generate subsequent values based on LAST_INSERT_ID():
There might be a second transaction which was rolled back running at the same time, so there will be a gap in your auto_incremented ID's.
Iterating over the number of rows by incrementing LAST_INSERT_ID value will not work, since it depends of value of session variable ##auto_increment_increment (which is especially in multi master replication not 1).
Instead, you should use RETURNING to get the ID's of inserted rows:
MariaDB [test]> create table t1 (a int not null auto_increment primary key);
Query OK, 0 rows affected (0,022 sec)
MariaDB [test]> insert into t1 (a) values (1),(3),(NULL), (NULL) returning a;
+---+
| a |
+---+
| 1 |
| 3 |
| 4 |
| 5 |
+---+
4 rows in set (0,006 sec)
Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)
This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.
First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).
If this is the dictionary of constraint:
dictName:`region`Code;
dictValue:(`NJ`NY;`EEE213);
dict:dictName!dictValue;
I would like to pass the dict to a function and depending on how many keys there are and let the query react accordingly. If there is one key region, then I would like to put it as
select from table where region in dict`region;
The same thing is for code. But if I pass two keys, I would like the query knows and pass it as:
select form table where region in dict`region,Code in dict`code;
Is there any way to do this?
I came up this code:
funcForOne:{[constraint]?[`bce;enlist(in;constraint;(`dict;enlist constraint));0b;()]};
funcForAll[]
{[dict]$[(null dict)~1;select from bce;($[(count key dict)=1;($[`region in (key dict);funcForOne[`region];funcForOne[`Code]]);select from bce where region in dict`region,rxmCode in dict`Code])]};
It works for one and two constraint. but when I called funcForAll[] it gives type error. How should I change it? i think it is from null dict~1
I tried count too. but doesn't work too well.
Update
So I did this but I have some error
tab:([]code:`B90056`B90057`B90058`B90059;region:`CA`NY`NJ`CA);
dictKey:`region`Code;dictValue:(`NJ`NY;`B90057);
dict:dictKey!dictValue;
?[tab;f dict;0b;()];
and I got 'NY error. Do you know why? Also,if I pass a null dictionary it doesn't seem working.
As I said funtional form would be the better approach but if your requirement is very limited as you said then you can consider other solution as below:
Note: Assuming all dictionary keys will be in table columns list.
q) f:{[dict] if[0=count dict;:select from t];
select from t where (#[key dict;t]) in {$[any 0<=type each value x;flip ;enlist ]x}[dict] }
Explanation:
1. convert dict to table depending on the values type. Flip if any value is a general list else enlist.
$[any 0<=type each value dict;flip ;enlist ]dict
Get subset of table t which consists only of dictionary keys as columns.
#[key dict;t]
get rows where (2) in (1)
Basically we are using below form of querying and matching:
q)t1:([]id:1 2;s:`a`b);
q)t2:([]id:1 3 ;s:`a`b);
q)select from t1 where ([]id;s) in t2
If you're just using in, you can do something like:
f:{{[x;y](in),'key[y],'(),x}[;x]enlist each value[x]}
So that:
q)d
a| 10 1
b| ,`a
q)f d
in `a 10 1
in `b ,`a
q)t
a b c
------
1 a 10
2 b 20
3 c 30
q)?[t;f d;0b;()]
a b c
------
1 a 10
Note that because of the enlist each the resulting list is enlisted so that singletons work too:
q)d:enlist[`a]!enlist 1
q)d
a| 1
q)?[t;f d;0b;()]
a b c
------
1 a 10
Update to secondary question
This still works with empty dict, i.e. ()!(). I'm passing in the dictionary variable.
In your 2nd question your dictionary is not constructed correctly (also remember q is case sensitive). Also your values need to be enlisted. Look up functional select in the reference pages on the kx site, you'll see that you need to enlist the symbol lists to differentiate them from column name declarations
`region`code!(enlist `NY`NJ;enlist `B90057)
I have one table with the following structure:
ID|FK1|FK2|FK3|FK4
ID|FK1|FK2|FK3|FK4
ID|FK1|FK2|FK3|FK4
And another table that holds:
FK|DATA
FK|DATA
FK|DATA
FK|DATA
The FKn columns in the first table references the FK field in the second one. There can be more than one record linked between the first table and the second one.
What I want to achieve is to create another table with the total number of records of every FKn linked. For example:
ID|FK1|FK2|FK3|FK4
A|0 |23 |9 |3
B|4 |0 |2 |0
I know how to transform the row flow and iterate over every FKn field. I also know how to count. What Im not able to do is to group every FKn count from the same ID into one row, because after I use a tLoop component, every count operation is transformed into a new row like:
FK|count
FK|count
FK|count
FK|count
...
Any idea about how to join rows by packing N of them into one single row each time? Or is there another way to do it?
NOTE: I'm using text data as input
if i understood your issue then i would suggest a different way as below
(provided you have fixed number of FK1, FK2, FK3, FK4)
tFileInput-->Tmap (left join to lookup tAnotherTable(FK DATA) on FK1) -->output-1 i will have four columns - ID, FK1=(0 or 1 - 0 if no matching row is found, 1 if matching row is found), FK2=0, FK3=0, FK4=0. This could be a case where for same ID we can get many FK1 values as you mentioned there can be more than one rows)..
Similarly i will have
tFileInput-->Tmap (left join to lookup tAnotherTable(FK DATA) on FK2) -->output-2 i will have four columns - ID, FK1=0, FK2=(0 or 1 - 0 if no matching row is found, 1 if matching row is found), FK3=0, FK4=0.
tFileInput-->Tmap (left join to lookup tAnotherTable(FK DATA) on FK3) -->output-3 i will have four columns - ID, FK1=0, FK2=0, FK3=(0 or 1 - 0 if no matching row is found, 1 if matching row is found), FK4=0.
.....
...
Next i will Union all these Output-1, output-2, output-3, output-4 into a final result set say union all ResultUnionall and will move them to tagg and group by ID column and take SUM(FK1), SUM(FK2), SUM(FK3)...
to summarize you job will look something as below
tFileInput--->tmap(withlookup)-->tHash1/tFileOutput1
tFileInput--->tmap(withlookup)-->tHash1/tFileOutput2
tFileInput--->tmap(withlookup)-->tHash1/tFileOutput3
tFileInput--->tmap(withlookup)-->tHash1/tFileOutput4
tHashInput1/tFIleInput1---
tHashInput1/tFIleInput1---
tHashInput1/tFIleInput1----
tHashInput1/tFIleInput1--- tUnite--->tAgg-->finaloutput.