Clickhouse not consuming Kafka messages via complex Materialized View - apache-kafka

TLDR Summary: Clickhouse Kafka engine, materialized view wont work with complex select statement.
Longer Version:
I am trying to send a large number of JSON data points to Clickhouse via its Kafka engine using JSONEachRow. But the materialized view wont consume the stream correctly.
I have a kafka producer written in go, which takes data from multiple tcp streams and asynchronously writes to the kafka queue.
Data flows thus:
TCP Sources -> Producer -> Kafka -> Clickhouse(Kafka Engine) -> Materialized View ->
Destination Table
All this works, so far so good.
I first hit a bottleneck when i ramped up the speed of input data (400,000 points/sec) my producer was not able to write to kafka fast enough and the connections piled up. So i hoped to try and batch the data, but it seems Clickhouse cannot take an array of json as input (https://clickhouse.yandex/docs/en/interfaces/formats/)
So i hit on the idea of batching the datapoints at their source and transforming the messages in the materialized view, so where before i had lots of individual messages:
{ "t": 1547457441651445401,"i": "device_2","c": 20001,"v": 56454654}" }
i now have a message which is multiples of the above and stringified,with newline delimiters between the points.
{"realtimes":"{\"t\":1547458266855015791,\"i\":\"device_2\",\"c\":20001,\"v\":56454654}\n{\"t\":1547458266855015791,\"i\":\"device_2\",\"c\":20001,\"v\":56454654}"}
The intention here is to parse and transform the string to multiple values using visitParamExtract in the select statement of the materialized view.
Materialized View:
CREATE MATERIALIZED VIEW ltdb_mat_view TO default.ltdb AS SELECT
visitParamExtractInt(x, 't') AS timestamp,
visitParamExtractString(x, 'i') AS device_id,
visitParamExtractInt(x, 'v') AS value FROM (
SELECT arrayJoin(*) AS x
FROM
(
SELECT splitByChar('\n', realtimes)
FROM kafka_stream_realtimes
) )
It seems to be doing something, since when it is running the kafka_stream_realtimes gets cleared and i cant query it manually getting an error "DB::Exception: Failed to claim consumer: ." but the data never hits the final table.
Summary:
The data reaches clickhouse, it just dissapears and never seems to
arrive at the final table.
If i drop the materialized view, i can see data build up in
kafka_stream_realtimes
If i run the materialized view query as an INSERT INTO statement
followed by the select, it will take data from the stream to the
final table.
I realize i may just be pushing the bottleneck down into clickhouse
and this may never work, but i want to take this through for
completeness
For Completeness:
kafka_stream_realimes:
CREATE TABLE IF NOT EXISTS kafka_stream_realtimes(realtimes String)
ENGINE = Kafka('kafka:9092', 'realtimes', 'groupTest', 'JSONEachRow');
ltdb:
CREATE TABLE default.ltdb (timestamp Int64,device_id String,value Int64) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(round(timestamp/1000000000)))
ORDER BY (device_id, value)
SETTINGS index_granularity=8192;

but it seems Clickhouse cannot take an array of json as input
It seems the motivation is to do batch commit on the producer side. Why not just group a bunch of JSON rows and commit them in one go? ClickHouse will receive those multi-rows messages and parse them for you. You may also need to provide kafka_row_delimiter settings to the Kafka Engine as most kafka producers don't append the row delimiter at the end of each message.
So one message becomes
{ "t": 1547457441651445401,"i": "device_2","c": 20001,"v": 56454654}
{ "t": 1547457441651445402,"i": "device_2","c": 20001,"v": 56454654}
{ "t": 1547457441651445403,"i": "device_2","c": 20001,"v": 56454654}
...

Related

How do you get the latest offset from a remote query to a Table in ksqlDB?

I have an architecture where I would like to query a ksqlDB Table from a Kafka stream A (created by ksqlDB). On startup, Service A will load in all the data from this table into a hashmap, and then afterward it will start consuming from Kafka Stream A and act off any events to update this hashmap. I want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the table, and when I started consuming off Kafka Stream A. Is there a way that I can retrieve the latest offset that my query to the table is populated by so that I can use that offset to start consuming from Kafka Stream A?
Another thing to mention is that we have hundreds of instances of our app going up and down so reading directly off the Kafka stream is not an option. Reading an entire stream worth of data every time our apps come up is not a scalable solution. Reading in the event streams data into a hashmap on the service is a hard requirement. This is why the ksqlDB table seems like a good option since we can get the latest state of data in the format needed and then just update based off of events from the stream. Kafka Stream A is essentially a CDC stream off of a MySQL table that has been enriched with other data.
You used "materialized view" but I'm going to pretend I
heard "table". I have often used materialized views
in a historical reporting context, but not with live updates.
I assume that yours will behave similar to a "table".
I assume that all events, and DB rows, have timestamps.
Hopefully they are "mostly monotonic", so applying a
small safety window lets us efficiently process just
the relevant recent ones.
The crux of the matter is racing updates.
We need to prohibit races.
Each time an instance of a writer, such as your app,
comes up, assign it a new name.
Rolling a guid is often the most convenient way to do that,
or perhaps prepend it with a timestamp if sort order matters.
Ensure that each DB row mentions that "owning" name.
want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the materialized view, and when I started consuming off Kafka Stream A.
We will need a guaranteed monotonic column with an integer ID
or a timestamp. Let's call it ts.
Query m = max(ts).
Do a big query of records < m, slowly filling your hashmap.
Start consuming Stream A.
Do a small query of records >= m, updating the hashmap.
Continue to loop through subsequently arriving Stream A records.
Now you're caught up, and can maintain the hashmap in sync with DB.
Your business logic probably requires that you
treat DB rows mentioning the "self" guid
in a different way from rows that existed
prior to startup.
Think of it as de-dup, or ignoring replayed rows.
You may find offsetsForTimes() useful.
There's also listOffsets().

Is it possible to detect and drop duplicate data using ksql

I have a simple question whether can we detect and drop duplicates in streaming data on kafka topic using KSQL.
By default, tables are de-duped on keys. A new record for the same key will overwrite old events. If you need to "detect" and "process" the old data, as new events come in, then KSQL cannot do this.
If you need distinct values rather than by-key, you can create a table against some stream of events and filtering on HAVING COUNT(field) = 1 over a time window, which is the best you can do there. Ref - https://kafka-tutorials.confluent.io/finding-distinct-events/ksql.html
If you need indefinite time windows to ensure you only process a certain field once, then you'll want to use an external database, and optionally an internal cache, to perform lookups against. This would need to be done with a regular consumer, or Kafka Streams.

Druid how to drop duplicates in Kafka indexing service

I am using DRUID with Kafka Indexing service. I am trying to understanding how it handles duplicate messages.
Example
Consider I have following message in Kafka Topic[1 partition only]
[Offset=100]
{
"ID":4,
"POINTS":1005,
"CREATED_AT":1616258354000000,
"UPDATED_AT":1616304119000000
}
Now consider after 24 hours, somehow same message is pushed again to topic.
[Offset=101]
{
"ID":4,
"POINTS":1005,
"CREATED_AT":1616258354000000,
"UPDATED_AT":1616304119000000
}
Note: Payload has not changed.
Actual:Now, In DRUID I see the same message again.
Expected: What I expect is since the payload has not changed the message should be ignored.
My timestamp column is CREATED_AT
Can you be sure that there will never be two unique events with the same timestamp other than duplicates? If so, you can try using rollup to eliminate the duplicates.
You can set that in the granularitySpec, and the queryGranularity will basically truncate all timestamps based on that granularity, and if ALL dimensions are identical, they get combined using the aggregation functions you set in the spec.
For the aggregation functions, you will want to use something like MAX or MIN, because SUM will add them up.
This will fail if you have multiple kafka partitions, but could be fixed with reindexing.

Initial load of Kafka stream data with windowed join

I am using a Windowed Join between two streams, let's say a 7 day window.
On initial load, all records in the DB (via kafka connect source connector) are being loaded to the streams. It seems then that ALL records end up in the window state store for those first 7 days as the producer/ingested timestamps are all in current time vs. a field (like create_time) that might be in the message value.
Is there a recommended way to balance the initial load against the Windows of the join?
Well, the question is what records do you want to join to each other? And what timestamp the source connector sets as record timestamp (might also depend on the topic configuration, [log.]message.timestamp.type.
The join is executed based on whatever the TimestampExtractor returns. By default, that is the record timestamp. If you want to base the join on some other timestamp, a custom timestampe extractor is the way to go.
If you want to get processing time semantics, you may want to use the WallclockTimestampExtractor though.

Data streamed from Kafka to Postgres and missing seconds later

I am trying to save data from local Kafka instance to local Postgres with Spark Streaming. I have configured all connections and parameters, and data actually gets to the database. However, it is there only for a couple of seconds. After that, the table simply becomes empty. If I stop the app as soon there is some data in Postgres, data persists, so I suppose I have missed some parameter for streaming in Spark or something in Kafka configuration files. The code is in Java, not Scala, so there is dataset instead of DataFrame.
I tried setting spark.driver.allowMultipleContexts to true, but this has nothing with context. When I run count on database with complete data set streaming in the background, there is always about 1700 records, which means there might be some parameter for batch size.
censusRecordJavaDStream.map(e -> {
Row row = RowFactory.create(e.getAllValues());
return row;
}).foreachRDD(rdd -> {
Dataset<Row> censusDataSet = spark.createDataFrame(rdd, CensusRecord.getStructType());
censusDataSet
.write()
.mode(SaveMode.Overwrite)
.jdbc("jdbc:postgresql:postgres", "census.census", connectionProperties);
});
My goal is to stream data from Kafka and save it to Postgre. Each record has unique ID, which is used as a key in Kafka, so there should be no conflicts regarding primary key or double entries. For current testing purposes, I am using small subset of about 100 records; complete dataset is over 300MB.