Clickhouse Materialised view is not triggered - apache-kafka

I created a table that reads the kafka topic with JSONAsString format
CREATE TABLE tracking_log_kafka_raw
(
jsonString String
) ENGINE = Kafka
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'tracking_log_new',
kafka_group_name = 'test_1',
kafka_format = 'JSONAsString';
Final table
CREATE TABLE k_t_res
(
jsonString String
) ENGINE = MergeTree()
ORDER BY jsonString
SETTINGS index_granularity = 8192;
And materialized view
CREATE MATERIALIZED VIEW test_c TO k_t_res
AS
SELECT *
FROM tracking_log_kafka_raw;
But when I write to kafka, the messages get into the tracking_log_kafka_raw table, but they are not triggered mat view , so nothing gets into the final k_t_res table.
I tried using JSONEachRow format and everything worked, but the message format in kafka doesn't allow it to be used.

The problem was the version of clickhouse used. Initially used 21.9.4.35, after switching to 20.10.6.27 everything worked

Related

AWS Glue add new partitions and overwrite existing partitions

I'm attempting to write pyspark code in Glue that lets me update the Glue Catalog by adding new partitions and overwrite existing partitions in the same call.
I read that there is no way to overwrite partitions in Glue so we must use pyspark code similar to this:
final_df.withColumn('year', date_format('date', 'yyyy'))\
.withColumn('month', date_format('date', 'MM'))\
.withColumn('day', date_format('date', 'dd'))\
.write.mode('overwrite')\
.format('parquet')\
.partitionBy('year', 'month', 'day')\
.save('s3://my_bucket/')
However with this method, the Glue Catalog does not get updated automatically so an msck repair table call is needed after each write. Recently AWS released a new feature enableUpdateCatalog, where newly created partitions are immediately updated in the Glue Catalog. The code looks like this:
additionalOptions = {"enableUpdateCatalog": True}
additionalOptions["partitionKeys"] = ["year", "month", "day"]
dyn_frame_catalog = glueContext.write_dynamic_frame_from_catalog(
frame=partition_dyf,
database = "my_db",
table_name = "my_table",
format="parquet",
additional_options=additionalOptions,
transformation_ctx = "my_ctx"
)
Is there a way to combine these 2 commands or will I need to use the pyspark method with write.mode('overwrite') and run an MSCK REPAIR TABLE my_table on every run of the Glue job?
If you have not already found your answer, I believe the following will work:
DataSink5 = glueContext.getSink(
path = "s3://...",
connection_type = "s3",
updateBehavior = "UPDATE_IN_DATABASE",
partitionKeys = ["year", "month", "day"],
enableUpdateCatalog = True,
transformation_ctx = "DataSink5")
DataSink5.setCatalogInfo(
catalogDatabase = "my_db",
catalogTableName = "my_table")
DataSink5.setFormat("glueparquet")
DataSink5.writeFrame(partition_dyf)

Flink Table API: GROUP BY in SQL Execution throws org.apache.flink.table.api.TableException

I have this very simplified use case: I want to use Apache Flink (1.11) to read data from a Kafka topic (let's call it source_topic), count an attribute in it (called b) and write the result into another Kafka topic (result_topic).
I have the following code so far:
from pyflink.datastream import StreamExecutionEnvironment, TimeCharacteristic
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
def log_processing():
env = StreamExecutionEnvironment.get_execution_environment()
env_settings = EnvironmentSettings.new_instance().use_blink_planner().in_streaming_mode().build()
t_env = StreamTableEnvironment.create(stream_execution_environment=env, environment_settings=env_settings)`
t_env.get_config().get_configuration().set_boolean("python.fn-execution.memory.managed", True)
t_env.get_config().get_configuration().set_string("pipeline.jars", "file:///opt/flink-1.11.2/lib/flink-sql-connector-kafka_2.11-1.11.2.jar")
source_ddl = """
CREATE TABLE source_table(
a STRING,
b INT
) WITH (
'connector' = 'kafka',
'topic' = 'source_topic',
'properties.bootstrap.servers' = 'node-1:9092',
'scan.startup.mode' = 'earliest-offset',
'format' = 'csv',
'csv.ignore-parse-errors' = 'true'
)
"""
sink_ddl = """
CREATE TABLE result_table(
b INT,
result BIGINT
) WITH (
'connector' = 'kafka',
'topic' = 'result_topic',
'properties.bootstrap.servers' = 'node-1:9092',
'format' = 'csv'
)
"""
t_env.execute_sql(source_ddl)
t_env.execute_sql(sink_ddl)
t_env.execute_sql("INSERT INTO result_table SELECT b,COUNT(b) FROM source_table GROUP BY b")
t_env.execute("Kafka_Flink_job")
if __name__ == '__main__':
log_processing()
But when I execute it, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o5.executeSql.
: org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.result_table' doesn't support consuming update changes which is produced by node GroupAggregate(groupBy=[b], select=[b, COUNT(b) AS EXPR$1])
I am able to write data into a Kafka topic with a simple SELECT statement. But as soon as I add the GROUP BY clause, the exception above is thrown. I followed Flink's documentation on the use of the Table API with SQL for Python: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/common.html#sql
Any help is highly appreciated, I am very new to Stream Processing and Flink. Thank you!
Using a GROUP BY clause will generate an updating stream, which is not supported by the Kafka connector as of Flink 1.11. On the other hand, when you use a simple SELECT statement without any aggregation, the result stream is append-only (this is why you're able to consume it without issues).
Flink 1.12 is very close to being released, and it includes a new upsert Kafka connector (FLIP-149, if you're curious) that will allow you to do this type of operation also in PyFlink (i.e. the Python Table API).

How to persist streaming data to disk in DolphinDB?

By default, the streaming table keeps all streaming data in memory.How can I persist streaming data to disk in DolphinDB? For example, I have a stream table like following:
n=20000000
colNames = `time`sym`qty`price
colTypes = [TIME,SYMBOL,INT,DOUBLE]
t=streamTable(n:0, colNames, colTypes)
share t as trades_stream
You can call enableTablePersistence or enableTableShareAndPersistence to persist data to disk.Examples are as follows:
n=20000000
colNames = `time`sym`qty`price
colTypes = [TIME,SYMBOL,INT,DOUBLE]
t=streamTable(n:0, colNames, colTypes)
enableTableShareAndPersistence(t,`trades_stream,true,true,1200000)

How to consume all valid data in one kafka topic

In our project, we expect all the data exist in the specific topic can be consumed by Kafka engine, but we tried two ways, however, none of them works.
Tried to put key-word [auto_offset_reset] when create kafka engine table just like below. No error return when created table, but only incremental data in the topic is consumed in this way.
CREATE TABLE xx.yyy (
`shop_id` String,
`last_updated_at` String
) ENGINE = Kafka('XXX', 'shop_price_center.t_sku_shop_price', 'xxx', 'JSONEachRow', '', '', 1, 0, 0, 20000,
auto_offset_reset='earliest')
Change the configuration in xml file, still not works
<kafka>
<debug>cgrp</debug>
<auto_offset_reset>earliest</auto_offset_reset>
</kafka>
Any guru can show me a complete example to show the solution? Very appreciate.
Data can be read from Kafka just once (after each read the consumer's offset is moved forward, therefore, repeated reading the same data is not possible) so need to use materialized view that listens to Kafka topic and put data to ordinary-table:
/* ordinary table */
CREATE TABLE xx.yyy (
shop_id String,
last_updated_at String
)
ENGINE = MergeTree()
PARTITION BY tuple() /* just demo settings */
ORDER BY (last_updated_at, shop_id); /* just demo settings */
/* Kafka 'queue' */
CREATE TABLE xx.yyy_queue (
shop_id String,
last_updated_at String
)
ENGINE = Kafka SETTINGS
kafka_broker_list = '..',
kafka_topic_list = 'topic',
kafka_group_name = '..',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n',
kafka_skip_broken_messages = 1,
kafka_num_consumers = 1,
kafka_max_block_size = 1000;
/* materialized view: it transfers data from topic to sql-table */
CREATE MATERIALIZED VIEW xx.yyy_consumer TO xx.yyy AS
SELECT
shop_id,
last_updated_at
FROM xx.yyy_queue;
The Kafka-specific settings such as auto_offset_reset should be defined in xml-file located in /etc/clickhouse-server/config.d-directory not in table definition:
<?xml version="1.0"?>
<yandex>
<kafka>
<auto_offset_reset>earliest</auto_offset_reset>
</kafka>
</yandex>
Select data only from xx.yyy-ordinary table not xx.yyy_queue:
SELECT *
FROM xx.yyy;

Flink: join file with kafka stream

I have a problem I don't really can figure out.
So I have a kafka stream that contains some data like this:
{"adId":"9001", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
And I want to replace 'adId' with another value 'bookingId'.
This value is located in a csv file, but I can't really figure out how to get it working.
Here is my mapping csv file:
9001;8
9002;10
So my output would ideally be something like
{"bookingId":"8", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
This file can get refreshed every hour at least once, so it should pick up changes to it.
I currently have this code which doesn't work for me:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(30000); // create a checkpoint every 30 seconds
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
DataStream<String> adToBookingMapping = env.readTextFile(parameters.get("adToBookingMapping"));
DataStream<Tuple2<Integer,Integer>> input = adToBookingMapping.flatMap(new Tokenizer());
//Kafka Consumer
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", parameters.get("bootstrap.servers"));
properties.setProperty("group.id", parameters.get("group.id"));
FlinkKafkaConsumer010<ObjectNode> consumer = new FlinkKafkaConsumer010<>(parameters.get("inbound_topic"), new JSONDeserializationSchema(), properties);
consumer.setStartFromGroupOffsets();
consumer.setCommitOffsetsOnCheckpoints(true);
DataStream<ObjectNode> logs = env.addSource(consumer);
DataStream<Tuple4<Integer,String,Integer,Float>> parsed = logs.flatMap(new Parser());
// output -> bookingId, action, impressions, sum
DataStream<Tuple4<Integer, String,Integer,Float>> joined = runWindowJoin(parsed, input, 3);
public static DataStream<Tuple4<Integer, String, Integer, Float>> runWindowJoin(DataStream<Tuple4<Integer, String, Integer, Float>> parsed,
DataStream<Tuple2<Integer, Integer>> input,long windowSize) {
return parsed.join(input)
.where(new ParsedKey())
.equalTo(new InputKey())
.window(TumblingProcessingTimeWindows.of(Time.of(windowSize, TimeUnit.SECONDS)))
//.window(TumblingEventTimeWindows.of(Time.milliseconds(30000)))
.apply(new JoinFunction<Tuple4<Integer, String, Integer, Float>, Tuple2<Integer, Integer>, Tuple4<Integer, String, Integer, Float>>() {
private static final long serialVersionUID = 4874139139788915879L;
#Override
public Tuple4<Integer, String, Integer, Float> join(
Tuple4<Integer, String, Integer, Float> first,
Tuple2<Integer, Integer> second) {
return new Tuple4<Integer, String, Integer, Float>(second.f1, first.f1, first.f2, first.f3);
}
});
}
The code only runs once and then stops, so it doesn't convert new entries in kafka using the csv file. Any ideas on how I could process the stream from Kafka with the latest values from my csv file?
Kind regards,
darkownage
Your goal appears to be to join steaming data with a slow-changing catalog (i.e. a side-input). I don't think the join operation is useful here because it doesn't store the catalog entries across windows. Also, the text file is a bounded input whose lines are read once.
Consider using connect to create a connected stream, and store the catalog data as managed state to perform lookups into. The operator's parallelism would need to be 1.
You may find a better solution by researching 'side inputs', looking at the solutions that people use today. See FLIP-17 and Dean Wampler's talk at Flink Forward.