When I tried to use PARTITION BY at KSQL, but the field that PARTITION BY use, will missing at the value - apache-kafka

I have a topic test_partition_key_stream, and it's have the schema like this:
value: key:null
{ "id": 1, "age": 18, "name": "lisa" }
Then I did this:
CREATE STREAM TEST_STREAM_JSON (id INT ,age INT ,name VARCHAR) WITH (KAFKA_TOPIC = 'test_partition_key_stream', VALUE_FORMAT = 'JSON');
CREATE STREAM TEST_STREAM_AVRO WITH (PARTITIONS=3, VALUE_FORMAT='AVRO') AS SELECT * FROM TEST_STREAM_JSON PARTITION BY ID;
But when I use PARTITION BY, the 'ID' field will missed at the topic value side.
The new Topic generated to:
{ "fields": [ { "default": null, "name": "AGE", "type": [ "null", "int" ] }, { "default": null, "name": "NAME", "type": [ "null", "string" ] } ], "name": "KsqlDataSourceSchema", "namespace": "io.confluent.ksql.avro_schemas", "type": "record" }
I want to let the new topic partition by ID, but I don't want to lose it at vaule.

Resolved.
The PARTITION BY clause moves the columns into the key. If you want them in the value also, you must copy them by using the AS_VALUE function.
The doc by https://docs.ksqldb.io/en/latest/developer-guide/joins/partition-data/

Related

ksqldb keeps saying - VALUE_FORMAT should support schema inference when VALUE_SCHEMA_ID is provided. Current format is JSON

I'm trying to create a stream in ksqldb to a topic in Kafka using an avro schema.
The command looks like this:
CREATE STREAM customer_stream WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON', VALUE_SCHEMA_ID=1);
Topic customers looks like this:
Using the command - print 'customers';
Key format: ¯_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2022/09/29 12:34:53.440 Z, key: , value: {"Name":"John Smith","PhoneNumbers":["212 555-1111","212 555-2222"],"Remote":false,"Height":"62.4","FicoScore":" > 640"}, partition: 0
rowtime: 2022/09/29 12:34:53.440 Z, key: , value: {"Name":"Jane Smith","PhoneNumbers":["269 xxx-1111","269 xxx-2222"],"Remote":false,"Height":"69.9","FicoScore":" > 690"}, partition: 0
To this topic an avro schema has been added.
{
"type": "record",
"name": "Customer",
"namespace": "com.acme.avro",
"fields": [{
"name": "ficoScore",
"type": ["null", "string"],
"default": null
}, {
"name": "height",
"type": ["null", "double"],
"default": null
}, {
"name": "name",
"type": ["null", "string"],
"default": null
}, {
"name": "phoneNumbers",
"type": ["null", {
"type": "array",
"items": ["null", "string"]
}
],
"default": null
}, {
"name": "remote",
"type": ["null", "boolean"],
"default": null
}
]
}
When I run the command below I got this reply:
CREATE STREAM customer_stream WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON', VALUE_SCHEMA_ID=1);
VALUE_FORMAT should support schema inference when VALUE_SCHEMA_ID is provided. Current format is JSON.
Any suggestion?
JSON doesn't use schema IDs. JSON_SR format does, but if you want Avro, then you need to use AVRO as the format.
You dont "add schemas" to topics. You can only register them in the registry.
Example of converting JSON to Avro with kSQL:
CREATE STREAM sensor_events_json (sensor_id VARCHAR, temperature INTEGER, ...)
WITH (KAFKA_TOPIC='events-topic', VALUE_FORMAT='JSON');
CREATE STREAM sensor_events_avro WITH (VALUE_FORMAT='AVRO') AS SELECT * FROM sensor_events_json;
Notice that you dont need to refer to any ID as the serializer will auto-register the necessary schema.

How to implement KStream-Ktable leftJoin, how to get and set the field by using Envelope object and implement the join for KStream-Ktable?

If I have one topic schema (that is Kstream):
{
"type": "record",
"name": "Value",
"namespace": "test1",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
}
],
"connect.name": "test1.Value"
}
Schema for other topic
{
"type": "record",
"name": "Envelope",
"namespace": "test2",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
],
"connect.name": "test2.Value"
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
}
],
"connect.name": "test2.Envelope"
}
I want to implement join between these two topics KStream and Ktable.
How to implement by using test1 topic id and test2 topic id(which is inside the after obj), how can I extract the id from an object (after obj by using envelope schema) for implenting join.
Left Join (KStream, KTable) → KStream
It performs a LEFT JOIN of a stream with the table, effectively doing a table lookup.
The input data for both sides must be co-partitioned.
It causes data re-partitioning of the stream if and only if the stream was marked for re-partitioning
KStream<String, Long> left = ...;
KTable<String, Double> right = ...;
KStream<String, String> joined = left.leftJoin(right,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, /* ValueJoiner */
Joined.keySerde(Serdes.String()) /* key / .withValueSerde(Serdes.Long()) / left value */ );
Detailed behaviour
The join is key-based, i.e. with the join predicate leftRecord.key == rightRecord.key.
The join will be triggered under the conditions listed below whenever new input is received. When it is triggered, the user-supplied ValueJoiner will be called to produce join output records.
Only input records for the left side (stream) trigger the join. Input records for the right side (table) update only the internal right-side join state.
Input records for the stream with a null key or a null value are ignored and do not trigger the join.
Input records for the table with a null value are interpreted as tombstones for the corresponding key, which indicate the deletion of the key from the table. Tombstones do not trigger the join.
For each input record on the left side that does not have any match on the right side, the ValueJoiner will be called with ValueJoiner#apply(leftRecord.value, null).
Very low level detailed implementation here https://developer.confluent.io/learn-kafka/kafka-streams/joins/
Also refer the session 2.7 in https://mydeveloperplanet.com/2019/10/30/kafka-streams-joins-explored/

PutDatabaseRecort NumberFormatException: For input string 'yyyy-MM-dd h:m:s.S'

I have two similar PutDatabaseRecords processors which write to two tables on Postgres DB respectively: src.sales and src.task_data.
One of them writes the data successfully, but another fails with the error:
2021-07-19 01:56:24,316 ERROR [Timer-Driven Process Thread-2] o.a.n.p.standard.PutDatabaseRecord PutDatabaseRecord[id=325364c0-0064-3127-92f0-6c1b83b076aa] Failed to put Records to database for StandardFlowFileRecord[uuid=a2265064-29fc-4472-832d-b3f8edf5f826,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1626647841973-1, container=default, section=1], offset=11792, length=1995],offset=0,name=3f1d0a15-dc53-4177-917b-e07d55ce6437,size=1995]. Routing to failure.: java.lang.NumberFormatException: For input string: "2021-06-30 00:00:00.0"
java.lang.NumberFormatException: For input string: "2021-06-30 00:00:00.0"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at org.apache.nifi.serialization.record.util.DataTypeUtils.toDouble(DataTypeUtils.java:1402)
at org.apache.nifi.serialization.record.util.DataTypeUtils.convertType(DataTypeUtils.java:196)
at org.apache.nifi.serialization.record.util.DataTypeUtils.convertType(DataTypeUtils.java:153)
at org.apache.nifi.serialization.record.util.DataTypeUtils.convertType(DataTypeUtils.java:149)
at org.apache.nifi.processors.standard.PutDatabaseRecord.executeDML(PutDatabaseRecord.java:709)
at org.apache.nifi.processors.standard.PutDatabaseRecord.putToDatabase(PutDatabaseRecord.java:841)
at org.apache.nifi.processors.standard.PutDatabaseRecord.onTrigger(PutDatabaseRecord.java:487)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1173)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:214)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-07-19 01:56:24,475 INFO [Flow Service Tasks Thread-2] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController#4f899a03 // Another save pending = false
I want to say again, that all configuration the same, except name of tables, fields etc.
This is a configuration of the processor:
This is configuration of reader:
This is the schema of src.sales, that is working:
{
"name": "sales",
"type": "record",
"namespace": "maxi",
"fields": [
{
"name": "nmcl_id",
"type": "int"
},
{
"name": "assort_id",
"type": "int"
},
{
"name": "rtt_id",
"type": "int"
},
{
"name": "rep_date",
"type": "string"
},
{
"name": "out_items",
"type": "float"
}
]
}
This is the schema of src.task_data that isn't working:
{
"name": "task_data",
"type": "record",
"namespace": "maxi",
"fields": [
{
"name": "doc_id",
"type": "int"
},
{
"name": "line_id",
"type": "int"
},
{
"name": "nmcl_id",
"type": "int"
},
{
"name": "assortment_id",
"type": "int"
},
{
"name": "org_id",
"type": "int"
},
{
"name": "items_qnt",
"type": "float"
},
{
"name": "start_date",
"type": "string"
},
{
"name": "end_date",
"type": "string"
}
]
}
Sales table:
CREATE TABLE src.sales (
nmcl_id int NOT NULL,
assort_id int NOT NULL,
rtt_id int NOT NULL,
rep_date timestamp NOT NULL,
out_items float NOT NULL
);
ALTER TABLE ONLY src.sales
ADD CONSTRAINT pk_sales PRIMARY KEY (nmcl_id, assort_id, rtt_id, rep_date);
Tasks table:
CREATE TABLE src.task_data (
doc_id int not null,
line_id int not null,
nmcl_id int NOT NULL,
assortment_id int NOT NULL,
org_id int NOT NULL,
start_date timestamp NOT NULL,
end_date timestamp NOT NULL,
items_qnt float NOT NULL,
load_date timestamp NULL
);
ALTER TABLE ONLY src.task_data
ADD CONSTRAINT pk_task_data PRIMARY KEY (doc_id, line_id);
Sales JSON:
[{"rep_date": "2021-06-25 00:00:00.0", "nmcl_id": "494031", "assort_id": "7", "rtt_id": "100", "out_items": "2"} ... ]
Tasks JSON:
[{"doc_id": "1797690451", "line_id": "691950586", "org_id": "5", "nmcl_id": "349589", "assortment_id": "7", "items_qnt": "1.67", "start_date": "2021-06-29 00:00:00.0", "end_date": "2021-06-30 00:00:00.0", "load_date": null} ... ]
Sales fetch schema:
02:21:05 MSK DEBUG
PutDatabaseRecord[id=ee811cfb-3db0-1d69-8c8a-772e960df75f] Fetched Table Schema TableSchema[columns=[
Column[name=nmcl_id, dataType=4, required=true, columnSize=10],
Column[name=assort_id, dataType=4, required=true, columnSize=10],
Column[name=rtt_id, dataType=4, required=true, columnSize=10],
Column[name=rep_date, dataType=93, required=true, columnSize=29],
Column[name=out_items, dataType=8, required=true, columnSize=17]]] for table name sales
Tasks fetch schema:
02:25:43 MSK DEBUG
PutDatabaseRecord[id=325364c0-0064-3127-92f0-6c1b83b076aa] Fetched Table Schema TableSchema[columns=[
Column[name=doc_id, dataType=4, required=true, columnSize=10],
Column[name=line_id, dataType=4, required=true, columnSize=10]
Column[name=nmcl_id, dataType=4, required=true, columnSize=10], Column[name=assortment_id, dataType=4, required=true, columnSize=10],
Column[name=org_id, dataType=4, required=true, columnSize=10],
Column[name=start_date, dataType=93, required=true, columnSize=29],
Column[name=end_date, dataType=93, required=true, columnSize=29],
Column[name=items_qnt, dataType=8, required=true, columnSize=17],
Column[name=load_date, dataType=93, required=false, columnSize=13]]] for table name task_data
I tried to put date format string in appropriate fields in the JsonPathReader's config and I tried to make the date field: type: "int", logicalType: "date" in the schema.
I tried it in the same time and in different time... the result is the same.
So, what can be a difference? Where the problem is?
As Adrian Klever said, the issue is around of a float!
I had this problem previously, but forgot about.
Here is a solution.

PostgresSQL nested jsonb update value of complex key/value pairs

Starting out with JSONB data type and I'm hoping someone can help me out.
I have a table (properties) with two columns (id as primary key and data as jsonb).
The data structure is:
{
"ProductType": "ABC",
"ProductName": "XYZ",
"attributes": [
{
"name": "Color",
"type": "STRING",
"value": "Silver"
},
{
"name": "Case",
"type": "STRING",
"value": "Shells"
},
...
]
}
I would like to update the value of a specific attributes element by name for a row with a given id. For example, for the element with "name"="Case" change the value to "Glass". So it ends up like
{
"ProductType": "ABC",
"ProductName": "XYZ",
"attributes": [
{
"name": "Color",
"type": "STRING",
"value": "Silver"
},
{
"name": "Case",
"type": "STRING",
"value": "Glass"
},
...
]
}
Is this possible with this structure using SQL?
I have created table structure if any of you would like to give it a shot.
dbfiddle
Use the jsonb concatenation operator, ||, to replace keys on the fly:
WITH properties (id, data) AS (
values
(1, '{"ProductType": "ABC","ProductName": "XYZ","attributes": [{"name": "Color","type": "STRING","value": "Silver"},{"name": "Case","type": "STRING","value": "Shells"}]}'::jsonb),
(2, '{"ProductType": "ABC","ProductName": "XYZ","attributes": [{"name": "Color","type": "STRING","value": "Red"},{"name": "Case","type": "STRING","value": "Shells"}]}'::jsonb)
)
SELECT id,
data||
jsonb_build_object(
'attributes',
jsonb_agg(
case
when attribs->>'name' = 'Case' then attribs||'{"value": "Glass"}'::jsonb
else attribs
end
)
) as data
FROM properties m
CROSS JOIN LATERAL JSONB_ARRAY_ELEMENTS(data->'attributes') as a(attribs)
GROUP BY id, data
Updated fiddle

Copying 7 column table to 6 column table

I'm porting SQL Server Integration Services packages to Azure Data Factory.
I have two tables (Table 1 and Table 2) which live on different servers. One has seven columns, the other six. I followed the example at https://learn.microsoft.com/en-us/azure/data-factory/data-factory-map-columns
Table 1 DDL:
CREATE TABLE dbo.Table1
(
zonename nvarchar(max),
propertyname nvarchar(max),
basePropertyid int,
dfp_ad_unit_id bigint,
MomentType nvarchar(200),
OperatingSystemName nvarchar(50)
)
Table 2 DDL
CREATE TABLE dbo.Table2
(
ZoneID int IDENTITY,
ZoneName nvarchar(max),
propertyName nvarchar(max),
BasePropertyID int,
dfp_ad_unit_id bigint,
MomentType nvarchar(200),
OperatingSystemName nvarchar(50)
)
In ADF, I define Table 1 as:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json",
"name": "Table1",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "PlatformX",
"structure": [
{ "name": "zonename" },
{ "name": "propertyname" },
{ "name": "basePropertyid" },
{ "name": "dfp_ad_unit_id" },
{ "name": "MomentType" },
{ "name": "OperatingSystemName" }
],
"external": true,
"typeProperties": {
"tableName": "Platform.Zone"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
In ADF I define Table 2 as:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json",
"name": "Table2",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "BrixDW",
"structure": [
{ "name": "ZoneID" },
{ "name": "ZoneName" },
{ "name": "propertyName" },
{ "name": "BasePropertyID" },
{ "name": "dfp_ad_unit_id" },
{ "name": "MomentType" },
{ "name": "OperatingSystemName" }
],
"external": true,
"typeProperties": {
"tableName": "staging.DimZone"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
As you can see, Table2 has an identity column, which will automatically populated.
This should be a simple Copy activity:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "Copy_Table1_to_Table2",
"properties": {
"description": "Copy_Table1_to_Table2",
"activities": [
{
"name": "Copy_Table1_to_Table2",
"type": "Copy",
"inputs": [
{ "name": "Table1" }
],
"outputs": [
{
"name": "Table2"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from dbo.Table1"
},
"sink": {
"type": "SqlSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "zonename: ZoneName, propertyname: propertyName, basePropertyid: BasePropertyID, dfp_ad_unit_id: dfp_ad_unit_id, MomentType: MomentType, OperatingSystemName: OperatingSystemName"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 3,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
],
"start": "2017-07-23T00:00:00Z",
"end": "2020-07-19T00:00:00Z"
}
}
I figured by not mapping ZoneID, it would just be ignored. But ADF is giving me the following error.
Copy activity encountered a user error: GatewayNodeName=APP1250S,ErrorCode=UserErrorInvalidColumnMappingColumnCountMismatch,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: 'zonename: ZoneName, propertyname: propertyName, basePropertyid: BasePropertyID, dfp_ad_unit_id: dfp_ad_unit_id, MomentType: MomentType, OperatingSystemName: OperatingSystemName', Detailed message: Different column count between target structure and column mapping. Target column count:7, Column mapping count:6. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'
In a nutshell I'm trying to copy a 7 column table to a 6 column table and Data Factory doesn't like it. How can I accomplish this task?
I realize this is an old question, but I ran into this issue just now. My problem was that I initially generated the destination/sink table, created a pipeline, and then added a column.
Despite clearing and reimporting the schemas, whenever triggering the pipeline, it would throw the above error. I made sure the new column (which has a default on it) was deselected in the mappings, so it would only use the default value. The error was still thrown.
The only way I managed to get things to work was by completely recreating the pipelines from scratch. It's almost as if somewhere in the meta data, the old mappings are retained.
I had the exact same issue and I solved it by going into the azure dataset and removing the identity column. Then making sure I had the same number of columns in my source and target(sink). After doing this the copy will add the records and the identity in the table will just work as expected. I did not have to modify the physical table in SQL only the dataset for the table in azure.
One option would be to create a view over the 7-column table which does not include the identity column and insert into that view.
CREATE VIEW bulkLoad.Table2
AS
SELECT
ZoneName,
propertyName,
BasePropertyID,
dfp_ad_unit_id,
MomentType,
OperatingSystemName
GO
I can do some digging and see if some trick is possible with the column mapping but that should unblock you.
HTH
I was told by MSFT support to just remove the identity column from the table definition. It seems to have worked.