ksql create stream from json key/value - apache-kafka

I have a problem with apache kafka and the output of connector.
when i try to create a stream from the topic i get some errors.
the data of the topic is this one ( without a schema, in json format ):
key:
{
"payload": {
"sourceName": "HotPump",
"jobName": "pollingHotPump"
}
}
value:
{
"payload": {
"fields": {
"cw": [
4657,
0,
0,
0,
0,
0,
0,
0,
0,
0,
13108,
16637,
0,
0,
0
]
},
"timestamp": 1638540457655,
"expires": null,
"connection-name": "Condensator"
}
}
the ksql query for create a stream is this one:
CREATE STREAM s_devices
(
key struct<payload struct<sourceName string>> ,
value struct<payload struct<fields struct<cw array>>>,
ts struct<payload struct<timestamp bigint>>
)
WITH (KAFKA_TOPIC='devices',
VALUE_FORMAT='JSON', KEY_FORMAT='JSON');
The result of ksql client is: "Failed to prepare statement: Cannot resolve unknown type: ARRAY"
When i try to create the stream only with key struct<payload struct<sourceName string>>
the select select key->payload->sourceName, value->payload->timestamp from s_devices;
working correctly and the value is showed;
when i try only with ts struct<payload struct<timestamp bigint>> the table is created but when i try to select the value is null select value->payload->timestamp from s_devices;
where is the error?
thanks

Related

How to use result of Lookup Activity in next Lookup of Azure Data Factory?

I have Lookup "Fetch Customers" with SQL statement:
Select Count(CustomerId) As 'Row_count' ,Min(sales_amount) as 'Min_Sales' From [sales].
[Customers]
It returns value
10, 5000
Next I have Lookup "Update Min Sales" with SQL statement, but getting error:
Update Sales_Min_Sales
SET Row_Count = #activity('Fetch Customers').output.Row_count,
Min_Sales = #activity('Fetch Customers').output.Min_Sales
Select 1
Same error occurs even I set Lookup to
Select #activity('Fetch Fetch Customers').output.Row_count
Error:
A database operation failed with the following error: 'Must declare the scalar variable
"#activity".',Source=,''Type=System.Data.SqlClient.SqlException,Message=Must declare the
scalar variable "#activity".,Source=.Net SqlClient Data
Provider,SqlErrorNumber=137,Class=15,ErrorCode=-2146232060,State=2,Errors=
[{Class=15,Number=137,State=2,Message=Must declare the scalar variable "#activity".,},],'
I have similar set up as yours. Two lookup activities.
First look up brings min ID and Max ID as shown
{
"count": 1,
"value": [
{
"Min": 1,
"Max": 30118
}
],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "DIUHours"
}
]
},
"durationInQueue": {
"integrationRuntimeQueue": 22
}
}
in my second lookup i am using the below expression
Update saleslt.customer set somecol=someval where CustomerID=#{activity('Lookup1').output.Value[0]['Min']}
Select 1 as dummy
Just that we have to access lookup output using indices as mentioned and place the activity output inside {}.

Inserting nested json file in postgreSQL

I am trying to insert a nested json file in PostgreSQL DB, Below is the sample data of the json file.
[
{
"location_id": 11111,
"recipe_id": "LLLL324",
"serving_size_number": 1,
"recipe_fraction_description": null,
"description": "1/2 gallon",
"recipe_name": "DREXEL ALMOND MILK 32 OZ",
"marketing_name": "Almond Milk",
"marketing_description": null,
"ingredient_statement": "Almond Milk (ALMOND MILK (FILTERED WATER, ALMONDS), CANE SUGAR, CONTAINS 2% OR LESS OF: VITAMIN AND MINERAL BLEND (CALCIUM CARBONATE, VITAMIN E ACETATE, VITAMIN A PALMITATE, VITAMIN D2), SEA SALT, SUNFLOWER LECITHIN, LOCUST BEAN GUM, GELLAN GUM.)",
"allergen_attributes": {
"allergen_statement_not_available": null,
"contains_shellfish": "NO",
"contains_peanut": "NO",
"contains_tree_nuts": "YES",
"contains_milk": "NO",
"contains_wheat": "NO",
"contains_soy": "NO",
"contains_eggs": "NO",
"contains_fish": "NO",
"contains_added_msg": "UNKNOWN",
"contains_hfcs": "UNKNOWN",
"contains_mustard": "UNKNOWN",
"contains_celery": "UNKNOWN",
"contains_sesame": "UNKNOWN",
"contains_red_yellow_blue_dye": "UNKNOWN",
"gluten_free_per_fda": "UNKNOWN",
"non_gmo_claim": "UNKNOWN",
"contains_gluten": "NO"
},
"dietary_attributes": {
"vegan": "YES",
"vegetarian": "YES",
"kosher": "YES",
"halal": "UNKNOWN"
},
"primary_attributes": {
"protein": 7.543,
"total_fat": 19.022,
"carbohydrate": 69.196,
"calories": 463.227,
"total_sugars": 61.285,
"fiber": 5.81,
"calcium": 3840.228,
"iron": 3.955,
"potassium": 270.768,
"sodium": 1351.208,
"cholesterol": 0.0,
"trans_fat": 0.0,
"saturated_fat": 1.488,
"monounsaturated_fat": 11.743,
"polyunsaturated_fat": 4.832,
"calories_from_fat": 171.195,
"pct_calories_from_fat": 36.957,
"pct_calories_from_saturated_fat": 2.892,
"added_sugars": null,
"vitamin_d_(mcg)": null
},
"secondary_attributes": {
"ash": null,
"water": null,
"magnesium": 120.654,
"phosphorous": 171.215,
"zinc": 1.019,
"copper": 0.183,
"manganese": null,
"selenium": 1.325,
"vitamin_a_(IU)": 5331.357,
"vitamin_a_(RAE)": null,
"beta_carotene": null,
"alpha_carotene": null,
"vitamin_e_(A-tocopherol)": 49.909,
"vitamin_d_(IU)": null,
"vitamin_c": 0.0,
"thiamin_(B1)": 0.0,
"riboflavin_(B2)": 0.449,
"niacin": 0.979,
"pantothenic_acid": 0.061,
"vitamin_b6": 0.0,
"folacin_(folic_acid)": null,
"vitamin_b12": 0.0,
"vitamin_k": null,
"folic_acid": null,
"folate_food": null,
"folate_DFE": null,
"vitamin_a_(RE)": null,
"pct_calories_from_protein": 6.514,
"pct_calories_from_carbohydrates": 59.751,
"biotin": null,
"niacin_(mg_NE)": null,
"vitamin_e_(IU)": null
}
}
]
When tried to copy the data using below postgres query
\copy table_name 'location of thefile'
got below error
ERROR: invalid input syntax for type integer: "["
CONTEXT: COPY table_name, line 1, column location_id: "["
I tried below approach as well but no luck
INSERT INTO json_table
SELECT [all key fields]
FROM json_populate_record (NULL::json_table,
'{
sample data
}'
);
What is the best simple way to insert this type of nested json files in postegreSQL tables. Is there a query which we can use to insert any nested json files ?
Insert json to a table. In fact I don't what is your expect.
yimo=# create table if not exists foo(a int,b text);
CREATE TABLE
yimo=# insert into foo select * from json_populate_record(null::foo, ('[{"a":1,"b":"3"}]'::jsonb->>0)::json);
INSERT 0 1
yimo=# select * from foo;
a | b
---+---
1 | 3
(1 row)

Kafka jdbc connect sink: Is it possible to use pk.fields for fields in value and key?

The issue i'm having is that when jdbc sink connector consumes kafka message, the key variables when writing to db is null.
However, when i consume directly through the kafka-avro-consumer - I can see the key and value variables with it's values because I use this config: --property print.key=true.
ASK: is there away to make sure that jdbc connector is processing the message key variable values?
console kafka-avro config
/opt/confluent-5.4.1/bin/kafka-avro-console-consumer \
--bootstrap-server "localhost:9092" \
--topic equipmentidentifier.persist \
--property parse.key=true \
--property key.separator=~ \
--property print.key=true \
--property schema.registry.url="http://localhost:8081" \
--property key.schema=[$KEY_SCHEMA] \
--property value.schema=[$IDENTIFIER_SCHEMA,$VALUE_SCHEMA]
error:
org.apache.kafka.connect.errors.RetriableException: java.sql.SQLException: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO "assignment_table" ("created_date","custome
r","id_type","id_value") VALUES('1970-01-01 03:25:44.567+00'::timestamp,123,'BILL_OF_LADING','BOL-123') was aborted: ERROR: null value in column "equipment_ide
ntifier_type" violates not-null constraint
Detail: Failing row contains (null, null, null, null, 1970-01-01 03:25:44.567, 123, id, 56). Call getNextException to see other errors in the batch.
org.postgresql.util.PSQLException: ERROR: null value in column "equipment_identifier_type" violates not-null constraint
Sink config:
task.max=1
topic=assignment
connect.class=io.confluet.connect.jdbc.JdbcSinkConnector
connection.url=jdbc:postgresql://localhost:5432/db
connection.user=test
connection.password=test
table.name.format=assignment_table
auto.create=false
insert.mode=insert
pk.fields=customer,equip_Type,equip_Value,id_Type,id_Value,cpId
transforms=flatten
transforms.flattenKey.type=org.apache.kafka.connect.transforms.Flatten$Key
transforms.flattenKey.delimiter=_
transforms.flattenKey.type=org.apache.kafka.connect.transforms.Flatten$Value
transforms.flattenKey.delimiter=_
Kafka key:
{
"assignmentKey": {
"cpId": {
"long": 1001
},
"equip": {
"Identifier": {
"type": "eq",
"value": "eq_45"
}
},
"vendorId": {
"string": "vendor"
}
}
}
Kafka value:
{
"assigmentValue": {
"id": {
"Identifier": {
"type": "id",
"value": "56"
}
},
"timestamp": {
"long": 1234456756
},
"customer": {
"long": 123
}
}
}
You need to tell the connector to use fields from the key, because by default it won't.
pk.mode=record_key
However you need to use fields from either the Key or the Value, not both as you have in your config currently:
pk.fields=customer,equip_Type,equip_Value,id_Type,id_Value,cpId
If you set pk.mode=record_key then pk.fields will refer to the fields in the message key.
Ref: https://docs.confluent.io/current/connect/kafka-connect-jdbc/sink-connector/sink_config_options.html#sink-pk-config-options
See also https://rmoff.dev/kafka-jdbc-video

Kafka: All messages failing in stream while data in topics

I have topic post_users_tand while use PRINT command on it I get
rowtime: 4/2/20 2:03:48 PM UTC, key: <null>, value: {"userid": 6, "id": 8, "title": "testest", "body": "Testingmoreand more"}
rowtime: 4/2/20 2:03:48 PM UTC, key: <null>, value: {"userid": 7, "id": 11, "title": "testest", "body": "Testingmoreand more"}
So then I create a stream out of this with:
CREATE STREAM userstream (userid INT, id INT, title VARCHAR, body VARCHAR)
WITH (KAFKA_TOPIC='post_users_t',
VALUE_FORMAT='JSON');
But I cant select anything from it and when I DESCRIBE EXTENDED it all the messages have failed.
consumer-messages-per-sec: 1.06 consumer-total-bytes: 116643 consumer-total-messages: 3417 last-message: 2020-04-02T14:08:08.546Z
consumer-failed-messages: 3417 consumer-failed-messages-per-sec: 1.06 last-failed: 2020-04-02T14:08:08.56Z
What am I doing wrong here?
Extra info under!
Print topic from beginning:
ksql> print 'post_users_t' from beginning limit 2;
Key format: SESSION(AVRO) or HOPPING(AVRO) or TUMBLING(AVRO) or AVRO or SESSION(PROTOBUF) or HOPPING(PROTOBUF) or TUMBLING(PROTOBUF) or PROTOBUF or SESSION(JSON) or HOPPING(JSON) or TUMBLING(JSON) or JSON or SESSION(JSON_SR) or HOPPING(JSON_SR) or TUMBLING(JSON_SR) or JSON_SR or SESSION(KAFKA_INT) or HOPPING(KAFKA_INT) or TUMBLING(KAFKA_INT) or KAFKA_INT or SESSION(KAFKA_BIGINT) or HOPPING(KAFKA_BIGINT) or TUMBLING(KAFKA_BIGINT) or KAFKA_BIGINT or SESSION(KAFKA_DOUBLE) or HOPPING(KAFKA_DOUBLE) or TUMBLING(KAFKA_DOUBLE) or KAFKA_DOUBLE or SESSION(KAFKA_STRING) or HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: AVRO or KAFKA_STRING
rowtime: 4/2/20 1:04:08 PM UTC, key: <null>, value: {"userid": 1, "id": 1, "title": "loremit", "body": "loremit heiluu ja paukkuu"}
rowtime: 4/2/20 1:04:08 PM UTC, key: <null>, value: {"userid": 2, "id": 2, "title": "lorbe", "body": "larboloilllaaa"}
Per the output from ksqlDB's inspection of the topic, your data is serialised in Avro:
Value format: AVRO or KAFKA_STRING
but you have created the STREAM specifying VALUE_FORMAT='JSON'. This will result in deserialisation errors which if you run docker-compose logs -f ksqldb-server you'll see being written out when you try to query the stream.
Since you're using Avro, you don't need to specify the schema. Try this instead:
CREATE STREAM userstream
WITH (KAFKA_TOPIC='post_users_t',
VALUE_FORMAT='AVRO');

Extract value from cloudant IBM Bluemix NoSQL Database

How to Extract value from Cloudant IBM Bluemix NoSQL Database stored in JSON format?
I tried this code
def readDataFrameFromCloudant(host,user,pw,database):
cloudantdata=spark.read.format("com.cloudant.spark"). \
option("cloudant.host",host). \
option("cloudant.username", user). \
option("cloudant.password", pw). \
load(database)
cloudantdata.createOrReplaceTempView("washing")
spark.sql("SELECT * from washing").show()
return cloudantdata
hostname = ""
user = ""
pw = ""
database = "database"
cloudantdata=readDataFrameFromCloudant(hostname, user, pw, database)
It is stored in this format
{
"_id": "31c24a382f3e4d333421fc89ada5361e",
"_rev": "1-8ba1be454fed5b48fa493e9fe97bedae",
"d": {
"count": 9,
"hardness": 72,
"temperature": 85,
"flowrate": 11,
"fluidlevel": "acceptable",
"ts": 1502677759234
}
}
I want this result
Expected
Actual Outcome
Create a dummy dataset for reproducing the issue:
cloudantdata = spark.read.json(sc.parallelize(["""
{
"_id": "31c24a382f3e4d333421fc89ada5361e",
"_rev": "1-8ba1be454fed5b48fa493e9fe97bedae",
"d": {
"count": 9,
"hardness": 72,
"temperature": 85,
"flowrate": 11,
"fluidlevel": "acceptable",
"ts": 1502677759234
}
}
"""]))
cloudantdata.take(1)
Returns:
[Row(_id='31c24a382f3e4d333421fc89ada5361e', _rev='1-8ba1be454fed5b48fa493e9fe97bedae', d=Row(count=9, flowrate=11, fluidlevel='acceptable', hardness=72, temperature=85, ts=1502677759234))]
Now flatten:
flat_df = cloudantdata.select("_id", "_rev", "d.*")
flat_df.take(1)
Returns:
[Row(_id='31c24a382f3e4d333421fc89ada5361e', _rev='1-8ba1be454fed5b48fa493e9fe97bedae', count=9, flowrate=11, fluidlevel='acceptable', hardness=72, temperature=85, ts=1502677759234)]
I tested this code with an IBM Data Science Experience notebook using Python 3.5 (Experimental) with Spark 2.0
This answer is based on: https://stackoverflow.com/a/45694796/1033422