I have a problem with apache kafka and the output of connector.
when i try to create a stream from the topic i get some errors.
the data of the topic is this one ( without a schema, in json format ):
key:
{
"payload": {
"sourceName": "HotPump",
"jobName": "pollingHotPump"
}
}
value:
{
"payload": {
"fields": {
"cw": [
4657,
0,
0,
0,
0,
0,
0,
0,
0,
0,
13108,
16637,
0,
0,
0
]
},
"timestamp": 1638540457655,
"expires": null,
"connection-name": "Condensator"
}
}
the ksql query for create a stream is this one:
CREATE STREAM s_devices
(
key struct<payload struct<sourceName string>> ,
value struct<payload struct<fields struct<cw array>>>,
ts struct<payload struct<timestamp bigint>>
)
WITH (KAFKA_TOPIC='devices',
VALUE_FORMAT='JSON', KEY_FORMAT='JSON');
The result of ksql client is: "Failed to prepare statement: Cannot resolve unknown type: ARRAY"
When i try to create the stream only with key struct<payload struct<sourceName string>>
the select select key->payload->sourceName, value->payload->timestamp from s_devices;
working correctly and the value is showed;
when i try only with ts struct<payload struct<timestamp bigint>> the table is created but when i try to select the value is null select value->payload->timestamp from s_devices;
where is the error?
thanks
Related
I have Lookup "Fetch Customers" with SQL statement:
Select Count(CustomerId) As 'Row_count' ,Min(sales_amount) as 'Min_Sales' From [sales].
[Customers]
It returns value
10, 5000
Next I have Lookup "Update Min Sales" with SQL statement, but getting error:
Update Sales_Min_Sales
SET Row_Count = #activity('Fetch Customers').output.Row_count,
Min_Sales = #activity('Fetch Customers').output.Min_Sales
Select 1
Same error occurs even I set Lookup to
Select #activity('Fetch Fetch Customers').output.Row_count
Error:
A database operation failed with the following error: 'Must declare the scalar variable
"#activity".',Source=,''Type=System.Data.SqlClient.SqlException,Message=Must declare the
scalar variable "#activity".,Source=.Net SqlClient Data
Provider,SqlErrorNumber=137,Class=15,ErrorCode=-2146232060,State=2,Errors=
[{Class=15,Number=137,State=2,Message=Must declare the scalar variable "#activity".,},],'
I have similar set up as yours. Two lookup activities.
First look up brings min ID and Max ID as shown
{
"count": 1,
"value": [
{
"Min": 1,
"Max": 30118
}
],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "DIUHours"
}
]
},
"durationInQueue": {
"integrationRuntimeQueue": 22
}
}
in my second lookup i am using the below expression
Update saleslt.customer set somecol=someval where CustomerID=#{activity('Lookup1').output.Value[0]['Min']}
Select 1 as dummy
Just that we have to access lookup output using indices as mentioned and place the activity output inside {}.
I am trying to insert a nested json file in PostgreSQL DB, Below is the sample data of the json file.
[
{
"location_id": 11111,
"recipe_id": "LLLL324",
"serving_size_number": 1,
"recipe_fraction_description": null,
"description": "1/2 gallon",
"recipe_name": "DREXEL ALMOND MILK 32 OZ",
"marketing_name": "Almond Milk",
"marketing_description": null,
"ingredient_statement": "Almond Milk (ALMOND MILK (FILTERED WATER, ALMONDS), CANE SUGAR, CONTAINS 2% OR LESS OF: VITAMIN AND MINERAL BLEND (CALCIUM CARBONATE, VITAMIN E ACETATE, VITAMIN A PALMITATE, VITAMIN D2), SEA SALT, SUNFLOWER LECITHIN, LOCUST BEAN GUM, GELLAN GUM.)",
"allergen_attributes": {
"allergen_statement_not_available": null,
"contains_shellfish": "NO",
"contains_peanut": "NO",
"contains_tree_nuts": "YES",
"contains_milk": "NO",
"contains_wheat": "NO",
"contains_soy": "NO",
"contains_eggs": "NO",
"contains_fish": "NO",
"contains_added_msg": "UNKNOWN",
"contains_hfcs": "UNKNOWN",
"contains_mustard": "UNKNOWN",
"contains_celery": "UNKNOWN",
"contains_sesame": "UNKNOWN",
"contains_red_yellow_blue_dye": "UNKNOWN",
"gluten_free_per_fda": "UNKNOWN",
"non_gmo_claim": "UNKNOWN",
"contains_gluten": "NO"
},
"dietary_attributes": {
"vegan": "YES",
"vegetarian": "YES",
"kosher": "YES",
"halal": "UNKNOWN"
},
"primary_attributes": {
"protein": 7.543,
"total_fat": 19.022,
"carbohydrate": 69.196,
"calories": 463.227,
"total_sugars": 61.285,
"fiber": 5.81,
"calcium": 3840.228,
"iron": 3.955,
"potassium": 270.768,
"sodium": 1351.208,
"cholesterol": 0.0,
"trans_fat": 0.0,
"saturated_fat": 1.488,
"monounsaturated_fat": 11.743,
"polyunsaturated_fat": 4.832,
"calories_from_fat": 171.195,
"pct_calories_from_fat": 36.957,
"pct_calories_from_saturated_fat": 2.892,
"added_sugars": null,
"vitamin_d_(mcg)": null
},
"secondary_attributes": {
"ash": null,
"water": null,
"magnesium": 120.654,
"phosphorous": 171.215,
"zinc": 1.019,
"copper": 0.183,
"manganese": null,
"selenium": 1.325,
"vitamin_a_(IU)": 5331.357,
"vitamin_a_(RAE)": null,
"beta_carotene": null,
"alpha_carotene": null,
"vitamin_e_(A-tocopherol)": 49.909,
"vitamin_d_(IU)": null,
"vitamin_c": 0.0,
"thiamin_(B1)": 0.0,
"riboflavin_(B2)": 0.449,
"niacin": 0.979,
"pantothenic_acid": 0.061,
"vitamin_b6": 0.0,
"folacin_(folic_acid)": null,
"vitamin_b12": 0.0,
"vitamin_k": null,
"folic_acid": null,
"folate_food": null,
"folate_DFE": null,
"vitamin_a_(RE)": null,
"pct_calories_from_protein": 6.514,
"pct_calories_from_carbohydrates": 59.751,
"biotin": null,
"niacin_(mg_NE)": null,
"vitamin_e_(IU)": null
}
}
]
When tried to copy the data using below postgres query
\copy table_name 'location of thefile'
got below error
ERROR: invalid input syntax for type integer: "["
CONTEXT: COPY table_name, line 1, column location_id: "["
I tried below approach as well but no luck
INSERT INTO json_table
SELECT [all key fields]
FROM json_populate_record (NULL::json_table,
'{
sample data
}'
);
What is the best simple way to insert this type of nested json files in postegreSQL tables. Is there a query which we can use to insert any nested json files ?
Insert json to a table. In fact I don't what is your expect.
yimo=# create table if not exists foo(a int,b text);
CREATE TABLE
yimo=# insert into foo select * from json_populate_record(null::foo, ('[{"a":1,"b":"3"}]'::jsonb->>0)::json);
INSERT 0 1
yimo=# select * from foo;
a | b
---+---
1 | 3
(1 row)
The issue i'm having is that when jdbc sink connector consumes kafka message, the key variables when writing to db is null.
However, when i consume directly through the kafka-avro-consumer - I can see the key and value variables with it's values because I use this config: --property print.key=true.
ASK: is there away to make sure that jdbc connector is processing the message key variable values?
console kafka-avro config
/opt/confluent-5.4.1/bin/kafka-avro-console-consumer \
--bootstrap-server "localhost:9092" \
--topic equipmentidentifier.persist \
--property parse.key=true \
--property key.separator=~ \
--property print.key=true \
--property schema.registry.url="http://localhost:8081" \
--property key.schema=[$KEY_SCHEMA] \
--property value.schema=[$IDENTIFIER_SCHEMA,$VALUE_SCHEMA]
error:
org.apache.kafka.connect.errors.RetriableException: java.sql.SQLException: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO "assignment_table" ("created_date","custome
r","id_type","id_value") VALUES('1970-01-01 03:25:44.567+00'::timestamp,123,'BILL_OF_LADING','BOL-123') was aborted: ERROR: null value in column "equipment_ide
ntifier_type" violates not-null constraint
Detail: Failing row contains (null, null, null, null, 1970-01-01 03:25:44.567, 123, id, 56). Call getNextException to see other errors in the batch.
org.postgresql.util.PSQLException: ERROR: null value in column "equipment_identifier_type" violates not-null constraint
Sink config:
task.max=1
topic=assignment
connect.class=io.confluet.connect.jdbc.JdbcSinkConnector
connection.url=jdbc:postgresql://localhost:5432/db
connection.user=test
connection.password=test
table.name.format=assignment_table
auto.create=false
insert.mode=insert
pk.fields=customer,equip_Type,equip_Value,id_Type,id_Value,cpId
transforms=flatten
transforms.flattenKey.type=org.apache.kafka.connect.transforms.Flatten$Key
transforms.flattenKey.delimiter=_
transforms.flattenKey.type=org.apache.kafka.connect.transforms.Flatten$Value
transforms.flattenKey.delimiter=_
Kafka key:
{
"assignmentKey": {
"cpId": {
"long": 1001
},
"equip": {
"Identifier": {
"type": "eq",
"value": "eq_45"
}
},
"vendorId": {
"string": "vendor"
}
}
}
Kafka value:
{
"assigmentValue": {
"id": {
"Identifier": {
"type": "id",
"value": "56"
}
},
"timestamp": {
"long": 1234456756
},
"customer": {
"long": 123
}
}
}
You need to tell the connector to use fields from the key, because by default it won't.
pk.mode=record_key
However you need to use fields from either the Key or the Value, not both as you have in your config currently:
pk.fields=customer,equip_Type,equip_Value,id_Type,id_Value,cpId
If you set pk.mode=record_key then pk.fields will refer to the fields in the message key.
Ref: https://docs.confluent.io/current/connect/kafka-connect-jdbc/sink-connector/sink_config_options.html#sink-pk-config-options
See also https://rmoff.dev/kafka-jdbc-video
I have topic post_users_tand while use PRINT command on it I get
rowtime: 4/2/20 2:03:48 PM UTC, key: <null>, value: {"userid": 6, "id": 8, "title": "testest", "body": "Testingmoreand more"}
rowtime: 4/2/20 2:03:48 PM UTC, key: <null>, value: {"userid": 7, "id": 11, "title": "testest", "body": "Testingmoreand more"}
So then I create a stream out of this with:
CREATE STREAM userstream (userid INT, id INT, title VARCHAR, body VARCHAR)
WITH (KAFKA_TOPIC='post_users_t',
VALUE_FORMAT='JSON');
But I cant select anything from it and when I DESCRIBE EXTENDED it all the messages have failed.
consumer-messages-per-sec: 1.06 consumer-total-bytes: 116643 consumer-total-messages: 3417 last-message: 2020-04-02T14:08:08.546Z
consumer-failed-messages: 3417 consumer-failed-messages-per-sec: 1.06 last-failed: 2020-04-02T14:08:08.56Z
What am I doing wrong here?
Extra info under!
Print topic from beginning:
ksql> print 'post_users_t' from beginning limit 2;
Key format: SESSION(AVRO) or HOPPING(AVRO) or TUMBLING(AVRO) or AVRO or SESSION(PROTOBUF) or HOPPING(PROTOBUF) or TUMBLING(PROTOBUF) or PROTOBUF or SESSION(JSON) or HOPPING(JSON) or TUMBLING(JSON) or JSON or SESSION(JSON_SR) or HOPPING(JSON_SR) or TUMBLING(JSON_SR) or JSON_SR or SESSION(KAFKA_INT) or HOPPING(KAFKA_INT) or TUMBLING(KAFKA_INT) or KAFKA_INT or SESSION(KAFKA_BIGINT) or HOPPING(KAFKA_BIGINT) or TUMBLING(KAFKA_BIGINT) or KAFKA_BIGINT or SESSION(KAFKA_DOUBLE) or HOPPING(KAFKA_DOUBLE) or TUMBLING(KAFKA_DOUBLE) or KAFKA_DOUBLE or SESSION(KAFKA_STRING) or HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: AVRO or KAFKA_STRING
rowtime: 4/2/20 1:04:08 PM UTC, key: <null>, value: {"userid": 1, "id": 1, "title": "loremit", "body": "loremit heiluu ja paukkuu"}
rowtime: 4/2/20 1:04:08 PM UTC, key: <null>, value: {"userid": 2, "id": 2, "title": "lorbe", "body": "larboloilllaaa"}
Per the output from ksqlDB's inspection of the topic, your data is serialised in Avro:
Value format: AVRO or KAFKA_STRING
but you have created the STREAM specifying VALUE_FORMAT='JSON'. This will result in deserialisation errors which if you run docker-compose logs -f ksqldb-server you'll see being written out when you try to query the stream.
Since you're using Avro, you don't need to specify the schema. Try this instead:
CREATE STREAM userstream
WITH (KAFKA_TOPIC='post_users_t',
VALUE_FORMAT='AVRO');
How to Extract value from Cloudant IBM Bluemix NoSQL Database stored in JSON format?
I tried this code
def readDataFrameFromCloudant(host,user,pw,database):
cloudantdata=spark.read.format("com.cloudant.spark"). \
option("cloudant.host",host). \
option("cloudant.username", user). \
option("cloudant.password", pw). \
load(database)
cloudantdata.createOrReplaceTempView("washing")
spark.sql("SELECT * from washing").show()
return cloudantdata
hostname = ""
user = ""
pw = ""
database = "database"
cloudantdata=readDataFrameFromCloudant(hostname, user, pw, database)
It is stored in this format
{
"_id": "31c24a382f3e4d333421fc89ada5361e",
"_rev": "1-8ba1be454fed5b48fa493e9fe97bedae",
"d": {
"count": 9,
"hardness": 72,
"temperature": 85,
"flowrate": 11,
"fluidlevel": "acceptable",
"ts": 1502677759234
}
}
I want this result
Expected
Actual Outcome
Create a dummy dataset for reproducing the issue:
cloudantdata = spark.read.json(sc.parallelize(["""
{
"_id": "31c24a382f3e4d333421fc89ada5361e",
"_rev": "1-8ba1be454fed5b48fa493e9fe97bedae",
"d": {
"count": 9,
"hardness": 72,
"temperature": 85,
"flowrate": 11,
"fluidlevel": "acceptable",
"ts": 1502677759234
}
}
"""]))
cloudantdata.take(1)
Returns:
[Row(_id='31c24a382f3e4d333421fc89ada5361e', _rev='1-8ba1be454fed5b48fa493e9fe97bedae', d=Row(count=9, flowrate=11, fluidlevel='acceptable', hardness=72, temperature=85, ts=1502677759234))]
Now flatten:
flat_df = cloudantdata.select("_id", "_rev", "d.*")
flat_df.take(1)
Returns:
[Row(_id='31c24a382f3e4d333421fc89ada5361e', _rev='1-8ba1be454fed5b48fa493e9fe97bedae', count=9, flowrate=11, fluidlevel='acceptable', hardness=72, temperature=85, ts=1502677759234)]
I tested this code with an IBM Data Science Experience notebook using Python 3.5 (Experimental) with Spark 2.0
This answer is based on: https://stackoverflow.com/a/45694796/1033422