KSQLDB coalesce always returns null despite parameters - apache-kafka

I have the following ksql query:
SELECT
event->acceptedevent->id as id1,
event->refundedevent->id as id2,
coalesce(event->acceptedevent->id, event->refundedevent->id) as coalesce_col
FROM events
EMIT CHANGES;
Based on the documentation, (https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/scalar-functions/#coalesce) COALESCE returns the first non-null parameter.
Query returns the following:
+-----------------------------------------------+-----------------------------------------------+-----------------------------------------------+
|ID1 |ID2 |COALESCE_COL |
+-----------------------------------------------+-----------------------------------------------+-----------------------------------------------+
|1 |null |null |
|2 |null |null |
|3 |null |null |
I was expecting since ID1 is clearly not null, being the first parameter to the call, COALESCE will return same value as ID1 but it returns null. What am I missing?
I am using confluentinc/cp-ksqldb-server:6.1.1 and use avro for the value serde.
EventMessage.avsc:
{
"type": "record",
"name": "EventMessage",
"namespace": "com.example.poc.processor2.avro",
"fields": [
{
"name": "event",
"type": [
"com.example.poc.processor2.avro.AcceptedEvent",
"com.example.poc.processor2.avro.RefundedEvent"
]
}
]
}

Probably a bug in how data is deserialized, or the COALESCE function.
What KSQL version are you running
How is your data serialized in the topic?
I tried with a JSON format and it worked.
ksql> describe events;
Name : EVENTS
Field | Type
------------------------------------------------------------------------------------
EVENT | STRUCT<ACCEPTEDEVENT STRUCT<ID INTEGER>, REFUNDEDEVENT STRUCT<ID INTEGER>>
------------------------------------------------------------------------------------
ksql> print 'events' from BEGINNING;
Key format: ¯\_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2021/03/24 13:57:27.403 Z, key: <null>, value: {"event":{"acceptedevent":{"id":1}, "refundedevent":{}}}, partition:
ksql> select event->acceptedevent->id, event->refundedevent->id, coalesce(event->acceptedevent->id, event->refundedevent->id) from events emit changes;
+----------------------------------------------------------+----------------------------------------------------------+----------------------------------------------------------+
|ID |ID_1 |KSQL_COL_0 |
+----------------------------------------------------------+----------------------------------------------------------+----------------------------------------------------------+
|1 |null |1 |

Related

How to get the two nearest values in spark scala DataFrame

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:
first row for example: value 1 is between 2 and 7 in the value2 columns
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |1 |
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |2 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
|3 |5 |7 |
|3 |5 |8 |
My output should like this:
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
Can someone guide me how to resolve this please.
Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.
Group the elements (select id, value1) (aggregate on value2
with collect_list) so you can collect all the value2 into an
array.
select (id, and (add(concat) value1 to the collect_list array)) Sorting the array .
find( array_position ) value1 in the array.
splice the array. retrieving value before and value after
the result of (array_position)
If the array is less than 3 elements do error handling
now the last value in the array and the first value in the array are your 'closest numbers'.
You will need window functions for this.
val window = Window
.partitionBy("id", "value1")
.orderBy(asc("value2"))
val result = df
.withColumn("prev", lag("value2").over(window))
.withColumn("next", lead("value2").over(window))
.withColumn("dist_prev", col("value2").minus(col("prev")))
.withColumn("dist_next", col("next").minus(col("value2")))
.withColumn("min", min(col("dist_prev")).over(window))
.filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
.drop("prev", "next", "dist_prev", "dist_next", "min")
I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.
Here is what's going on here:
First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.
Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively. (prev will be null for the first row in the group, and next will be null for the last row – that is ok).
Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively. (Note that dist_prev for each row will have the same value as dist_next for the previous row).
Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).
Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev. This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case. This implementation will simply return all of these rows.
Finally, drop all extra columns that were added to the dataframe to return it to its original shape.

The stream created in ksqlDB shows NULL value

I am trying to create a stream in ksqlDB to get the data from the kafka topic and perform query on it.
CREATE STREAM test_location (
id VARCHAR,
name VARCHAR,
location VARCHAR
)
WITH (KAFKA_TOPIC='public.location',
VALUE_FORMAT='JSON',
PARTITIONS=10);
The data in the topics public.location is in JSON format.
UPDATED topic message.
print 'public.location' from beginning limit 1;
Key format: ¯\_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2021/05/23 11:27:39.429 Z, key: <null>, value: {"sourceTable":{"id":"1","name":Sam,"location":Manchester,"ConnectorVersion":null,"connectorId":null,"ConnectorName":null,"DbName":null,"DbSchema":null,"TableName":null,"payload":null,"schema":null},"ConnectorVersion":null,"connectorId":null,"ConnectorName":null,"DbName":null,"DbSchema":null,"TableName":null,"payload":null,"schema":null}, partition: 3
After the stream is created, and performing SELECT on the created stream I get NULL in the output. Although the topic has the data.
select * from test_location
>EMIT CHANGES limit 5;
+-----------------------------------------------------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
|ID |NAME |LOCATION |
+-----------------------------------------------------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
|null |null |null |
|null |null |null |
|null |null |null |
|null |null |null |
|null |null |null |
Limit Reached
Query terminated
Here is the details from docker file
version: '2'
services:
ksqldb-server:
image: confluentinc/ksqldb-server:0.18.0
hostname: ksqldb-server
container_name: ksqldb-server
depends_on:
- schema-registry
ports:
- "8088:8088"
environment:
KSQL_LISTENERS: "http://0.0.0.0:8088"
KSQL_BOOTSTRAP_SERVERS: "broker:29092"
KSQL_KSQL_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true"
KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true"
# Configuration to embed Kafka Connect support.
KSQL_CONNECT_GROUP_ID: "ksql-connect-01"
KSQL_CONNECT_BOOTSTRAP_SERVERS: "broker:29092"
KSQL_CONNECT_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
KSQL_CONNECT_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
KSQL_CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
KSQL_CONNECT_CONFIG_STORAGE_TOPIC: "_ksql-connect-01-configs"
KSQL_CONNECT_OFFSET_STORAGE_TOPIC: "_ksql-connect-01-offsets"
KSQL_CONNECT_STATUS_STORAGE_TOPIC: "_ksql-connect-01-statuses"
KSQL_CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
KSQL_CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
KSQL_CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
KSQL_CONNECT_PLUGIN_PATH: "/usr/share/kafka/plugins"
Update:
Here is a message in the topic that I see in the Kafka
{
"sourceTable": {
"id": "1",
"name": Sam,
"location": Manchester,
"ConnectorVersion": null,
"connectorId": null,
"ConnectorName": null,
"DbName": null,
"DbSchema": null,
"TableName": null,
"payload": null,
"schema": null
},
"ConnectorVersion": null,
"connectorId": null,
"ConnectorName": null,
"DbName": null,
"DbSchema": null,
"TableName": null,
"payload": null,
"schema": null
}
Which step or configuration I am missing?
Given your payload, you would need to declare the schema nested, because id, name, and location are not "top level" fields in the Json, but they are nested within sourceTable.
CREATE STREAM est_location (
sourceTable STRUCT<id VARCHAR, name VARCHAR, location VARCHAR>
)
It's not possible to "unwrap" the data when defining the schema, but the schema must match what is in the topic. In addition to sourceTable you could also add ConnectorVersion etc to the schema, as they are also "top level" fields in you JSON. Bottom line is, that column in ksqlDB can only be declared on top level field. Everything else is nested data that you can access using STRUCT type.
Of course later, when you query est_location you can refer to individual fields via sourceTable->id etc.
It would also be possible to declare a derived STREAM if you want to unnest the schema:
CREATE STREAM unnested_est_location AS
SELECT sourceTable->id AS id,
sourceTable->name AS name,
sourceTable->location AS location
FROM est_location;
Of course, this would write the data into a new topic.

Talend : Transform JSON lines to columns, extracting column names from JSON

I have a json rest response with a structure somehow like this one :
{
"data" : [
{
"fields" : [
{ "label" : "John", "value" : "John" },
{ "label" : "Smith", "value" : "/person/4315" },
{ "label" : "43", "value" : "43" },
{ "label" : "London", "value" : "/city/54" }
]
},
{
"fields" : [
{ "label" : "Albert", "value" : "Albert" },
{ "label" : "Einstein", "value" : "/person/154" },
{ "label" : "141", "value" : "141" },
{ "label" : "Princeton", "value" : "/city/9541" }
]
}
],
"columns" : ["firstname", "lastname", "age", "city"]
}
I'm looking for a way to transform this data to rows like
| first_name_label | firstname_value | lastname_label | lastname_value | age_label | age_value | city_label | city_value |
---------------------------------------------------------------------------------------------------------------------------
| John | John | Smith | /person/4315 | 43 | 43 | London | /city/54 |
| Albert | Albert | Einstein | /person/154 | 141 | 141 | Princeton | /city/9541 |
Of course the number of columns and their names may change so I don't know the schema before runtime.
I probably can write java to handle this but I'd like to know if there's a more standard way.
I'm new to Talend so I spent hours trying, but since my attempts were probably totally wrong I won't describe it here.
Thanks for your help.
Here's a completely dynamic solution I put together.
First, you need to read the json in order to get the column list. Here's what tExtractJSONFields_2 looks like:
Then you store the columns and their positions in a tHashOutput (you need to unhide it in File > Project properties > Designer > Palette settings). In tMap_2, you get the position of the column using a sequence:
Numeric.sequence("s", 1, 1)
The output of this subjob is:
|=-------+--------=|
|position|column |
|=-------+--------=|
|1 |firstname|
|2 |lastname |
|3 |age |
|4 |city |
'--------+---------'
The 2nd step is to read the json again, in order to parse the fields property.
Like in step 1, you need to add a position to each field, relative to the columns. Here's the expression I used to get the sequence:
(Numeric.sequence("s1", 0, 1) % ((Integer)globalMap.get("tHashOutput_1_NB_LINE"))) + 1
Note that I'm using a different sequence name, because sequences keep their value throughout the job. I'm using the number of columns from tHashOutput_1 in order to keep things dynamic.
Here's the output from this subjob:
|=-------+---------+---------------=|
|position|label |value |
|=-------+---------+---------------=|
|1 |John |John |
|2 |Smith |/person/4315 |
|3 |43 |43 |
|4 |London |/city/54 |
|1 |Albert |Albert |
|2 |Einstein |/person/154 |
|3 |141 |141 |
|4 |Princeton|/city/9541 |
'--------+---------+----------------'
In the last subjob, you need to join the fields data with the columns, using the column position we stored with either one.
In tSplitRow_1 I generate 2 rows for each incoming row. Each row is a key value pair. The first row is <columnName>_label (like firstname_label, lastname_label) its value being the label from the fields. The 2nd row's key is <columnName>_value, and its value is the value from the fields.
Once again, we need to add a position to our data in tMap_4, using this expression:
(Numeric.sequence("s2", 0, 1) / ((Integer)globalMap.get("tHashOutput_1_NB_LINE") * 2)) + 1
Note that since we have twice as many rows coming out of tSplitRow, I multiply the number of columns by 2.
This will attribute the same ID for the data that needs to be on the same row in the output file.
The output of this tMap will be like:
|=-+---------------+-----------=|
|id|col_label |col_value |
|=-+---------------+-----------=|
|1 |firstname_label|John |
|1 |firstname_value|John |
|1 |lastname_label |Smith |
|1 |lastname_value |/person/4315|
|1 |age_label |43 |
|1 |age_value |43 |
|1 |city_label |London |
|1 |city_value |/city/54 |
|2 |firstname_label|Albert |
|2 |firstname_value|Albert |
|2 |lastname_label |Einstein |
|2 |lastname_value |/person/154 |
|2 |age_label |141 |
|2 |age_value |141 |
|2 |city_label |Princeton |
|2 |city_value |/city/9541 |
'--+---------------+------------'
This leads us to the last component tPivotToColumnsDelimited which will pivot our rows to columns using the unique ID.
And the final result is a csv file like:
id;firstname_label;firstname_value;lastname_label;lastname_value;age_label;age_value;city_label;city_value
1;John;John;Smith;/person/4315;43;43;London;/city/54
2;Albert;Albert;Einstein;/person/154;141;141;Princeton;/city/9541
Note that you end up with an extraneous column at the beginning which is the row id which can be easily removed by reading the file and removing it.
I tried adding a new column along with the corresponding fields in the input json, and it works as expected.

Casting the Dataframe columns with validation in spark

I need to cast the column of the data frame containing values as all string to a defined schema data types.
While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column
Example of Dataframe
+---+----------+-----+
|id |name |class|
+---+----------+-----+
|1 |abc |21 |
|2 |bca |32 |
|3 |abab | 4 |
|4 |baba |5a |
|5 |cccca | |
+---+----------+-----+
Json Schema of the file:
{"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}
In this row 4 is corrupt records as the class column is of type Integer
So only this records has to be there in corrupt records, not the 5th row
Just check if value is NOT NULL before casting and NULL after casting
import org.apache.spark.sql.functions.when
df
.withColumn("class_integer", $"class".cast("integer"))
.withColumn(
"class_corrupted",
when($"class".isNotNull and $"class_integer".isNull, $"class"))
Repeat for each column / cast you need.

SQL Joining two table

I am struggling, maybe the simplest problem ever. My SQL knowledge pretty much limits me from achieving this. I am trying to build an sql query that should show JobTitle, Note and NoteType. Here is the thing, First job doesn't have any note but we should see it in the results. System notes never and ever should be displayed. An expected result should look like this
Result:
--------------------------------------------
|ID |Title |Note |NoteType |
--------------------------------------------
|1 |FirstJob |NULL |NULL |
|2 |SecondJob |CustomNot1|1 |
|2 |SecondJob |CustomNot2|1 |
|3 |ThirdJob |NULL |NULL |
--------------------------------------------
.
My query (doesn't work, doesn't display third job)
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN NOTE N ON N.JobId = J.ID
WHERE N.NoteType IS NULL OR N.NoteType = 1
My Tables:
My JOB Table
----------------------
|ID |Title |
----------------------
|1 |FirstJob |
|2 |SecondJob |
|3 |ThirdJob |
----------------------
My NOTE Table
--------------------------------------------
|ID |JobId |Note |NoteType |
--------------------------------------------
|1 |2 |CustomNot1|1 |
|2 |2 |CustomNot2|1 |
|3 |2 |SystemNot1|2 |
|4 |2 |SystemNot3|2 |
|5 |3 |SystemNot1|2 |
--------------------------------------------
This can't be true together (NoteType can't be NULL as well as 1 at the same time):
WHERE N.NoteType IS NULL AND N.NoteType = 1
You may want to use OR instead to check if NoteType is either NULL or 1.
WHERE N.NoteType IS NULL OR N.NoteType = 1
EDIT: With corrected query, your third job will not be retrieved as JOB_ID is matching but its the row getting filtered out because of the where condition.
Try below as work around to get the third job with null values.
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN
( SELECT JOBID NOTE, NOTETYPE FROM NOTE
WHERE N.NoteType IS NULL OR N.NoteType = 1) N
ON N.JobId = J.ID
just exclude the systemNotes and use a sub-select:
select * from job j
left outer join (
select * from note where notetype!=2
) n
on j.id=n.jobid;
if you include the joined table into where then left outer join might work as an inner join.