I'm learning K-SQL/KSQL-DB and currently exploring joins. Below is the issue where I'm stuck.
I have 1 stream 'DRIVERSTREAMREPARTITIONEDKEYED' and one table 'COUNTRIES', below is their description.
ksql> describe DRIVERSTREAMREPARTITIONEDKEYED;
Name: DRIVERSTREAMREPARTITIONEDKEYED
Field | Type
--------------------------------------
COUNTRYCODE | VARCHAR(STRING) (key)
NAME | VARCHAR(STRING)
RATING | DOUBLE
--------------------------------------
ksql> describe countries;
Name : COUNTRIES
Field | Type
----------------------------------------------
COUNTRYCODE | VARCHAR(STRING) (primary key)
COUNTRYNAME | VARCHAR(STRING)
----------------------------------------------
This is the sample data that they have,
ksql> select * from DRIVERSTREAMREPARTITIONEDKEYED emit changes;
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|COUNTRYCODE |NAME |RATING |
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|SGP |Suresh |3.5 |
|IND |Mahesh |2.4 |
ksql> select * from countries emit changes;
+---------------------------------------------------------------------+---------------------------------------------------------------------+
|COUNTRYCODE |COUNTRYNAME |
+---------------------------------------------------------------------+---------------------------------------------------------------------+
|IND |INDIA |
|SGP |SINGAPORE |
I'm trying to do a 'left outer' join on them with the stream being on the left side, but below is the output I get,
select d.name,d.rating,c.COUNTRYNAME from DRIVERSTREAMREPARTITIONEDKEYED d left join countries c on d.COUNTRYCODE=c.COUNTRYCODE emit changes;
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|NAME |RATING |COUNTRYNAME |
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|Suresh |3.5 |null |
|Mahesh |2.4 |null |
In ideal scenario I should get the data in 'COUNTRYNAME' column as the 'COUNTRYCODE' column in both stream and data have matching data.
I tried searching a lot but to no avail.
I'm using 'Confluent Platform: 6.1.1'
For join to work it is our responsible to verify if the keys of both entities which are being joined lie in the same partition, KsqlDB can't verify whether the partitioning strategies are the same for both join inputs.
In my case My 'Drivers' topic had 2 partitions on which I had created a stream 'DriversStream' which in turn also had 2 partitions, but the table 'Countries' which I wanted to Join it with had only 1 partition, due to this I 're-keyed' the 'DriversStream' and created another stream 'DRIVERSTREAMREPARTITIONEDKEYED' shown in the question.
But the data of the table and the stream were not in the same partition hence the join was failing.
I created another topic with 1 partition 'DRIVERINFO'.
kafka-topics --bootstrap-server localhost:9092 --create --partitions 1 --replication-factor 1 --topic DRIVERINFO
Then created a stream over it 'DRIVERINFOSTREAM'.
CREATE STREAM DRIVERINFOSTREAM (NAME STRING, RATING DOUBLE, COUNTRYCODE STRING) WITH (KAFKA_TOPIC='DRIVERINFO', VALUE_FORMAT='JSON');
Finally joined it with 'COUNTRIES' table which finally worked.
ksql> select d.name,d.rating,c.COUNTRYNAME from DRIVERINFOSTREAM d left join countries c on d.COUNTRYCODE=c.COUNTRYCODE EMIT CHANGES;
+-------------------------------------------+-------------------------------------------+-------------------------------------------+
|NAME |RATING |COUNTRYNAME |
+-------------------------------------------+-------------------------------------------+-------------------------------------------+
|Suresh |2.4 |SINGAPORE |
|Mahesh |3.6 |INDIA |
Refer to below links for details,
KSQL join
Partitioning data for Joins
Related
How to create a KSQL stream listening on the TOPIC T where the JSON structure of the message is:
{"k":"1","a":{"b":1,"c":{"d":10}}}
I tried the following and it does not work. Getting a syntax error.
create stream s (k VARCHAR,a STRUCT <b INT ,c <STRUCT d INT >> )
with (KAFKA_TOPIC='T',VALUE_FORMAT='JSON',KEY='k')
Here's an example of how to do it. I've got the test data in a topic:
ksql> PRINT test FROM BEGINNING;
Format:JSON
{"ROWTIME":1578571011016,"ROWKEY":"null","k":"1","a":{"b":1,"c":{"d":10}}}
^CTopic printing ceased
Declare the stream:
ksql> CREATE STREAM TEST (K VARCHAR,
A STRUCT<B INT,
C STRUCT<D INT>>
) WITH (KAFKA_TOPIC='test',
VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
Query the stream:
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql> SELECT * FROM TEST EMIT CHANGES LIMIT 1;
+---------------------+----------+-----+---------------------+
|ROWTIME |ROWKEY |K |A |
+---------------------+----------+-----+---------------------+
|1578571011016 |null |1 |{B=1, C={D=10}} |
Limit Reached
Query terminated
ksql> SELECT K, A, A->B, A->C, A->C->D FROM TEST EMIT CHANGES LIMIT 1;
+----+-----------------+-------+---------+----------+
|K |A |A__B |A__C |A__C__D |
+----+-----------------+-------+---------+----------+
|1 |{B=1, C={D=10}} |1 |{D=10} |10 |
Limit Reached
Query terminated
I did configure a connection to database and all data transfer over the topic becaus when i run the consumer it return data
How can i transform this topic to table and persist the data inside KSQL?
thanks very much
You don't persist data in KSQL. KSQL is simply an engine for querying and transforming data in Kafka. The source for KSQL queries is Kafka topic(s), and the output of KSQL queries is either interactive, or back out to another kafka topic.
If you have data in your Kafka topics—which it sounds like you have—then in KSQL run LIST TOPICS;:
ksql> LIST TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
---------------------------------------------------------------------------------------------------------
_confluent-metrics | false | 12 | 1 | 0 | 0
asgard.demo.accounts | false | 1 | 1 | 0 | 0
To see your Kafka topics. From there, pick your topic, and you can run PRINT 'my-topic' FROM BEGINNING;
ksql> PRINT 'asgard.demo.accounts' FROM BEGINNING;
Format:AVRO
10/11/18 9:24:45 AM UTC, null, {"account_id": "a42", "first_name": "Robin", "last_name": "Moffatt", "email": "robin#confluent.io", "phone": "+44 123 456 789", "address": "22 Acacia Avenue", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
10/11/18 9:24:45 AM UTC, null, {"account_id": "a081", "first_name": "Sidoney", "last_name": "Lafranconi", "email": "slafranconi0#cbc.ca", "phone": "+44 908 687 6649", "address": "40 Kensington Pass", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
10/11/18 9:24:45 AM UTC, null, {"account_id": "a135", "first_name": "Mick", "last_name": "Edinburgh", "email": "medinburgh1#eepurl.com", "phone": "+44 301 837 6535", "address": "27 Blackbird Lane", "country": "United Kingdom", "create_ts": "2018-10-11T09:23:22Z", "update_ts": "2018-10-11T09:23:22Z", "messagetopic": "asgard.demo.accounts", "messagesource": "Debezium CDC from MySQL on asgard"}
to see the contents of it. Press Ctrl-C to cancel the PRINT statement and return to the command line.
Note the Format on the output of the PRINT statement. This is the serialisation format of your data.
If the data's serialised in Avro then you can run:
CREATE STREAM mydata WITH (KAFKA_TOPIC='asgard.demo.accounts', VALUE_FORMAT='AVRO');
If it's in JSON you'll need to also specify the column names and datatypes
CREATE STREAM mydata (col1 INT, col2 VARCHAR) WITH (KAFKA_TOPIC='asgard.demo.accounts', VALUE_FORMAT='JSON');
Now that you've 'registered' this topic with KSQL, you can view its schema with DESCRIBE:
ksql> DESCRIBE mydata;
Name : MYDATA
Field | Type
-------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
ACCOUNT_ID | VARCHAR(STRING)
FIRST_NAME | VARCHAR(STRING)
LAST_NAME | VARCHAR(STRING)
EMAIL | VARCHAR(STRING)
PHONE | VARCHAR(STRING)
ADDRESS | VARCHAR(STRING)
COUNTRY | VARCHAR(STRING)
CREATE_TS | VARCHAR(STRING)
UPDATE_TS | VARCHAR(STRING)
MESSAGETOPIC | VARCHAR(STRING)
MESSAGESOURCE | VARCHAR(STRING)
-------------------------------------------
and then use KSQL to query and manipulate the data:
ksql> SET 'auto.offset.reset'='earliest';
ksql> SELECT FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME, EMAIL FROM mydata WHERE COUNTRY='United Kingdom';
Robin Moffatt | robin#confluent.io
Sidoney Lafranconi | slafranconi0#cbc.ca
Mick Edinburgh | medinburgh1#eepurl.com
Merrill Stroobant | mstroobant2#china.com.cn
Press Ctrl-C to cancel a SELECT query.
KSQL can persist this to a new Kafka topic:
CREATE STREAM UK_USERS AS SELECT FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME, EMAIL FROM mydata WHERE COUNTRY='United Kingdom';
If you list your KSQL topics again, you'll see the new one created and populated:
ksql> LIST TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
---------------------------------------------------------------------------------------------------------
_confluent-metrics | false | 12 | 1 | 0 | 0
asgard.demo.accounts | true | 1 | 1 | 2 | 2
UK_USERS | true | 4 | 1 | 0 | 0
---------------------------------------------------------------------------------------------------------
ksql>
Every event coming into the source topic (asgard.demo.accounts) gets read and filtered by KSQL and written to the target topic (UK_USERS) based on the SQL you've executed.
For more info see the KSQL syntax docs, and tutorials.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.
I am working with the following two tables;
Table 1
Key |Clicks |Impressions
-------------+-------+-----------
USA-SIM-CARDS|55667 |544343
DE-SIM-CARDS |4563 |234829
AU-SIM-CARDS |3213 |232242
UK-SIM-CARDS |3213 |1333223
CA-SIM-CARDS |4321 |8883111
MX-SIM-CARDS |3193 |3291023
Table 2
Key |Conversions |Final Conversions|Active Sims
-----------------+------------+-----------------+-----------
USA-SIM-CARDS |456 |43 |4
USA-SIM-CARDS |65 |2 |1
UK-SIM-CARDS |123 |4 |3
UK-SIM-CARDS |145 |34 |5
The goal is to get the following output;
Key |Clicks |Impressions|Conversions|Final Conversions|Active Sims
-------------+-------+-----------+-----------+-----------------+-----------
USA-SIM-CARDS|55667 |544343 |521 |45 |5
DE-SIM-CARDS |4563 |234829 | | |
AU-SIM-CARDS |3213 |232242 | | |
UK-SIM-CARDS |3213 |1333223 |268 |38 |8
CA-SIM-CARDS |4321 |8883111 | | |
MX-SIM-CARDS |3193 |3291023 | | |
The most crucial part of this function involves aggregating the second table based on conversions
I would then I imagine execute this with an inner join.
Thank you.
Take this in two steps then:
1) Aggregate the second table:
SELECT Key, sum(Conversions) as Conversions, sum("Final Conversions") as FinalConversions, Sum("Active Sims") as ActiveSims FROM Table2 GROUP BY key
2) Use that as a subquery/derived table joining to your first table:
SELECT
t1.key,
t1.clicks,
t1.impressions,
t2.conversions,
t2.finalConversions,
t2.ActiveSims
From Table1 t1
LEFT OUTER JOIN (SELECT Key, sum(Conversions) as Conversions, sum("Final Conversions") as FinalConversions, Sum("Active Sims") as ActiveSims FROM Table2 GROUP BY 2) t2
ON t1.key = t2.key;
As an alternative, you could join and then group by as well since there isn't any need to aggregate twice or anything:
SELECT
t1.key,
t1.clicks,
t1.impressions,
sum(Conversions) as Conversions,
sum("Final Conversions") as FinalConversions,
Sum("Active Sims") as ActiveSims
From Table1 t1
LEFT OUTER JOIN table2 t2
ON t1.key = t2.key
GROUP BY t1.key, t1.clicks, t1.impressions
The only other important thing here is that we are using a LEFT OUTER JOIN since we want all record from Table1 and any records from Table2 that match on the key.
I have two tables, table A has ID column whose values are comma separated, each of those ID value has a representation in table B.
Table A
+-----------------+
| Name | ID |
+------------------
| A1 | 1,2,3|
| A2 | 2 |
| A3 | 3,2 |
+------------------
Table B
+-------------------+
| ID | Value |
+-------------------+
| 1 | Apple |
| 2 | Orange |
| 3 | Mango |
+-------------------+
I was wondering if there is an efficient way to do a select where the result would as below,
Name, Value
A1 Apple, Orange, Mango
A2 Orange
A3 Mango, Orange
Any suggestions would be welcome. Thanks.
You need to first "normalize" table_a into a new table using the following:
select name, regexp_split_to_table(id, ',') id
from table_a;
The result of this can be joined to table_b and the result of the join then needs to be grouped in order to get the comma separated list of the names:
select a.name, string_agg(b.value, ',')
from (
select name, regexp_split_to_table(id, ',') id
from table_a
) a
JOIN table_b b on b.id = a.id
group by a.name;
SQLFiddle: http://sqlfiddle.com/#!12/77fdf/1
There are two regex related functions that can be useful:
http://www.postgresql.org/docs/current/static/functions-string.html
regexp_split_to_table()
regexp_split_to_array()
Code below is untested, but you'd use something like it to match A and B:
select name, value
from A
join B on B.id = ANY(regexp_split_to_array(A.id, E'\\s*,\\s*', 'g')::int[]))
You can then use array_agg(value), grouping by name, and format using array_to_string().
Two notes, though:
It won't be as efficient as normalizing things.
The formatting itself ought to be done further down, in your views.
I have a table as follows
|GroupID | UserID |
--------------------
|1 | 1 |
|1 | 2 |
|1 | 3 |
|2 | 1 |
|2 | 2 |
|3 | 20 |
|3 | 30 |
|5 | 200 |
|5 | 100 |
Basically what this does is create a "group" which user IDs get associated with, so when I wish to request members of a group I can call on the table.
Users have the option of leaving a group, and creating a new one.
When all users have left a group, there's no longer that groupID in my table.
Lets pretend this is for a chat application where users might close and open chats constantly, the group IDs will add up very quickly, but the number of chats will realistically not reach millions of chats with hundreds of users.
I'd like to recycle the group ID numbers, such that when I goto insert a new record, if group 4 is unused (as is the case above), it gets assigned.
There are good reasons not to do this, but it's pretty straightforward in PostgreSQL. The technique--using generate_series() to find gaps in a sequence--is useful in other contexts, too.
WITH group_id_range AS (
SELECT generate_series((SELECT MIN(group_id) FROM groups),
(SELECT MAX(group_id) FROM groups)) group_id
)
SELECT min(gir.group_id)
FROM group_id_range gir
LEFT JOIN groups g ON (gir.group_id = g.group_id)
WHERE g.group_id IS NULL;
That query will return NULL if there are no gaps or if there are no rows at all in the table "groups". If you want to use this to return the next group id number regardless of the state of the table "groups", use this instead.
WITH group_id_range AS (
SELECT generate_series(
(COALESCE((SELECT MIN(group_id) FROM groups), 1)),
(COALESCE((SELECT MAX(group_id) FROM groups), 1))
) group_id
)
SELECT COALESCE(min(gir.group_id), (SELECT MAX(group_id)+1 FROM groups))
FROM group_id_range gir
LEFT JOIN groups g ON (gir.group_id = g.group_id)
WHERE g.group_id IS NULL;