How can there be duplicate IDs in a KSQL table? - apache-kafka

I'm playing around with Confluent Community and a Postgres database and am running into the following issue.
The events flow well into kafka and the topics are created. I created a stream out of a topic and rekeyed it because the key was null.
Out of that new topic underlying the rekeyed stream, I created a table. The goal is to have a constantly up to date table of objects (here categories).
The thing is that the table never gets updated with the new data when I do a manual UPDATE in the database. The rows just keep being added like it's a stream. Of course, I did the select again because I know that the 'update' rows show up when we're still running the query.
ksql> select * from categories;
1568287458487 | 1 | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales
1568287458487 | 2 | 2 | Condiments | Sweet and savory sauces, relishes, spreads, and seasonings
1568287458488 | 3 | 3 | Confections | Desserts, candies, and sweet breads
1568287458488 | 4 | 4 | Dairy Products | Cheeses
1568287458488 | 5 | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal
1568287458488 | 6 | 6 | Meat/Poultry | Prepared meats
1568287458489 | 7 | 7 | Produce | Dried fruit and bean curd
1568287458489 | 8 | 8 | Seafood | Seaweed and fish
1568288647248 | 8 | 8 | Seafood2 | Seaweed and fish
1568290562250 | 1 | 1 | asdf | Soft drinks, coffees, teas, beers, and ales
1568296165250 | 8 | 8 | Seafood3 | Seaweed and fish
1568296704747 | 8 | 8 | Seafood4 | Seaweed and fish
^CQuery terminated
ksql> select * from categories;
1568287458487 | 1 | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales
1568287458487 | 2 | 2 | Condiments | Sweet and savory sauces, relishes, spreads, and seasonings
1568287458488 | 3 | 3 | Confections | Desserts, candies, and sweet breads
1568287458488 | 4 | 4 | Dairy Products | Cheeses
1568287458488 | 5 | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal
1568287458488 | 6 | 6 | Meat/Poultry | Prepared meats
1568287458489 | 7 | 7 | Produce | Dried fruit and bean curd
1568287458489 | 8 | 8 | Seafood | Seaweed and fish
1568288647248 | 8 | 8 | Seafood2 | Seaweed and fish
1568290562250 | 1 | 1 | asdf | Soft drinks, coffees, teas, beers, and ales
1568296165250 | 8 | 8 | Seafood3 | Seaweed and fish
1568296704747 | 8 | 8 | Seafood4 | Seaweed and fish
^CQuery terminated
ksql>
Categories table in postgres:
CREATE TABLE categories (
category_id smallint NOT NULL,
category_name character varying(15) NOT NULL,
description text
);
categories table in KSQL:
ksql> describe extended categories;
Name : CATEGORIES
Type : TABLE
Key field : CATEGORY_ID_ST
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : AVRO
Kafka topic : categories_rk (partitions: 1, replication: 1)
Field | Type
--------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
CATEGORY_ID_ST | VARCHAR(STRING)
CATEGORY_NAME | VARCHAR(STRING)
DESCRIPTION | VARCHAR(STRING)
MESSAGETOPIC | VARCHAR(STRING)
MESSAGESOURCE | VARCHAR(STRING)
--------------------------------------------
How is it possible that a table that is supposed to have a unique ROWKEY keeps adding more 'update' rows with the same ROWKEY?
I'm actually expecting the table to display an always up-to-date list of categories, as stated in https://www.youtube.com/watch?v=DPGn-j7yD68&list=PLa7VYi0yPIH2eX8q3mPpZAn3qCS1eDX8W&index=9:
"A TABLE is a materialized view of events with only the latest values for each key". But maybe I misunderstood that?

A table in KSQL is constantly being updated as new data arrives. The output topic that the table's rows are written to is known as a changelog: it is an immutable log of changes to the table. If a specific key is updated multiple times, then the output topic will contain multiple messages for the same key. Each new value replaces the last.
When you run a query such as:
select * from categories;
in the version of ksql you're using, you're not running traditional query, like you'd expect in a traditional RDBS. Such a query would give you the current set of rows in the table. In ksql the above query will stream all the updates to the rows as they're occuring. Hence, if the same key is updated multiple times; you'll see the same key output by the query multiple times.
In more recent versions of ksqlDB the above query would not be written:
select * from categories emit changes;
Inside ksql each key is only stored in the materialized table only once, and it's always the most recent version seen.

Related

Insert a record for evey row from one table into another using one field in postesql

I'm trying to fill a table with data to test a system.
I have two tables
User
+----+----------+
| id | name |
+----+----------+
| 1 | Majikaja |
| 2 | User 2 |
| 3 | Markus |
+----+----------+
Goal
+----+----------+---------+
| id | goal | user_id |
+----+----------+---------+
I want to insert into goal one record for every user only using their IDs (they have to exists) and some fixed or random value.
I was thinking in something like this:
INSERT INTO Goal (goal, user_id) values ('Fixed value', select u.id from user u)
So it will generate:
Goal
+----+-------------+---------+
| id | goal | user_id |
+----+-------------+---------+
| 1 | Fixed value | 1 |
| 2 | Fixed value | 2 |
| 3 | Fixed value | 3 |
+----+-------------+---------+
I could just write a simple PHP script to achieve it but I wonder if is it possible to do using raw SQL only.

KSQL Table which shows last recent non-null value

At the moment I have a stream with several sensor data, which send their status code once when they update themselves.
This is a one-time value, then the sensor value is zero again until something changes again. So in my table the last value should replace the zero values until a new value is delivered. Currently i create my table like this:
CREATE TABLE LRS WITH
(KAFKA_TOPIC='lrs', KEY_FORMAT='DELIMITED', PARTITIONS=6, REPLICAS=3)
AS SELECT
Device,
LATEST_BY_OFFSET(CAST(Sensor1 AS DOUBLE)),
LATEST_BY_OFFSET(CAST(Sensor2 AS DOUBLE))
FROM RELEVANT_VALUES RELEVANT_VALUES
WINDOW TUMBLING ( SIZE 10 SECONDS )
GROUP BY Device
So instead of behaving like this:
Device | Sensor1 | Sensor2 | Timestamp
1 | null | null | 05:00am
1 | 3 | 2 | 05:01am
1 | null | null | 05:02am
1 | null | null | 05:03am
1 | 2 | 1 | 05:04am
1 | null | null | 05:05am
it should look like this while updating the values:
Device | Sensor1 | Sensor2 | window
1 | null | null | 05:00-01
1 | 3 | 2 | 05:01-02
1 | 3 | 2 | 05:02-03
1 | 3 | 2 | 05:03-04
1 | 2 | 1 | 05:04-05
1 | 2 | 1 | 05:05-06
I basically want to create a Table which always show the latest sent value, which is not null.
Is there a way to achieve this using KSQL ?
You can always add a filter before if you are using streams or with ksql you can do something like WHERE Sensor1 IS NOT NULL

Thread safety in relational database triggers

I have a question regarding the thread-safety of trigger operations in relational databases like mariadb or mysql
Imagine a table structure like
+----+-------+----------+--------+
| ID | NAME | CATEGORY | OFFSET |
+----+-------+----------+--------+
| 1 | name1 | CAT_1 | 0 |
+----+-------+----------+--------+
| 2 | name2 | CAT_1 | 1 |
+----+-------+----------+--------+
| 3 | name3 | CAT_2 | 0 |
+----+-------+----------+--------+
| 4 | name4 | CAT_1 | 2 |
+----+-------+----------+--------+
| 5 | name5 | CAT_2 | 1 |
+----+-------+----------+--------+
Please note the value of column OFFSET in relation to CATEGORY. The offset increases by 1 everytime a record of a particular type is inserted.
For example the next record with id = 6 of type CAT_1 will have the value 3 for offset
and a record with id = 7 of type CAT_2 will have offset = 2
New records will be inserted via a rest API and the id and offset needs to returned in the response.
Now this process needs to have thread safety i.e no two records (even if invoked concurrently via HTTP request to the API) of the same category should have the same offset value.
One way I thought of doing this is via a before_insert trigger where it would read the last offset value of the to_be_inserted category and insert the new record with a +1.
What I am unsure about is if this process is thread-safe.
Can it result in a situation where two simultaneous inserts of same category will execute triggers that will read the same previous offset value and calculate the same current offset ?
If yes then what would be a thread-safe way of doing it ?
Any help would be greatly appreciated

What exactly is a wide column store?

Googling for a definition either returns results for a column oriented DB or gives very vague definitions.
My understanding is that wide column stores consist of column families which consist of rows and columns. Each row within said family is stored together on disk. This sounds like how row oriented databases store their data. Which brings me to my first question:
How are wide column stores different from a regular relational DB table? This is the way I see it:
* column family -> table
* column family column -> table column
* column family row -> table row
This image from Database Internals simply looks like two regular tables:
The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores. So here is my second question:
Are wide column stores sorted from left to right? Meaning, in the above example, are the rows sorted first by Row Key, then by Timestamp, and finally by Qualifier?
Let's start with the definition of a wide column database.
Its architecture uses (a) persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format
meant for massive scalability (over and above the petabyte scale).
A relational database is designed to maintain the relationship between the entity and the columns that describe the entity. A good example is a Customer table. The columns hold values describing the Customer's name, address, and contact information. All of this information is the same for each and every customer.
A wide column database is one type of NoSQL database.
Maybe this is a better image of four wide column databases.
My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table. It's an attribute/value table within a particular entity (column).
For Customer information, the first wide-area database example might look like this.
Customer ID Attribute Value
----------- --------- ---------------
100001 name John Smith
100001 address 1 10 Victory Lane
100001 address 3 Pittsburgh, PA 15120
Yes, we could have modeled this for a relational database. The power of the attribute/value table comes with the more unusual attributes.
Customer ID Attribute Value
----------- --------- ---------------
100001 fav color blue
100001 fav shirt golf shirt
Any attribute that a marketer can dream up can be captured and stored in an attribute/value table. Different customers can have different attributes.
The Super Column model keeps the same information in a different format.
Customer ID: 100001
Attribute Value
--------- --------------
fav color blue
fav shirt golf shirt
You can have as many Super Column models as you have entities. They can be in separate NoSQL tables or put together as a Super Column family.
The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information.
Most (if not all) Wide-column stores are indeed row-oriented stores in that every parts of a record are stored together. You can see that as a 2-dimensional key-value store. The first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server.
Wide-column stores will have different features and behaviors. However, Apache Cassandra, for example, allows you to define how the data will be sorted. Take this table for example:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If your partitioning key is (id) and your clustering key is (country, timestamp), the data will be stored like this:
[Key 1]
1:JP,2020-11-01,"b..." | 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If you change the primary key (composite of partitioning and clustering key) to (id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id is the partitioning key, timestamp is the clustering key in descending order), the result would be:
[Key 1]
1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." | 1:JP,2020-11-01,"b..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|

PostgreSQL simple count query

Trying to scale this down so the answer is simple. I can probably extrapolate the answers here to apply to a bigger data set.
Given the following table:
+------+-----+
| name | age |
+------+-----+
| a | 5 |
| b | 7 |
| c | 8 |
| d | 8 |
| e | 10 |
+------+-----+
I want to make a table that shows the count of people where their age is equal to or greater than x. For instance, the table about would produce:
+--------------+-------+
| at least age | count |
+--------------+-------+
| 5 | 5 |
| 6 | 4 |
| 7 | 4 |
| 8 | 3 |
| 9 | 1 |
| 10 | 1 |
+--------------+-------+
Is there a single query that can accomplish this task? Obviously, it is easy to write a simple function for it, but I'm hoping to be able to do this quickly with one query.
Thanks!
Yes, what you're looking for is a window function.
with cte_age_count as (
select age,
count(*) c_star
from people
group by age)
select age,
sum(c_star) over (order by age
range between unbounded preceding
and current row)
from cte_age_count
Not syntax checked ... let me know if it works!