I have a stream of data in topic that should be treated as ksql table (only last value of given key matters) and this data is about updates of some data's specific fields in other topic. Is there any way in KSQLDB to process stream that update values in other stream/table/topic? Target topic has entities with let's say 20 fields, but my stream that contains update has update of 3 fields, so I want to update only these 3 fields and other 17 fields should remain the same in target topic (treated as table).
You can solve your problem using a JOIN STATEMENT with a little adjustment, follow the sample, will create a table with 5 fields, but only will be necessary to update the fields skill and level from another table.
1.Create the table from the source topic:
CREATE TABLE TBL_EMPLOYEE( `employee_id` VARCHAR, `name` varchar, `lastName` varchar, `age` INT, `skill` VARCHAR, `level` VARCHAR ) WITH ( KAFKA_TOPIC = 'employee-topic-input', PARTITIONS = 3, VALUE_FORMAT = 'JSON', KEY = '`employee_id`');
2.Create the table to handle the desired updates ( It can be stream or table, resulting from another query)
CREATE TABLE TBL_EMPLOYEE_DESIRED_UPDATES (`employee_id` VARCHAR, `skill` VARCHAR, `level` VARCHAR) WITH( KAFKA_TOPIC = 'employee-desired-updates-topic', PARTITIONS = 3, VALUE_FORMAT ='JSON', KEY = '`employee_id`');
3.Create the final table to update the required fields, the left join allows all the elements on the first table. if there is not any update on the second table, the skill and level fields will be the same.
SET 'auto.offset.reset' = 'earliest';
CREATE TABLE TBL_EMPLOYEE_FINAL AS
SELECT
EMP.`employee_id` AS `employee_id`,
EMP.`name` AS `name`,
EMP.`lastName` AS `lastName`,
IFNULL(UPD.`skill`, EMP.`skill`) as `skill`,
IFNULL(UPD.`level`, EMP.`level`) as `level`
FROM TBL_EMPLOYEE AS EMP
LEFT JOIN TBL_EMPLOYEE_DESIRED_UPDATES UPD ON EMP.ROWKEY = UPD.ROWKEY EMIT CHANGES;
Example:
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('117', 'John', 'Constantine', 30, 'java', 'jr');
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('118', 'Anthony', 'Stark', 40, 'AWS', 'architect');
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('119', 'Clark', 'Kent', 35, 'python', 'senior');
ksql> SELECT * FROM TBL_EMPLOYEE_FINAL EMIT CHANGES;
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|ROWTIME |ROWKEY |employee_id |name |lastName |skill |level |
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|1611440363833 |119 |119 |Clark |Kent |python |senior |
|1611440361284 |117 |117 |John |Constantine |java |jr |
|1611440361408 |118 |118 |Anthony |Stark |AWS |architect |
The second step is to send an update
INSERT INTO TBL_EMPLOYEE_DESIRED_UPDATES (`employee_id`, `skill`, `level` ) VALUES ('118', 'mongo', 'senior');
The result
ksql> SELECT * from TBL_EMPLOYEE_FINAL EMIT CHANGES;
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|ROWTIME |ROWKEY |employee_id |name |lastName |skill |level |
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|1611440363833 |119 |119 |Clark |Kent |python |senior |
|1611440361284 |117 |117 |John |Constantine |java |jr |
|1611440361408 |118 |118 |Anthony |Stark |AWS |architect |
|1611440585726 |118 |118 |Anthony |Stark |mongo |senior |
You have to consider the latest element in the table as the new one with the two modifications. The other one is part of the changelog of the table. The records are immutables.
Related
I am trying to copy a JSON data from Kafka to vertica. I am using the following query
COPY public.from_kafka
SOURCE KafkaSource(stream='example_data|0|-2, example_data|1|-2',
brokers='kafka01.example.com:9092',
duration=interval '10000 milliseconds') PARSER KafkaJSONParser()
REJECTED DATA AS TABLE public.rejections;
each message in the topic looks like that:
{"location_id":30277, "start_date":1667911800000}
when I am running the query, no new rows are created. when I am checking the rejections table I see the following rejected_reason:
Missing or null value for column with NOT NULL constraint [start_date]
however the rejected_data is {"location_id":30277, "start_date":1667911800000}
why does Vertica not recognize the start_date field and how can I solve it?
vertica table:
CREATE TABLE public.from_kafka
(
location_id int NOT NULL,
start_date timestamp NOT NULL
)
CREATE PROJECTION public.from_kafka /*+createtype(L)*/
(
location_id ENCODING RLE,
start_date ENCODING GCDDELTA
)
AS
SELECT from_kafka.location_id,
from_kafka.start_date,
FROM public.from_kafka
ORDER BY from_kafka.start_date,
from_kafka.location_id
SEGMENTED BY hash(from_kafka.location_id, from_kafka.start_date) ALL NODES KSAFE 1;
EDIT - SOLUTION
PARSER KafkaJSONParser() does not know how to convert long into timestamp, due to that I had to convert the JSON message with java, insert the updated JSON to a new topic and then use KafkaJSONParser() function
A timestamp, in any SQL database, is a timestamp, not an integer.
To load your JSON format and have a timestamp, redefine your table to receive an integer and convert it to a timestamp on the fly.
I do it from file, here, but it will work with a Kafka stream, too.
-- create your table like so:
DROP TABLE IF EXISTS public.from_kafka;
CREATE TABLE public.from_kafka
(
location_id int NOT NULL,
start_date int NOT NULL,
start_date_ts timestamp DEFAULT TO_TIMESTAMP(start_date//1000)
);
This is the JSON file I use:
$ cat kafka.json
{"location_id":30277, "start_date":1667911800000},
{"location_id":30278, "start_date":1667911900000},
{"location_id":30279, "start_date":1667912000000},
{"location_id":30280, "start_date":1667912100000},
{"location_id":30281, "start_date":1667912200000},
{"location_id":30282, "start_date":1667912300000}
And this is the copy command I use:
COPY public.from_kafka (
location_id
, start_date
)
FROM LOCAL 'kafka.json' PARSER FJsonParser(record_terminator=E'\n')
EXCEPTIONS 'kafka.log';
And this, finally, is what from_kafka will contain:
SELECT * FROM public.from_kafka;
-- out location_id | start_date | start_date_ts
-- out -------------+------------+---------------------
-- out 30277 | 1667911800 | 2022-11-08 13:50:00
-- out 30278 | 1667911900 | 2022-11-08 13:51:40
-- out 30279 | 1667912000 | 2022-11-08 13:53:20
-- out 30280 | 1667912100 | 2022-11-08 13:55:00
-- out 30281 | 1667912200 | 2022-11-08 13:56:40
-- out 30282 | 1667912300 | 2022-11-08 13:58:20
i had managed to create tables in postgres but encountered issues when trying to insert values.
comands = (
CREATE TYPE student AS (
name TEXT,
id INTEGER
)
CREATE TABLE studentclass(
date DATE NOT NULL,
time TIMESTAMPTZ NOT NULL,
PRIMARY KEY (date, time),
class student
)
)
And in psycog2
command = (
INSERT INTO studentclass (date, time, student) VALUES (%s,%s, ROW(%s,%s)::student)
)
student_rec = ("John", 1)
record_to_insert = ("2020-05-21", "2020-05-21 08:10:00", student_rec)
cursor.execute(commands, record_to_insert)
When executed, the errors are the incorrect argument and if i tried to hard coded the student value inside the INSERT statement, it will inform me about the unrecognized column for student.
Please advise.
One issue is the column name is class not student. Second is psycopg2 does tuple adaption as composite type
So you can do:
insert_sql = "INSERT INTO studentclass (date, time, class) VALUES (%s,%s,%s)"
student_rec = ("John", 1)
record_to_insert = ("2020-05-21", "2020-05-21 08:10:00", student_rec)
cur.execute(insert_sql, record_to_insert)
con.commit()
select * from studentclass ;
date | time | class
------------+-------------------------+----------
05/21/2020 | 05/21/2020 08:10:00 PDT | (John,1)
I recently started porting a SQLite database over to PostGreSQL for a Flask site built with SQLAlchemy. I have my schemas in PGSQL and even inserted the data into the database. However, I am unable to run my usual INSERT commands to add information to the database. Normally, I insert new records using SQL Alchemy by leaving the ID column to be NULL and then just setting the other columns. However, that results in the following error:
sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) null value in column "id" violates not-null constraint
DETAIL: Failing row contains (null, 2017-07-24 20:40:37.787393+00, 2017-07-24 20:40:37.787393+00, episode_length_list = [52, 51, 49, 50, 83]
sum_length = 0
for ..., 0, f, 101, 1, 0, 0, , null).
[SQL: 'INSERT INTO submission (date_created, date_modified, code, status, correct, assignment_id, course_id, user_id, assignment_version, version, url) VALUES (CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, %(code)s, %(status)s, %(correct)s, %(assignment_id)s, %(course_id)s, %(user_id)s, %(assignment_version)s, %(version)s, %(url)s) RETURNING submission.id'] [parameters: {'code': 'episode_length_list = [52, 51, 49, 50, 83]\n\nsum_length = 0\n\nfor episode_length in episode_length_list:\n pass\n\nsum_length = sum_length + episode_length\n\nprint(sum_length)\n', 'status': 0, 'correct': False, 'assignment_id': 101, 'course_id': None, 'user_id': 1, 'assignment_version': 0, 'version': 0, 'url': ''}]
Here is my SQL Alchemy table declarations:
class Base(Model):
__abstract__ = True
#declared_attr
def __tablename__(cls):
return cls.__name__.lower()
def __repr__(self):
return str(self)
id = Column(Integer(), primary_key=True)
date_created = Column(DateTime, default=func.current_timestamp())
date_modified = Column(DateTime, default=func.current_timestamp(),
onupdate=func.current_timestamp())
class Submission(Base):
code = Column(Text(), default="")
status = Column(Integer(), default=0)
correct = Column(Boolean(), default=False)
assignment_id = Column(Integer(), ForeignKey('assignment.id'))
course_id = Column(Integer(), ForeignKey('course.id'))
user_id = Column(Integer(), ForeignKey('user.id'))
assignment_version = Column(Integer(), default=0)
version = Column(Integer(), default=0)
url = Column(Text(), default="")
I created the schema by calling db.create_all() in a script.
Checking the PostGreSQL side, we can see the constructed table:
Table "public.submission"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+--------------------------+-----------+----------+--------------+-------------
id | bigint | not null | plain | |
date_created | timestamp with time zone | | plain | |
date_modified | timestamp with time zone | | plain | |
code | text | | extended | |
status | bigint | | plain | |
correct | boolean | | plain | |
assignment_id | bigint | | plain | |
user_id | bigint | | plain | |
assignment_version | bigint | | plain | |
version | bigint | | plain | |
url | text | | extended | |
course_id | bigint | | plain | |
Indexes:
"idx_16881_submission_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"submission_course_id_fkey" FOREIGN KEY (course_id) REFERENCES course(id)
"submission_user_id_fkey" FOREIGN KEY (user_id) REFERENCES "user"(id)
Has OIDs: no
I'm still new to this, but shouldn't there be a sequence?
Any insight or suggestions on what to look for next would be super appreciated.
It is standard SQL that a PRIMARY KEY is UNIQUE and NOT NULL. PostgreSQL enforces the standard, and does not allow you to have any (not a single one) NULL on the table. Other databases allow you to have one NULL, therefore, the different behaviour.
PostgreSQL current documentation on Primary Keys clearly states it:
5.3.4. Primary Keys
A primary key constraint indicates that a column, or group of columns, can be used as a unique identifier for rows in the table. This requires that the values be both unique and not null.
If you want your PRIMARY KEY to be a synthetic (i.e.: not natural) sequence number, you should define it with type BIGSERIAL instead of BIGINT. I don't know the details on how this is achieved using SQLAlchemy, but look at the references.
When you then INSERT into your table, the id should NOT be in the INSERT column list (it should not be set to null, just not be there). I.e.:
This will generate a new id:
INSERT INTO public.submission (code) VALUES ('Some code') ;
will work.
This won't:
INSERT INTO public.submission (id, code) VALUES (NULL, 'Some code') ;
I guess SQLAlchemy should be smart enough to generate the proper SQL INSERT statements, once properly configured.
Reference:
Why isn't SQLAlchemy creating serial columns?
Ultimately, I discovered what went wrong, and it was definitely my fault. The process I used to load the old data into the database (pgloader) was doing more than just loading data - it was somehow overwriting parts of the table definitions! I was able to pg_dump the data out, reset the tables, and then load it back in - everything works as expected. Thanks for sanity checks!
I want to insert a new row which copies the two fields from the original row, and changes the last field to a new value. This is all done on one table.
Please excuse the table names/fields, they are very long.
Table 1 - alert_template_allocations
alert_template_allocation_id - pkey (ignored)
alert_template_allocation_io_id - (copy)
alert_template_allocation_alert_template_id - (copy)
alert_template_allocation_user_group_id - (change to a static value)
Table 2 - io
io_id - copy io_ids that belong to station 222
io_station_id - want to only copy rows where the station id = 222
My Attempt
insert into alert_template_allocations
(alert_template_allocation_io_id,
alert_template_allocation_alert_template_id,
alert_template_allocation_user_group_id)
values
(
(Select at.alert_template_allocation_io_id,
at.alert_template_allocation_alert_template_id
from alert_template_allocations at join io i on
i.io_id = at.alert_template_allocation_io_id
and i.io_station_id = 222)
, 4);
Use INSERT INTO SELECT syntax:
INSERT INTO alert_template_allocations (alert_template_allocation_io_id,
alert_template_allocation_alert_template_id,
alert_template_allocation_user_group_id)
SELECT at.alert_template_allocation_io_id,
at.alert_template_allocation_alert_template_id,
4
FROM alert_template_allocations at
JOIN io i
ON i.io_id = at.alert_template_allocation_io_id
AND i.io_station_id = 222;
I have the following heap of text:
"BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,
URLConnectionSample,ShortVersion,1.0,Version,1.0,BundleSize,155648,DynamicSize,
16384,Identifier,com.IdentifierForVendor3,Name,IdentifierForVendor3,ShortVersion,
1.0,Version,1.0,".
What I'd like to do is extract data from this in the following manner:
BundleSize:155648
DynamicSize:204800
Identifier:com.URLConnectionSample
Name:URLConnectionSample
ShortVersion:1.0
Version:1.0
BundleSize:155648
DynamicSize:16384
Identifier:com.IdentifierForVendor3
Name:IdentifierForVendor3
ShortVersion:1.0
Version:1.0
All tips and suggestions are welcome.
It isn't quite clear what do you need to do with this data. If you really need to process it entirely in the database (looks like the task for your favorite scripting language instead), one option is to use hstore.
Converting records one by one is easy:
Assuming
%s =
BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,URLConnectionSample,ShortVersion,1.0,Version,1.0
SELECT * FROM each(hstore(string_to_array(%s, ',')));
Output:
key | value
--------------+-------------------------
Name | URLConnectionSample
Version | 1.0
BundleSize | 155648
Identifier | com.URLConnectionSample
DynamicSize | 204800
ShortVersion | 1.0
If you have table with columns exactly matching field names (note the quotes, populate_record is case-sensitive to key names):
CREATE TABLE data (
"BundleSize" integer, "DynamicSize" integer, "Identifier" text,
"Name" text, "ShortVersion" text, "Version" text);
You can insert hstore records into it like this:
INSERT INTO data SELECT * FROM
populate_record(NULL::data, hstore(string_to_array(%s, ',')));
Things get more complicated if you have comma-separated values for more than one record.
%s = BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,URLConnectionSample,ShortVersion,1.0,Version,1.0,BundleSize,155648,DynamicSize,16384,Identifier,com.IdentifierForVendor3,Name,IdentifierForVendor3,ShortVersion,1.0,Version,1.0,
You need to break up an array into chunks of number_of_fields * 2 = 12 elements first.
SELECT hstore(row) FROM (
SELECT array_agg(str) AS row FROM (
SELECT str, row_number() OVER () AS i FROM
unnest(string_to_array(%s, ',')) AS str
) AS str_sub
GROUP BY (i - 1) / 12) AS row_sub
WHERE array_length(row, 1) = 12;
Output:
"Name"=>"URLConnectionSample", "Version"=>"1.0", "BundleSize"=>"155648", "Identifier"=>"com.URLConnectionSample", "DynamicSize"=>"204800", "ShortVersion"=>"1.0"
"Name"=>"IdentifierForVendor3", "Version"=>"1.0", "BundleSize"=>"155648", "Identifier"=>"com.IdentifierForVendor3", "DynamicSize"=>"16384", "ShortVersion"=>"1.0"
And inserting this into the aforementioned table:
INSERT INTO data SELECT (populate_record(NULL::data, hstore(row))).* FROM ...
the rest of the query is the same.