cannot recognize epoch time when trying to copy from kafka to vertica - apache-kafka

I am trying to copy a JSON data from Kafka to vertica. I am using the following query
COPY public.from_kafka
SOURCE KafkaSource(stream='example_data|0|-2, example_data|1|-2',
brokers='kafka01.example.com:9092',
duration=interval '10000 milliseconds') PARSER KafkaJSONParser()
REJECTED DATA AS TABLE public.rejections;
each message in the topic looks like that:
{"location_id":30277, "start_date":1667911800000}
when I am running the query, no new rows are created. when I am checking the rejections table I see the following rejected_reason:
Missing or null value for column with NOT NULL constraint [start_date]
however the rejected_data is {"location_id":30277, "start_date":1667911800000}
why does Vertica not recognize the start_date field and how can I solve it?
vertica table:
CREATE TABLE public.from_kafka
(
location_id int NOT NULL,
start_date timestamp NOT NULL
)
CREATE PROJECTION public.from_kafka /*+createtype(L)*/
(
location_id ENCODING RLE,
start_date ENCODING GCDDELTA
)
AS
SELECT from_kafka.location_id,
from_kafka.start_date,
FROM public.from_kafka
ORDER BY from_kafka.start_date,
from_kafka.location_id
SEGMENTED BY hash(from_kafka.location_id, from_kafka.start_date) ALL NODES KSAFE 1;
EDIT - SOLUTION
PARSER KafkaJSONParser() does not know how to convert long into timestamp, due to that I had to convert the JSON message with java, insert the updated JSON to a new topic and then use KafkaJSONParser() function

A timestamp, in any SQL database, is a timestamp, not an integer.
To load your JSON format and have a timestamp, redefine your table to receive an integer and convert it to a timestamp on the fly.
I do it from file, here, but it will work with a Kafka stream, too.
-- create your table like so:
DROP TABLE IF EXISTS public.from_kafka;
CREATE TABLE public.from_kafka
(
location_id int NOT NULL,
start_date int NOT NULL,
start_date_ts timestamp DEFAULT TO_TIMESTAMP(start_date//1000)
);
This is the JSON file I use:
$ cat kafka.json
{"location_id":30277, "start_date":1667911800000},
{"location_id":30278, "start_date":1667911900000},
{"location_id":30279, "start_date":1667912000000},
{"location_id":30280, "start_date":1667912100000},
{"location_id":30281, "start_date":1667912200000},
{"location_id":30282, "start_date":1667912300000}
And this is the copy command I use:
COPY public.from_kafka (
location_id
, start_date
)
FROM LOCAL 'kafka.json' PARSER FJsonParser(record_terminator=E'\n')
EXCEPTIONS 'kafka.log';
And this, finally, is what from_kafka will contain:
SELECT * FROM public.from_kafka;
-- out location_id | start_date | start_date_ts
-- out -------------+------------+---------------------
-- out 30277 | 1667911800 | 2022-11-08 13:50:00
-- out 30278 | 1667911900 | 2022-11-08 13:51:40
-- out 30279 | 1667912000 | 2022-11-08 13:53:20
-- out 30280 | 1667912100 | 2022-11-08 13:55:00
-- out 30281 | 1667912200 | 2022-11-08 13:56:40
-- out 30282 | 1667912300 | 2022-11-08 13:58:20

Related

PostgreSQL Stored procedure performance

I use PostgreSQL and I have a table 'PERSON' in schema 'public' that looks like this:
+----+-------------+-------+----------------------------+
| id | internal_id | name | created |
+----+-------------+-------+----------------------------+
| 1 | P0001-XX00 | Bob | 2021-05-24 22:10:01.93025 |
+----+-------------+-------+----------------------------+
| 2 | P0001-CX00 | Tom | 2021-06-27 22:10:01.93025 |
+----+-------------+-------+----------------------------+
| 3 | P0002-XX00 | Anna | 2021-05-24 22:10:01.93025 |
+----+-------------+-------+----------------------------+
id -> bigint; internal_id -> character varying; name -> character varying; created -> timestamp without timezone
I need to write procedure that delete those records that are older than fixed timestamp, for example: now(). But as soon as such an old record has been found, I need to check if there are other records in the table which are not old yet and with the same first 5 characters in internal_id as the found old record. If there are such records, then I should not delete the old record.
So I wrote the following procedure with plpgsql and it seems to work:
BEGIN
DELETE FROM public."PERSON" AS t1
WHERE t1.created < now()
AND NOT EXISTS
(
SELECT
FROM public."PERSON" AS t2
WHERE left(t2.internal_id, 5) = left(t1.internal_id, 5)
AND t2.created >= now()
);
COMMIT;
END;
Questions:
Could it have been made more correct or prettier or cleaner? Perhaps, instead of using the left() function, it was necessary to use LIKE, or, in principle, to do it somehow differently?
Do you think this procedure has normal performance or it can be improved?
Thank you in advance!
That should work just fine.
For good performance, create an index:
CREATE INDEX ON public."PERSON" (left(internal_id, 5), created);

Is there a KSQL statement to update values in table?

I have a stream of data in topic that should be treated as ksql table (only last value of given key matters) and this data is about updates of some data's specific fields in other topic. Is there any way in KSQLDB to process stream that update values in other stream/table/topic? Target topic has entities with let's say 20 fields, but my stream that contains update has update of 3 fields, so I want to update only these 3 fields and other 17 fields should remain the same in target topic (treated as table).
You can solve your problem using a JOIN STATEMENT with a little adjustment, follow the sample, will create a table with 5 fields, but only will be necessary to update the fields skill and level from another table.
1.Create the table from the source topic:
CREATE TABLE TBL_EMPLOYEE( `employee_id` VARCHAR, `name` varchar, `lastName` varchar, `age` INT, `skill` VARCHAR, `level` VARCHAR ) WITH ( KAFKA_TOPIC = 'employee-topic-input', PARTITIONS = 3, VALUE_FORMAT = 'JSON', KEY = '`employee_id`');
2.Create the table to handle the desired updates ( It can be stream or table, resulting from another query)
CREATE TABLE TBL_EMPLOYEE_DESIRED_UPDATES (`employee_id` VARCHAR, `skill` VARCHAR, `level` VARCHAR) WITH( KAFKA_TOPIC = 'employee-desired-updates-topic', PARTITIONS = 3, VALUE_FORMAT ='JSON', KEY = '`employee_id`');
3.Create the final table to update the required fields, the left join allows all the elements on the first table. if there is not any update on the second table, the skill and level fields will be the same.
SET 'auto.offset.reset' = 'earliest';
CREATE TABLE TBL_EMPLOYEE_FINAL AS
SELECT
EMP.`employee_id` AS `employee_id`,
EMP.`name` AS `name`,
EMP.`lastName` AS `lastName`,
IFNULL(UPD.`skill`, EMP.`skill`) as `skill`,
IFNULL(UPD.`level`, EMP.`level`) as `level`
FROM TBL_EMPLOYEE AS EMP
LEFT JOIN TBL_EMPLOYEE_DESIRED_UPDATES UPD ON EMP.ROWKEY = UPD.ROWKEY EMIT CHANGES;
Example:
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('117', 'John', 'Constantine', 30, 'java', 'jr');
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('118', 'Anthony', 'Stark', 40, 'AWS', 'architect');
INSERT INTO TBL_EMPLOYEE (`employee_id`, `name`, `lastName`, `age`, `skill`, `level`) VALUES ('119', 'Clark', 'Kent', 35, 'python', 'senior');
ksql> SELECT * FROM TBL_EMPLOYEE_FINAL EMIT CHANGES;
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|ROWTIME |ROWKEY |employee_id |name |lastName |skill |level |
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|1611440363833 |119 |119 |Clark |Kent |python |senior |
|1611440361284 |117 |117 |John |Constantine |java |jr |
|1611440361408 |118 |118 |Anthony |Stark |AWS |architect |
The second step is to send an update
INSERT INTO TBL_EMPLOYEE_DESIRED_UPDATES (`employee_id`, `skill`, `level` ) VALUES ('118', 'mongo', 'senior');
The result
ksql> SELECT * from TBL_EMPLOYEE_FINAL EMIT CHANGES;
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|ROWTIME |ROWKEY |employee_id |name |lastName |skill |level |
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|1611440363833 |119 |119 |Clark |Kent |python |senior |
|1611440361284 |117 |117 |John |Constantine |java |jr |
|1611440361408 |118 |118 |Anthony |Stark |AWS |architect |
|1611440585726 |118 |118 |Anthony |Stark |mongo |senior |
You have to consider the latest element in the table as the new one with the two modifications. The other one is part of the changelog of the table. The records are immutables.

Unable to insert nested record in postgres

i had managed to create tables in postgres but encountered issues when trying to insert values.
comands = (
CREATE TYPE student AS (
name TEXT,
id INTEGER
)
CREATE TABLE studentclass(
date DATE NOT NULL,
time TIMESTAMPTZ NOT NULL,
PRIMARY KEY (date, time),
class student
)
)
And in psycog2
command = (
INSERT INTO studentclass (date, time, student) VALUES (%s,%s, ROW(%s,%s)::student)
)
student_rec = ("John", 1)
record_to_insert = ("2020-05-21", "2020-05-21 08:10:00", student_rec)
cursor.execute(commands, record_to_insert)
When executed, the errors are the incorrect argument and if i tried to hard coded the student value inside the INSERT statement, it will inform me about the unrecognized column for student.
Please advise.
One issue is the column name is class not student. Second is psycopg2 does tuple adaption as composite type
So you can do:
insert_sql = "INSERT INTO studentclass (date, time, class) VALUES (%s,%s,%s)"
student_rec = ("John", 1)
record_to_insert = ("2020-05-21", "2020-05-21 08:10:00", student_rec)
cur.execute(insert_sql, record_to_insert)
con.commit()
select * from studentclass ;
date | time | class
------------+-------------------------+----------
05/21/2020 | 05/21/2020 08:10:00 PDT | (John,1)

Epoch timestamp to 'yyyy-MM-dd HH:mm:ss.SSS' format conversion while executing AWS Redshift COPY command

With ref to other related question; with following configuration I able to insert data into Redshift -
COPY "hits" FROM 's3://your-bucket/your_folder/'
CREDENTIALS 'aws_access_key_id=<AWS_ACCESS_KEY_ID>;aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>'
FORMAT as JSON 's3://your-bucket/config/jsonpaths'
TIMEFORMAT as 'epochmillisecs';
It is converting '1528207694599' into '2018-06-05 14:08:14' , but I'm expecting '2018-06-05 14:08:14.599' .
Any luck? Thanks in advance.
Seems like you may be doing something wrong at somewhere, I would like to show you a systamatic steps, to prove copy populates data properly including milliseconds.
create table sales(
salesid integer not null,
category varchar(10),
update_at timestamp);
data file data.json's content
[1,"Sports1","1528207694599"]
[2,"Sports2","1528207694456"]
Mapping file json_path.json's content
{
"jsonpaths": [
"$[0]",
"$[1]",
"$[2]"
]
}
Then copy command,
COPY sales FROM 's3://s3-path/to/data/data.json' CREDENTIALS 'aws_access_key_id=**********;aws_secret_access_key=*******' FORMAT as JSON 's3://s3-path/to/mapping/json_path.json' TIMEFORMAT as 'epochmillisecs';
Output:
COPY sales FROM .............
INFO: Load into table 'sales' completed, 2 record(s) loaded successfully.
COPY;
$Select * from Sales;
salesid | category | update_at
---------+----------+-------------------------
2 | Sports2 | 2018-06-05 14:08:14.456
1 | Sports1 | 2018-06-05 14:08:14.599
(2 rows)
As you could see, update_at has value including milliseconds.
Hope it helps.

Error when querying PostgreSQL using range operators

I am trying to query a postgresql (v 9.3.6) table with a tstzrange to determine if a given timestamp exists within the table defined as
CREATE TABLE sensor(
id serial,
hostname varchar(64) NOT NULL,
ip varchar(15) NOT NULL,
period tstzrange NOT NULL,
PRIMARY KEY(id),
EXCLUDE USING gist (hostname WITH =, period with &&)
);
I am using psycopg2 and when I try the query:
sql = "SELECT id FROM sensor WHERE %s <# period;"
cursor.execute(sql,(isotimestamp,))
I get the error
psycopg2.DataError: malformed range literal:
...
DETAIL: Missing left parenthesis or bracket.
I've tried various type castings to no avail.
I've managed a workaround using the following query:
sql = "SELECT * FROM sensor WHERE %s BETWEEN lower(period) AND upper(period);"
but would like to know why I am having problem with the range operators. Is it my code or psycopg2 or what?
Any help is appreciated.
EDIT 1:
In response to the comments, I have attempted the same query on a simple 1-row table in postgresql like below
=> select * from sensor;
session_id | hostname | ip | period
------------+----------+-----------+-------------------------------------------------------------------
1 | bob | 127.0.0.1 | ["2015-02-08 19:26:42.032637+00","2015-02-08 19:27:28.562341+00")
(1 row)
Now by using the "#>" operator I get the following error:
=> select * from sensor where period #> '2015-02-08 19:26:43.04+00';
ERROR: malformed range literal: "2015-02-08 19:26:43.04+00"
LINE 1: select * from sensor where period #> '2015-02-08 19:26:42.03...
Which appears to be the same as the psycopg2 error, a malformed range literal, so I thought I would try typecasting to timestamp as below
=> select * from sensor where sensor.period #> '2015-02-08 19:26:42.032637+00'::timestamptz;
session_id | hostname | ip | period
------------+----------+-----------+-------------------------------------------------------------------
1 | feral | 127.0.0.1 | ["2015-02-08 19:26:42.032637+00","2015-02-08 19:27:28.562341+00")
So it appears that it is my mistake, the literal has to be typecast or it is assumed to be a range. Using psycopg2, the query can be executed with:
sql="select * from sensor where period #> %s::timestamptz"