Table-Table Join duplicate entries - apache-kafka

we are using kafka in production and I try to push the adoption and usage of KSQL in the same direction. But I already failed with one simple table-table join. I’ve tried with our production data first and ran in an issue. So I thought I missed something and moved back to the example from the confluent docs and ran in the same problem.
I will explain my issue with the example data https://docs.confluent.io/current/ksql/docs/tutorials/basics-docker.html#table-table-join
When I have created both tables and try to join the data it works, but as soon as I try to alter or add something I get new entries in my table. From every example I found at confluent or even at the youtube videos this is not suppose to happen.
Creating records
docker run --interactive --rm --network tutorials_default \
confluentinc/cp-kafkacat \
kafkacat -b kafka:39092 \
-t warehouse_location \
-K: \
-P <<EOF
1:{"warehouse_id":1,"city":"Leeds","country":"UK"}
2:{"warehouse_id":2,"city":"Sheffield","country":"UK"}
3:{"warehouse_id":3,"city":"Berlin","country":"Germany"}
EOF
docker run --interactive --rm --network tutorials_default \
confluentinc/cp-kafkacat \
kafkacat -b kafka:39092 \
-t warehouse_size \
-K: \
-P <<EOF
1:{"warehouse_id":1,"square_footage":16000}
2:{"warehouse_id":2,"square_footage":42000}
3:{"warehouse_id":3,"square_footage":94000}
EOF
Creating tables
CREATE TABLE WAREHOUSE_LOCATION (WAREHOUSE_ID INT, CITY VARCHAR, COUNTRY VARCHAR)
WITH (KAFKA_TOPIC='warehouse_location',
VALUE_FORMAT='JSON',
KEY='WAREHOUSE_ID');
CREATE TABLE WAREHOUSE_SIZE (WAREHOUSE_ID INT, SQUARE_FOOTAGE DOUBLE)
WITH (KAFKA_TOPIC='warehouse_size',
VALUE_FORMAT='JSON',
KEY='WAREHOUSE_ID');
Creating a joined table:
CREATE TABLE WH_U AS SELECT WL.WAREHOUSE_ID, WL.CITY, WL.COUNTRY, WS.SQUARE_FOOTAGE
FROM WAREHOUSE_LOCATION WL
LEFT JOIN WAREHOUSE_SIZE WS
ON WL.WAREHOUSE_ID=WS.WAREHOUSE_ID;
With this I get the expected results:
1 | Leeds | UK | 16000.0
2 | Sheffield | UK | 42000.0
3 | Berlin | Germany | 94000.0
But when I add or chnage records, this happens:
1566375174496 | 1 | 1 | Leeds | UK | 16000.0
1566375174496 | 2 | 2 | Sheffield | UK | 42000.0
1566375174496 | 3 | 3 | Berlin | Germany | 94000.0
1566375595372 | 4 | 4 | London | UK | null
1566375641291 | 4 | 4 | London | UK | 94000.0
1566375641291 | 1 | 1 | Leeds | UK | 1.0
I expected:
1566375174496 | 1 | 1 | Leeds | UK | 1.0
1566375174496 | 2 | 2 | Sheffield | UK | 42000.0
1566375174496 | 3 | 3 | Berlin | Germany | 94000.0
1566375641291 | 4 | 4 | London | UK | 94000.0
What am I missing?
SOLVED
The reason for this behaviour was a simple env in ksql server. KSQL_CACHE_MAX_BYTES_BUFFERING was set to 0

Related

Timescaledb: retention policy isn't removing data from hypertable

(note: I've also posted this as a github issue https://github.com/timescale/timescaledb/issues/3653)
I have a hypertable request_logs configured with a 24 hour retention policy. The retention policy is being marked as running successfully, however no old data from the table is being removed. The table continues to grow day by day.
I checked and don't see any errors in the postgresql log files.
Could use additional guidance on where to look for information to troubleshoot this issue.
request_logs table structure
\d+ request_logs;
Table "public.request_logs"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------+--------------------------+-----------+----------+---------+---------+--------------+-------------
time | timestamp with time zone | | not null | | plain | |
referer | bigint | | | | plain | |
useragent | bigint | | | | plain | |
Indexes:
"request_logs_time_idx" btree ("time" DESC)
Triggers:
ts_insert_blocker BEFORE INSERT ON request_logs FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_1_37_chunk,
_timescaledb_internal._hyper_1_38_chunk,
_timescaledb_internal._hyper_1_40_chunk
Access method: heap
This is the hypertable description retrieved by running select * from _timescaledb_catalog.hypertable;
id | schema_name | table_name | associated_schema_name | associated_table_prefix | num_dimensions | chunk_sizing_func_schema | chunk_sizing_func_name | chunk_target_size | compression_state | compressed_hypertable_id | replication_factor
----+-------------+--------------+------------------------+-------------------------+----------------+--------------------------+--------------------------+-------------------+-------------------+--------------------------+--------------------
1 | public | request_logs | _timescaledb_internal | _hyper_1 | 1 | _timescaledb_internal | calculate_chunk_interval | 0 | 0 | |
(1 row)
This is the retention_policy retrieved by running SELECT * FROM timescaledb_information.job_stats;.
hypertable_schema | hypertable_name | job_id | last_run_started_at | last_successful_finish | last_run_status | job_status | last_run_duration | next_start | total_runs | total_successes | total_failures
-------------------+-----------------+--------+-------------------------------+-------------------------------+-----------------+------------+-------------------+-------------------------------+------------+-----------------+----------------
public | request_logs | 1002 | 2021-10-05 23:59:01.601404+00 | 2021-10-05 23:59:01.638441+00 | Success | Scheduled | 00:00:00.037037 | 2021-10-06 23:59:01.638441+00 | 8 | 8 | 0
| | 1 | 2021-10-05 08:38:20.473945+00 | 2021-10-05 08:38:21.153468+00 | Success | Scheduled | 00:00:00.679523 | 2021-10-06 08:38:21.153468+00 | 45 | 45 | 0
(2 rows)
Relevant system information:
OS: Ubuntu 20.04.3 LTS
PostgreSQL version (output of postgres --version): 12
TimescaleDB version (output of \dx in psql): 2.4.1
Installation method: apt install process described https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/ubuntu/installation-apt-ubuntu/#installation-apt-ubuntu
It looks as though this might be related to a bug that has been fixed in version 2.4.2 of TimescaleDB. The GitHub report has been updated, if you find that the issue remains after upgrade, please update the issue on GitHub with your example. Thanks for reporting!
Transparency: I work for Timescale

Add row number within PostgreSQL COPY command

When I'm running a COPY my_table FROM 'my_table_source.csv' WITH HEADER CSV; is it possible to extract the row number within the csv file and add that information into my target table? I have some flat files coming from external sources and going to multiple databases that would be useful to trace backwards during occasional audits done down the road. Thanks.
Add column names and omit the name of the serial:
CREATE TEMP TABLE passwords (
seq serial not null PRIMARY KEY
, name text
, passwd text
, uid integer not null
, gid integer not null
, gcos text
, home text
, shell text
);
COPY passwords(name,passwd,uid,gid,gcos,home,shell)
FROM '/etc/passwd' WITH csv DELIMITER ':' ;
SELECT * FROM passwords
WHERE seq < 10
;
Output:
CREATE TABLE
COPY 48
seq | name | passwd | uid | gid | gcos | home | shell
-----+--------+--------+-----+-------+--------+----------------+-------------------
1 | root | x | 0 | 0 | root | /root | /bin/bash
2 | daemon | x | 1 | 1 | daemon | /usr/sbin | /usr/sbin/nologin
3 | bin | x | 2 | 2 | bin | /bin | /usr/sbin/nologin
4 | sys | x | 3 | 3 | sys | /dev | /usr/sbin/nologin
5 | sync | x | 4 | 65534 | sync | /bin | /bin/sync
6 | games | x | 5 | 60 | games | /usr/games | /usr/sbin/nologin
7 | man | x | 6 | 12 | man | /var/cache/man | /usr/sbin/nologin
8 | lp | x | 7 | 7 | lp | /var/spool/lpd | /usr/sbin/nologin
9 | mail | x | 8 | 8 | mail | /var/mail | /usr/sbin/nologin
(9 rows)

Unable to create new database

When I create a new mysql db, slashdb's test connection fails.
Here is how I log into mysql:
$ mysql -u 7stud -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 15
Server version: 5.5.5-10.4.13-MariaDB Homebrew
Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| chat |
| ectoing_repo |
| ejabberd |
| information_schema |
| mydb |
| mysql |
| performance_schema |
| test |
+--------------------+
8 rows in set (0.00 sec)
mysql> use mydb;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+----------------+
| Tables_in_mydb |
+----------------+
| cheetos |
| greetings |
| mody |
| people |
+----------------+
4 rows in set (0.00 sec)
mysql> select * from people;
+----+--------+------+
| id | name | info |
+----+--------+------+
| 1 | 7stud | abc |
| 2 | Beth | xxx |
| 3 | Diane | xyz |
| 4 | Kathy | xyz |
| 5 | Kathy | xyz |
| 6 | Dave | efg |
| 7 | Tom | zzz |
| 8 | David | abc |
| 9 | Eloise | abc |
| 10 | Jess | xyz |
| 11 | Jeffsy | 2.0 |
| 12 | XXX | xxx |
| 13 | XXX | xxx |
+----+--------+------+
13 rows in set (0.00 sec)
In the slashdb form for creating a new database, here is the info I entered:
Hostname: 127.0.0.1
Port: 80
Database Login: 7stud
Database Password: **
Database Name: mydb
Then I hit the "Test Connection" button, whereupon I get a spinning wheel, which disappears after a few minutes, but no "Connection Successful" message. What am I doing wrong?
Now, I'm using port 3306:
mysql> SHOW GLOBAL VARIABLES LIKE 'PORT';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| port | 3306 |
+---------------+-------+
1 row in set (0.00 sec)
but when slashdb tries to connect, I get the error:
Host localhost:3306 is not accessible
Your port is wrong in the database connection. You said that your MySQL is configured on port 3306, but you also posted your SlashDB config for database with port 80. Please change that to 3306.
Also, not sure how if you don't need to enable remote access to MySQL. Even if your SlashDB is running on the same machine as the MySQL database, it uses TCP/IP to connect.

Filter duplicate output over consecutive rolling period on Postgres

I have a simple table with daily regression test results and would like to get an output of tests that fail consecutively in the last 3 days.
The table looks something like this.
| ID | rule | status | environment | date | note |
------------------------------------------
| 1 | Test01 | pass | dev | 2018-05-23 |
| 2 | Test02 | pass | dev | 2018-05-23 |
| 3 | Test03 | pass | dev | 2018-05-23 |
| 4 | Test01 | pass | staging | 2018-05-23 |
| 5 | Test02 | pass | staging | 2018-05-23 |
| 6 | Test03 | pass | staging | 2018-05-23 |
| 7 | Test01 | pass | dev | 2018-05-24 |
| 8 | Test02 | fail | dev | 2018-05-24 | fail note
| 9 | Test03 | pass | dev | 2018-05-24 |
| 10 | Test01 | fail | dev | 2018-05-24 | fail note
| 11 | Test02 | fail | dev | 2018-05-24 | fail note
| 12 | Test03 | pass | dev | 2018-05-24 |
| 13 | Test01 | pass | dev | 2018-05-25 |
| 14 | Test02 | fail | dev | 2018-05-25 | fail note
| 15 | Test03 | fail | dev | 2018-05-25 | fail note
| 16 | Test01 | pass | dev | 2018-05-26 |
| 17 | Test02 | fail | dev | 2018-05-26 | fail note
| 18 | Test03 | pass | dev | 2018-05-26 |
So, assuming today is 2018-05-26 how do I output a result that shows the Test02 has been failing in the last 3 days or ID (rolling period) in Postgres?
The reason why I want it consecutive is because there may be tests that fail one day and pass the next and fail the next due to network issues so having consecutive requirement ensures elimination of that. Additionally there are also duplicate test runs during the same day (so Test01 can run multiple times during the same day potentially)
In respect to current date, you want to look through records that are dating maximum 2 days back from it and show only these rules for which you've found that a test has failed all the times it was done (for each environment separately):
select rule, environment
from t
where date > current_date - 3 -- for your sample data it's '2018-05-26'
group by rule, environment
having sum(case when status = 'fail' then 1 end) = count(*)
Output for your sample data and where date > '2018-05-26'::date - 3:
rule | environment
--------+-------------
Test02 | dev

how to flatten rows to columns in postgreSQL

using postgresql 9.3 I have a table that shows indivual permits issued across a single year below:
permit_typ| zipcode| address| name
-------------+------+------+-----
CONSTRUCTION | 20004 | 124 fake streeet | billy joe
SUPPLEMENTAL | 20005 | 124 fake streeet | james oswald
POST CARD | 20005 | 124 fake streeet | who cares
HOME OCCUPATION | 20007 | 124 fake streeet | who cares
SHOP DRAWING | 20009 | 124 fake streeet | who cares
I am trying to flatten this so it looks like
CONSTRUCTION | SUPPLEMENTAL | POST CARD| HOME OCCUPATION | SHOP DRAWING | zipcode
-------------+--------------+-----------+----------------+--------------+--------
1 | 2 | 3 | 5 | 6 | 20004
1 | 2 | 3 | 5 | 6 | 20005
1 | 2 | 3 | 5 | 6 | 20006
1 | 2 | 3 | 5 | 6 | 20007
1 | 2 | 3 | 5 | 6 | 20008
have been trying to use Crosstab but its a bit above my rusty SQL experiance. anybody have any ideas
I usually approach this type of query using conditional aggregation. In Postgres, you can do:
select zipcode,
sum( (permit_typ = 'CONSTRUCTION')::int) as Construction,
sum( (permit_typ = 'SUPPLEMENTAL')::int) as SUPPLEMENTAL,
. . .
from t
group by zipcode;