Cannot create stream in Ksql - apache-kafka

I have the stream as below and i want to create another stream from this. I am trying the command as below and i am getting the following error. Am i missing something?
ksql> create stream down_devices_stream as select * from fakedata119 where deviceProperties['status']='false';
Failed to generate code for SqlPredicate.filterExpression: (FAKEDATA119.DEVICEPROPERTIES['status'] = 'false')schema:org.apache.kafka.connect.data.SchemaBuilder#6e18dbbfisWindowedKey:false
Caused by: Line 1, Column 180: Operator "<=" not allowed on reference operands
ksql> select * from fakedata119;
1529505497087 | null | 19 | visibility sensors | Wed Jun 20 16:38:17 CEST 2018 | {visibility=74, status=true}
1529505498087 | null | 7 | fans | Wed Jun 20 16:38:18 CEST 2018 | {temperature=44, rotationSense=1, status=false, frequency=49}
1529505499088 | null | 28 | air quality monitors | Wed Jun 20 16:38:19 CEST 2018 | {coPpm=257, status=false, Co2Ppm=134}
1529505500089 | null | 4 | fans | Wed Jun 20 16:38:20 CEST 2018 | {temperature=42, rotationSense=1, status=true, frequency=51}
1529505501089 | null | 23 | air quality monitors | Wed Jun 20 16:38:21 CEST 2018 | {coPpm=158, status=true, Co2Ppm=215}
sql> describe fakedata119;
Field | Type
---------------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
DEVICEID | INTEGER
CATEGORY | VARCHAR(STRING)
TIMESTAMP | VARCHAR(STRING)
DEVICEPROPERTIES | MAP[VARCHAR(STRING),VARCHAR(STRING)]

Without seeing your input data, I have guessed that it looks something like this:
{
"id": "a42",
"category": "foo",
"timestamp": "2018-06-21 10:04:57 BST",
"deviceID": 42,
"deviceProperties": {
"status": "false",
"foo": "bar"
}
}
And if so, you are better using EXTRACTJSONFIELD to access the nested values, and build predicates.
CREATE STREAM test (Id VARCHAR, category VARCHAR, timeStamp VARCHAR, \
deviceID INTEGER, deviceProperties VARCHAR) \
WITH (KAFKA_TOPIC='test_map2', VALUE_FORMAT='JSON');
ksql> SELECT EXTRACTJSONFIELD(DEVICEPROPERTIES,'$.status') AS STATUS FROM fakeData223;
false
ksql> SELECT * FROM fakeData223 \
WHERE EXTRACTJSONFIELD(DEVICEPROPERTIES,'$.status')='false';
1529572405759 | null | a42 | foo | 2018-06-21 10:04:57 BST | 42 | {"status":"false","foo":"bar"}
The error you've found I've logged as a bug to track here: https://github.com/confluentinc/ksql/issues/1474

I've added a test to cover this usecase:
https://github.com/confluentinc/ksql/pull/1476/files
Interestingly, this passes on our master and upcoming 5.0 branches, but fails on 4.1.
So... looks like this is an issue on the version you're using, but the good news is its fixed on the up coming release. Plus you can use Robin's work around above for now.
Happy querying!
Andy

Related

slow running postgres11 logical replication from EC2 to RDS

I'm trying to move a Postgres (11.6) database on EC2 to RDS (postgres 11.6). I started replication last a couple nights ago and have now noticed the replication has slowed down considerably when I see how fast the database size in increasing on the subscriber SELECT pg_database_size('db_name')/1024/1024. Here are some stats of the environment:
Publisher Node:
Instance type: r5.24xlarge
Disk: 5Tb GP2 with 16,000 PIOPs
Database Size w/ pg_database_size/1024/1024: 2,295,955 mb
Subscriber Node:
Instance type: RDS r5.24xlarge
Disk: 3Tb GP2
Here is the current DB size for the subscriber and publisher:
Publisher:
SELECT pg_database_size('db_name')/1024/1024 db_size_publisher;
db_size_publisher
-------------------
2295971
(1 row)
Subscriber:
SELECT pg_database_size('db_name')/1024/1024 as db_size_subscriber;
db_size_subscriber
--------------------
1506348
(1 row)
The difference is still about 789GB left to replicate it seems like and I've noticed that the subsriber db is increasing at a rate of about 250kb/sec
db_name=> SELECT pg_database_size('db_name')/1024/1024, current_timestamp;
?column? | current_timestamp
----------+-------------------------------
1506394 | 2020-05-21 06:27:46.028805-07
(1 row)
db_name=> SELECT pg_database_size('db_name')/1024/1024, current_timestamp;
?column? | current_timestamp
----------+-------------------------------
1506396 | 2020-05-21 06:27:53.542946-07
(1 row)
At this rate, it would take another 30 days to finish replication, which makes me think I've set something up wrong.
Here are also some other stats from the publisher and subscriber:
Subscriber pg_stat_subscription:
db_name=> select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time
-------+----------------+-------+-------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
21562 | rds_subscriber | 2373 | 18411 | | 2020-05-20 18:41:54.132407-07 | 2020-05-20 18:41:54.132407-07 | | 2020-05-20 18:41:54.132407-07
21562 | rds_subscriber | 43275 | | 4811/530587E0 | 2020-05-21 06:15:55.160231-07 | 2020-05-21 06:15:55.16003-07 | 4811/5304BD10 | 2020-05-21 06:15:54.931053-07
(2 rows)
At this rate...it would take weeks to complete....what am I doing wrong here?

SQL Query to display Calculated fields on a year, monthly basis

I need help writing this SQL query (PostgresSQL) to display results in the form below:
--------------------------------------------------------------------------------
State | Jan '17 | Feb '17 | Mar '17 | Apr '17 | May '17 ... Dec '18
--------------------------------------------------------------------------------
Principal Outs. |700,839 |923,000 |953,000 |6532,293 | 789,000 ... 913,212
Disbursal Amount |23,000 |25,000 |23,992 | 23,627 | 25,374 ... 23,209
Interest |113,000 |235,000 |293,992 |322,627 |323,374 ... 267,209
There are multiple tables but I would be okay joining them.

How can i query a map field in ksql?

I have created a stream from a topic i ksql. The stream has the fields as below. I can query diferent field for example: select category from fake-data-119. I would like to know how can i get a single item from the map field, for example : status?
The data that are coming from the source are:
ProducerRecord(topic=fake-data-119, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value={"deviceId": 16, "category": "visibility sensors", "timeStamp": "Tue Jun 19 10:11:10 CEST 2018", "deviceProperties": {"visibility": "72", "status": "true"}}, timestamp=null)
ProducerRecord(topic=fake-data-119, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value={"deviceId": 6, "category": "fans", "timeStamp": "Tue Jun 19 10:11:11 CEST 2018", "deviceProperties": {"temperature": "22", "rotationSense": "1", "status": "false", "frequency": "56"}}, timestamp=null)
ProducerRecord(topic=fake-data-119, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value={"deviceId": 23, "category": "air quality monitors", "timeStamp": "Tue Jun 19 10:11:12 CEST 2018", "deviceProperties": {"coPpm": "136", "status": "false", "Co2Ppm": "450"}}, timestamp=null)
I am using the statement below to create the stream:
CREATE STREAM fakeData119 WITH (KAFKA_TOPIC='fake-data-119', VALUE_FORMAT='AVRO');
Field | Type
---------------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
DEVICEID | INTEGER
CATEGORY | VARCHAR(STRING)
TIMESTAMP | VARCHAR(STRING)
DEVICEPROPERTIES | MAP[VARCHAR(STRING),VARCHAR(STRING)]
---------------------------------------------------------
ksql> select * from fakeData119;
1529394182864 | null | 6 | fans | Tue Jun 19 09:43:02 CEST 2018 | {temperature=36, rotationSense=1, status=false, frequency=72}
1529394183869 | null | 5 | fans | Tue Jun 19 09:43:03 CEST 2018 | {temperature=23, rotationSense=1, status=true, frequency=76}
1529394184872 | null | 16 | visibility sensors | Tue Jun 19 09:43:04 CEST 2018 | {visibility=14, status=true}
1529394185875 | null | 25 | air quality monitors | Tue Jun 19 09:43:05 CEST 2018 | {coPpm=280, status=false, Co2Ppm=170}
You can get items in the map in the following way:
select deviceproperties['status'] from fakedata119

How to get back aggregate values across 2 dimensions using Python Cubes?

Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table

Postgres placeholders for 0 data

I have some Postgres data like this:
date | count
2015-01-01 | 20
2015-01-02 | 15
2015-01-05 | 30
I want to run a query that pulls this data with 0s in place for the dates that are missing, like this:
date | count
2015-01-01 | 20
2015-01-02 | 15
2015-01-03 | 0
2015-01-04 | 0
2015-01-05 | 30
This is for a very large range of dates, and I need it to fill in all the gaps. How can I accomplish this with just SQL?
Given a table junk of:
d | c
------------+----
2015-01-01 | 20
2015-01-02 | 15
2015-01-05 | 30
Running
select fake.d, coalesce(j.c, 0) as c
from (select min(d) + generate_series(0,7,1) as d from junk) fake
left outer join junk j on fake.d=j.d;
gets us:
d | c
------------+----------
2015-01-01 | 20
2015-01-02 | 15
2015-01-03 | 0
2015-01-04 | 0
2015-01-05 | 30
2015-01-06 | 0
2015-01-07 | 0
2015-01-08 | 0
You could of course adjust the start date for the series, length it runs for, etc.
Where is this data going? To an outside source or another table or view?
There's probably a better solution but you could create a new table(or in excel wherever the data is going) that has the entire date-range you want with another integer column of null values. Then update that table with your current dataset then replace all nulls with zero.
It's a really roundabout way to do things but it'll work.
I don't have enough rep to comment :(
This is also a good reference
Using COALESCE to handle NULL values in PostgreSQL