Can I convert from Table to Stream in KSQL? - apache-kafka

I am working in the kafka with KSQL. I would like to find out the last row within 5 min in different DEV_NAME(ROWKEY). Therefore, I have created the stream and aggregated table for further joining.
By below KSQL, I have created the table for finding out the last row within 5 min for different DEV_NAME
CREATE TABLE TESTING_TABLE AS
SELECT ROWKEY AS DEV_NAME, max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY ROWKEY;
Then, I would like to join together:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
However, it occured the error:
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (org.apache.kafka.streams.kstream.TimeWindowedSerializer) is not compatible to the actual key type (key type: org.apache.kafka.connect.data.Struct). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
It should be the WINDOW TUMBLING function changed my ROWKEY style
(e.g. DEV_NAME_11508 -> DEV_NAME_11508 : Window{start=157888092000 end=-}
Therefore, without setting the Serdes, could I convert from the table to stream and set the PARTITION BY DEV_NAME?

As you've identified, the issue is that your table is a windowed table, meaning the key of the table is windowed, and you can not look up into a windowed table with a non-windowed key.
You're table, as it stands, will generate one unique row per-ROWKEY for each 5 minute window. Yet it seems like you don't care about anything but the most recent window. It may be that you don't need the windowing in the table, e.g.
CREATE TABLE TESTING_TABLE AS
SELECT
ROWKEY AS DEV_NAME,
max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM
WHERE ROWTIME > (UNIX_TIMESTAMP() - 300000)
GROUP BY ROWKEY;
Will track the max timestamp per key, ignoring any timestamp that is over 5 minutes old. (Of course, this check is only done at the time the event is received, the row isn't removed after 5 minutes).
Also, this join:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
Almost certainly isn't doing what you think and wouldn't work in the way you want due to race conditions.
It's not clear what you're trying to achieve. Adding more information about your source data and required output may help people to provide you with a solution.

Related

Why is counting by i not working but selecting from hdb works just fine?

I have a partitioned hdb and the following query works fine :
select from tableName where date within(.z.d-8;.z.d)
but the following query breaks :
select count i by date from tableName where date within(.z.d-8;.z.d)
with the following error :
"./2017.10.14/:./2017.10.15/tableName. OS reports: No such file or directory"
Any idea why this might happen ?
As the error indicates, there's no table called tableName is in a partition for 2017.10.15. For partitioned databases kdb caches table counts; it happens when it runs the first query with the following properties:
the "select" part of the query is either count i or the partition field itself (in your example that would be date)
the where clause is either empty or constrains the partition field only.
(.Q.ps -- partitioned select -- is where all this magic happens, see the definition of it if you need all the details.)
You have several options to avoid the error you're getting.
Amend the query to avoid having either count i on its own or the empty where.
Any of the following will work; the first is the simplest while the others are useful if you're writing a query for the general case and don't know field names in advance.
select count sym by date where date within (.z.d-8;.z.d) / any field except i
select count i by date where date within (.z.d-8;.z.d),i>=0
select `dummy, count i by date where date within (.z.d-8;.z.d)
select {count x}i by date where date within (.z.d-8;.z.d)
Use .Q.view to define a sub-view to exclude partitions with missing tables; kdb won't cache or otherwise access them.
The previous solutions will not work if the date range in your select includes partitions with missing tables. In this case you can either
Run .Q.chk to create empty tables where they are mising; or
Run .Q.bv to construct the dictionary of table schemas for tables with missing partitions.
You probably need to create the missing tables. I believe when doing a 'count i' on a partitioned table as you have done, it counts every single partition (not just the ones in your query) and caches these counts in .Q.pn
If you run .Q.chk[HDB root location], it should create the missing tables and your query should work
https://code.kx.com/q/ref/dotq/#qchk-fill-hdb
'count i' will scan each partition regardless of what is specified in the where clause. So it's likely those two partitions are incomplete.
Better to pick an actual column for things like that or else something like
select count i>0 by date from tableName where date within(.z.d-8;.z.d)
will prevent the scanning of all partitions.
Jason

Move partition from one table to another table PostgreSQL 10.11

I am new to postgreSQL, I am working on a project where I am requested to move all the partitions older than 6 months to a legacy table so that the query on the table would be faster. I have the partition table with 10 years of data.
Lets assume if myTable is the table with current 6 months data and myTable_legacy is going to have all the data older than 6 months for up-to 10 years. The table is partitioned by monthly range
My questions that I researched online and unable to conclude are
I am currently testing before finalizing the steps, I was using below link as reference for my lab testing, and before performing the actual migration.
How to migrate an existing Postgres Table to partitioned table as transparently as possible?
create table myTable(
forDate date not null,
key2 int not null,
value int not null
) partition by range (forDate);
create table myTable_legacy(
forDate date not null,
key2 int not null,
value int not null
) partition by range (forDate);
1)Daily application query will be only on the current 6 month data. Is it necessary to move data older than 6 months to a new partition to get a better response of query. I researched online but wasn't able to find any solid evidence related to the same.
2)If performance going to be better, How to move older partitions from myTable to myTable_legacy. Based on my research, I can see that we don't have option of exchange partition in PostgreSQL.
Any help or guidance would help me proceed further with the requirement.
When I try to attach partition to mytable_legacy, I am getting error
alter table mytable detach partition mytable_200003;
alter table mytable_legacy attach partition mytable_200003
for values from ('2003-03-01') to ('2003-03-30');
results in:
ERROR: partition constraint is violated by some row
SQL state: 23514
The contents of the partition:
select * from mytable_200003;
"2000-03-02" 1 19
"2000-03-30" 15 8
It's always better to keep the production table light, One of the practices that i do is to use timestamp and write trigger function that will insert row in the other table if timestamp is less than now() (6 months old data).
Quote from the manual
When creating a range partition, the lower bound specified with FROM is an inclusive bound, whereas the upper bound specified with TO is an exclusive bound
(emphasis mine)
So the expression to ('2003-30-03') does not allow March, 30st to be inserted into the partition.
Additionally your data in mytable_200003 is for the year 2000, not for the year 2003 (which you use in your partition definition). To specify the complete march, simply use April, 1st as the upper bound
So you need to change the partition definition to cover March 2000 not March 2003.
alter table mytable_legacy
attach partition mytable_200003
for values from ('2000-03-01') to ('2000-04-01');
^ here ^ here
Online example

Ksql: Left Join Displays columns from stream but not tables

I have one steam and a table in KSQL as mentioned below:
Stream name: DEAL_STREAM
Table name: EXPENSE_TABLE
When I run the below queries it displays only columns from the stream but no table columns are being displays.
Is this the expected output. If not am I doing something wrong?
SELECT TD.EXPENSE_CODE, TD.BRANCH_CODE, TE.EXPENSE_DESC
FROM DEAL_STREAM TD
LEFT JOIN EXPENSE_TABLE TE ON TD.EXPENSE_CODE = TE.EXPENSE_CODE
WHERE TD.EXPENSE_CODE LIKE '%NL%' AND TD.BRANCH_CODE LIKE '%AM%';
An output of the query is as shown below.
NL8232##0 | AM | null
NL0232##0 | AM | null
NL6232#!0 | AM | null
NL5232^%0 | AM | null
When I run the below queries it displays only columns from the stream but no table columns are being displays.
In a stream-table (left) join, the output records will contain null columns (for table-side columns) if there is not matching record in the table at the time of the join/lookup.
Is this the expected output. If not am I doing something wrong?
Is it possible that, for example, you wrote the (1) input data into the stream before you wrote (2) the input data into the table? If so, then the stream-table join query would have attempted to perform table-lookups at the time of (1) when no such lookup data was available in the table yet (because that happened later at time (2)). Because there was no such table data available, the join wrote output records where the table-side columns were null.
Note: This stream-table join in KSQL (and, by extension, Apache Kafka's Streams API, on which KSQL is built) is the pretty much the norm for joins in the streaming world. Here, only the stream-side of the stream-table join will trigger downstream join outputs, and if there's no matching for a stream record on the table-side at the time when a new input record is being joined, then the table-side columns will be null. Since this is, however, a common cause of user confusion, we are currently working on adding table-side triggering of join output to Apache Kafka's Streams API and KSQL. When such a feature is available, then your issue above would not happen anymore.

Cassandra efficient table walk

I'm currently working on a benchmark (which is part of my bachelor thesis) that compares SQL and NoSQL Databases based on an abstract data model an abstract queries to achieve fair implementation on all systems.
I'm currently working on the implementation of a query that is specified as follows:
I have a table in Cassandra that is specified as follows:
CREATE TABLE allocated(
partition_key int,
financial_institution varchar,
primary_uuid uuid,
report_name varchar,
view_name varchar,
row_name varchar,
col_name varchar,
amount float,
PRIMARY KEY (partition_key, report_name, primary_uuid));
This table contains about 100,000,000 records (~300GB).
We now need to calculate the sum for the field "amount" for every possible combination of report_name, view_name, col_name and row_name.
In SQL this would be quite easy, just select sum (amount) and group it by the fields you want.
However, since Cassandra does not support these operations (which is perfectly fine) I need to achieve this on another way.
Currently I achieve this by doing a full-table walk, processing each record and storing the sum in a HashMap in Java for each combination.
The prepared statement I use is as follows:
SELECT
partition_key,
financial_institution,
report_name,
view_name,
col_name,
row_name,
amount
FROM allocated;
That works partially on machines with lots on RAM for both, cassandra and the Java app, but crashes on smaller machines.
Now I'm wondering whether it's possible to achieve this on a faster way?
I could imagine using the partition_key, which serves also as the cassandra partition key and do this for every partition (I have 5 of them).
Also I though of doing this multithreaded by assigning every partition and report to a seperate thread and running it parallel. But I guess this would cause a lot of overhead on the application side.
Now to the actual question: Would you recommend another execution strategy to achieve this?
Maybe I still think too much in a SQL-like way.
Thank you for you support.
Here are two ideas that may help you.
1) You can efficiently scan rows in any table using the following approach. Consider a table with PRIMARY KEY (pk, sk, tk). Let's use a fetch size of 1000, but you can try other values.
First query (Q1):
select whatever_columns from allocated limit 1000;
Process these and then record the value of the three columns that form the primary key. Let's say these values are pk_val, sk_val, and tk_val. Here is your next query (Q2):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk = sk_val and tk > tk_val limit 1000;
The above query will look for records for the same pk and sk, but for the next values of tk. Keep repeating as long as you keep getting 1000 records. When get anything less, you ignore the tk, and do greater on sk. Here is the query (Q3):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk > sk_val limit 1000;
Again, keep doing this as long as you get 1000 rows. Once you are done, you run the following query (Q4):
select whatever_columns from allocated where token(pk) > token(pk_val) limit 1000;
Now, you again use the pk_val, sk_val, tk_val from the last record, and run Q2 with these values, then Q3, then Q4.....
You are done when Q4 returns less than 1000.
2) I am assuming that 'report_name, view_name, col_name and row_name' are not unique and that's why you maintain a hashmap to keep track of the total amount whenever you see the same combination again. Here is something that may work better. Create a table in cassandra where key is a combination of these four values (maybe delimited). If there were three, you could have simply used a composite key for those three. Now, you also need a column called amounts which is a list. As you are scanning the allocate table (using the approach above), for each row, you do the following:
update amounts_table set amounts = amounts + whatever_amount where my_primary_key = four_col_values_delimited;
Once you are done, you can scan this table and compute the sum of the list for each row you see and dump it wherever you want. Note that since there is only one key, you can scan using only token(primary_key) > token(last_value_of_primary_key).
Sorry if my description is confusing. Please let me know if this helps.

How to optimize a table for queries sorted by insertion order in Postgres

I have a table of time series data where for almost all queries, I wish to select data ordered by collection time. I do have a timestamp column, but I do not want to use actual Timestamps for this, because if two entries have the same timestamp it is crucial that I be able to sort them in the order they were collected, which is information I have at Insert time.
My current schema just has a timestamp column. How would I alter my schema to make sure I can sort based on collection/insertion time, and make sure querying in collection/insertion order is efficient?
Add column based on sequence (i.e. serial), and create index on (timestamp_column, serial_column). Then you can have insertion order (more or less) by doing:
ORDER BY timestamp_column, serial_column;
You could use a SERIAL column called insert_order. This way there will be no two rows with the same value. However, I am not sure that you requirement of being in absolute time order is possible to achieve.
For example suppose there are two transactions, T1 and T2 and they do happen at the same time, and you are running on a machine with multiple processor, so in fact both T1 and T2 did the insert at exactly the same instant. Is this a case that you are concerned about? There was not enough info your question to know exactly.
Also with a serial column you have the issue of gaps, for example T1 cloud grab serial value 14 and T2 can grab value 15, then T1 rolls back and T2 does not, so you have to expect that the insert_order column might have gaps in it.