What is the Execution path for the following Query in Cassandra:
- 5 rows from one Cassandra node with token 1(Node1)
- 5 rows from one Cassandra node with token 2(Node2)
- 5 rows from one Cassandra node with token 3(Node3)
A client sends a query to Node1.
- What is the sequence this query is executed in 3 nodes?
- How Node1 propagates this query to node2 and node3?
- Node1 merges the rows from node2 and node3 to serve the complete query results?
There are two types of queries you can issue to retrieve data from multiple partitions (I'll use CQL terminology - a partition is what used to be called a row). Which one you use depends on if you know the partition keys or not.
I'll assume a simple schema that doesn't have any clustering keys:
CREATE TABLE mytable (key text PRIMARY KEY, field text);
If you don't know the partition keys, you can issue
SELECT * FROM mytable LIMIT 15;
This will return the first 15 rows, ordered by hash of the partition. Because it is ordered by the hash, such queries are only normally useful if you want to page through all your data.
The node that receives the query (the coordinator for this query) first forwards it to the node with the lowest token plus replicas. They return up to 15 rows. If fewer then the coordinator will forward on to the node with the second lowest token plus replicas. This happens until 15 rows are found, or until all nodes have been contacted. This query therefore could potentially contact every node in the cluster.
With replication factor greater than 1, conflicting results could be returned. The coordinator looks at the timestamps to merge the results and returns only the latest to the client.
If you do know the partition keys, you can use
SELECT * FROM mytable WHERE key in ('key1', 'key2');
The coordinator treats this the same as receiving the separate queries:
SELECT * FROM mytable WHERE key = 'key1';
SELECT * FROM mytable WHERE key = 'key2';
It forwards the messages to the node(s) who hold data for each key. There is one per key so these queries are executed in parallel. The responses are gathered on the coordinator, merged so only the latest remain, and sent to the client.
Related
Running the command below:
SELECT * FROM MY_STREAM WHERE speed != 0 GROUP BY name LIMIT 10;
results to an error:
Pull queries don't support GROUP BY clauses.
Is there a way to query 10 records with the name value being different and unique across all 10 records returned?
You can try something like this. Also, you can include WINDOW for the specific duration.
SELECT name, count(*) FROM MY_STREAM
WHERE speed != 0
GROUP BY name
HAVING count(*) = 10
EMIT CHANGES
LIMIT 10;
PULL QUERY
Pulls the current value from the materialized table and terminate. The result of this statement will not be persisted in a Kafka topic and will only be printed out in the console.
The WHERE clause must contain a single value of ROWKEY to retrieve and may optionally include bounds on WINDOWSTART if the materialized table is windowed.
How can I (quickly) test if a postgres partition has any rows in it?
I have a partitioned postgres table 'TABLE_A', partitioned by date-range. The name of each individual partition indicates the date-range i.e. TABLE_A_20220101 (1st Jan this year) TABLE_A_20220102 (2nd Jan 2022)
The table includes many years of data, so it includes several thousand individual partitions, each partition contains many millions of rows.
Is there a quick way of testing if a partition has any data in it? There are several solutions I've found, but they all involve count(*) and all take ages.
Please note - I'm NOT trying to accurately determine the row-count, just determine if each partition has any rows in it.
You can use an exists condition:
select exists (select * from partition_name limit 1)
That will return true if partition_name contains at least one row
I have one range partition from 2020/01/01 and 2020/06/01
I want to move this to two different partitions i.e 2020/01/01 to 2020/03/31 and 2nd partition from 2020/04/01 to 2020/06/01
how I can achieve as it not allow me to create two partition in same date range and then move data as it say "violates partition rule"
We can detach partition like that
ALTER TABLE "eMAR" Detach PARTITION "eMAR_default";
then can create any partitons like that
CREATE TABLE "eMAR_inBw" PARTITION OF public."eMAR"
FOR VALUES FROM ('2016-04-17 20:00:00') TO ('2016-06-17 20:00:00');
and then can move data from detached partitions like that
insert into "eMAR_inBw"
select * from "eMAR_default" where "MedDateTime" >= '2016-04-17 20:00:00' and "MedDateTime" <= '2016-06-17 20:00:00'
and then you can delete the old partition as well if not required
I am working in the kafka with KSQL. I would like to find out the last row within 5 min in different DEV_NAME(ROWKEY). Therefore, I have created the stream and aggregated table for further joining.
By below KSQL, I have created the table for finding out the last row within 5 min for different DEV_NAME
CREATE TABLE TESTING_TABLE AS
SELECT ROWKEY AS DEV_NAME, max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY ROWKEY;
Then, I would like to join together:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
However, it occured the error:
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (org.apache.kafka.streams.kstream.TimeWindowedSerializer) is not compatible to the actual key type (key type: org.apache.kafka.connect.data.Struct). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
It should be the WINDOW TUMBLING function changed my ROWKEY style
(e.g. DEV_NAME_11508 -> DEV_NAME_11508 : Window{start=157888092000 end=-}
Therefore, without setting the Serdes, could I convert from the table to stream and set the PARTITION BY DEV_NAME?
As you've identified, the issue is that your table is a windowed table, meaning the key of the table is windowed, and you can not look up into a windowed table with a non-windowed key.
You're table, as it stands, will generate one unique row per-ROWKEY for each 5 minute window. Yet it seems like you don't care about anything but the most recent window. It may be that you don't need the windowing in the table, e.g.
CREATE TABLE TESTING_TABLE AS
SELECT
ROWKEY AS DEV_NAME,
max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM
WHERE ROWTIME > (UNIX_TIMESTAMP() - 300000)
GROUP BY ROWKEY;
Will track the max timestamp per key, ignoring any timestamp that is over 5 minutes old. (Of course, this check is only done at the time the event is received, the row isn't removed after 5 minutes).
Also, this join:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
Almost certainly isn't doing what you think and wouldn't work in the way you want due to race conditions.
It's not clear what you're trying to achieve. Adding more information about your source data and required output may help people to provide you with a solution.
I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.
The main table has a few hundred million rows.
creating the subtable is done with a query like this:
create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'
I have keys defined as:
SORTKEY (customer_id, time)
DISTKEY customer_id
Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.
Am I missing something or do I just need to scale the cluster?
If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.
Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.
To see this in action look in the system tables. First, find an example query:
SELECT *
FROM stl_query
WHERE userid > 1
ORDER BY starttime DESC
LIMIT 10;
Then, look at the bytes per slice for each step of you query in svl_query_report:
SELECT *
FROM svl_query_report
WHERE query = <your query id>
ORDER BY query,segment,step,slice;
For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"