I have a Cassandra column family where I am storing a large number (hundreds of thousands) of events per month with timestamp (“Ymdhisu”) as the row key. It has multiple columns capturing some data for each event. I tried retrieving events data for a specific time range. For example for the month of Jan, I used the following CQL query:
a) Query between range Jan 1- Jan 15, 2013
select count(*) from Test where Key > 20130101070100000000 and Key <
20130115070100000000 limit 100000; Bad Request: Start key's md5 sorts
after end key's md5. This is not allowed; you probably should not
specify end key at all, under RandomPartitioner
b) Query between range Jan 1- Jan 10, 2013
select count(*) from Test where Key > 20130101070100000000 and Key <
20130110070100000000 limit 100000; count - 73264
c) Query between range Jan 1- Jan 2, 2013
select count(*) from Test where Key > 20130101070100000000 and Key <
20130102070100000000 limit 100000; count - 78328
It appears as though the range search simply is not working! The schema of my Columnfamily is:
Create column family Test with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type AND compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64};
To extract data, what are the suggestions? Do I need to redefine my schema with key validation class as TimeUUID type? Is there any other way to query efficiently without changing the schema?
I am dealing with at least 100-200K rows of data monthly in this column family. If this schema does not work for this purpose, what would be an appropriate Cassandra schema to store and retrieve the kind of data described here?
You can create secondary indexes such as "Date" and "Month", and store each event's Date and Month in those columns along with other data. When querying data, you can fetch all rows for specified months or days.
I dont think range query on Keys will work. Perhaps if you change your partitioner from RandomPartitioner to ByteOrderedPartitioner?
Related
Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?
The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.
I have a T1 table with user_id and the date they created their account either in 2019 or 2018. I want to assign a random date to each user. For users in 2018 the random date can be anything in 2019, but for people registered in 2019 there random date must lie between after one day of account creation and 31st Dec 2019.
Despite using order by and setting seed (redshift allows only random to give a number between 0 to 1), I am getting a new random date for the same user_id, when I re-run the below code. This is on redshift and I'm using alation(SQL) for query.
set seed to 0.25;
select *,
case when created_date<'2019-01-01' then date(DATEADD(day,cast(random()*DATEDIFF(day,'2019-01-01','2019-12-31') as int),'2019-01-01'))
when created_date<'2020-01-01' and created_date>='2019-01-01' then date(DATEADD(day,cast(random()*DATEDIFF(day,created_date,'2019-12-31') as int),created_date))
end as random_date
from scratchdb.tmp_table1
order by 1,2
I'm designing a data warehouse using dimensional modeling. I've read most of the Data Warehouse Toolkit by Kimbal & Ross. My question is regarding the columns in a dimensional table that hold dates. For example, here is a table for Users of the application:
CREATE TABLE user_dim (
user_key BIGINT, -- surrogate key
user_id BIGINT, -- natural key
user_name VARCHAR(100),
...
user_added_date DATE, -- type 0, date user added to the system
...
-- Type-2 SCD administrative columns
row_start_date DATE, -- first effective date for this row
row_end_date DATE, -- last effective date for this row, 9999-12-31 if current
row_current_flag VARCHAR(10), -- current or expired
)
The last three attributes are for implementing type 2 slowly-changing dimensions. See Kimbal page 150-151.
Question 1: Is there are best practice for the data type of the row_start_date and row_end_date columns? The type could be DATE (as shown), STRING/VARCHAR/CHAR ("YYYY-MM-DD"), or even BIGINT (foreign key to Date Dimension). I don't think there would be much filtering on the row start/end dates, so a key to the Date Dimension is not required.
Question 2: Is there a best practice for the data type of dimension attributes such "user_added_date"? I can see someone wanting reports on users added per fiscal quarter, so using a foreign key to Date Dimension would be helpful. Any downsides to this, besides having to join from User Dimension to Date Dimension for display of the attribute?
If it matters, I'm using Amazon Redshift.
Question 1 : For the SCD from and to dates I suggest you use timestamp. My preference is WITHOUT time zone and ensure all of your timestamps are UTC
Question 2 : I always set up a date dimension table with a logical key of the actual date. that way you can join any date (e.g. the start date of the user) to the date dimension to find the eg "fiscal month" or whatever off the date dimension. But also you can see the date without joining to the date dimension as its plain to see (stored as a date)
With redshift (or any columnar MPP DBMS) it is good practice to denormalise a little. e.g. use star schema rather than snowflake schema. This is because of the efficiencies that columnar brings, and deals with the inneficient joins (because there are no indexes)
For Question 1: row_start_date and row_end_date are not part of the incoming data. As you mentioned they are created artifially for SCD Type 2 purposes, so they should not have a key to Date dimension. User dim has no reason to have a key to Date dimension. For data type YYYY-MM-DD should be fine.
For Question 2: If you have a requirement like this I would suggest creating a derived fact table (often called accumulating snapshot fact table) to keep derived measures like user_added_date
For more info see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/accumulating-snapshot-fact-table/
I have paritioned data on s3 I would like to access via spectrum. The current format file structure is similar to: s3://bucket/dir/year=2018/month=11/day=19/hour=12/file.parquet
I partitioned the data using glue, by parsing a field I use for timestamps, ts. Most queries I will do will be on the ts field, as they are timestamp range queries that are more granular than daily(may span multiple days, or less than one day, but time is often involved.
How would I go about creating hourly(preferred, daily would work if needed) partitions on my data so when I query the ts(or another timestamp) field, it will access the partitions correctly. If needed I can recreate my data with different paritions. Most examples/docs just bucket data daily, and use the date field in the query.
I would be happy to provide more information if needed.
Thank you!
Example query would be something like:
SELECT * FROM spectrum.data
WHERE ts between '2018-11-19 17:30:00' AND '2018-11-20 04:45:00'
Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...
And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...
Looks ugly.
You can consider doing something else :
s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet
In that case your query will be more simple
WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')
OR using BETWEEN
https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html
If the partitions are created like mentioned below, it will cater to the query asked by #Eumcoz
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:30:00')
LOCATION 's3path/ts=2018-11-19 17:30:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:40:00')
LOCATION 's3path/ts=2018-11-19 17:40:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:50:00')
LOCATION 's3path/ts=2018-11-19 17:50:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-20 07:30:00')
LOCATION 's3path/ts=2018-11-20 07:30:00/';
Then if you fire this query, it will return the data in all the above partitions:
select * from spectrum.data where ts between '2018-11-19 17:30:00' and '2018-11-20 07:50:00'
P.S. Please up-vote this if it solves your purpose. (I need 50 reputations to be able to comment on posts :) )
How do I setup an index on DynamoDB table to compare dates? i.e. for example, I have a column in my table called synchronizedAt, I want my query to fetch all the rows that were never synchronized (i.e. 0) or weren’t synchronized in the past 2 weeks i.e. (new Date().getTime()) - (1000 * 60 * 60 * 24 * 7 * 4)
It depends by the other attributes of your table.
You may use an Hash and Range Primary Key if the set of Hash values is relatively small and stable; in this case you could filter the dates by putting them in the Range, but anyway the queries will be done by specifying the Index also, and because of this it may or may not make sense to pre-query for all the Index values in order to perform a loop where ask for the interesting Range (inside of the loop for each Index value).
An alternative could be an Hash and Range GSI. In this case you might put a fixed dumb value as Index, in order to query for the range of all Items at once.
Lastly, the less efficient Scan, but with large tables it will be a problem (the larger the table the more time the Scan will take to complete).
I had similar requirement to query on the date range. In my case date range was only criteria. The issue with DynamoDB is you cannot create an Index with Just Range key. It always require Hashkey and Query on such index always expect equal to condition for Hashkey.
So I tricked the DB. I created a Key as Century and populated with century year of the date. For example 1 Jan 2019, century key value is 20. For 1 Jan 2020 century key value is also 20. Very easy to derived from any date. Then Created GSI with Hashkey as Century and RangeKey as date. While querying it is very easy to derive century from date range and Build query condition Hashkey as century and date range. Since I am dealing with data no more than 5 years, trick won't fail for next 75 years. :)
It is not so "nice to have" workaround but it work for me quite well. May be it will help someone else as well.