I have paritioned data on s3 I would like to access via spectrum. The current format file structure is similar to: s3://bucket/dir/year=2018/month=11/day=19/hour=12/file.parquet
I partitioned the data using glue, by parsing a field I use for timestamps, ts. Most queries I will do will be on the ts field, as they are timestamp range queries that are more granular than daily(may span multiple days, or less than one day, but time is often involved.
How would I go about creating hourly(preferred, daily would work if needed) partitions on my data so when I query the ts(or another timestamp) field, it will access the partitions correctly. If needed I can recreate my data with different paritions. Most examples/docs just bucket data daily, and use the date field in the query.
I would be happy to provide more information if needed.
Thank you!
Example query would be something like:
SELECT * FROM spectrum.data
WHERE ts between '2018-11-19 17:30:00' AND '2018-11-20 04:45:00'
Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...
And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...
Looks ugly.
You can consider doing something else :
s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet
In that case your query will be more simple
WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')
OR using BETWEEN
https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html
If the partitions are created like mentioned below, it will cater to the query asked by #Eumcoz
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:30:00')
LOCATION 's3path/ts=2018-11-19 17:30:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:40:00')
LOCATION 's3path/ts=2018-11-19 17:40:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:50:00')
LOCATION 's3path/ts=2018-11-19 17:50:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-20 07:30:00')
LOCATION 's3path/ts=2018-11-20 07:30:00/';
Then if you fire this query, it will return the data in all the above partitions:
select * from spectrum.data where ts between '2018-11-19 17:30:00' and '2018-11-20 07:50:00'
P.S. Please up-vote this if it solves your purpose. (I need 50 reputations to be able to comment on posts :) )
Related
I want to get truncked data over the last month. My time is in unix timestamps and I need to get data from last 30 days for each specific day.
The data is in the following form:
{
"id":"648637",
"exchange_name":"BYBIT",
"exchange_icon_url":"https://cryptowisdom.com.au/wp-content/uploads/2022/01/Bybit-colored-logo.png",
"trade_time":"1675262081986",
"price_in_quote_asset":23057.5,
"price_in_usd":1,
"trade_value":60180.075,
"base_asset_icon":"https://assets.coingecko.com/coins/images/1/large/bitcoin.png?1547033579",
"qty":2.61,
"quoteqty":60180.075,
"is_buyer_maker":true,
"pair":"BTCUSDT",
"base_asset_trade":"BTC",
"quote_asset_trade":"USDT"
}
I need to truncate data based on trade_time
How do I write the query?
The secret sauce is the date_trunc function, which takes a timestamp with time zone and truncates it to a specific precision (hour, day, week, etc). You can then group based on this value.
In your case we need to convert these unix timestamps javascript style timestamps to timestamp with time zone first, which we can do with to_timestamp, but it's still a fairly simple query.
SELECT
date_trunc('day', to_timestamp(trade_time / 1000.0)),
COUNT(1)
FROM pings_raw
GROUP BY date_trunc('day', to_timestamp(trade_time / 1000.0))
Another approach would be to leave everything as numbers, which might be marginally faster, though I find it less readable
SELECT
(trade_time/(1000*60*60*24))::int * (1000*60*60*24),
COUNT(1)
FROM pings_raw
GROUP BY (trade_time/(1000*60*60*24))::int
I'm designing a data warehouse using dimensional modeling. I've read most of the Data Warehouse Toolkit by Kimbal & Ross. My question is regarding the columns in a dimensional table that hold dates. For example, here is a table for Users of the application:
CREATE TABLE user_dim (
user_key BIGINT, -- surrogate key
user_id BIGINT, -- natural key
user_name VARCHAR(100),
...
user_added_date DATE, -- type 0, date user added to the system
...
-- Type-2 SCD administrative columns
row_start_date DATE, -- first effective date for this row
row_end_date DATE, -- last effective date for this row, 9999-12-31 if current
row_current_flag VARCHAR(10), -- current or expired
)
The last three attributes are for implementing type 2 slowly-changing dimensions. See Kimbal page 150-151.
Question 1: Is there are best practice for the data type of the row_start_date and row_end_date columns? The type could be DATE (as shown), STRING/VARCHAR/CHAR ("YYYY-MM-DD"), or even BIGINT (foreign key to Date Dimension). I don't think there would be much filtering on the row start/end dates, so a key to the Date Dimension is not required.
Question 2: Is there a best practice for the data type of dimension attributes such "user_added_date"? I can see someone wanting reports on users added per fiscal quarter, so using a foreign key to Date Dimension would be helpful. Any downsides to this, besides having to join from User Dimension to Date Dimension for display of the attribute?
If it matters, I'm using Amazon Redshift.
Question 1 : For the SCD from and to dates I suggest you use timestamp. My preference is WITHOUT time zone and ensure all of your timestamps are UTC
Question 2 : I always set up a date dimension table with a logical key of the actual date. that way you can join any date (e.g. the start date of the user) to the date dimension to find the eg "fiscal month" or whatever off the date dimension. But also you can see the date without joining to the date dimension as its plain to see (stored as a date)
With redshift (or any columnar MPP DBMS) it is good practice to denormalise a little. e.g. use star schema rather than snowflake schema. This is because of the efficiencies that columnar brings, and deals with the inneficient joins (because there are no indexes)
For Question 1: row_start_date and row_end_date are not part of the incoming data. As you mentioned they are created artifially for SCD Type 2 purposes, so they should not have a key to Date dimension. User dim has no reason to have a key to Date dimension. For data type YYYY-MM-DD should be fine.
For Question 2: If you have a requirement like this I would suggest creating a derived fact table (often called accumulating snapshot fact table) to keep derived measures like user_added_date
For more info see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/accumulating-snapshot-fact-table/
We're collecting lots of sensor data and logging them to a postgres DB.
Basic schema - cut down:
id | BIGINT PK
sensor-id| INT FK
location-id | INT FK
sensor-value | NUMERIC(0,2)
last-updated | TIMESTAMP_WITH_TIMEZONE
I'm trying to get the biggest change in sensor data in the last day. By that I mean, out of all the sensors, sensor ids 4,5,6,7 changed the biggest compared to the previous day. Before that, I'm trying to get a SQL query to figure out the delta between last reading and latest reading.
I thought maybe the lead and lag functions would help, but my query doesn't quite give me the result I was after:
SELECT
srd.last_updated,
spi.title,
lead(srd.value) OVER (ORDER BY srd.sensor_id DESC) as prev,
lag(srd.value) OVER (ORDER BY srd.sensor_id DESC) as next
FROM
sensor_rt_data srd
join sensor_prod_info spi on srd.sensor_id = spi.id
where srd.last_updated >= NOW() - '1 day'::INTERVAL -- current_date - 1
ORDER BY
srd.last_updated DESC
Simple dataset - making this up now because i can't login to the DB right now:
id|sensor,location,value,updated
1|1,1,24,'2017-04-28 19:30'
2|1,1,22,'2017-04-27 19:30'
3|2,1,35,'2017-04-28 19:30'
4|2,1,33,'2017-04-28 08:30'
5|2,1,31,'2017-04-27 19:30'
6|1,1,25,'2017-04-26 19:30'
Forgetting the join (that's for the user-friendly sensor tag name field staff need and the location), how do I workout which sensor has reported the biggest change in temperature over a time-series when they're grouped by sensor-id?
I'd be expecting:
updated,sensor,prev,next
'2017-04-28 19:30',1,24,22
'2017-04-28 19:30',2,33,31
(then from that, I can subtract and order to workout the top 10 sensors that have changed)
I noticed that Postgres 9.6 has some other functions too but want to try get Lead/Lag working first.
Window function aren't a best fit for this kind of task. Try this:
select sensor, max(value)-min(value) as value_change
from sensordata
where updated>=?-'1 day'::interval
group by sensor
order by value_change desc
limit 1;
Not much use for indexes besides updated for this kind of query. It would be probably possible to use a specially crafted index if you would be looking the largest change for a calendar day instead of last 24 hours.
I have table with created (timestamptz) property. Now, i need to create pagination based on timestamp, because while user is watching first page, new items could be submitted into this table, which will make data inconsistent in case if i'll use OFFSET for pagination.
So, the question is: should i keep created type as timestamptz or it's better to convert it into integer (unix, e.g. 1472031802812). If so, is there any disadvantages? Also, atm i have now() as default value in created - is there alternative function to create unix timestamp?
Let me rewrite things from comments to my answer. You want to use timestamp type instead of integer simply because that's exactly what it was designed for. Doing manual convertions between timestamp integers and timestamp objects is just a pain and you gain nothing. And you will need it eventually for more complex datetime based queries.
To answer a question about pagination. You simply do a query
SELECT *
FROM table_name
WHERE created < lastTimestamp
ORDER BY created DESC
LIMIT 30
If it is first query then you set say lastTimestamp = '3000-01-01'. Otherwise you set lastTimestamp = last_query.last_row.created.
Optimization
Note that if the table is big then ORDER BY created DESC might not be efficient (especially if called parallely with different ranges). In this case you can use moving "time windows", for example:
SELECT *
FROM table_name
WHERE
created < lastTimestamp
AND created >= lastTimestamp - interval '1 day'
The 1 day interval is picked arbitrarly (tune it to your needs). You can also sort results in the app.
If results is not empty then you update (in your app)
lastTimestamp = last_query.last_row.created
(assuming you've done sorting, otherwise you take min(last_query.row.created))
If results is empty then you repeat the query with lastTimestamp = lastTimestamp - interval '1 day' until you fetch something. Also you have to stop if lastTimestamp becomes to low, i.e. when it is lower then any other timestamp in the table (which has to be prefetched).
All of that is under some assumptions for inserts:
new_row.created >= any_row.created and
new_row.created ~ current_time
The distribution of new_row.created is more or less uniform
Assumption 1 ensures that pagination results in consistent data while assumption 2 is only needed for the default 3000-01-01 date. Assumption 3 is to make sure that you don't have big empty gaps when you have to issue many empty queries.
You mean something like this?
select extract(epoch from now())::integer as unix_time
I have a Cassandra column family where I am storing a large number (hundreds of thousands) of events per month with timestamp (“Ymdhisu”) as the row key. It has multiple columns capturing some data for each event. I tried retrieving events data for a specific time range. For example for the month of Jan, I used the following CQL query:
a) Query between range Jan 1- Jan 15, 2013
select count(*) from Test where Key > 20130101070100000000 and Key <
20130115070100000000 limit 100000; Bad Request: Start key's md5 sorts
after end key's md5. This is not allowed; you probably should not
specify end key at all, under RandomPartitioner
b) Query between range Jan 1- Jan 10, 2013
select count(*) from Test where Key > 20130101070100000000 and Key <
20130110070100000000 limit 100000; count - 73264
c) Query between range Jan 1- Jan 2, 2013
select count(*) from Test where Key > 20130101070100000000 and Key <
20130102070100000000 limit 100000; count - 78328
It appears as though the range search simply is not working! The schema of my Columnfamily is:
Create column family Test with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type AND compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64};
To extract data, what are the suggestions? Do I need to redefine my schema with key validation class as TimeUUID type? Is there any other way to query efficiently without changing the schema?
I am dealing with at least 100-200K rows of data monthly in this column family. If this schema does not work for this purpose, what would be an appropriate Cassandra schema to store and retrieve the kind of data described here?
You can create secondary indexes such as "Date" and "Month", and store each event's Date and Month in those columns along with other data. When querying data, you can fetch all rows for specified months or days.
I dont think range query on Keys will work. Perhaps if you change your partitioner from RandomPartitioner to ByteOrderedPartitioner?