Data partitioning in s3 - scala

We have our data in relational database in single table with columns id and date as this.
productid date value1 value2
1 2005-10-26 24 27
1 2005-10-27 22 28
2 2005-10-26 12 18
Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. Our most frequent queries will be filtering on product id, day, month and year. So trying to load the data partitions in a way to have better query performance.
From what i understood, I can create the partitions like this
s3://my-bucket/my-dataset/dt=2017-07-01/
...
s3://my-bucket/my-dataset/dt=2017-07-09/
s3://my-bucket/my-dataset/dt=2017-07-10/
or like this,
s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
Which will be faster in terms of query as I have 7 years data.
Also, how can i add partitioning for product id here? So that it will be faster.
How can i create this (s3://mybucket/year=2017/month=06/day=01/) folder structures with key=value using spark scala.? Any examples?

We partitioned like this,
s3://bucket/year/month/year/day/hour/minute/product/region/availabilityzone/
s3://bucketname/2018/03/01/11/30/nest/e1/e1a
minute is rounded to 30 mins. If you traffic is high, you can go for higher resolution for minutes or you can reduce by hour or even by day.
It helped a lot based on what data we want to query (using Athena or Redshift Spectrum) and for what time duration.
Hope it helps.

Related

ReplaceWhere greater than year partition in delta table

This is how my data is stored in 'year' partition in a delta table
This is the query I want to write. df_data_model only has data for years 2020 and above. After executing this query I only want year greater than 2020 to be present in delta table and rest deleted. Can this be achived with a query like this ? If yes what should I write in <?> . If not what additional script do I need to write if I am to automate this.
The gist of the question is "Delete data that do not exist in DF, replace data that exists and create new folders for new data"
(df_data_model
.write
.partitionBy("Year")
.mode('overwrite')
.option("replaceWhere", "<?>")
.format('delta')
.save(path_delta)
)
The replaceWhere works and there wont be no data in the folders. But the folders remain there for 7 days and it will be automatically deleted.
If you load from the delta table and check the years it will only have 2020 and 2021. But folders will remain for 2017-2019 for 7 days.

Get the data from multiple datasources irrepective of data available in primary datasource

I have one database with 2 months of data (e.g Jan and Feb) and other database has 12 months of data (Jan to Dec).
My task is to show only the data from both databases but for only months in primary database, in the sense my report should have data only for Jan and Feb as only 2 months exists in primary database
What I have tried:
As option1 tried to blend the data but problem is for some Id's their is no data for Jan, so in this case data from secondary database also not showing since their is no data in primary database.
Second option I tried to link databases and take left join on ID, and used Fixed LoD to pick full data,now I got data but problem here is if I used the date field from secondary database then I am getting 0 (though my data is 1000) if I take datefield from primary database then I get data for full year since I used Fixed LoD
How can I get data for only those months from secondary database (irrespective of primary database has data for those months)?

Partitioning data for a timestamp query

I have paritioned data on s3 I would like to access via spectrum. The current format file structure is similar to: s3://bucket/dir/year=2018/month=11/day=19/hour=12/file.parquet
I partitioned the data using glue, by parsing a field I use for timestamps, ts. Most queries I will do will be on the ts field, as they are timestamp range queries that are more granular than daily(may span multiple days, or less than one day, but time is often involved.
How would I go about creating hourly(preferred, daily would work if needed) partitions on my data so when I query the ts(or another timestamp) field, it will access the partitions correctly. If needed I can recreate my data with different paritions. Most examples/docs just bucket data daily, and use the date field in the query.
I would be happy to provide more information if needed.
Thank you!
Example query would be something like:
SELECT * FROM spectrum.data
WHERE ts between '2018-11-19 17:30:00' AND '2018-11-20 04:45:00'
Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...
And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...
Looks ugly.
You can consider doing something else :
s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet
In that case your query will be more simple
WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')
OR using BETWEEN
https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html
If the partitions are created like mentioned below, it will cater to the query asked by #Eumcoz
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:30:00')
LOCATION 's3path/ts=2018-11-19 17:30:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:40:00')
LOCATION 's3path/ts=2018-11-19 17:40:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:50:00')
LOCATION 's3path/ts=2018-11-19 17:50:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-20 07:30:00')
LOCATION 's3path/ts=2018-11-20 07:30:00/';
Then if you fire this query, it will return the data in all the above partitions:
select * from spectrum.data where ts between '2018-11-19 17:30:00' and '2018-11-20 07:50:00'
P.S. Please up-vote this if it solves your purpose. (I need 50 reputations to be able to comment on posts :) )

How do i divide my minute data into tables containing each month in Timescaledb (PostgreSQL extension)

I am new to timescaledb and I want to store one minute ohlcv ticks for each stock in the table. There are 1440 ticks generated daily for one stock and 43200 ticks a month. I have a 100 assets whose ticks I would like to store every month and basically have the tables divide every 30 days or so, so that I dont have to build complex logic for this division. Any suggestions on how this can be done with timescale DB.
Currently, the way I am doing it is
Take incoming tick (ex timestamp 1535090420)
Round its timestamp to the nearest 30 day period (1535090420/(86400 * 30)) = 592.241674383
Round this number to 592 and multiply by interval to get 1534464000 which is the nearest 30 day bucket inside which all the ticks should be stored So I create a table called OHLC_1534464000 if not exists and add the ticks there
Is there a better way to do this
Unless I'm missing something, it sounds like you just want a table partitioned by 30 day intervals that will automatically know which table to put incoming data into based on the timestamp. This is exactly the functionality TimescaleDB's hypertables.
You can create your table using the following syntax:
SELECT create_hypertable('ohlc', 'timestamp', chunk_time_interval => interval '30 days');
This will create 30 day chunks (just Postgres tables under the hood) and transparently place each tick in the appropriate chunk based on its timestamp. It will automatically create a new chunk once a tick passing the current 30 day period comes in. You can just insert and query on the main table (ohlc) as though it were a single Postgres table.

Web analytics schema with postgres

I am building a web analytics tool and use Postgresql as a database. I will not insert postgres each user visit but only aggregated data each 5 seconds:
time country browser num_visits
========================================
0 USA Chrome 12
0 USA IE 7
5 France IE 5
As you can see each 5 seconds I insert multiple rows (one per each dimensions combination).
In order to reduce the number of rows need to be scanned in queries, I am thinking to have multiple tables with the above schema based on their resolution: 5SecondResolution, 30SecondResolution, 5MinResolution, ..., 1HourResolution. Now when the user asks about the last day I will go to the hour resolution table which is smaller than the 5 sec resolution table (although I could have used that one too - it's just more rows to scan).
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc. AFAIU I have traded one query to a huge table (that has many relevant rows to scan) with multiple queries to medium tables + combine results on client side.
Does this sound like a good optimization?
Any other considerations on this?
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc.
You can't do that if you want your results to be accurate. Imagine if they're asking for one hour resolution from 01:30 to 04:30. You're imagining that you'd get the first and last half hour from the 5 second (or 1 minute) res table, then the rest from the one hour table.
The problem is that the one-hour table is offset by half an hour, so the answers won't actually be correct; each hour will be from 2:00 to 3:00, etc, when the user wants 2:30 to 3:30. It's an even more serious problem as you move to coarser resolutions.
So: This is a perfectly reasonable optimisation technique, but only if you limit your users' search start precision to the resolution of the aggregated table. If they want one hour resolution, force them to pick 1:00, 2:00, etc and disallow setting minutes. If they want 5 min resolution, make them pick 1:00, 1:05, 1:10, ... and so on. You don't have to limit the end precision the same way, since an incomplete ending interval won't affect data prior to the end and can easily be marked as incomplete when displayed. "Current day to date", "Hour so far", etc.
If you limit the start precision you not only give them correct results but greatly simplify the query. If you limit the end precision too then your query is purely against the aggregated table, but if you want "to date" data it's easy enough to write something like:
SELECT blah, mytimestamp
FROM mydata_1hour
WHERE mytimestamp BETWEEN current_date + INTERVAL '1' HOUR AND current_date + INTERVAL '4' HOUR
UNION ALL
SELECT sum(blah), current_date + INTERVAL '5' HOUR
FROM mydata_5second
WHERE mytimestamp BETWEEN current_date + INTERVAL '4' HOUR AND current_date + INTERVAL '5' HOUR;
... or even use several levels of union to satisfy requests for coarser resolutions.
You could use inheritance/partition. One resolution master table and many hourly resolution children tables ( and, perhaps, many minutes and seconds resolution children tables).
Thus you only have to select from the master table only, let the constraint of each children tables decide which is which.
Of course you have to add a trigger function to separate insert into appropriate children tables.
Complexities in insert versus complexities in display.
PostgreSQL - View or Partitioning?