How do I setup an index on DynamoDB table to compare dates? i.e. for example, I have a column in my table called synchronizedAt, I want my query to fetch all the rows that were never synchronized (i.e. 0) or weren’t synchronized in the past 2 weeks i.e. (new Date().getTime()) - (1000 * 60 * 60 * 24 * 7 * 4)
It depends by the other attributes of your table.
You may use an Hash and Range Primary Key if the set of Hash values is relatively small and stable; in this case you could filter the dates by putting them in the Range, but anyway the queries will be done by specifying the Index also, and because of this it may or may not make sense to pre-query for all the Index values in order to perform a loop where ask for the interesting Range (inside of the loop for each Index value).
An alternative could be an Hash and Range GSI. In this case you might put a fixed dumb value as Index, in order to query for the range of all Items at once.
Lastly, the less efficient Scan, but with large tables it will be a problem (the larger the table the more time the Scan will take to complete).
I had similar requirement to query on the date range. In my case date range was only criteria. The issue with DynamoDB is you cannot create an Index with Just Range key. It always require Hashkey and Query on such index always expect equal to condition for Hashkey.
So I tricked the DB. I created a Key as Century and populated with century year of the date. For example 1 Jan 2019, century key value is 20. For 1 Jan 2020 century key value is also 20. Very easy to derived from any date. Then Created GSI with Hashkey as Century and RangeKey as date. While querying it is very easy to derive century from date range and Build query condition Hashkey as century and date range. Since I am dealing with data no more than 5 years, trick won't fail for next 75 years. :)
It is not so "nice to have" workaround but it work for me quite well. May be it will help someone else as well.
Related
If I have a Cloudant MapReduce view with year/month/day as the key array, can I query the dataset just by month or just by day?
No. You can query by y/m/d or by y/m or by y.
In other words, you are allowed to omit fields, but you cannot have gaps, so you have to start omitting from the right.
Examples:
Querying by y/m/d -- key=[2022,5,20] finds everything for one day
Querying by y/m -- startkey=[2022,1]&endkey=[2022,2] finds everything in January
Querying by y -- startkey=[2021]&endkey=[2022] finds everything in 2021
Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?
The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.
Are you able to use MongoDB to combine rows of data into one row?
I'm using dates with year, month, day and hour. The data is shown per hour. Is there a way to combine data of the hours into just one day with data. I would basically remove the hour column and sum the hour data into per day data.
I'm not sure what you mean by "the data is shown per hour" - do you mean it's stored in the database that way?
MongoDB doesn't have rows and columns - the equivalent of a row is a document, and the column equivalent is a field. Unlike in traditional SQL, a field isn't just one piece of information (a string, number/date, boolean, null, etc). It can be more than one piece of data - it can be an array, or a document, or an array of documents, etc.
Anyway, based on the small amount of information I have on your situation, I'd absolutely design the data with the bucket pattern. https://www.mongodb.com/blog/post/building-with-patterns-the-bucket-pattern
You could $unset the 'measurements' array and just keep the sum/count fields if that's what you want.
If your data is already set in stone, then I'd use an aggregation pipeline to group all the documents ('rows') together - the group _id would be year, month, day, and you could sum/count/min/max/etc the data in the group too.
If I have a list of tasks with a certain date ranges, and the task is broken into weekly hour chunks of work (ie. 30 hours from 2018-12-31 to 2019-01-06 ... etc starting from Monday).
The kind of operations I would like to do are
Display all the weekly hours of all the tasks for a list of users
Sum the weekly hours for a user for all his tasks for the week
When the duration of the task is modified, create/destroy the weekly hour chunks.
Would it be more efficient to store these weekly records as
start date/end date/hours,
year/week number/hours
Storing start/end date probably give more flexibility to the table as it could potentially store non-weekly align hours.
Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date, and populating the weeks in between (without converting to date ranges). Also easier validation for updating the hours for a week, as long as the week number is 1-53.
Wondering if anyone has tried out either option and can give any pointers on their preferred option.
I would probably go for a daterange column.
That gives you the flexibility to have differently sized chunks and allows you to define an exclusion constraint to prevent overlapping ranges.
Finding the row for a given week is still quite simple using the "contains" operator #>, e.g. where the_column #> to_date('2019-24', 'iyyy-iw') finds the row(s) that contain week number 24 in 2019.
The expression to_date('2019-24', 'iyyy-iw') returns the first day (Monday) of the specified week.
Finding all rows that are between two weeks can also be done, however construction the corresponding date range looks a bit ugly. You can either construction an inclusive range with the first and last day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-24', 'iyyy-iw') + 6, '[]')
Or you can create a range with an exclusive upper range with the next week's first day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-25', 'iyyy-iw'), '[)')
While ranges can be indexed quite efficiently and , the required GIST indexes are a bit more expensive to maintain than a B-Tree index on two integer columns.
Another downside of using ranges (if you don't really need the flexibility) is that they take up more space than two integer columns (14 byte instead of 8, or even 4 with two smallint). So if the size of the table is of any concern, then your current solution with the year/week columns is more efficient.
"Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date"
If your input is a start and end date to begin with (rather than a "week number"), then I would definitely go for a daterange column. If that start and end date cover more than one week, then you store only one row, rather than multiple rows.
Good day,
I wish to merge two dates to next closest.
Datasets are huge 500Mb to 1G so proc sql is out of the question.
I have two data sets. First (Fleet) has observations, second has date and which generation number to use for further processing. Like this:
data Fleet
CreatedPortalDate
2013/2/19
2013/8/22
2013/8/25
2013/10/01
2013/10/07
data gennum_list
date
01/12/2014
08/12/2014
15/12/2014
22/12/2014
29/12/2014
...
What I'd like to have is a link-table like this:
data link_table
CreatedPortalDate date
14-12-03 01/12/2014
14-12-06 01/12/2014
14-12-09 08/12/2014
14-12-11 08/12/2014
14-12-14 08/12/2014
With logic that
Date < CreatedPortalDate and (CreatedPortalDate - date) = min(CreatedPortalDate - date)
What I came up with is a bit clunky and I'm looking for an efficient/better way to accomplish this.
data all_comb;
set devFleet(keep=createdportaldate);
do i=1 to n;
set gennum_list(keep=date) point=i nobs=n;
if createdportaldate > date
and createdportaldate - 15 < date then do;/*Assumption, the generations are created weekly.*/
distance= createdportaldate - date;
output;
end;
end;
run;
proc sort data=all_comb; by createdportaldate distance; run;
data link_table;
set _all_comb(drop=distance);
by createdportaldate;
if first.createdportaldate;
run;
Any ideas how to improve or approach this issue?
Ignorant idea: Could I create hash tables where distance would be stored.
Arrays maybe? somehow.
EDIT:
common format
Done
Where does the billion rows come from?
Yes, there are other data involved but the date is the only linking variable.
Sorted?
Yes, the data is sorted and can be sorted again.
Are gen num dates always seven days apart ?
No. That's the tricky part. Otherwise I could use weekand year(or other binning) as unique identifier.
Huge is a relative term, today's huge is tomorrow's speck.
Key data features indicate a direct addressing lookup scheme is possible
Date values are integers.
Date value ranges are limited.
A date value, or any of the next 14 days will be used as a lookup verifier
The key is a date value, which can be used as an array index.
Load the Gennum lookup once as follows
array gennum_of ( %sysfunc(today()) ) _temporary_;
if last_date then
do index = last_date to date-1;
gennum_of(index) = prev_date;
end;
last_date = date;
And fetch a gennum as
if portaldate > last_date
then portal_gennum = last_date;
else portal_gennum = gennum_of ( portaldate );
If you have many rows due to grouping by account ids, you will have to clear and load up the gennum array per group.
This is a typical application of a sas by statement.
The by statement in a data step is meant to read two or more data sets at onece sorted by a common variable.
The common variable is the date, but it is named differently on both datasets. In sql, you solve that by requiring equality of the one variable to the other Fleet.CreatedPortalDate = gennum_list.date, but the by statement does not allow such construction, so we have to rename (at least) one of them while reading the datasets. That is waht we do in the rename clause within the options of gennum_list
data all_comb;
merge gennum_list (in = in_gennum rename = (date = CreatedPortalDate))
Fleet (in = in_fleet);
by CreatedPortalDate;
I choose to combine the by statement with a merge statement, though a set would have done the job too, but then the order of both input datasets makes a difference.
Also note that I requested sas to create indicator variables in_gennum and in_fleet that indicate in which input dataset a value was present. It is handy to know that this type of variables id not written to the result data set.
However, we have to recover the date from the CreatedPortalDate, of course
if in_gennum then date = CreatedPortalDate;
If you are new to sas, you will be surprised the above statement does not work unless you explicitly instruct sas to retain the value of date from one observation to the nest. (Observation is sas jargon for row.)
retain date;
And here we write out one observation for each observation read from the Fleet dataset.
if in_fleet then output;
run;
The advantages of this approach are
you need much less logic to correctly combine the observations from both input datasets (and that is what the data step is invented for)
you never have to retain an array of values in memory, so you can not have overflow problems
this sollution is of order 1 (O1), in the size of the datasets (apart from the sorting), so we know upfront that doubling the amount of data will only const double the time.
Disclaimer: this answer is under construction.
It will be tested later this week