DynamoDB column with tilde and query using JPA

DynamoDB column with tilde and query using JPA - spring-data-jpa

i have table column with tilde value like below
vendorAndDate - Column name
Chipotle~08-26-2020 - column value
I want to query for month "vendorAndPurchaseDate like '%~08%2020'" and for year ends with 2020 "vendorAndPurchaseDate like '%2020'". I am using Spring Data JPA to query the values. I have not worked on column with tilde values before. Please point me in a right direction or some examples

You cannot.
If vendorAndPurchaseDate is your partition key , you need to pass the whole value.
If vendorAndPurchaseDate is range key , you can only perform
= ,>,<>=,<=,between and begins_with operation along with a partition key
reference : https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html

DynamoDB does not support this type of wildcard query.
Let's consider a more DynamoDB way of handling this type of query. It sounds like you want to support 2 access patterns:
Get Item by month
Get Item by year
You don't describe your Primary Keys (Partition Key/Sort Key), so I'm going to make some assumptions to illustrate one way to address these access patterns.
Your attribute appears to be a composite key, consisting of <vendor>~<date>, where the date is expressed by MM-DD-YYYY. I would recommend storing your date fields in YYYY-MM-DD format, which would allow you to exploit the sort-ability of the date field. An example will make this much clearer. Imagine your table looked like this:
I'm calling your vendorAndDate attribute SK, since I'm using it as a Sort Key in this example. This table structure allows me to implement your two access patterns by executing the following queries (in pseudocode to remain language agnostic):
Access Pattern 1: Fetch all Chipotle records for August 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-08-00 and Chipotle~2020-08-31
Access Pattern 2: Fetch all Chipotle records for 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-01-01 and Chipotle~2020-12-31
Because dates stored in ISO8601 format (e.g. YYYY-MM-DD...) are lexicographically sortable, you can perform range queries in DynamoDB in this way.
Again, I've made some assumptions about your data and access patterns for the purpose of illustrating the technique of using lexicographically sortable timestamps to implement range queries.

Related

Casting a Firebase timestamp to a Date/Time in BigQuery - referencing a computed field again in the same query?

I am not a coder, and very much flying blind, so please excuse the simplicity of this query.
I am streaming Firebase Firestore updates to a BigQuery table using the Firebase extension "Stream collection to BigQuery" which I am then linking as a DataSource in Google Data Studio. This is currently working as intended.
I have 2 questions:
Is there a more efficient way to convert a Firebase timestamp into a BigQuery Date/Time value? The Firebase Timestamp shows in JSON format in the BigQuery table as follows:
{"created_time":{"_seconds":1647554254,"_nanoseconds":234000000}}
My BigQuery SQL code to convert it (which works) is:
DATETIME(TIMESTAMP_SECONDS(CAST(JSON_VALUE(DATA,'$.created_time._seconds') AS int64)),"Africa/Johannesburg") AS createDate
Is there a more efficient way to do this, or is this reasonable?
How do I reference the createDate computed field (above) in another computed field ageDays within this same query? I haven't found it in Google or StackOverflow, either because of poor phrasing or its just too basic a query. I tried using a table alias referencing the createDate computed field (e.g. T.createDate) but no dice. My very ugly workaround was therefore just to reperform the createDate calculation in it's entirety (which feels wrong) in my new computed column ageDays as follows:
DATE_DIFF(current_date("Africa/Johannesburg"),DATETIME(TIMESTAMP_SECONDS(CAST(JSON_VALUE(DATA,'$.created_time._seconds') AS int64)),"Africa/Johannesburg"), DAY) AS ageDays
Would be sincerely grateful for any insights - many thanks.

For your requirement, JSON_EXTRACT can also be used instead of JSON_VALUE. You can use below query to get the expected output.
select
date(timestamp_seconds(cast(json_extract( data , '$.created_time._seconds') as int64))) AS Date_Created
from `project.dataset.timetable`
Output
Table alias cannot be used to reference a field in another column with a SELECT statement as it has limited visibility. Alias can be used with Order By, Group By or Having clauses in a SELECT statement. The best way to get the ageDays is by again computing the whole createDate field.

How to efficiently index fields with an identical (and long) prefix in PostgreSQL?

I’m working with identifiers in a rather unusual format: every single ID has the same prefix and the prefix consists of as many as 25 characters. The only thing that is unique is the last part of the ID string and it has a variable length of up to ten characters:
ID
----------------------------------
lorem:ipsum:dolor:sit:amet:12345
lorem:ipsum:dolor:sit:amet:abcd123
lorem:ipsum:dolor:sit:amet:efg1
I’m looking for advice on the best strategy around indexing and matching this kind of ID string in PostgreSQL.
One approach I have considered is basically cutting these long prefixes out and only storing the unique suffix in the table column.
Another option that comes to mind is only indexing the suffix:
CREATE INDEX ON books (substring(book_id FROM 26));
I don’t think this is the best idea though as you would need to remember to always strip out the prefix when querying the table. If you forgot to do it and had a WHERE book_id = '<full ID here>' filter, the index would basically be ignored by the planner.

Most times I always create an integer type ID for my tables if even I have one unique string type of field. Recommendation for you is a good idea, I must view all your queries in DB. If you are recently using substring(book_id FROM 26) after the where statement, this is the best way to create expression index (function-based index). Basically, you need to check table joining conditions, which fields are used in the joining processes, and which fields are used after WHERE statements in your queries. After then you can prepare the best plan for creating indexes. If on the process of table joining you are using last part unique characters on the ID field then this is the best way to extract unique last characters and store this in additional fields or create expression index using the function for extracting unique characters.

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

Postgres hstore for time series

I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.
I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100.
I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:
The hstore column needs to be loaded from thousands of timestamp,value records that exist in a csv. The hstore should be appended whenever we get a new csv.
The table needs to store the productId and corresponding Timeseries data.
Can you please advise if using hstore would be helpful ? If yes then how can I load data from csv as explained above. Also, if there could be any impact on the performance on inserts/updates in the hstore, as data grows please share your experiences.

I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:
CREATE TABLE product_data
(
product TEXT, -- I'm making an assumption about the types of your columns
time TIMESTAMP,
value DOUBLE PRECISION,
PRIMARY KEY (product, time);
);
I would definitely keep hstore and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.
Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?
What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value), if the product has many distinct timestamps.
Other options
hstore is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.
Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:
CREATE TABLE product_data
(
product TEXT,
day DATE,
values DOUBLE PRECISION[], -- An array from 0 to 23.
PRIMARY KEY (product, day);
);
You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)
But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.

simplest example of a query by date range in cassandra 1.x

I want to store an ID and a date and I want to retrieve all entries from dateA up to dateB, what exactly do I need to be able to perform select from my_column_family where date >= dateA and date < dateB; ?

the guys at #cassandra (IRC) helped me find a way, there's many subtle details so I'd like to document that here.
first you need to declare a column family similar to this (examples from cassandra-cli):
create column family users with comparator=UTF8Type and key_validation_class=UTF8Type and column_metadata=[
{column_name: id, validation_class: LongType}
{column_name: name, validation_class: UTF8Type, index_type: KEYS}
{column_name: age, validation_class: LongType}
];
few important things about this declaration:
the comparator and key_validation_class are there to be able to use strings as key names
the first declared column is special, it's the "row key" which is used to address each row and therefore cannot contain duplicate values (the INSERT is really an UPSERT so when there's duplicates the new values overwrite the old ones)
the second column declares a "secondary index" on its values (more on that below)
the dates are stored as Long datatypes, interpretation is up to the client
now let's add some values:
set users[1][name] = john;
set users[1][age] = 19;
set users[2][name] = jane;
set users[2][age] = 21;
set users[3][name] = john;
set users[3][age] = 32;
according to this: http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/ Cassandra does not support the < operators, what it does is to manually exclude the rows that don't match but it does that AFTER there's a resultset and it also refuses to do so unless and actual filtering has taken place.
what that means is that a query like get users where age > 20; will return null but if we add a predicate that includes = it'll magically work.
here's where the secondary index is important, without it you can't use = so on this example I can do get users where name = jane; but I cannot ask for get users where age = 21;
the funny thing is that, after using = the < works so having a secondary index allows you to ask for get users where name = john and age > 20; and it'll filter correctly.

There are a few ways to solve this. The simplest is probably the secondary index solution with the equality limitation mentioned in your own answer. I've used this method, adding an additional column called 'valid', setting the value to 1. Then the queries can become where valid=1 and date>nnnn
The other solutions require additional column families and additional queries.
When loading the data, create and add to a column family which contains the timestamps as keys, and each entry would list all the user ids as column names.
If the partitioning strategy is ordered, then a single RangeSliceQuery can specify the date range as a key range and get all the columns for each key. Then iterate through the result keys, using the column values for each user id and if needed, query the original column family for the data associated with each id. Cassandra always stores the column names sorted, and can be reversed when reading.
But, as documented, the ordered partitioner is not ideal, leading to hot spots and difficulty in load balancing the nodes.
Without the ordered partitioner, still keeping the timestamp column family, you would have to create another column family while loading data where you can store all the timestamps as the columns under one or more known keys (e.g. 'created' or 'updated'). The first query would be a SliceQuery for a known key, and then the column names (as timestamps) would provide the keys for the MultigetSliceQuery to the timestamp column family.
I've used variations on this, usually adding Composite keys or columns for additional flexibility.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse