DynamoDB model that supports queries on any given attribute - nosql

The application we're designing has a function where users can dynamically add new elements to an entity that then need to be efficiently searched. The number of these elements is essentially unlimited. Our team has been looking at DynamoDB as a data store option, and we've been wrestling with the key/value model and how to get this dynamic data under an index for efficient querying.
I think I have a single-table solution that handles the problem elegantly and also allows for querying on any given attribute in the data store, but am disturbed that I can't find an example of it anywhere else. Hopefully it's not fundamentally flawed in some way - I would appreciate any critique!
The model is essentially the Entity-Attribute-Value approach used for adding dynamic or sparse data to RDBMs. So instead of storing different entities/objects in a DynamoDB table like so:
PK SK SK-1 SK-2 SK-3 SK-N... PK SK SK-1 SK-N...
Key Key Key Key --> Name Money
Entity Id Value Value Value Value Person 22 Fred 30000
... which lets me query things like "all persons where name = Fred" but where you would eventually run out of sort key indexes and you would need to know which index goes with which key before you query, the data could be stored in EAV format like so:
PK SK & GSI-PK GSI-SK PK SK & GSI-PK GSI-SK
Id Entity#Key Value 22 Person#Name Fred
Id Entity#Key Value --> 22 Person#Money 30000
Id Entity#Key Value 22 Person#Sex M
Id Entity#Key Value 22 Person#DOB 09/00
Now, with one global secondary index (GSI-1 PK over Entity.Key and GSI-1 SK over Value) I can do a range search on any value for any key and get a list of Ids that match. Users can add their attributes or even entirely new entities and have them persisted in a way that's instantly indexed without us having to revamp the DynamoDB schema.
The one major downside to this approach that I can think of is that data returned from a query on an Entity#Key-Value only contains values for that key and the entity Id, not the entire entity. That's fine for charts and graphs but a problem if you want to get a grid-type result with one query. I also worry about hot partition keys on the index, but hopefully we could solve that with intelligent write sharding.
That's pretty much it. With a few tweaks the model can be extended to support the logging of all changes on each key and allow some nice time series queries against those changes, but my question is if anyone has found it useful to take an EAV type approach to a KV store like DynamoDB, or if there's another way to handle querying a dynamic schema?

You can have pk as the id of the entity. And then a sort key of {attributeName}. You may still want to have the base entity with fields like createdAt, etc.
So you might have:
PK SORT Attributes:
#Entity#22 #Entity#Details createdAt=2020
#Entity#22 #Attribute#name key=name value=Fred
#Entity#22 #Attribute#money key=money value=30000
To get all the attributes of an entity you simply do a query with no filter of pk={id}. You cannot dynamically sort by every given attribute, this is exactly what DynamoDB is not good at, I repeat! That case is exactly what NOSQL performs poorly at.
What you can do is use streaming to do aggregation. So you can for instance store the top 10 wealthiest people:
PK SORT Attributes:
#Money#Highest #1 id=#Entity#22 value=30000
#Money#Highest #2 id=#Entity#52 value=30000
Which you would calculate in a DynamoDB Streams. But you couldn't dynamically index values, DynamoDB works by effectively copying data from one form to another so that it can be efficiently retrieved. So you would be copying your entire database for each new attribute you wanted to search by, or otherwise you would have to use Scans and that wouldn't make any sense to do because you would get no benefit to using DynamoDB if all you ever did was do Scans all the time.
Your processes need to be very well understood to make good use of DynamoDb, if you want to index data at will, and do all sorts of different queries, you probably want an SQL database or elasticsearch.

Related

Feedback about my database design (multi tenancy)

The idea of the SaaS tool is to have dynamic tables with dynamic custom fields and values of different types, we were thinking to use "force.com/salesforce.com" example but is seems to be too complicated to maintain moving forward, also making some reports to create with a huge abstraction level, so we came up with simple idea but we have to be sure that this is kinda good approach.
This is the architecture we have today (in few steps).
Each tenant has it own separate database on the cluster (Postgres 12).
TABLE table, used to keep all of those tables as reference, this entity has ManyToOne relation to META table and OneToMany relation with DATA table.
META table is used for metadata configuration, has OneToMany relation with FIELDS (which has name of the fields as well as the type of field e.g. TEXT/INTEGER/BOOLEAN/DATETIME etc. and attribute value - as string, only as reference).
DATA table has ManyToOne relation to TABLES and 50 character varying columns with names like: attribute1...50 which are NULL-able.
Example flow today:
When user wants to open a TABLE DATA e.g. "CARS", we load the META table with all the FIELDS (to get fields for this query). User specified that he want to query against: Brand, Class, Year, Price columns.
We are checking by the logic, the reference for Brand, Class, Year and Price in META>FIELDS table, so we know that Brand = attribute2, Class = attribute 5, Year = attribute6 and Price = attribute7.
We parse his request into a query e.g.: SELECT [attr...2,5,6,7] FROM DATA and then show the results to user, if user decide to do some filters on it, based on this data e.g. Year > 2017 AND Class = 'A' we use CAST() functionality of SQL for example SELECT CAST(attribute6 AS int) AND attribute5 FROM DATA WHERE CAST(attribute6 AS int) > 2017 AND attribute5 = 'A';, so then we can actually support most principles of SQL.
However moving forward we are scared a bit:
Manage such a environment for more tenants while we are going to have more tables (e.g. 50 per customer, with roughly 1-5 mil per TABLE (5mil is maximum which we allow, for bigger data we have BigQuery) which is giving us 50-250 mil rows in single table DATA_X) which might affect performance of the queries, especially when we gave possibilities to manage simple WHERE statements (less,equal,null etc.) using some abstraction language e.g. GET CARS [BRAND,CLASS,PRICE...] FILTER [EQ(CLASS,A),MT(YEAR,2017)] developed to be similar to JQL (Jira Query Language).
Transactions lock, as we allow to batch upload CSV into the DATA_X so once they want to load e.g. 1GB of the data, it kinda locks the table for other systems to access the DATA table.
Keeping multiple NULL columns which can affect space a bit (for now we are not that scared as while TABLE creation, customer can decide how many columns he wants, so based on that we are assigning this TABLE to one of hardcoded entities DATA_5, DATA_10, DATA_15, DATA_20, DATA_30, DATA_50, where numbers corresponds to limitations of the attribute columns, and those entities are different, we also support migration option if they decide to switch from 5 to 10 attributes etc.
We are on super early stage, so we can/should make those before we scale, as we knew that this is most likely not the best approach, but we kept it to run the project for small customers which for now is working just fine.
We were thinking also about JSONB objects but that is not the option, as we want to keep it simple for getting the data.
What do you think about this solution (fyi DATA has PRIMARY key out of 2 tables - (ID,TABLEID) and built in column CreatedAt which is used form most of the queries, so there will be maximum 3 indexes)?
If it seem bad, what would you recommend as the alternative to this solution based on the details which I shared (basically schema-less RDBMS)?
IMHO, I anticipate issues when you wanted to join tables and also using cast etc.
We had followed the approach below that will be of help to you
We have a table called as Cars and also have a couple of tables like CarsMeta, CarsExtension columns. The underlying Cars table will have all the common fields for a ll tenant's. Also, we will have the CarsMeta table point out what are the types of columns that you can have for extending the Cars entity. In the CarsExtension table, you will have columns like StringCol1...5, IntCol1....5, LongCol1...10
In this way, you can easily filter for data also like,
If you have a filter on the base table, perform the search, if results are found, match the ids to the CarsExtension table to get the list of exentended rows for this entity
In case the filter is on the extended fields, do a search on the extension table and match with that of the base entity ids.
As we will have the extension table organized like below
id - UniqueId
entityid - uniqueid (points to the primary key of the entity)
StringCol1 - string,
...
IntCol1 - int,
...
In this case, it will be easy to do a join for entity and then get the data along with the extension fields.
In case you are having the table metadata and data being inferred from separate tables, it will be a difficult task to maintain this over long period of time and also huge volume of data.
HTH

PostgreSQL array of elements that each are a foreign key

I am attempting to create a DB for my app and one thing I'd like to find the best way of doing is creating a one-to-many relationship between my Users and Items tables.
I know I can make a third table, ReviewedItems, and have the columns be a User id and an Item id, but I'd like to know if it's possible to make a column in Users, let's say reviewedItems, which is an integer array containing foreign keys to Items that the User has reviewed.
If PostgreSQL can do this, please let me know! If not, I'll just go down my third table route.
It may soon be possible to do this: https://commitfest.postgresql.org/17/1252/ - Mark Rofail has been doing some excellent work on this patch!
The patch will (once complete) allow
CREATE TABLE PKTABLEFORARRAY (
ptest1 float8 PRIMARY KEY,
ptest2 text
);
CREATE TABLE FKTABLEFORARRAY (
ftest1 int[],
FOREIGN KEY (EACH ELEMENT OF ftest1) REFERENCES PKTABLEFORARRAY,
ftest2 int
);
However, author currently needs help to rebase the patch (beyond my own ability) so anyone reading this who knows Postgres internals please help if you can.
No, this is not possible.
PostgreSQL is a relational DBMS, operating most efficiently on properly normalized data models. Arrays are not relational data structures - by definition they are sets - and while the SQL standard supports defining foreign keys on array elements, PostgreSQL currently does not support it. There is an (dormant? no activity on commitfest since February 2021) effort to implement this - see this answer to this same question - so the functionality might one day be supported.
For the time being you can, however, build a perfectly fine database with array elements linking to primary keys in other tables. Those array elements, however, can not be declared to be foreign keys and the DBMS will therefore not maintain referential integrity. Using an appropriate set of triggers (both on the referenced and referencing tables, as a change in either would have to trigger a check and possible update on the other) one would in principle be able to implement referential integrity over the array elements but the performance is unlikely to be stellar (because indexes would not be used, for instance).

NOSQL Table Schema

I'm trying to plan a NOSQL table schema. There are relationships in my data, but they are mostly what would be N:N in a relational db; there are very few normal 1:N relationships.
So in this case, I'm trying to create implicit relationships that will allow me to browse from both ends of the relationship. I'm using Azure Table Storage, so I understand that full-text searching isn't available; I can only retrieve an "object" by its Partition Key + Row Key combination.
So imagine I have a table called "People" and a table called "Hamburgers" and each object in the tables can be related to multiple objects in the other table. Hamburgers are eaten by many people, people each eat many hamburgers.
Since the relationship is probably weighted to the people side - i.e. there are more people per hamburger than vice-versa, I would handle this in the tables like this:
Hamburger Table
Partition Key: Only 1 partition
Row Key: Unique ID
People Table
Partition Key: Only 1 partition
Row Key: Unique ID
"Columns": an extra value for every hamburger the person eats
Hamburger-People Table
Partition Key: Hamburger Row Key
Row Key: People Row Key
This way, if I'm looking at a hamburger and want to see all the people that eat it, I can go to the Hamburger-People table and use my Hamburger's Row Key to get the partition of all the people that eat the hamburger.
If I'm at a person and want to see all the hamburgers he/she eats, I have the extra values with the Row Keys of the hamburgers the person eats.
When inserting data into the tables, if the data involves a hamburger/person relationship, I would insert both values in the proper tables, then create the Hamburger-People table. If I was trying to keep a duplicate-free list of hamburgers, I would need to search the Hamburger table first to make sure the hamburger wasn't already in there (like "Whopper" - if it's in there, I wouldn't insert it again). Then, I would need to go insert a row in the hamburger's existing partition in Hamburger-People table.
But for the most part, the no-duplicate requirement doesn't exist.
Is this a good best-practices approach to NOSQL schema, or am I going to run into problems later?
UPDATE
Also, I would like to be able to partition the data tables later, but I'm not sure how to do so with this structure; adding a 2nd partition to the hamburger table would require me to store an extra value in the hamburger-People table, and I'm not sure if that would start to be too complex.
Ok, nice questions and I think most of them are the ones each RDMBS developer face as soon as hits NoSQL world:
1. How to group the partitions?
To get the best of the partitions you need to think that the load of your database should be distributed across your servers, lets see what will happend with your approach
A person with Key "A" enters to the restaurant you will save it and his burger, which is a Classic Tasty (Key "T") the person record goes to the server X and the Burger goes to server Y, now a new customer goes enters with the Key "B", and wants something different, a burger "W", again the person goes to server X and the burguer to server X, this time the server X is getting all the load, if you repeat this you'll see that the server X becomes a bottle neck, because 75% of the records are going there (all the people and 50% of the burgers), that will create some problems with your load. But... the problem will be better when you try to query because all the queries will hit the server X.
To solve this you could use the key of the person as part of the partition for the relationship, so the person will be partitioned in the same server of the burguers relationship, this way your workload will be balanced and you wont have any problems if one of the servers goes down (the person and hamburguers will be "lost" together), this will be a consistence "inconsistency"
2. Should I use a "relationship" in a NoSQL database?
Remember that NoSQL means that you are granted to duplicate information anytime your problem requires a solution to avoid "overqueries", so, if you can store the information that will be commonly queried together you will avoid a roundtrip to the database. So, if you store a "transaction" instead of "person and burguers" you will get a better performance and avoid some hits to the database, lets do an example of real data with your approach and compare it with "my" approach:
Joe Black comes to the restaurant and ask for a tasty, here you will do the following transactions:
Create a Joe Black record
Create a Burguer transaction record
if you want to list your daily transactions you will need to:
Get all the records from the day in the "table" person-burguer, then go to the person "table" and retrieve the name of the customers and now, go to the hamburguer records and retrieve their names. (you wont be able to do cross-table queries because some records could be in one server and others in the second server)
Ok, what if you create a table "transactions" and store in there the following json:
{ custid: "AAABCCC",
name: "Joe", lastName: "Black",
date: "2012/07/07",
order: {
code: "Burger0001",
name: "Tasty",
price: 3.5
}
}
I know you will have several records with the same "tasty" description, that's desnormalization which is very useful when you approach NoSQL solutions to these type of problems, now, how many transactions did you create to store the information to the database? just one! wow... and how many queries will you need to retrieve the information at the end of the day? again... just one, it will create some problems, but will save you a lot of work too, like... could you reprint the order easily? (yes you can!) what if the name of the customer changes? is that even possible?
I hope this help you some way,
I'm the creator of http://djondb.com so I think that having inside knowledge gives me a different approach to the problems according to what the database will be able to do, but I'm not aware of how azure will handle the queries if you are not able to query the document values and just the row keys, but anyway I hope this gives you an insight.

Structure a dynamoDB table to enable ASC or DESC ordered pagination on * items in a table

I want to ORDER_BY by time/date, and paginate through all items in a table. Scan seems designed to paginate through everything, but does not seem to have a "ASC/DESC" equiv. Query has ScanIndexForward but requires specific primary keys. (no way to SELECT * ?)
Based on the first comment of this question I'm thinking the only way to achieve this is to use a common primary key (!?) and then Query based on that, focusing on the Range key. Is this really how it's supposed to work? I'd have to make a whole separate table with mirrored attributes if I wanted to Query an individual item based on a unique primary key.
Please excuse my NoSQL noobness. I'm a front-end dev who's only dabbled in MySQL and SimpleDB.
Yes, this is what Query is for. The hash key identifies the list of things to page over, and the range key indicates the position within the list. If you can tolerate the latency hit, all you need to store in the table is primary keys where all the data being paged over lives, you can then issue a BatchGetItem to read a pageful of data in parallel.
Duplicate data isn't the sin in NoSQL that it is in the relational model, you're essentially crafting a MySQL style index by hand.

Postgres full text search across multiple related tables

This may be a very simplistic question, so apologies in advance, but I am very new to database usage.
I'd like to have Postgres run its full text search across multiple joined tables. Imagine something like a model User, with related models UserProfile and UserInfo. The search would only be for Users, but would include information from UserProfile and UserInfo.
I'm planning on using a gin index for the search. I'm unclear, however, on whether I'm going to need a separate tsvector column in the User table to hold the aggregated tsvectors from across the tables, and to setup triggers to keep it up to date. Or if it's possible to create an index without a tsvector column that'll keep itself up to date whenever any of the relevant fields in any of the relevant tables change. Also, any tips on the syntax of the command to create all this would be much appreciated as well.
Your best answer is probably to have a separate tsvector column in each table (with an index on, of course). If you aggregate the data up to a shared tsvector, that'll create a lot of updates on that shared one whenever the individual ones update.
You will need one index per table. Then when you query it, obviously you need multiple WHERE clauses, one for each field. PostgreSQL will then automatically figure out which combination of indexes to use to give you the quickest results - likely using bitmap scanning. It will make your queries a little more complex to write (since you need multiple column matching clauses), but that keeps the flexibility to only query some of the fields in the cases where you want.
You cannot create one index that tracks multiple tables. To do that you need the separate tsvector column and triggers on each table to update it.