Non-Relational Database Design For Big Data Warehouse

Non-Relational Database Design For Big Data Warehouse - mongodb

Suppose I need to design a table for Spotify where I need to quickly retrieve what items (song or album) a user already purchased so it can play for the user. The scenario is straightforward: when users click to buy the song, the database needs to quickly update a particular song being purchased to the user account.
Since it really requires near real-time response and the table could be increased exponentially, on the other hand, the access format is quite simple and fix, a non-relational database is designed for this use case. That's why I am thinking about using HBase, Cassandra, or MongoDB.
I would like to use UserId as the primary key for this Purchase Table, will Wide Column Stores like (HBase or Cassandra) or Document databases like MongoDB work better for this scenario?
The input is just a user_id and the database table response with all available purchased items. What is the best database table design strategy?
{user_id:int
{purchased_item: item1
item2
item3
}
}
The second table will be used for searching for specific artists, albums, genres, and songs that are available for purchase.
Appreciate if you can share any examples of best practice from the real-world application. Or any good article/document/blogs I can read.

If you are considering near real-time I would definitely consider using Cassandra especially for history detailed storage!
What I would do using Cassandra is the folowing:
CREATE TABLE purchases( user_id uuid, purchase_id uuid, item_id uuid, item_details text, item_name text, time_of purchase timestamp, PRIMARY KEY((user_id), purchase_id, item_id));
This will let you cluster the data in several ways first using the user_id then using the purchase_id to keep all items recorded per purchase!
By having the the Primary key formed of Partition key the user_id the clustering key the purchase_id and item_id we are able to group the items in the purchase_id and then in the user_id.
https://cassandra.apache.org/doc/latest/data_modeling/intro.html
https://docs.datastax.com/en/landing_page/doc/landing_page/current.html

Related

How does PostgreSQL deal with performance when having millions of entries

It may be a silly basic question but as described in the title, I am wondering how PostgreSQL deals with performance when having millions of entries (with the possibility of reaching a billion entries).
To put it in a more concrete way, I want to store data (audio, photos and videos) in my database (I'm only storing their path, files are organised in the file system), but I have to decide wether I use a single table "data" to store all the different types of data, or multiple tables ("data_audio", "data_photos", "data_videos") to separate those types.
The reason why I am asking this question is that I have something like 95% of photos and 5% of audio and videos, and if I want to query my database for an audio entry, I don't want it to be slowed by all the photos entries (searching for a line among a thousand must be different than searching among a million). So I would like to know how PostgreSQL deals with this and if there exists some way to have the best optimisation.
I have read this topic that is really interesting and seems relevant:
How does database indexing work?
Is it the way I should do?
Recap of the core stored informations I will have in my core tables:
1st option:
DATA TABLE (containing audio, photos and videos):
id of type bigserial
_timestamp of type timestamp
path_file of type text
USERS TABLE:
id of type serial
forename of type varchar(255)
surname of type varchar(255)
birthdate of type date
email_address of type varchar(255)
DATA USERS RELATION TABLE:
id_data of type bigserial
id_user of type serial
ACTIVITIES TABLE:
id of type serial
name of type varchar(255)
description of type text
DATA ACTIVITIES RELATION TABLE:
id_data of type bigserial
id_activity of type serial
(SEARCH queries are mainly done on DATA._timestamp and ACTIVITIES.name fields after filtering data by USERS.id)
2nd option (only switching the previous DATA TABLE with the following three tables and keeping all the other tables):
DATA_AUDIO TABLE
DATA_PHOTOS TABLE
DATA_VIDEOS TABLE
Additional question:
Is it a good idea to have a database per user ? (in the storyline, to be able to query the database for data depends on whether you have the permission or not, and if you want to retrieve data from two different users, you have to ask the permission from both users, and the permission process is a process in its own right, it is not handled here, so let’s say that when you query the database, it will always be queries on the same user)
I hope I have been clear, thanks in advance for any help or advices!
Cyrille

Answers:
PostgreSQL is cool with millions and billions of rows.
If the different types of data all have the same attributes and are the same from the database perspective (have the same relationships to other tables etc.), then keep them in one table. If not, use different tables.
The speed of index access to a table does not depend on the size of the table.
If the data of different users have connections, like they use common base tables or you want to be able to join tables for different users, it is best to keep them in different schemas in one database. If it is important that they be separated no matter what, keep them in different databases.
It is also an option to keep data for different users in one table, if you use Row Level Security or let your application take care of it.
This decision depends strongly on your use case and architecture.
Warning: don't create clusters with thousands of databases and databases with thousands of schemas. That causes performance problems in the catalogs.

How to handle many to many in DynamoDB

I am new to NoSql and DynamoDb, but from RDBMS..
My tables are being moved from MySql to DynamoDb. I have tables:
customer (columns: cid [PK], name, contact)
Hardware (columns: hid[PK], name, type )
Rent (columns: rid[PK], cid, hid, time) . => this is the association of customer and Hardware item.
one customer can have many Hardware Items and one Hardware Item can be shared among many customers.
Requirements: seperate lists of customers and hadware items should be able to retrieve.
Rent details- which customer barrowed which Hardeware Item.
I referred this - secondary index table. This is about keeping all columns in one table.
I thought to have 2 DynamoDb tables:
Customer - This has all attributes similar to columns AND set of hardware Item hash keys. (Then my issue is, when customer table is queried to retrieve only customers, all hardware keys are also loaded.)
Any guidance please for table structure? How to save, and load, and even updates ?
Any java samples please? (couldn't find any useful resource which similar to my scenario)

Have a look on DynamoDB's Adjacency List Design Pattern
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html
In your case, based on Adjacency List Design Pattern, your schema can be designed as following
The prefix of partition key and sort key indicate the type of record.
If the record type is customer, both partition key and sort key should have the prefix 'customer-'.
If the record is that the customer rents the hardware, the partition key's prefix should be 'customer-' and the sort key's prefix should be 'hardware-'
base table
+------------+------------+-------------+
|PK |SK |Attributes |
|------------|------------|-------------|
|customer-cid|customer-cid|name, contact|
|hardware-hid|hardware-hid|name, type |
|customer-cid|hardware-hid|time |
+------------+------------+-------------+
Global Secondary Index Table
+------------+------------+----------+
|GSI-1-PK |GSI-1-SK |Attributes|
|------------|------------|----------|
|hardware-hid|customer-cid|time |
+------------+------------+----------+
customer and hardware should be stored in the same table. customer can refer to hardware by using
SELECT * FROM base_table WHERE PK=customer-123 AND SK.startsWith('hardware-')
if you hardware want to refer back to customer, you should use GSI table
SELECT * FROM GSI_table WHERE PK=hardware-333 AND SK.startsWith('customer-')
notice: the SQL I wrote is just pseudo code, to provide you an idea only.

Take a look at this answer, as it covers many of the basics which are relevant to you.
DynamoDB does not support foreign keys as such. Each table is independent and there are no special tools for keeping two tables synchronised.
You would probably have an attribute in your customers table called hardwares. The attribute would be a list of hardware ids the customer has. If you wanted to see all hardware items belonging to a customer you would:
Perform GetItem on the customer id. Or use Query depending on how you are looking the customer up.
For each hardware id in the customer's hardware attribute, perform a GetItem on the Hardware table.
With DynamoDB you generally end up doing more in the client application relative to an RDBMS solution. The benefits are that its fast and simple. But you will find you probably move a lot of your work from the database server to your client server.

Efficient way to model azure table storage for social networking

I have tables like this in SQL Server
Users
UserId (Unique)
Name
Age
Friends
UserId
FriendId
Topics
UserId
Subject
There can be several thousands of users. and there are several other properties in the table.
I can query to get following answers.
Give me all the friends of user "Tom".
Give me all the topics created by "Tom".
Give me all the topics created by Tom's friends that contains "abc" in the subject.
If I were to do it in Azure table storage, how do I structure my tables?
I have gone through this and this I would like someone who had more experience on modeling Azure Table storage to give some insights..

1 and 2 are pretty easy. You create two Azure tables - Friends and Topics indexed by user id (with user id in the key).
3rd one is much more difficult with Azure tables, especially "that contains 'abc' in the subject" part.
Azure tables don't support full text search. Basically it is only possible to efficiently retrieve values (or range of values) either using exact keys or using 'startswith' operator. Like "Give me all records where key is equal to 'key value'". Or "give me all records where key is greated than 'key lower bound' and is less than 'key upper bound'".
It is also possible to filter using 'startswith' by any non-key field of a record, but this will involve table scan and is not efficient. It's not possible to do similar filtering with 'contains'.
So I think you need something with full text search support here.

NOSQL Table Schema

I'm trying to plan a NOSQL table schema. There are relationships in my data, but they are mostly what would be N:N in a relational db; there are very few normal 1:N relationships.
So in this case, I'm trying to create implicit relationships that will allow me to browse from both ends of the relationship. I'm using Azure Table Storage, so I understand that full-text searching isn't available; I can only retrieve an "object" by its Partition Key + Row Key combination.
So imagine I have a table called "People" and a table called "Hamburgers" and each object in the tables can be related to multiple objects in the other table. Hamburgers are eaten by many people, people each eat many hamburgers.
Since the relationship is probably weighted to the people side - i.e. there are more people per hamburger than vice-versa, I would handle this in the tables like this:
Hamburger Table
Partition Key: Only 1 partition
Row Key: Unique ID
People Table
Partition Key: Only 1 partition
Row Key: Unique ID
"Columns": an extra value for every hamburger the person eats
Hamburger-People Table
Partition Key: Hamburger Row Key
Row Key: People Row Key
This way, if I'm looking at a hamburger and want to see all the people that eat it, I can go to the Hamburger-People table and use my Hamburger's Row Key to get the partition of all the people that eat the hamburger.
If I'm at a person and want to see all the hamburgers he/she eats, I have the extra values with the Row Keys of the hamburgers the person eats.
When inserting data into the tables, if the data involves a hamburger/person relationship, I would insert both values in the proper tables, then create the Hamburger-People table. If I was trying to keep a duplicate-free list of hamburgers, I would need to search the Hamburger table first to make sure the hamburger wasn't already in there (like "Whopper" - if it's in there, I wouldn't insert it again). Then, I would need to go insert a row in the hamburger's existing partition in Hamburger-People table.
But for the most part, the no-duplicate requirement doesn't exist.
Is this a good best-practices approach to NOSQL schema, or am I going to run into problems later?
UPDATE
Also, I would like to be able to partition the data tables later, but I'm not sure how to do so with this structure; adding a 2nd partition to the hamburger table would require me to store an extra value in the hamburger-People table, and I'm not sure if that would start to be too complex.

Ok, nice questions and I think most of them are the ones each RDMBS developer face as soon as hits NoSQL world:
1. How to group the partitions?
To get the best of the partitions you need to think that the load of your database should be distributed across your servers, lets see what will happend with your approach
A person with Key "A" enters to the restaurant you will save it and his burger, which is a Classic Tasty (Key "T") the person record goes to the server X and the Burger goes to server Y, now a new customer goes enters with the Key "B", and wants something different, a burger "W", again the person goes to server X and the burguer to server X, this time the server X is getting all the load, if you repeat this you'll see that the server X becomes a bottle neck, because 75% of the records are going there (all the people and 50% of the burgers), that will create some problems with your load. But... the problem will be better when you try to query because all the queries will hit the server X.
To solve this you could use the key of the person as part of the partition for the relationship, so the person will be partitioned in the same server of the burguers relationship, this way your workload will be balanced and you wont have any problems if one of the servers goes down (the person and hamburguers will be "lost" together), this will be a consistence "inconsistency"
2. Should I use a "relationship" in a NoSQL database?
Remember that NoSQL means that you are granted to duplicate information anytime your problem requires a solution to avoid "overqueries", so, if you can store the information that will be commonly queried together you will avoid a roundtrip to the database. So, if you store a "transaction" instead of "person and burguers" you will get a better performance and avoid some hits to the database, lets do an example of real data with your approach and compare it with "my" approach:
Joe Black comes to the restaurant and ask for a tasty, here you will do the following transactions:
Create a Joe Black record
Create a Burguer transaction record
if you want to list your daily transactions you will need to:
Get all the records from the day in the "table" person-burguer, then go to the person "table" and retrieve the name of the customers and now, go to the hamburguer records and retrieve their names. (you wont be able to do cross-table queries because some records could be in one server and others in the second server)
Ok, what if you create a table "transactions" and store in there the following json:
{ custid: "AAABCCC",
name: "Joe", lastName: "Black",
date: "2012/07/07",
order: {
code: "Burger0001",
name: "Tasty",
price: 3.5
}
}
I know you will have several records with the same "tasty" description, that's desnormalization which is very useful when you approach NoSQL solutions to these type of problems, now, how many transactions did you create to store the information to the database? just one! wow... and how many queries will you need to retrieve the information at the end of the day? again... just one, it will create some problems, but will save you a lot of work too, like... could you reprint the order easily? (yes you can!) what if the name of the customer changes? is that even possible?
I hope this help you some way,
I'm the creator of http://djondb.com so I think that having inside knowledge gives me a different approach to the problems according to what the database will be able to do, but I'm not aware of how azure will handle the queries if you are not able to query the document values and just the row keys, but anyway I hope this gives you an insight.

Ways to implement data versioning in PostreSQL

Can you share your thoughts how would you implement data versioning in PostgreSQL. (I've asked similar question regarding Cassandra and MongoDB. If you have any thoughts which db is better for that please share)
Suppose that I need to version records in a simple address book. Address book records are stored in one table without relations for simplicity. I expect that the history:
will be used infrequently
will be used all at once to present it in a "time machine" fashion
there won't be more versions than few hundred to a single record.
history won't expire.
I'm considering the following approaches:
Create a new object table to store history of records with a copy of schema of addressbook table and add timestamp and foreign key to address book table.
Create a kind of schema less table to store changes to address book records. Such table would consist of: AddressBookId, TimeStamp, FieldName, Value. This way I would store only changes to the records and I wouldn't have to keep history table and address book table in sync.
Create a table to store seralized (JSON) address book records or changes to address book records. Such table would looks as follows: AddressBookId, TimeStamp, Object (varchar).
Again this is schema less so I wouldn't have to keep the history table with address book table in sync.
(This is modelled after Simple Document Versioning with CouchDB)

I do something like your second approach: have the table with the actual working set and a history with changes (timestamp, record_id, property_id, property_value). This includes the creation of records. A third table describes the properties (id, property_name, property_type), which helps in data conversion higher up in the application. So you can also track very easily changes of single properties.
Instead of a timestamp you could also have an int-like, wich you increment for every change per record_id, so you have an actual version.

You could have start_date and end_date.
When end_date is NULL, it`s the actual record.

I'm versioning glossary data, and my approach was pretty successful for my needs. Basically, for records you need versioning, you divide the fieldset into persistent fields and version-dependent fields, thus creating two tables. Some of the first set should also be the unique key for the first table.
Address
id [pk]
fullname [uk]
birthday [uk]
Version
id [pk]
address_id [uk]
timestamp [uk]
address
In this fashion, you get an address subjects determined by fullname and birthday (should not change by versioning) and a versioned records containing addresses. address_id should be related to Address:id through foreign key. With each entry in Version table you'll get new version for subject Address:id=address_id with a specific timestamp, in which way you can have a history reference.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse