So in a traditional database I might have 2 tables like users, company
id
username
companyid
email
1
j23
1
something#gmail.com
2
fj222
1
james#aol.com
id
ownerid
company_name
1
1
A Really boring company
This is to say that user 1 and 2 are apart of company 1 (a really boring company) and user 1 is the owner of this company.
I could easily issue an update statement in MySQL or Postgresql to update the company name.
But how could I model the same data from a NoSQL perspective, in something like Dynamodb or Mongodb?
Would each user record (document in NoSQL) contain the same company table data (id, ownerid (or is owner true/false, and company name)? I'm unclear how to update the record for all users containing this data then if the company name needed to be updated.
In case you want to save the company object as JSON in each field (for performance reasons), indeed, you have to update a lot of rows.
But best way to achieve this is to have a similar structure as you have above, in MySQL. NoSql schema depends a lot on the queries you will be making.
For example, the schema above is great for:
Find a particular user by username, along with his company name. First you need to query User by username (you can add an index), get the companyId and do another query on Company to fetch the name.
Let's assume company name changes often
In this case company name update is easy. To execute the read query, you need 2 queries to get your result (but they should execute fast)
Embedded company JSON would work better for:
Find all users from a specific city and show their company name
Let's assume company name changes very rarely
In this case, we can't use the "relational" approach, because we will do 1 query to fetch Users by city and then another query for all users found to fetch the company name
Using embedded approach, we need only 1 query
To update a company name, a full (expensive) scan is needed, but should be ok if done rarely
What if company name changes ofter and I want to get users by city?
This becomes tricky, NoSQL is not a replacement for SQL, it has it's shortcomings. Solution may be a platform dependent feature (from mongo, dynamodb, firestore etc.), an additional layer above (elasticSearch) or no solution at all (consider not using key-value NoSQL)
Depends on the programming language used to handle NoSQL objects/documents you have variety of ORM libraries to model your schema. Eg. for MongoDB plus JS/Typescript I recommend Mongoose and its subdocuments. Here is more about it:
https://mongoosejs.com/docs/subdocs.html
Related
I am currently trying to model a MongoDB database structure where the entities are very complex in relation to each other.
In my current collections, MongoDB queries are difficult or impossible to put into a single aggregation. Incidentally, I'm not a database specialist and have been working with MongoDB for only about half a year.
To keep it as simple as possible but necessary, this is my challenge:
I have newspaper articles that contain simple keywords, works (oevres, books, movies), persons and linked combinations of works and persons. In addition, the same people appear under different names in different articles.
Later, on the person view I want to show the following:
the links of the person with name and work and the respective articles
the articles in which the person appears without a work (by name)
the other keywords that are still in the article
In my structure I want to avoid that entities such as people occur multiple times. So these are my current collections:
Article
id
title
keywordRelations
KeywordRelation
id
type (single or combination)
simpleKeywordId (optional)
personNameConnectionIds (optional)
workIds (optional)
SimpleKeyword
id
value
PersonNameConnection
id
personId
nameInArticleId
Person
id
firstname
lastname
NameInArticle
id
name
type (e.g. abbreviation, synonyme)
Work
id
title
To meet the requirements, I would always have to create queries that range over 3 to 4 tables. Is that possible and useful with MongoDB?
Or is there an easier way and structure to achieve that?
I am trying to come up with a MongoDB document model and would like others opinions. I want to have a Document that represents an Employee. This table will contain all attributes of an employee (I.e. firstName, LastName). Now where I am stuck coming from the relational realm, is the need to store a list of employees an employee can access. In other words lets say Employee A is a Manager. I need to store the direct reports that he manages, in order to use this in various applications. In relational I would have a mapping table that tied an employee to many employees. In mongo not being able join documents, do you think I should utilize an embeded (sub-document) to store the list of accessible employees as part of the Employee document? Any other ideas ?
Unless your using employee groups (Accounting, HR, etc) You'll probably be fine adding the employee name, mongo Object ID, and any other information unique to that manager / employee relationship as a sub document to the managers document.
With that in place you could probably do your reporting on these relationships through a simple aggregation.
This is all IMHO, and begs the question; Is simple aggregation another oxymoron like military intelligence?
I have read Eric Evans' Domain Driven Design book and I have been trying to apply some of the concepts.
In his book, Eric talks about aggregates and how aggregate roots should have a unique global id whereas aggregate members should have a unique local id. I have been trying to apply that concept to my database tables and I'm running into some issues.
I have two tables in my PostgreSQL database: facilities and employees where employees can be assigned to a single facility.
In the past, I would lay out the employees table as follows:
CREATE TABLE "employees" (
"employeeid" serial NOT NULL PRIMARY KEY,
"facilityid" integer NOT NULL,
...
FOREIGN KEY ("facilityid") REFERENCES "facilities" ("facilityid")
);
where employeeid is a globally unique id. I would then add code in the backend for access control validation, preventing users of one facility from accessing rows pertaining to other facilities. I have a feeling this might not be the safest way to do it.
What I am now considering is this layout:
CREATE TABLE "employees" (
"employeeid" integer NOT NULL,
"facilityid" integer NOT NULL,
...
PRIMARY KEY ("employeeid", "facilityid"),
FOREIGN KEY ("facilityid") REFERENCES "facilities" ("facilityid")
);
where employeeid is unique (locally) for a given facilityid but needs to be paired with a facilityid to be unique globally.
Concretely, this is what I am looking for:
Employee A (employeeid: 1, facilityid: 1)
Employee B (employeeid: 2, facilityid: 1)
Employee C (employeeid: 1, facilityid: 2)
where A, B and C are 3 distinct employees and...
adding an employee D to facility 1 would give him the keys (employeeid : 3, facilityid: 1)
adding an employee E to facility 2 would give him the keys (employeeid : 2, facilityid: 2)
I see two ways of achieving this:
I could use triggers or stored procedures to automatically generate new employeeids and store the last ids for every facility in another table for quicker access but I am concerned about concurrency issues and ending up with 2 employees from the same facility with the same id.
I could possibly create a new sequence for each facility to manage the employeeids but I fear ending up with thousands of sequences to manage and with procedures to delete those sequences in case a facility is deleted. Is there anything wrong with this? It seems heavy to me.
Which approach should I take? Is there anything I'm missing out on?
I am inferring from your question that you will be running a single database for all facilities, or at least that if you have a local database as the "master" for each facility that the data will need to be combined in a central database without collisions.
I would make the facilityid the high order part of the primary key. You could probably assign new employee numbers using a simple SELECT max(employeeid) + 1 ... WHERE facilityid = n approach, since adding employees to any one facility is presumably not something that happens hundreds of times per second from multiple concurrent sources. There is some chance that this could generate an occasional serialization failure, but it is my opinion that any database access should be through a framework which recognizes those and automatically retries the transaction.
I guess you overstressed the aggregate root concept here. In my understanding of modelling an employee (that depends on your context) an employee is almost always an aggregate root possibly referenced by another aggregate root facility.
Both employee and facility almost always have natural keys. For the employee this is typically some employee id (printed on employee identification badges, or at least maintained in the human resources software system) and facilities have this natural keys too almost always containing some location part and some number like "MUC-1" for facility 1 located in munich. But that all depends on your context. In case employee and facility have this natural keys your database model should be quite clear.
I've been wondering how facebook manages the database design for all the different things that you can "like". If there is only one thing to like, this is simple, just a foreign key to what you like and a foreign key to who you are.
But there must be hundreds of different tables that you can "like" on facebook. How do they store the likes?
If you want to represent this sort of structure in a relational database, then you need to use a hierarchy normally referred to as table inheritance. In table inheritance, you have a single table that defines a parent type, then child tables whose primary keys are also foreign keys back to the parent.
Using the Facebook example, you might have something like this:
User
------------
UserId (PK)
Item
-------------
ItemId (PK)
ItemType (discriminator column)
OwnerId (FK to User)
Status
------------
ItemId (PK, FK to Item)
StatusText
RelationshipUpdate
------------------
ItemId (PK, FK to Item)
RelationshipStatus
RelationTo (FK to User)
Like
------------
OwnerId (FK to User)
ItemId (FK to Item)
Compound PK of OwnerId, ItemId
In the interest completeness, it's worth noting that Facebook doesn't use an RDBMS for this sort of thing. They have opted for a NoSQL solution for this sort of storage. However, this is one way of storing such loosely-coupled information within an RDBMS.
Facebook does not have traditional foreign keys and such, as they don't use relational databases for most of their data storage. Simply, they don't cut it for that.
However they use several NoSQL type data stores. The "Like" is most likely attributed based on a service, probably setup in an SOA style manner throughout their infrastructure. This way the "Like" can basically be attributed to anything they want it to be associated with. All this, with vast scalability and no tightly coupled relational issues to deal with. Something that Facebook, can't really afford to deal with at the volume they operate.
They could also be using an AOP (Aspect Oriented Programming) style processing mechanism to "attach" a "Like" to anything that may need one at page rendering time, but I get the notion that it is asynchronous processing via JavaScript against an SOA style web service or other delivery mechanism.
Either way, I'd love to hear how they have this setup from an architecture perspective myself. Considering their volume, even the simple "Like" button becomes a significant implementation of technology.
You can have a table with Id, ForeignId and Type. Type can be anything like Photo, Status, Event, etc… ForeignId would be the id of the record in the table Type. This makes possible for both comments and likes. You only need one table for all likes, one for all comments and the one I described.
Example:
Items
Id | Foreign Id | Type
----+-------------+--------
1 | 322 | Photo
4 | 346 | Status
Likes
Id | User Id | Item Id
----+-------------+--------
1 | 111 | 1
Here, user with Id 111 likes the photo with Id 322.
Note: I assume you are using an RDBMS, but see Adron's answer. Facebook does not use an RDBMS for most of their data.
I'm pretty sure Facebook does not store "like" information as how some other suggested it using RDBMS. With millions of users and possibly thousands of like, we're looking at thousands of rows to join here which would impact performance.
The best approach here is to append all "likes" in a single row. For example, a table with user_like_id column of text datatype. Then all id's who liked the post is appended. In this case, you only query one row and you got everything. This will be a lot faster than joining tables and getting counts.
EDIT: I haven't been here on this site lately and I just discovered this answer has been downvoted. Well, here's an example post with like count and their avatars. This is my design where I just implemented what I'm talking about.
The two components here are 1.) XREF table and 2.) JSON object.
The likes are still stored on a XREF table. But at the same time, data is appended on JSON object and stored on a text column on the post table.
Why did I store the likes info on a text column as JSON? So that there's no need to do db lookup/joins for the likes. If someone unlike the post, the JSON object is just updated.
Now I don't know why this answer is downvoted by some users here. This answer provides quick data retrieval. This is close to NoSQL approach which is how FB access data. In this case, there's no need for extra joins/lookup to get likes info.
And here's the table that holds the likes. It's just a simple XREF mapping between user and item table.
I'm willing to give MongoDB and CouchDB a serious try. So far I've worked a bit with Mongo, but I'm also intrigued by Couch's RESTful approach.
Having worked for years with relational DBs, I still don't get what is the best way to get some things done with non relational databases.
For example, if I have 1000 car shops and 1000 car types, I want to specify what kind of cars each shop sells. Each car has 100 features. Within a relational database i'd make a middle table to link each car shop with the car types it sells via IDs. What is the approach of No-sql? If every car shop sells 50 car types, it means replicating a huge amount of data, if I have to store within the car shop all the features of all the car types it sells!
Any help appreciated.
I can only speak to CouchDB.
The best way to stick your data in the db is to not normalize it at all beyond converting it to JSON. If that data is "cars" then stick all the data about every car in the database.
You then use map/reduce to create a normalized index of the data. So, if you want an index of every car, sorted first by shop, then by car-type you would emit each car with an index of [shop, car-type].
Map reduce seems a little scary at first, but you don't need to understand all the complicated stuff or even btrees, all you need to understand is how the key sorting works.
http://wiki.apache.org/couchdb/View_collation
With that alone you can create amazing normalized indexes over differing documents with the map reduce system in CouchDB.
In MongoDB an often used approach would be store a list of _ids of car types in each car shop. So no separate join table but still basically doing a client-side join.
Embedded documents become more relevant for cases that aren't many-to-many like this.
Coming from a HBase/BigTable point of view, typically you would completely denormalize your data, and use a "list" field, or multidimensional map column (see this link for a better description).
The word "column" is another loaded
word like "table" and "base" which
carries the emotional baggage of years
of RDBMS experience.
Instead, I find it easier to think
about this like a multidimensional map
- a map of maps if you will.
For your example for a many-to-many relationship, you can still create two tables, and use your multidimenstional map column to hold the relationship between the tables.
See the FAQ question 20 in the Hadoop/HBase FAQ:
Q:[Michael Dagaev] How would you
design an Hbase table for many-to-many
association between two entities, for
example Student and Course?
I would
define two tables: Student: student
id student data (name, address, ...)
courses (use course ids as column
qualifiers here) Course: course id
course data (name, syllabus, ...)
students (use student ids as column
qualifiers here) Does it make sense?
A[Jonathan Gray] : Your design does
make sense. As you said, you'd
probably have two column-families in
each of the Student and Course tables.
One for the data, another with a
column per student or course. For
example, a student row might look
like: Student : id/row/key = 1001
data:name = Student Name data:address
= 123 ABC St courses:2001 = (If you need more information about this
association, for example, if they are
on the waiting list) courses:2002 =
... This schema gives you fast access
to the queries, show all classes for a
student (student table, courses
family), or all students for a class
(courses table, students family).
In relational database, the concept is very clear: one table for cars with columns like "car_id, car_type, car_name, car_price", and another table for shops with columns "shop_id, car_id, shop_name, sale_count", the "car_id" links the two table together for data Ops. All the columns must well defined in creating the database.
No SQL database systems do not require you pre-define these columns and tables. You just construct your records in a certain format, say JSon, like:
"{car:[id:1, type:auto, name:ford], shop:[id:100, name:some_shop]}",
"{car:[id:2, type:auto, name:benz], shop:[id:105, name:my_shop]}",
.....
After your system is on-line providing service for your management, you may find there are some flaws in your design of db structure, you hope to add one column "employee" of "shop" for your future records. Then your new records coming is as:
"{car:[id:3, type:auto, name:RR], shop:[id:108, name:other_shop, employee:Bill]}",
No SQL systems allow you to do so, but relational database is impossible for this job.