Thinking NoSql on reference data - mongodb

I'm trying out NoSql and while exploring I can't get into my head on how to deal with reference data. (I'm used to traditional database, the tabular one) Say, I have a School entity which have Students and Requirements. Now, the Student can be enrolled to a School and may comply the Requirements later. So the School would look for a Student and check on which Requirements did he comply.
On traditional database, I would do something like.
+---------+ +---------------+ +--------------------+ +---------+
| School | | Requirement | | StudentRequirement | | Student |
+---------+ +---------------+ +--------------------+ +---------+
| Id (PK) | | Id (PK) | | Id (PK) | | Id (PK) |
| Name | | Name | | StudentId (FK) | | Name |
+---------+ | SchoolId (FK) | | RequirementId (FK) | +---------+
+---------------+ | HasComply |
+--------------------+
I would create 4 Entities, and the Requirement has a many-to-many relationship to a Student. So whether I edit or remove a Requirement I can just look at the intermediary table.
A flow something like:
// EnrollStudentToASchool
// AssignAllRequirementsToNewStudent
Then somewhere in my code, if a new requirement was created
// IfNewRequirement
// AddNewRequirementToStudent
Now, in NoSql and in my case I'm using mongodb, a doc type data store. I read somewhere that data should be inline. Something like:
{
Id: 1,
School: 'Top1 Foo School',
Requirements:
[
{ Id: 1, Name: 'Req1' },
{ Id: 2, Name: 'Req2' }
],
Students:
[
{
Id: 1,
Name: 'Student1',
Requirements:
[
{ Id: 1, Name: 'Req1', HasComply: false },
{ Id: 2, Name: 'Req2', HasComply: true },
]
}
]
},
{
Id: 2,
School: 'Top1 Bar School',
Requirements: [],
Students: []
}
The root of my document will be the School, same flow above:
// EnrollStudentToASchool
// AssignAllRequirementsToNewStudent
// IfNewRequirement
// AddNewRequirementToStudent
But in case of, say, the School decided to edit the name of the Requirement or remove a Requirement.
How it should be done? Should I loop all my Students and Edit/Remove the Requirements? Or maybe I'm doing it all wrong.
Please advise.

This a nice use case.
Your example brings up most of the relevant pros and cons about converting from sql to noSql.
First please see proposed collection design:
We have two collections: school and student why that? We need to think about bson document size limitation (16MB) and if we have a good school number of students could go over that size.
So why we duplicate data in every student record? If we want to have students details we don't need to go to school (no extra round trip).
We have array of requirements to fulfil in school (a kind of master), and then every student has its own array with result.
Adding / removing such data requires iteration via all students and school.
So in simply words - no join on daily display operations=> efficiency, but update generates a bit more load versus sql.
Any comments welcome!

Related

Postgres SQL query, that will group fields in nested JSON objects

I need a SQL query in Postgres that produce a JSON with grouped/inherited data,
see example below.
having a table "issues" with following example data:
+--------------------------------------+-------+------------+-----------------------+
| product_id | level | typology | comment |
+--------------------------------------+-------+------------+-----------------------+
| e1227f18-0c1f-4ebb-8cbf-a09c74ba14f5 | 1 | electronic | LED broken |
| e1227f18-0c1f-4ebb-8cbf-a09c74ba14f5 | 1 | mechanical | missing gear |
| e1227f18-0c1f-4ebb-8cbf-a09c74ba14f5 | 1 | mechanical | cover damaged |
| e1227f18-0c1f-4ebb-8cbf-a09c74ba14f5 | 2 | electric | switch wrong color |
| e1227f18-0c1f-4ebb-8cbf-a09c74ba14f5 | 2 | mechanical | missing o-ring |
| e1227f18-0c1f-4ebb-8cbf-a09c74ba14f5 | 2 | electric | plug wrong type |
| 3567ae01-c7b3-4cd7-9e4f-85730aab89ee | 1 | mechanical | gear wrong dimensions |
+--------------------------------------+-------+------------+-----------------------+
product_id, typology and comment are string.
level is an integer.
I want to obtain this JSON:
{
"e1227f18-0c1f-4ebb-8cbf-a09c74ba14f5": {
"1": {
"electronic": [ "LED broken" ],
"mechanical": [ "missing gear", "cover damaged"]
},
"2": {
"electronic": [ "switch wrong color", "plug wrong type" ],
"mechanical": [ "missing o-ring" ]
}
},
"3567ae01-c7b3-4cd7-9e4f-85730aab89ee": {
"1": {
"mechanical": [ "gear wrong dimensions"]
}
}
}
So I begun to wrote a query like this:
SELECT array_to_json(array_agg(json_build_object(
product_id, json_build_object(
level, json_build_object(
typology, comment
)
)
))) FROM issues
but I didn't realize ho to group/aggregate to obtain the wanted JSON
step-by-step demo:db<>fiddle
SELECT
jsonb_object_agg(key, value)
FROM (
SELECT
jsonb_build_object(product_id, jsonb_object_agg(key, value)) as products
FROM (
SELECT
product_id,
jsonb_build_object(level, jsonb_object_agg(key, value)) AS level
FROM (
SELECT
product_id,
level,
jsonb_build_object(typology, jsonb_agg(comment)) AS typology
FROM
issues
GROUP BY product_id, level, typology
) s,
jsonb_each(typology)
GROUP BY product_id, level
) s,
jsonb_each(level)
GROUP BY product_id
) s,
jsonb_each(products)
jsonb_agg() aggregates some values into one JSON array. This has been done with the comments.
After that there is a more complicated step. To aggregate two different JSON objects into one object, you need to do this:
simplified demo:db<>fiddle
First you need to expand the elements into a key and a value column using jsonb_each(). Now you are able to aggregate these two columns using the aggregate function jsonb_object_agg(). See also
This is why the following steps look somewhat difficult. Every level of aggregation (level and product_id) need these steps because you want to merge the elements into single non-array JSON objects.
Because every single aggregation needs separate GROUP BY clauses, every step is done in its own subquery.

Two concurrent statements writing to database

+---+---------+-----------+
|id | title |description|
+---+---------------------+
| 1 | The King| Jonh X |
+---+---------------------+
Two concurrent statements:
update book set title = 'aaa', description = 'aaa' where id = 1
update book set title = 'bbb', description = 'bbb' where id = 1
Is it theoretically possible the following result?
+---+---------+-----------+
|id | title |description|
+---+---------------------+
| 1 | aaa | bbbb |
+---+---------------------+
update book set title = 'aaa', description = 'aaa' where id = 1
select title, description from book -> (The King, aaa)?
Those statements are not wrapped in transaction
What about popular database systems like SQL Server, Postgres?
Generally impossible in any ACID-compliant database.
ACID stands for atomicity, consistency, isolation, durability.
In particular, Postgres takes a write-lock on affected rows before the UPDATE and does not release it until the end of the transaction. (And every UPDATE runs inside a transaction, implicitly or explicitly.) Concurrent transactions trying to write to the same row must wait and re-evaluate filters once the lock is released. They may then change the row once more - or come up empty if the filters do not apply any more.

How do I add additional information to a Pivot (using Fluent)?

In Vapor, we can create many-to-many relationships by creating a Pivot<U, T> object, where U and T are the models that we want to link together. So if I want to create a system where Users can have many Files and many Files can belong to many Users, I'd associate them like this:
var alice = User(name: "Alice")
try! alice.save()
var sales = File(name: "sales.xclx")
try! sales.save()
var pivot = Pivot<User, File>(alice, sales)
try! pivot.save()
What I can't figure out for the life of me is how would I make a Pivot<User, File> contain additional information? For example, I'd like to know when was this file associated associated to Alice, or what permissions she has over it.
On a Relational database, Fluent creates this table for the Pivot<User, File> type.
+---------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| file_id | int(11) | NO | | NULL | |
| user_id | int(11) | NO | | NULL | |
+---------+---------+------+-----+---------+----------------+
But I'd like the ability to represent something like this:
+---------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| file_id | int(11) | NO | | NULL | |
| user_id | int(11) | NO | | NULL | |
| date | date | NO | | NULL | |
| perms | varchar | NO | | READ | |
+---------+---------+------+-----+---------+----------------+
The Pivot<U, T> object can be thought of as the "bare minimum" required fields for a pivoted relation like siblings.
If you want to add custom fields to this table, you can create your own class to act as the pivot as long as it has the required elements:
Table name for Foo and Bar is bar_foo (lowercase, alphabetically ordered)
There exists at least the three columns: id, bar_id, foo_id
In other words, the table created by your pivot class must have at least the elements a Pivot<Foo, Bar> preparation would have created.
With this done, you can create new pivot relations by creating and saving instances of your pivot class.
When .siblings() relations are called on your models that use this pivot table, the default Pivot<U, T> will still be created to perform the fetch. But, this won't create any issues since the required fields are present on the pivot table.
so after having the same problem described by Andy and asking for a solution on the Vapor Slack I was redirected here.
My implementation (using PostgreSQL) of the solution proposed by Tanner can be found here
The key is the Rating model:
it’s a plain Model subclass
it has an entity name of movie_user (as described by Tanner the names of the relating models in alphabetical order)
it has the fields userId (mapping to "user_id") and movieId (mapping to "movie_id"), both are of type Node.
in prepare(Database) it again uses the name "movie_user" and defines the Id fields as Ints.
With that set up you can define the following relationship convenience methods:
On Movie: all raters
extension Movie {
func raters() throws -> Siblings<User> {
return try siblings()
}
}
On User: all rated movies
extension User {
func ratedMovies() throws -> Siblings<Movie> {
return try siblings()
}
}
A new rating for a movie (for a user) can be added like this:
ratings.post() { request in
var rating = try Rating(node: request.json)
try rating.save()
return rating
}
As Rating is a Model subclass, we can create it directly from the requests JSON. This requires the client to send a JSON document that conforms to the node structure of the Rating class:
{
"user_id": <the rating users id>,
"movie_id": <the id of the movie to be rated>,
"stars": <1-5 or whatever makes sense for your usecase>
}
To get all actual Ratings for a given movie, you seem to have to resolve the relationship manually (at least I think so, maybe somebody can give me a hint on how to do this better):
let ratings = try Rating.query().filter("movie_id", movieId).all()
Also, there seems to be no way of somehow calculating an average on the database right now. I would have loved something like this to work:
// this is now how it works
let averageRating = try Rating.query().filter("movie_id", movieId).average("stars")
So I hope this helps anybody coming across this problem. And thanks to all the wonderful people who contribute to the Vapor project!
Thx to #WERUreo for pointing out that the part where a rating is created was missing.

Forum like data structure: NoSQL appropriate?

I'm trying to save data which has a "forum like" structure:
This is the simplified data model:
+---------------+
| Forum |
| |
| Name |
| Category |
| URL |
| |
+---------------+
|1
|n
+---------------+
| |
| Thread |
| |
| ID |
| Name |
| Author |
| Creation Date |
| URL |
| |
+---------------+
|1
|n
+---------------+
| |
| Post |
| |
| Creation Date |
| Links |
| Images |
| |
+---------------+
I have multiple forums/boards. They can have some threads. A thread can contain n posts (I'm just interested in the links, images and creation date a thread contains for data analysis purposes).
I'm looking for the right technology for saving and reading data in a structure like this.
While I was using SQL databases heavily in the past, I also had some NoSQL projects (primarily document based with MongoDB).
I'm sure MongoDB is excellent for STORING data in such a structure (Forum is a document, while the Threads are subdocuments. Posts are subdocuments in Threads). But what about reading them? I have the following use cases:
List all posts from a forum with a specific Category
Find a specific link in a Post in all datasets/documents
Which technology is best for those use cases?
Please find below my draft solution. I have considered MongoDB for the below design.
Post Collection:-
"image" should be stored separately in GridFS as MongoDB collection have a maximum size of 16MB. You can store the ObjectId of the image in the Post collection.
{
"_id" : ObjectId("57b6f7d78f19ac1e1fcec7b5"),
"createdate" : ISODate("2013-03-16T02:50:27.877Z"),
"links" : "google.com",
"image" : ObjectId("5143ddf3bcf1bf4ab37d9c6e"),
"thread" : [
{
"id" : ObjectId("5143ddf3bcf1bf4ab37d9c6e"),
"name" : "Sam",
"author" : "Sam",
"createdate" : ISODate("2013-03-16T02:50:27.877Z"),
"url" : "https://www.wikipedia.org/"
}
],
"forum" : [
{
"name" : "Andy",
"category" : "technology",
"url" : "https://www.infoq.com/"
}
]
}
In order to access the data by category, you can create an index on "forum.category" field.
db.post.createIndex( { "forum.category": 1 } )
In order to access the data by links, you can create an index on "links" field.
db.organizer.createIndex( { "links": 1 } )
Please note that the indexes are not mandatory. You can access/query the data without index as well. You can create indexes if you need better read performance.
I have seen applications using MongoDB for similar use case as yours. You can go ahead with MongoDB for the above mentioned use cases (or access patterns).

Zend DB inserting relational data

I'm using the Zend Framework database relationships for a couple of weeks now. My first impression is pretty good, but I do have a question related to inserting related data into multiple tables. For a little test application I've related two tables with each other by using a fuse table.
+---------------+ +---------------+ +---------------+
| Pages | | Fuse | | Prints |
+---------------+ +---------------+ +---------------+
| pageid | | fuseid | | printid |
| page_active | | fuse_page | | print_title |
| page_author | | fuse_print | | print_content |
| page_created | | fuse_locale | | ... |
| ... | | ... | +---------------+
+---------------+ +---------------+
Above is an example of my DB architecture
Now, my problem is how to insert related data to two separate tables and insert the two newly created ID's into the fuse table at the same time. If someone could could maybe explain or give me a topic related tutorial. I would appreciate it!
I assume you got separate models for each table. Then simply insert stuff in Prints table, store returned ID in variable. Then insert stuff in Pages table and store returned ID in another varialble. Eventually insert data in your Fuse table. You do not need any "at the same time" (atomic) operation here. ID of newly inserted rows are returned by save() (I assume you use autoincrement fields for this).
$printsModel = new Application_Model_Prints();
$pagesModel = new Application_Model_Pages();
$fuseModel = new Application_Model_Fuse();
$printData = array('print_title'=>'foo',
...);
$printId = $printsModel->insert( $printData );
$pagesData = array('page_author'=>'bar',
...);
$pageId = $pagesModel->insert($pagesData);
$fuseData = array('fuse_page' => $pageId,
'fuse_print' => $printId,
...);
$fuseId = $fuseModel->insert($fuseData);
thus is pseudo code, so you may want to move inserts into your models and do somoe i.e. normalisation etc.
I also suggest paying more attention to fields naming convention. It usually helps and now you got fuseid but also fuse_page. So it either should be fuse_id or fusepage (not to mention I suspect this field stores id so it would be fuse_page_id or fusepageid).
Prints and Pages are two entities . Create row clases for each
class Model_Page extends Zend_Db_Table_Row_Abastract
{
public function addPrint($print)
{
$fuseTb = new Table_Fuse();
$fuse = $fuseTb->createRow();
$fuse->fuse_page = $this->pageid;
$fuse->fuse_print = $print->printid;
$fuse->save();
return $fuse;
}
}
Now when you create page
$page = $pageTb->createRow() ; //instance of Model_Page is returned
$page->addPrint($printTb->find(1)->current());