How to decide whether to use a RDBMS, Doc/Obj ODBMS or Graph? - rdbms

What I intend to design basically boils down to a list of users, organisations, events, addresses and comments which could quite easily be maintained in a RDBMS such as MySQL. However, if the project takes off I want to add another aspect which is resources - i.e. files, videos, images etc which can belong to either a user, organisation or event. This instantly raises the question of whether to use a RDBMS and store a reference to an external file through a table related to each of the categories previously mentioned or whether to use a Doc/Obj ODBMS such as MongoDB to store these items.
But I also want to be able to link users, organisations and events. i.e. User A owns Org 1 and Org 2. User B owns Org 3 and Org 4. User C owns Org 5. Org 1 has an Event X, held at Addr M on Date R, which Org 3 will also be at. User C intends to attend Event X. Org 2 also has an Event Y at Addr M but on Date T. etc etc. As such, I would suspect that a Graph DBMS such as OrientDB would be the best solution. Either that, or I would have a lot of tables in a RDBMS with a lot of joins, and potentially a lot of queries, or a very strange structure in a Doc/Obj DBMS.
I've looked at InfoGrid, which is a Graph database that can connect to MySQL, which could be a potential way to skin this cat. Has anybody else attempted anything like this? What are your thoughts on how to implement such a system, which needs to be scalable? Suggestions are greatly appreciated.

Your description lends itself to a relational model. RDBMS for this particular setup is the proper way to go.

Related

TYPO3 backend workflow when avoiding the storage of data in intermediate table

I have a situation as described in the ExtbaseFluid book:
I would like to store information in the intermediate table which is not recommended at all.
Here is a cite from the warning box of the above linked book chapter:
Do not store data in the Intermediate Table that concern the Domain. Though TYPO3 supports this (especially in combination with Inline Relational Record Editing (IRRE) but this is always a sign that further improvements can be made to your Domain Model. Intermediate Tables are and should always be tools for storing relationships and nothing else.
Let’s say you want to store a CD with its containing music tracks: CD -- m:n (Intermediate Table) -- Song. The track number may be stored in a field of the Intermediate Table. However, the track should be stored as a separate domain object, and the connection be realized as CD -- 1:n -- Track -- n:1 -- Song.
So I want not to do what is not recommended. But thinking about the workflow for the editor that results of the recommended solution rises a few question for me.
To stay with this example I would need the following tables:
tx_extname_domain_model_cd
tx_extname_domain_model_cd_track_mm
tx_extname_domain_model_track (which holds the track number)
tx_extname_domain_model_track_song_mm
tx_extname_domain_model_song
From what I know this would end in the situation that the editor would need to create following records:
one record for the cd
one record for the song
now the editor can create one record for the track.
There the track number is added.
Furthermore the cd record needs to be assigned as well as the song.
So here are my questions:
I guess this workflow cannot be improved with some (to me unknown) TCA setup?
An editor cannot directly reach the song when the cd record is opened?
Instead first she / he has to open the track record and can from there navigate to the song?
Is it really that bad to store data in the intermediate table? The TYPO3 table sys_file_reference does the same!? But I wonder how those data could be shown (because IRRE is not possible because it shall only be used for 1:n relations (source).
The question you have to ask yourself is: Do I want to do coding by the book, or do I want to create a pragmatic approach to solve a customer's problem?
In this specific case the additional problem is, that the people who originally invented Extbase had a quite sophisticated and academic approach, but when it comes to a pragmatic use and performance, they were blocked by their own rules and stuck with coding by the book.
Especially this example and the warning message shows a way of thinking that was one of the reasons, why I never actually used Extbase but went for Core-API methods to create performant and pragmatic queries to get the desired result sets. Now that we've got Doctrine under the hood, this works like a charm even with quite exotic DB flavors.
Of course intermediate tables are a good idea and of course those intermediate tables can and should be enriched with additional data fields, that do not require a 3rd, 4th or nth table to store i.e. a simple set of dropdown options, since this can easily be handled with attributes configured in TCA, as it is shown here: https://docs.typo3.org/m/typo3/reference-tca/master/en-us/ColumnsConfig/Type/Inline/Examples.html
sys_file_reference is the most prominent example since it provides exactly that kind of additional information that should not be pumped into additional tables - and guess what, the TYPO3 core does not make use of a single line of Extbase code to deal with that data or almost any other data of the core tables.
To answer your last question: Take a look at the good old IRRE Tutorial to get a clue how to do m:n connections with intermediate inline tables.
https://docs.typo3.org/typo3cms/extensions/irre_tutorial/0.4.0/Manual/Index.html#intermediate-tables-for-m-n-relations
Depends on the issue, sometimes the intermediate table is an entity, sometimes not. In this example the intermediate table is the track, which would contain: [uid, cd, song, track_no, ... (whatever else needed to define the track)]
Be carefull when you define your data, that you do not make it too advanced.

Which method of storing USERS, ROLES & TEAMS in my relational DB is most efficient

I'm working on developing an app as part of my college assignment. It's a project management app, and I'm having trouble deciding the best way to store users and teams in my Postgres DB. Basically, users can signup and create/join teams. A user can be a part of multiple teams (each working on multiple projects). Users also have roles in teams (with varying permissions according to the role) and while they have only one role in a given team, they may have a different role in another one. In addition, users can mark some of their teams as favorites for easy access through the front-end.
I've come up with 3 ERDs to solve this.
First, store all users in one table and and all teams in another. Users table has all the data pertaining to a user, while the team table has the team data along with the members,roles and whether or not a user has marked this team as a favorite - like below.
This will have a lot of data duplication - if a team has a hundred members, there will be 100 entries where teamid, name, description are the same.
So, in v2 I separated them and added a members table. Now, each team is saved once, and so is each user. A reference to the team and user is made each time a user joins/creates a team and is stored in the members table along with the user's role and whether or not they have favorited the team.
But, I thought it might be bad to save roles as a string. If roles ever need to be changed/updated or I need to add new roles/rename roles, it would be easier with an ID rather than a string (I think).
So, then I came up with this.
Now all roles, users and teams are stored once (its possible that I've made the roles table into something like a lookup table, which I've heard is a bad practice). All these can be referenced in the members table.
My DBMS concepts are a little weak though I have tried my best to follow steps to normalize it and bring it into BCNF form. But I'm still unsure if I've done this right, or what to fix if something is wrong.
So essentially, I would like to know:
Is my table structure correct or incorrect?
Should everything be split into multiple tables, or is some data duplication okay (since I can use multiple or creative queries to get whatever I need)?
I like your ERD3 best. I don't think it is overkill, I think it looks fine. Having a "members" table be mostly foreign keys into other tables is a common thing.
It is not necessary to eliminate every trace of commonality in every table - sometimes it is more efficient to put up with a small amount of duplication - but in your example I think your ERD3 looks good.

CQRS (event sourcing): Projections with multiple aggregates

I have a question regarding projections involving multiple aggregates on a CQRS architecture.
For example sake, suppose I have two aggregates WorkItem and Developer and that the following events happen sequentially (but not immediately)
WorkItemCreated (workItemId)
WorkItemTitleChanged (workItemId, title)
DeveloperCreated (developerId)
DeveloperNameChanged (developerId, name)
WorkItemAssigned (workitemId, DeveloperId)
I wish to create a projection which is as "inner join" of developer-workitem:
| WorkItemId | DeveloperId | Title | DeveloperName | ... |
|------------|-------------|--------|---------------|-----|
| 1 | 1 | FixBug | John Doe | ... |
The way I am doing my projections is incrementally. Meaning I load the saved projections from the database and apply the remaining events as they come.
My problem is, the event responsible for creating a row on the projection table is WorkItemAssigned. However, that event does not carry required information from previous events (workitem title, developer name, etc.)
In order to have the required information by the time WorkItemAssigned, I have to load all events from the eventstore, keep states in-memory for all WorkItems and Developers so I have the required information by the time a WorkItemAssigned event arrives.
Sure, I could have a projection for Workitem, another for Developer and query them to retrieve their last states. But it seems like a lot of work, if I am to create projections for each aggregate separately, I might as well create a database view to inner-join them (In fact, that is what I am doing.)
I am not doing all this by hand, I am currently using a good framework called EventFlow, but it doesn´t direct me to answer this question.
This is a question on fundamentals of CQRS, and I fell I am missing something here.
I don't think you are missing anything. Projecting read models in an event-sourced system presents a different set of problems than querying from a relational model. The problems are not necessarily easier or harder to solve; they are just different.
The good news is that you have a lot of choices. Event Sourcing allows you to project data in any imaginable way, so you can decide on a solution that is most suitable for each individual projection. I guess the "bad" news (I would argue it's not bad news) is that the solution to the problem is not the same every time as it is with a relational system, which is to construct a query using JOINs.
You've already identified a few possible solutions:
Use a relational model as one of your read models
When a certain type of event comes in, re-query the streams that hold the data you need and use them to project on demand
You could also simply hold some data in an interim state (in memory, a document database, the file system, etc.) that allows you to look up the data and project it when needed. So keep lists of updated WorkItems and Developers where they can be read and used whenever a WorkItemAssigned event comes in.
I would say creating a relational database as an interim or permanent read model is a perfectly viable way of solving the problem, assuming you are not trying to achieve massive scalability.

How to implement Associative Rules Analysis or Market Basket Analysis from scratch?

I tried to went through numerous articles trying to understand what should be my first step to incorporate associative analysis (may be Market Basket analysis) into my system. They all go deep into implementation of algorithm but no one talked about how to store data in the first place.
I will really appreciate if someone can give me some start pointers or article links that I can begin with.
The first thing I want to implement is to track user clicks and provide suggestions based on tracked data.
E.g. User clicked on link A and subsequently on link B and link C. I can track this activity with some metadata associated (user, user organization, user role etc.)
I do not want it to be limited only to links. In future, I want to add number of similar usecases into the system and want to make it smart. E.g. If user set specific values for fields A and B, most likely he/she will set value <bla> for field C.
My system may generate several thousand such data points in a day (E.g. user clicks, field selection etc.).
Below are my questions:
How should I store my data? Go SQL or No SQL (I briefly looked into Mongo DB and it looked promising)
What tool should I use to perform the associative analysis? Are there any open source tools I can use?
It depend. Does your data suitable for NoSql databases? To answer this question it's better to read CAP Theorem and it's case studies: https://en.wikipedia.org/wiki/CAP_theorem or http://robertgreiner.com/2014/06/cap-theorem-explained/
. Some time you want Consistency(depending to your data) and Availability => so that it's better to use Relational Databases like Mysql(Try to read case studies and analyse your data to pick the best tools)
There is large number of open source libraries, but in my opinion it's better to first read some concepts and algorithms. Try searching for Apriori,ECLAT, FP-GROWTH Algorithms and get concepts of them. then you can pick a tool or write the code your self. Some usefull tools(depending to your programming language):
Python: https://github.com/asaini/Apriori, https://github.com/enaeseth/python-fp-growth, https://github.com/enaeseth/python-fp-growth/blob/master/fp_growth.py
PHP: https://github.com/sigidhanafi/fp-growth-php
JAVA: https://github.com/goodinges/FP-Growth-Java, http://www.philippe-fournier-viger.com/spmf/
Also you can use Spark: https://spark.apache.org/docs/1.1.1/mllib-guide.html

some questions about designing on OrientDB

We were looking for the most suitable database for our innovative “collaboration application”. Sorry, we don’t know how to name it in a way generally understood. In fact, highly complicated relationships among tenants, roles, users, tasks and bills need to be handled effectively.
After reading 5 DBs(Postgrel, Mongo, Couch, Arango and Neo4J), when the words “… relationships among things are more important than things themselves” came to my eyes, I made up my mind to dig into OrientDB. Both the design philosophy and innovative features of OrientDB (multi-models, cluster, OO,native graph, full graph API, SQL-like, LiveQuery, multi-masters, auditing, simple RID and version number ...) keep intensifying my enthusiasm.
OrientDB enlightens me to re-think and try to model from a totally different viewpoint!
We are now designing the data structure based on OrientDB. However, there are some questions puzzling me.
LINK vs. EDGE
Take a case that a CLIENT may place thousands of ORDERs, how to choose between LINKs and EDGEs to store the relationships? I prefer EDGEs, but they seem like to store thousands of RIDs of ORDERs in the CLIENT record.
Embedded records’ Security
Can an embedded record be authorized independently from it’s container record?
Record-level Security
How does activating Record-level Security affect the query performance?
Hope I express clearly. Any words will be truly appreciated.
LINK vs EDGE
If you don't have properties on your arch you can use a link, instead if you have it use edges. You really need edges if you need to traverse the relationship in both directions, while using the linklist you can only in one direction (just like a hyperlink on the web), without the overhead of edges. Edges are the right choice if you need to walk thru a graph.Edges require more storage space than a linklist. Another difference between them it's the fact that if you have two vertices linked each other through a link A --> (link) B if you delete B, the link doesn't disappear it will remain but without pointing something. It is designed this way because when you delete a document, finding all the other documents that link to it would mean doing a full scan of the database, that typically takes ages to complete. The Graph API, with bi-directional links, is specifically designed to resolve this problem, so in general we suggest customers to use that, or to be careful and manage link consistency at application level.
RECORD - LEVEL SECURITY
Using 1 Million vertex and an admin user called Luke, doing a query like: select from where title = ? with an NOT_UNIQUE_HASH_INDEX the execution time it has been 0.027 sec.
OrientDB has the concept of users and roles, as well as Record Level Security. It also supports token based authentication, so it's possible to use OrientDB as your primary means of authorizing/authenticating users.
EMBEDDED RECORD'S SECURITY
I've made this example for trying to answer to your question
I have this structure:
If I want to access to the embedded data, I have to do this command: select prop from User
Because if I try to access it through the class that contains the type of car I won't have any type of result
select from Car
UPDATE
OrientDB supports that kind of authorization/authentication but it's a little bit different from your example. For example: if an user A, without admin permission, inserts a record, another user B can't see the record inserted by user A without admin permission. An User can see only the records that has inserted.
Hope it helps