NoSql self join like - mongodb

I want to understand if I can do the following thing with NoSql in any way. I will take flights as example.
Lets say I have table or collection of flights with the following info:
...
{ from:XXX, to:YYY, date:01-01-2016 }
{ from:YYY, to:XXX, date:02-02-2016 }
...
I need to be able to perform something like self join to find the full route :
{from:XXX, to:YYY, outbound:01-01-2016, inbound:02-02-2016}
the table should have a lot of from and to locations.
Is it possible to do it with no relational DB?

Is it possible to do it with no relational DB?
That's the wrong question to ask. The idea of NoSQL is to use specialized data stores for specific problems, instead of attempting to solve every problem with the same tool.
Your use case is unclear, however - depending on what data you use to query, you could simply do two queries and merge the result(s), or use a simple $or query (in mongodb) to query for paths back and forth. There's dozens of ways to solve this with all kinds of tools, but it depends on the exact problem you want to solve.
The example of flights is not even a good fit for RDBMSs, because this is usually a routing problem where it might be allowed (or necessary) to combine two or more flights for each direction - neo4j might be the simpler tool for graph problems (note that I'm not saying 'better', because that can mean many things...)

Here is an efficient way to do this in AWS DynamoDB.
Create a table with the following schema:
HashKey: From_City-To_City
RangeKey: Time
So in your case your table would look like:
HashKey RangeKey
XXX-YYY 01-01-2016
YYY-XXX 02-02-2016
Now given a flight from XXX to YYY on 01-01-2016, you can find the return flight by doing a DynamoDB query like this:
HashKey=="YYY-XXX" and RangeKey > "01-01-2016".
This query should be very efficient because the hash key "YYY-XXX" is already defined and range key is sorted/indexed. So you can have tons of flight info in your table, but your query execution time should stay (mostly) the same regardless of the growth in table size.

Related

nosql wishlist models - Struggle between reference and embedded document

I got a question about modeling wishlists using mongodb and mongoose. The idea is I need a user beeing able to have many different wishlists which contain many wishes, each wish making a reference to a single article
I was thinking about it and because a wishlist only belong to a single user I thought using embedded document for that.
Same for the wish beeing embedded to a wishlist.
So I got something like that
var UserSchema = new Schema({
...
wishlists: [wishlistSchema]
...
})
var WishlistSchema = new Schema({
...
wishes: [wishSchema]
...
})
but my question is what to do with the article ? should I use a reference or should I copy the article's data in an embedded document.
If I use embedded document I got an update problem. When the article's price change, to update every wish referencing this article become a struggle. But to access those wishes's article is a piece of cake.
If I use reference, The update is not a problem anymore but I got a probleme when I filter the wish depending on their article criteria ( when I filter the wishes depending on a price, category etc .. ).
I think the second way is probably the best but I don't know how if it's possible to build a query to filter the wish depending on the article's field. I tried a lot of things using population but nothing works very well when you need to populate depending on a nested object field. ( for exemple getting wishes where their article respond to certain conditions ).
Is this kind of query doable ?
Sry for the loooong question and for my bad English :/ but any advice would be great !
In my experience in dealing with NoSQL database (mongo, mainly), when designing a collection, do not think of the relations. Instead, think of how you would display, page, and retrieve the documents.
I would prefer embedding and updating multiple schema when there's a change, as opposed to doing a ref, for multiple reasons.
Get would be fast and easy and filter is not a problem (like you've said)
Retrieve operations usually happen a lot more often than updates and with proper indexing, you wouldn't really have to bother about performance.
It leverages on NoSQL's schema-less nature and you'll be less prone restructuring due to requirement changes (new sorting, new filters, etc)
Paging would be a lot less of a hassle, and UI would not be restricted with it's design with paging and limit.
Joining could become expensive. Redundant data might be a hassle to update but it's always better than not being able to display a data in a particular way because your schema is normalized and joining is difficult.
I'd say that the rule of thumb is that only split them when you do not need to display them together. It is not impossible to join them back if you do, but definitely more troublesome.

Introduction to object databases

I'm trying to understand the idea of noSQL databases, to be more precise, the concept behind neo4j graph database. I have experience with SQL databases (MySQL, MS SQL), but the limitations of managing hierarchical data made me to expand my knowledge. But now I have some questions and I can't find their answers (maybe I don't know what to search).
Imagine we have list of countries in the world. Each country has it's GDP every year. Each country has it's GDP calculated by different sources - World Bank, their government, CIA etc. What's the best way to organise data in this case?
The simplest thing which came in mind is to have the node (the values are imaginary):
China:
GDPByWorldBank2012: 999,
GDPByCIA2011: 994,
GDPByGovernment2012: 1102,
In relational database, I would split the data in three tables: Countries, Sources and Values, where in Values I would have value of GDP, year, id of the country and id of the source.
Other thing which came in mind is to create nodes CIA, World bank, but node Government looks really weird. Even though, the idea is to have relationships (valueIfGDP):
CIA -> valueOfGDP - {year: 2011, value: 994} -> China
World Bank -> valueOfGDP - {year: 2012, value: 999} -> China
This looks pretty weird for me, what is more, what happens when we add the values for all the years from one source? We would have multiple relationships or what?
I'm sorry if my questions are too dumb and I would be happy if someone explain me or show me what book/article to read.
Thanks in advance. :)
Your questions are very legit and you're not the only one having difficulties to grasp graph modelling at first ;)
It is always easier to start thinking about the questions you wanna answer with your data before modelling it up front.
Let's imagine you wanna retrieve the GDP of year 2012 computed by CIA of all countries.
A simple way to achieve this is to label country nodes uniformly, and set an attribute name that obviously depends on the country name.
Moreover, CIA/WorldBank/Government in this domain are all "sources", let's label them uniformly as well.
For instance, that could give something like:
(ORGANIZATION {name: CIA})-[:HAS_COMPUTED_GDP {year:2011, value:994}]->(COUNTRY {name:China})
With Cypher Query Language, following this model, you would execute the following query:
START cia = node:nodes(name = "CIA")
MATCH cia-[gdp:HAS_COMPUTED_GDP]->(country)
WHERE gdp.year = 2012
RETURN cia, country, gdp
In this query, I used an index lookup as a starting point (rather than IDs which are a internal technical notion that shouldn't be used) to retrieve CIA by name and match the relevant subgraph to finally return CIA, the GDP relationships and their linked countries matching the input constraints.
Although Neo4J is totally schemaless, this does not mean you should necessarily have a totally flexible data model. Having a little structure will always help to make your queries or traversals easier to read.
If you're not familiar with Cypher Query Language (which is not the only way to read or write data into the graph), have a look at the excellent documentation of Neo4J (Cypher: http://docs.neo4j.org/chunked/stable/cypher-query-lang.html, complete: http://docs.neo4j.org/chunked/stable/index.html) and try some queries there: http://console.neo4j.org/!
And to answer your second question, if you wanna add another year of GDP computations, this will just boil down to adding new relationship "HAS_COMPUTED_GDP" between the organizations and the countries, no more no less.
Hope it helps :)

What is the best approach for showing large results from Amazon DynamoDB?

I have around 150.000 rows of subscribers answers in a table, and I need to provide a way to let a user select a winner.
I have all this implemented in MS SQL, but because we're having a bit more traffic that expected, I thought it was a good idea to move to a Amazon DynamoDB environment for this particular part (handling subscribers)
in MS SQL I have a SP that is something like:
SELECT s.name, s.guid
FROM calendars c
INNER JOIN challenges cl on cl.calendar_id = c.calendar_id
INNER JOIN challenge_answers ca on ca.calendar_id = c.calendar_id
INNER JOIN subscribers s on s.calendar_id = c.calendar_id
WHERE
c.calendar_id = 9 and
cl.day = 15 and
ca.correct = 1 and
s.email not like '%#mydomain.com'
ORDER BY s.name DESC;
and using LINQ I end up with .Take(25).Skip(page);
I understand that INNER JOIN's in Amazon DynamoDB are not a viable option, so I added more fields to the subscribers table, witch include all other fields so I can simply have only one table and each item contains everything for the query.
What should be the best approach using Amazon DynamoDB to retrieve only a partial group and safely skip "pages"?
DynamoDB is not really designed to skip pages, but rather get items based on their keys. The query like features is fairly limited and is applicable only for keys (hash or range) with basic operations. As well, you can use the keys of the defined indexes, but same limitations apply there. Here is some additional information about Query.
Regarding pagination, DynamoDB doesn't offer a cursor, so if you start iterating over a set of keys you need to read all items until LastEvaluatedKey value returned in each response in null. At the moment, there is no built in support of skipping to a specific page. You can emulate this by building index tables for pages, so then you can fetch the items of a page directly from that index. Chris Moyer suggest a solution here.
You could try to use LINQ2DynamoDB DataContext. It caches DynamoDB query results in an in-memory cache (MemcacheD), which greatly improves performance for sorting/paging scenarios in ASP.Net. There's also a custom ASP.Net DataSource implementation there, so you can turn the paging on with no single line of code.

Is there a way to link to Meteor.Collections() together?

I have finally go around to trying out meteor and I think it's really cool so far. I have been trying to link two Meteor.Collections() together like in a relational database.
For example let's say that the user enters and animal type like "Dog" and then other users could input types of dogs like "Doberman", "Labrador", etc.
Thanks in advance
Basically, the idea behind a document based Databases like MongoDB is not to try to imitate relational DBs. Try to see if you can add (embed) the types as a child of the animal type in the same collection instead of creating a link between two collections.
With that said, there is still a way to link between to collections - The way to do that is outside of queries - meaning you get the results from one query and then pass them to the other query as paramaters (as you can see it's not an efficient way).
More background info can be found at - http://www.mongodb.org/display/DOCS/Schema+Design#SchemaDesign-EmbeddingandLinking
or - MongoDB and "joins"

How do you track record relations in NoSQL?

I am trying to figure out the equivalent of foreign keys and indexes in NoSQL KVP or Document databases. Since there are no pivotal tables (to add keys marking a relation between two objects) I am really stumped as to how you would be able to retrieve data in a way that would be useful for normal web pages.
Say I have a user, and this user leaves many comments all over the site. The only way I can think of to keep track of that users comments is to
Embed them in the user object (which seems quite useless)
Create and maintain a user_id:comments value that contains a list of each comment's key [comment:34, comment:197, etc...] so that that I can fetch them as needed.
However, taking the second example you will soon hit a brick wall when you use it for tracking other things like a key called "active_comments" which might contain 30 million ids in it making it cost a TON to query each page just to know some recent active comments. It also would be very prone to race-conditions as many pages might try to update it at the same time.
How can I track relations like the following in a NoSQL database?
All of a user's comments
All active comments
All posts tagged with [keyword]
All students in a club - or all clubs a student is in
Or am I thinking about this incorrectly?
All the answers for how to store many-to-many associations in the "NoSQL way" reduce to the same thing: storing data redundantly.
In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it. Use the same criteria you would use to denormalize a relational database: if it's more important for data to have cohesion (think of values in a comma-separated list instead of a normalized table), then do it that way.
But this inevitably optimizes for one type of query (e.g. comments by any user for a given article) at the expense of other types of queries (comments for any article by a given user). If your application has the need for both types of queries to be equally optimized, you should not denormalize. And likewise, you should not use a NoSQL solution if you need to use the data in a relational way.
There is a risk with denormalization and redundancy that redundant sets of data will get out of sync with one another. This is called an anomaly. When you use a normalized relational database, the RDBMS can prevent anomalies. In a denormalized database or in NoSQL, it becomes your responsibility to write application code to prevent anomalies.
One might think that it'd be great for a NoSQL database to do the hard work of preventing anomalies for you. There is a paradigm that can do this -- the relational paradigm.
The couchDB approach suggest to emit proper classes of stuff in map phase and summarize it in reduce.. So you could map all comments and emit 1 for the given user and later print out only ones. It would require however lots of disk storage to build persistent views of all trackable data in couchDB. btw they have also this wiki page about relationships: http://wiki.apache.org/couchdb/EntityRelationship.
Riak on the other hand has tool to build relations. It is link. You can input address of a linked (here comment) document to the 'root' document (here user document). It has one trick. If it is distributed it may be modified at one time in many locations. It will cause conflicts and as a result huge vector clock tree :/ ..not so bad, not so good.
Riak has also yet another 'mechanism'. It has 2-layer key name space, so called bucket and key. So, for student example, If we have club A, B and C and student StudentX, StudentY you could maintain following convention:
{ Key = {ClubA, StudentX}, Value = true },
{ Key = {ClubB, StudentX}, Value = true },
{ Key = {ClubA, StudentY}, Value = true }
and to read relation just list keys in given buckets. Whats wrong with that? It is damn slow. Listing buckets was never priority for riak. It is getting better and better tho. btw. you do not waste memory because this example {true} can be linked to single full profile of StudentX or Y (here conflicts are not possible).
As you see it NoSQL != NoSQL. You need to look at specific implementation and test it for yourself.
Mentioned before Column stores look like good fit for relations.. but it all depends on your A and C and P needs;) If you do not need A and you have less than Peta bytes just leave it, go ahead with MySql or Postgres.
good luck
user:userid:comments is a reasonable approach - think of it as the equivalent of a column index in SQL, with the added requirement that you cannot query on unindexed columns.
This is where you need to think about your requirements. A list with 30 million items is not unreasonable because it is slow, but because it is impractical to ever do anything with it. If your real requirement is to display some recent comments you are better off keeping a very short list that gets updated whenever a comment is added - remember that NoSQL has no normalization requirement. Race conditions are an issue with lists in a basic key value store but generally either your platform supports lists properly, you can do something with locks, or you don't actually care about failed updates.
Same as for user comments - create an index keyword:posts
More of the same - probably a list of clubs as a property of student and an index on that field to get all members of a club
You have
"user": {
"userid": "unique value",
"category": "student",
"metainfo": "yada yada yada",
"clubs": ["archery", "kendo"]
}
"comments": {
"commentid": "unique value",
"pageid": "unique value",
"post-time": "ISO Date",
"userid": "OP id -> THIS IS IMPORTANT"
}
"page": {
"pageid": "unique value",
"post-time": "ISO Date",
"op-id": "user id",
"tag": ["abc", "zxcv", "qwer"]
}
Well in a relational database the normal thing to do would be in a one-to-many relation is to normalize the data. That is the same thing you would do in a NoSQL database as well. Simply index the fields which you will be fetching the information with.
For example, the important indexes for you are
Comment.UserID
Comment.PageID
Comment.PostTime
Page.Tag[]
If you are using NosDB (A .NET based NoSQL Database with SQL support) your queries will be like
SELECT * FROM Comments WHERE userid = ‘That user’;
SELECT * FROM Comments WHERE pageid = ‘That user’;
SELECT * FROM Comments WHERE post-time > DateTime('2016, 1, 1');
SELECT * FROM Page WHERE tag = 'kendo'
Check all the supported query types from their SQL cheat sheet or documentation.
Although, it is best to use RDBMS in such cases instead of NoSQL, yet one possible solution is to maintain additional nodes or collections to manage mapping and indexes. It may have additional cost in form of extra collections/nodes and processing, but it will give an solution easy to maintain and avoid data redundancy.