Canonical way to represent and query many to one relationships in Solr - postgresql

I have PostgreSQL and Solr running in the backend, and am using EmberJS with ember-data on the frontend.
In PostgreSQL I have a many to one relation where one article can have many video links. Ember-data requires that this be returned in the format:
{
articles: {
//Some number of articles
},
videos: {
//Videos that belong to those articles
}
}
Now what I intend to do is import from the database into Solr to allow searching on the data. This is where I begin to be confused.
I decided that the best way to go about doing this would be to separately make a core for the videos and make a core for the articles, and then import into both cores separately as well. However, I'm new to Solr and am unsure if having multiple cores is the canonical way to handle a many to one relationship.
Moreover, as far as I know, you can't do a join on two cores when querying which means to construct the JSON that ember requires it'll take multiple queries, which seems wrong to me.
How can I properly represent a many to one relationship in Solr?

It depends on what you want to search on. But when importing data from relational DBs into a document-based store like Solr, the most canonical way is to denormalize or flatten the data into a single document collection. In this case, assuming you want to search on the articles (article title, for instance), you would want to have a collection in which the documents looking like this
{
article_title: "<title_string>",
article_link: "<article_link1>",
videos: ["<video_link1>","<video_link2>",...]
}

Related

Amazon DynamoDB Json support while querying and scanning

After a painful day trying to figure out if we should go with DynamoDB to store Json documents (vs mongo) and reading through almost all the AWS documentations and online examples, I have decided to ask my questions here.
Ours is a Spring Boot Java application and we are using the aws-dynamodb sdk plugin. Our application has to manage a couple of thousands of Json documents and be able to retrieve based on various conditions.
For example, imagine this is the JSon document -
{
"serial":"123123",
"feed":{
"ABC":{
"queue": "ABC",
"active": true
},
"XYZ" : {
"queue":"XYZ",
"active": false
}
}
These are the questions I have
Can I store this whole Json document as a String attribute in Dynamo table and still be able to retrieve the records based on the value of certain attributes inside the Json and how?
For example I would like to get all the items that has the feed ABC active.
How scalable is this solution?
I know I can do this very easily in Mongo but just couldn't get it working in dynamo.
First, if you aren't using DynamoDBMapper for talking to DynamoDB, you should consider using it, instead of low-level APIs, as it provides a more convenient higher-level abstraction.
Now, answers to your questions:
Instead of storing it as a String, consider using Map. More information on supported data types can be found here. As for searching, there are two ways: Query (in which you need to provide primary keys of the records you need) and Scan. For your example (i.e. 'all the items that has the feed ABC active'), you'd have to do a Scan, as you don't know the primary keys.
DynamoDB is highly scalable. Querying is efficient, but looks like you'll be Scaning more. The latter has its limitations, as it literally goes through each record, but should work fine for you as you'll only have couple of thousand records. Do performance testing first though.

MongoDB and one-to-many relation

I am trying to come up with a rough design for an application we're working on. What I'd like to know is, if there is a way to directly map a one to many relation in mongo.
My schema is like this:
There are a bunch of Devices.
Each device is known by it's name/ID uniquely.
Each device, can have multiple interfaces.
These interfaces can be added by a user in the front end at any given
time.
An interface is known uniquely by it's ID, and can be associated with
only one Device.
A device can contain at least an order of 100 interfaces.
I was going through MongoDB documentation wherein they mention things relating to Embedded document vs. multiple collections. By no means am I having a detailed clarity over this as I've just started with Mongo and meteor.
Question is, what could seemingly be a better approach? Having multiple small collections or having one big embedded collection. I know this question is somewhat subjective, I just need some clarity from folks who have more expertise in this field.
Another question is, suppose I go with the embedded model, is there a way to update only a part of the document (specific to the interface alone) so that as and when itf is added, it can be inserted into the same device document?
It depends on the purpose of the application.
Big document
A good example on where you'd want a big embedded collection would be if you are not going to modify (normally) the data but you're going to query them a lot. In my application I use this for storing pre-processed trips with all the information. Therefore when someone wants to consult this trip, all the information is located in a single document. However if your query is based on a value that is embedded in a trip, inside a list this would be very slow. If that's the case I'd recommend creating another collection with a relation between both collections. Also for updating part of a document it would be slow since it would require you to fetch the whole document and then update it.
Small documents with relations
If you plan on modify the data a lot, I'd recommend you to stick to a reference to another collection. With small documents, this will allow you to update any collection quicker. If you want to model a unique relation you may consider using a unique index in mongo. This can be done using: db.members.createIndex( { "user_id": 1 }, { unique: true } ).
Therefore:
Big object: Great for querying data but slow for complex queries.
Small related collections: Great for updating but requires several queries on distinct collections.

Efficient way to creating relationships with NoSQL database

I am currently trying to implement Tumblr-like user interactions like reblog, following, followers, commenting, blog posts of people who I currently following etc.
Also there is a requirement to display activity for each blog post.
I am stuck with creating proper schema for database. There are several way to achieve this kind of functionality (defining data structures embedded like blog posts and comments, creating an activity document for each action etc.) but I couldn't currently decide which way is the best in terms of performance and scalability.
For instance let's look at implementation of people who I follow. Here is sample User document.
User = { id: Integer,
username: String,
following: Array of Users,
followers: Array of Users,
}
This seems trivial. I can manage following field per user action (follow/unfollow) but what if an user who I currently follow is deleted. Is it effective to update all User records who follows deleted user.
Another problem is creating a view of blog post from people who I follow.
Post = { id: Integer,
author: User,
body: Text,
}
So is it effective query latest posts like;
db.posts.find( { author: { $in : me.followers} } )
It seems (to me) that you are trying to use a single data store (in this case a document-oriented NoSQL database) to fulfill (at least) two different requirements. The first thing you seem to be trying to do is store data in a document-oriented store. I am going to assume that you have legitimate reasons for doing this.
The second thing you seem to be trying to do is establish relationship(s) between the documents you are storing. Your example shows a FOLLOWS relationship. I would recommend treating this as a different requirement from storing data in a document-oriented NoSQL database and look at storing the relationships in a graph-oriented NoSQL database such as Neo4j. This way, your entities can be stored in the document store and relationships in the graph store using just the document IDs.
My experience has been that it will be difficult (if not impossible) to get a single NoSQL database to meet all functional and non-functional needs of a medium to large sized application. For example, the latest application I am working on uses MongoDB, Redis and Neo4j besides an RDBMS. I spent a lot of time experimenting with technologies and settled on this combination. I have committed myself to using Spring 3, along with the Spring Data project and so far my experience has been great.
One approach that works is called "Star Schema". If you search the web or wikipedia then you'll find lots of information.

MongoDB: Should you still provide IDs linking to other collections to or just include collections?

I'm pretty new to MongoDB and NoSQL in general. I have a collection Topics, where each topics can have many comments. Each comment will have metadata and whatnot making a Comments collection useful.
In MySQL I would use foreign keys to link to the Comments table, but in NoSQL should I just include the a Comments collection within the Topics collection or have it be in a separate collection and link by ids?
Thanks!
Matt
It depends.
It depends on how many of each of these type of objects you expect to have. Can you fit them all into a single MongoDB document for a given Topic? Probably not.
It depends on the relationships - do you have one-to-many or many-to-many relationships? If it's one-to-many and the number of related entities is small you might chose to put embed them in an IList on a document. If it's many-to-many you might chose to use a more traditional relationship or you might chose to embed both sides as ILists.
You can still model relationships in MongoDB with separate collections BUT there are no joins in the database so you have to do that in code. Loading a Topic and then loading the Comments for it might be just fine from a performance perspective.
Other tips:
With MongoDB you can index INTO arrays on documents. So don't think of an Index as just being an index on a simple field on a document (like SQL). You can use, say, a Tag collection on a Topic and index into the tags. (See http://www.mongodb.org/display/DOCS/Indexes#Indexes-Arrays)
When you retrieve or write data you can do a partial read and a partial write of any document. (see http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields)
And, finally, when you can't see how to get what you want using collections and indexes, you might be able to achieve it using map reduce. For example, to find all the tags currently in use sorted by their frequency of use you would map each Topic emitting the tags used in it, and then you would reduce that set to get the result you want. You might then store the result of that map reduce permanently and only up date it when you need to.
It's a fairly significant mind-shift from relational thinking but it's worth it if you need the scalability and flexibility a NOSQL approach brings.
Also look at the Schema Design docs (http://www.mongodb.org/display/DOCS/Schema+Design). There are also some videos/slides of several 10Gen presentations on schema design linked on the Mongo site. See http://www.mongodb.org/pages/viewpage.action?pageId=17137769 for an overview.

How would you architect a blog using a document store (such as CouchDB, Redis, MongoDB, Riak, etc)

I'm slightly embarrassed to admit it, but I'm having trouble conceptualizing how to architect data in a non-relational world. Especially given that most document/KV stores have slightly different features.
I'd like to learn from a concrete example, but I haven't been able to find anyone discussing how you would architect, for example, a blog using CouchDB/Redis/MongoDB/Riak/etc.
There are a number of questions which I think are important:
Which bits of data should be denormalised (e.g. tags probably live with the document, but what about users)
How do you link between documents?
What's the best way to create aggregate views, especially ones which require sorting (such as a blog index)
First of all I think you would want to remove redis from the list as it is a key-value store instead of a document store. Riak is also a key-value store, but you it can be a document store with library like Ripple.
In brief, to model an application with document store is to figure out:
What data you would store in its own document and have another document relate to it. If that document is going to be used by many other documents, then it would make sense to model it in its own document. You also must consider about querying the documents. If you are going to query it often, it might be a good idea to store it in its own document as you would find it hard to query over embedded document.
For example, assuming you have multiple Blog instance, a Blog and Article should be in its own document eventhough an Article may be embedded inside Blog document.
Another example is User and Role. It makes make sense to have a separate document for these. In my case I often query over user and it would be easier if it is separated as its own document.
What data you would want to store (embed) inside another document. If that document only solely belongs to one document, then it 'might' be a good option to store it inside another document.
Comments sometimes would make more sense to be embedded inside another document
{ article : { comments : [{ content: 'yada yada', timestamp: '20/11/2010' }] } }
Another caveat you would want to consider is how big the size of the embedded document will be because in mongodb, the maximum size of embedded document is 5MB.
What data should be a plain Array. e.g:
Tags would make sense to be stored as an array. { article: { tags: ['news','bar'] } }
Or if you want to store multiple ids, i.e User with multiple roles { user: { role_ids: [1,2,3]}}
This is a brief overview about modelling with document store. Good luck.
Deciding which objects should be independent and which should be embedded as part of other objects is mostly a matter of balancing read/write performance/effort - If a child object is independent, updating it means changing only one document but when reading the parent object you have only ids and need additional queries to get the data. If the child object is embedded, all the data is right there when you read the parent document, but making a change requires finding all the documents that use that object.
Linking between documents isn't much different from SQL - you store an ID which is used to find the appropriate record. The key difference is that instead of filtering the child table to find records by parent id, you have a list of child ids in the parent document. For many-many relationships you would have a list of ids on both sides rather than a table in the middle.
Query capabilities vary a lot between platforms so there isn't a clear answer for how to approach this. However as a general rule you will usually be setting up views/indexes when the document is written rather than just storing the document and running ad-hoc queries later as you would with SQL.
Ryan Bates made a screencast a couple of weeks ago about mongoid and he uses the example of a blog application: http://railscasts.com/episodes/238-mongoid this might be a good place for you to get started.