Partition Lucene Index by ID across multiple indexes - lucene.net

I am trying to put together my Lucene search solution, and I'm having trouble figuring out how to start.
On my site, I want one search to span 5 different types of objects in my model.
I want my results to come back as one list, ordered by best match first, with a way to differentiate the type so I can show the data appropriately
Our system is split out into what we call sites. I want to index the 5 different model objects by site. Searching will always be done by site.
I'm not sure where to begin to index this system for optimal performance. I'm also not sure how best to implement the search for this setup. Any advice, articalse, and examples are greatly appreciated.
EDIT:
Since it has been said this is too broad,
Let's say I have 3 sites, Site 1, Site 2, and site 3.
Let's say I am indexing Dogs, Cats, and Hamsters. a record in each of these types is linked to a site.
So, for instance, my data might be (Type, Name, SiteId)
Dog, "Fido" 1
Cat, "Sprinkles", 2
Hamster, "Sprinkles", 2
Cat, "Mr. Pretty", 3
Cat, "Mr. Pretty 2", 3
So, when I do a search for "Mr. Pretty", I want to target a specific Site Id. If I go against site id 1, I'll get 0 results. If I search against site id 3, I'll get
Mr. Pretty
Mr. Pretty 2
And if I search for "Sprinkles" on Site 2, I will know that one result is a cat and the other result is a hamster.
What is the best way I can go about achieving this sort of search index?

As goalie7960 suggested, you can add a "SiteID" to each document and add a query term like siteid:3 to your query, in order to retrieve documents only from this site. You can also improve the performance of this by creating and storing a Filter for each different site, so you can apply it to the correspondent queries.
Regarding differente types in the same index, you could use the same strategy. Create a "type" field for each document with the corresponding type (maybe just an ID). Elasticsearch uses the same strategy to have different distinguishable types in the same index. Again, you can use Filters on the types to speed up queries (Elasticsearch does the same).

Related

Why there are two refs in declaring one-to-many association in mongoose?

I'm very new in mongodb, see this one-to-many example
As per my understanding
This example says that a person can write many stories or a story belongs_to a person , I think storing the person._id in stories collection was enough
why the person collection has the field stories
cases for fetching data
case 1
Fetch all stories of a person whose id is let us say x
solution: For this just fire a query in story collection where author = x
case 2
Fetch the author name of a particular story
solution: For this we have author field story collection
TL;DR
Put simply: Because there is no such notion as explicit relations in MongoDB.
Mongoose can not know how you want to resolve the relationship. Will the search be from a given story object and the author is to find? Or will the search be to find all stories for an author object? So it makes sure that it can resolve the relation regardless.
Note that there is a problem with that approach, and a big one. Say we are not talking of a one-to-few relation as in this example, but a "One-To-A-Shitload"™ relation. Since BSON documents have a size limit of 16MB, you have a limit of relations you can manage this way. Quite some, but there will be an artificial limit.
How to solve this: Instead of using an ODM, do proper modelling yourself. Since you know your use cases. I will give you an example below.
Detailed
Let us first elaborate your cases a bit:
For a given user (aka "we already have all the data of that user document"), what are his or her stories?
List all stories together with the user name on an overview page.
For a selected ("given") story, what are the authors details?
And just for demonstration purposes: A given user wants to change the name under which a story is displayed, be it his user name or natural name (it happens!) or even pseudonym.
Ok, and now lets put mongoose aside for now and let us think about how we could implement this ourselves. Keeping in mind that
Data modelling in MongoDB is deriving your model from the questions which come from your use cases so that they most common use cases are covered in the most efficient way.
As opposed to RDBMS modelling, where you identify your entities, their properties and relations and then jump through some hoops to get your questions answered somehow.
So, looking at our user stories, I guess we can agree that 2 is the most common use case, 3 and 1 next and 4 is rather rare compared to the other ones.
So now we can start
Modelling
We model the data involved in our most common use cases first.
So, we want to make the query for stories the most efficient one. And we want to sort the stories in descending order of submission. Simple enough:
{
_id: new ObjectId(),
user: "Name to Display",
story: "long story cut short",
}
Now lets say you want to display your stories, 10 of them:
db.stories.find({}).sort({_id:-1}).limit(10)
No relation, all the data we need, a single query, used the default index on _id for sorting. Since a timestamp is part of the ObjectId and it is the most significant part, we can use it to sort the stories by time. The question "Hey, but what if one changes his or her user name?" usually comes now. Simple:
db.stories.update({"user":oldname},{$set:{"user":newname}},{multi:true})
Since this is a rare use case, it only has to be doable and does not have to be extremely efficient. However, later we will see that we have to put an index on user anyway.
Talking of authors: Here it really depends on how you want to do it. But I will show you how I tend to model something like that:
{
_id: "username",
info1: "foo",
info2: "bar",
active: true,
...
}
We make use of some properties of _id here: It is a required index with a unique constraint. Just what we want for usernames.
However it comes with a caveat: _id is immutable. So if somebody wants to change his or her username, we need to copy the original user to a document with the _id of the new user name and change the user property in our stories accordingly. The advantage of this way of doing it that even when the update for changing usernames (see above) should fail during its runtime, each and every story can still be related to a user. If the update is successful, I tend to log out the user and have him log in with the new username again.
In case you want to have a distinction between username and displayed name, it all becomes even easier:
{
_id: "username",
displayNames: ["Foo B. Baz","P.S. Eudonym"],
...
}
Then you use the display name in your stories, of course.
Now let us see how we can get the user details of a given story. We know the author's name so it is as easy as:
db.authors.find({"_id":authorNameOfStory})
or
db.authors.find({"displayNames": authorNameOfStory})
Finding all stories for a given user is quite simple, too. It is either:
db.stories.find({"name":idFieldOfUser})
or
db.stories.find({"name":{$in:displayNamesOfUser}})
Now we have all your our use cases covered, now we can make them even more efficient with
Indexing
An obvious index is on the story models user field, so we do it:
db.stories.ensureIndex({"name":1})
If you are good with the "username as _id" way only, you are done with indexing. Using display names, you obviously need to index them. Since you most likely want display names and pseudonyms to be unique, it is a bit more complicated:
db.authors.ensureIndex({"displayNames":1},{sparse:true, unique:true})
Note: We need to make this as sparse index in order to prevent unnecessary errors when somebody has not decided for a display name or pseudonym yet. Make sure you explicitly add this field to an author document only when a user decides for a display name. Otherwise, it would evaluate to null server side , which is a valid value and you will get a constraint violation error, namely "E1100 duplicate key".
Conclusion
We have covered all your use cases with relations handled by the application thereby simplifying our data model a great deal and have the most efficient queries for our most common use cases. Every use case is covered with a single query, taking into account the information we already have at the time we are doing the query.
Note that there is no artificial limit on how many stories a user can publish since we use implicit relations to our advantage.
As for more complicated queries ("How many stories does each user submit per month?"), use the aggregation framework. That is what it is there for.

MongoDB model design for meteorjs app

I'm more used to a relational database and am having a hard time thinking about how to design my database in mongoDB, and am even more unclear when taking into account some of the special considerations of database design for meteorjs, where I understand you often prefer separate collections over embedded documents/data in order to make better use of some of the benefits you get from collections.
Let's say I want to track students progress in high school. They need to complete certain required classes each school year in order to progress to the next year (freshman, sophomore, junior, senior), and they can also complete some electives. I need to track when the students complete each requirement or elective. And the requirements may change slightly from year to year, but I need to remember for example that Johnny completed all of the freshman requirements as they existed two years ago.
So I have:
Students
Requirements
Electives
Grades (frosh, etc.)
Years
Mostly, I'm trying to think about how to set up the requirements. In a relational DB, I'd have a table of requirements, with className, grade, and year, and a table of student_requirements, that tracks the students as they complete each requirement. But I'm thinking in MongoDB/meteorjs, I'd have a model for each grade/level that gets stored with a studentID and initially instantiates with false values for each requirement, like:
{
student: [studentID],
class: 'freshman'
year: 2014,
requirements: {
class1: false,
class2: false
}
}
and as the student completes a requirement, it updates like:
{
student: [studentID],
class: 'freshman'
year: 2014,
requirements: {
class1: false,
class2: [completionDateTime]
}
}
So in this way, each student will collect four Requirements documents, which are somewhat dictated by their initial instantiation values. And instead of the actual requirements for each grade/year living in the database, they would essentially live in the code itself.
Some of the actions I would like to be able to support are marking off requirements across a set of students at one time, and showing a grid of users/requirements to see who needs what.
Does this sound reasonable? Or is there a better way to approach this? I'm pretty early in this application and am hoping to avoid painting myself into a corner. Any help suggestion is appreciated. Thanks! :-)
Currently I'm thinking about my application data design too. I've read the examples in the MongoDB manual
look up MongoDB manual data model design - docs.mongodb.org/manual/core/data-model-design/
and here -> MongoDB manual one to one relationship - docs.mongodb.org/manual/tutorial/model-embedded-one-to-one-relationships-between-documents/
(sorry I can't post more than one link at the moment in an answer)
They say:
In general, use embedded data models when:
you have “contains” relationships between entities.
you have one-to-many relationships between entities. In these relationships the “many” or child documents always appear with or are viewed in the context of the “one” or parent documents.
The normalized approach uses a reference in a document, to another document. Just like in the Meteor.js book. They create a web app which shows posts, and each post has a set of comments. They use two collections, the posts and the comments. When adding a comment it's submitted together with the post_id.
So in your example you have a students collection. And each student has to fulfill requirements? And each student has his own requirements like a post has his own comments?
Then I would handle it like they did in the book. With two collections. I think that should be the normalized approach, not the embedded.
I'm a little confused myself, so maybe you can tell me, if my answer makes sense.
Maybe you can help me too? I'm trying to make a app that manages a flea market.
Users of the app create events.
The creator of the event invites users to be cashiers for that event.
Users create lists of stuff they want to sell. Max. number of lists/sellers per event. Max. number of position on a list (25/50).
Cashiers type in the positions of those lists at the event, to track what is sold.
Event creators make billings for the sold stuff of each list, to hand out the money afterwards.
I'm confused how to set up the data design. I need Events and Lists. Do I use the normalized approach, or the embedded one?
Edit:
After reading percona.com/blog/2013/08/01/schema-design-in-mongodb-vs-schema-design-in-mysql/ I found following advice:
If you read people information 99% of the time, having 2 separate collections can be a good solution: it avoids keeping in memory data is almost never used (passport information) and when you need to have all information for a given person, it may be acceptable to do the join in the application.
Same thing if you want to display the name of people on one screen and the passport information on another screen.
But if you want to display all information for a given person, storing everything in the same collection (with embedding or with a flat structure) is likely to be the best solution

Blogs and Blog Comments Relationship in NoSQL

Going off an example in the accepted answer here:
Mongo DB relations between objects
For a blogging system, "Posts should be a collection. post author might be a separate collection, or simply a field within posts if only an email address. comments should be embedded objects within a post for performance."
If this is the case, does that mean that every time my app displays a blog post, I'm loading every single comment that was ever made on that post? What if there are 3,729 comments? Wouldn't this brutalize the database connection, SQL or NoSQL? Also there's the obvious scenario in which when I load a blog post, I want to show only the first 10 comments initially.
Document databases are not relational databases. You CANNOT first build the database model and then later on decide on various interesting ways of querying it. Instead, you should first determine what access patterns you want to support, and then design the document schemas accordingly.
So in order to answer your question, what we really need to know is how you intend to use the data. Displaying comments associated with a post is a distinctly different scenario than displaying all comments from a particular author. Each one of those requirements will dictate a different design, as will supporting them both.
This in itself may be useful information to you (?), but I suspect you want more concrete answers :) So please add some additional details on your intended usage.
Adding more info:
There are a few "do" and "don'ts" when deciding on a strategy:
DO: Optimize for the common use-cases. There is often a 20/80 breakdown where 20% of the UX drives 80% of the load - the homepage/landing page is a classic example. First priority is to make sure that these are as efficient as possible. Make sure that your data model allows either A) loading those in either a single IO request or B) is cache-friendly
DONT: don't fall into the dreaded "N+1" trap. This pattern occurs when you data model forces you to make N calls in order to load N entities, often preceded by an additional call to get the list of the N IDs. This is a killer, especially together with #3...
DO: Always cap (via the UX) the amount of data which you are willing to fetch. If the user has 3729 comments you obviously aren't going to fetch them all at once. Even it it was feasible from a database perspective, the user experience would be horrible. Thats why search engines use the "next 20 results" paradigm. So you can (for example) align the database structure to the UX and save the comments in blocks of 20. Then each page refresh involves a single DB get.
DO: Balance the Read and Write requirements. Some types of systems are read-heavy and you can assume that for each write there will be many reads (StackOverflow is a good example). So there it makes sense to make writes more expensive in order to gain benefits in read performance. For example, data denormalization and duplication. Other systems are evenly balanced or even write heavy and require other approaches
DO: Use the dimension of TIME to your advantage. Twitter is a classic example: 99.99% of tweets will never be accessed after the first hour/day/week/whatever. That opens all kinds of interesting optimization possibilities in the your data schema.
This is just the tip of the iceberg. I suggest reading up a little on column-based NoSQL systems (such as Cassandra)
Not sure if this answers you question, but anyhow you can throttle the amount of blog comments in two ways:
Load only the last 10 , or range of blog comments using $slice operator
db.blogs.find( {_id : someValue}, { comments: { $slice: -10 } } )
will return last 10 comments
db.blogs.find( {_id : someValue}, { comments: { $slice: [-10, 10] } } )
will return next 10 comments
Use capped array to save only the last n blog posts using capped arrays

Introduction to object databases

I'm trying to understand the idea of noSQL databases, to be more precise, the concept behind neo4j graph database. I have experience with SQL databases (MySQL, MS SQL), but the limitations of managing hierarchical data made me to expand my knowledge. But now I have some questions and I can't find their answers (maybe I don't know what to search).
Imagine we have list of countries in the world. Each country has it's GDP every year. Each country has it's GDP calculated by different sources - World Bank, their government, CIA etc. What's the best way to organise data in this case?
The simplest thing which came in mind is to have the node (the values are imaginary):
China:
GDPByWorldBank2012: 999,
GDPByCIA2011: 994,
GDPByGovernment2012: 1102,
In relational database, I would split the data in three tables: Countries, Sources and Values, where in Values I would have value of GDP, year, id of the country and id of the source.
Other thing which came in mind is to create nodes CIA, World bank, but node Government looks really weird. Even though, the idea is to have relationships (valueIfGDP):
CIA -> valueOfGDP - {year: 2011, value: 994} -> China
World Bank -> valueOfGDP - {year: 2012, value: 999} -> China
This looks pretty weird for me, what is more, what happens when we add the values for all the years from one source? We would have multiple relationships or what?
I'm sorry if my questions are too dumb and I would be happy if someone explain me or show me what book/article to read.
Thanks in advance. :)
Your questions are very legit and you're not the only one having difficulties to grasp graph modelling at first ;)
It is always easier to start thinking about the questions you wanna answer with your data before modelling it up front.
Let's imagine you wanna retrieve the GDP of year 2012 computed by CIA of all countries.
A simple way to achieve this is to label country nodes uniformly, and set an attribute name that obviously depends on the country name.
Moreover, CIA/WorldBank/Government in this domain are all "sources", let's label them uniformly as well.
For instance, that could give something like:
(ORGANIZATION {name: CIA})-[:HAS_COMPUTED_GDP {year:2011, value:994}]->(COUNTRY {name:China})
With Cypher Query Language, following this model, you would execute the following query:
START cia = node:nodes(name = "CIA")
MATCH cia-[gdp:HAS_COMPUTED_GDP]->(country)
WHERE gdp.year = 2012
RETURN cia, country, gdp
In this query, I used an index lookup as a starting point (rather than IDs which are a internal technical notion that shouldn't be used) to retrieve CIA by name and match the relevant subgraph to finally return CIA, the GDP relationships and their linked countries matching the input constraints.
Although Neo4J is totally schemaless, this does not mean you should necessarily have a totally flexible data model. Having a little structure will always help to make your queries or traversals easier to read.
If you're not familiar with Cypher Query Language (which is not the only way to read or write data into the graph), have a look at the excellent documentation of Neo4J (Cypher: http://docs.neo4j.org/chunked/stable/cypher-query-lang.html, complete: http://docs.neo4j.org/chunked/stable/index.html) and try some queries there: http://console.neo4j.org/!
And to answer your second question, if you wanna add another year of GDP computations, this will just boil down to adding new relationship "HAS_COMPUTED_GDP" between the organizations and the countries, no more no less.
Hope it helps :)

How do you track record relations in NoSQL?

I am trying to figure out the equivalent of foreign keys and indexes in NoSQL KVP or Document databases. Since there are no pivotal tables (to add keys marking a relation between two objects) I am really stumped as to how you would be able to retrieve data in a way that would be useful for normal web pages.
Say I have a user, and this user leaves many comments all over the site. The only way I can think of to keep track of that users comments is to
Embed them in the user object (which seems quite useless)
Create and maintain a user_id:comments value that contains a list of each comment's key [comment:34, comment:197, etc...] so that that I can fetch them as needed.
However, taking the second example you will soon hit a brick wall when you use it for tracking other things like a key called "active_comments" which might contain 30 million ids in it making it cost a TON to query each page just to know some recent active comments. It also would be very prone to race-conditions as many pages might try to update it at the same time.
How can I track relations like the following in a NoSQL database?
All of a user's comments
All active comments
All posts tagged with [keyword]
All students in a club - or all clubs a student is in
Or am I thinking about this incorrectly?
All the answers for how to store many-to-many associations in the "NoSQL way" reduce to the same thing: storing data redundantly.
In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it. Use the same criteria you would use to denormalize a relational database: if it's more important for data to have cohesion (think of values in a comma-separated list instead of a normalized table), then do it that way.
But this inevitably optimizes for one type of query (e.g. comments by any user for a given article) at the expense of other types of queries (comments for any article by a given user). If your application has the need for both types of queries to be equally optimized, you should not denormalize. And likewise, you should not use a NoSQL solution if you need to use the data in a relational way.
There is a risk with denormalization and redundancy that redundant sets of data will get out of sync with one another. This is called an anomaly. When you use a normalized relational database, the RDBMS can prevent anomalies. In a denormalized database or in NoSQL, it becomes your responsibility to write application code to prevent anomalies.
One might think that it'd be great for a NoSQL database to do the hard work of preventing anomalies for you. There is a paradigm that can do this -- the relational paradigm.
The couchDB approach suggest to emit proper classes of stuff in map phase and summarize it in reduce.. So you could map all comments and emit 1 for the given user and later print out only ones. It would require however lots of disk storage to build persistent views of all trackable data in couchDB. btw they have also this wiki page about relationships: http://wiki.apache.org/couchdb/EntityRelationship.
Riak on the other hand has tool to build relations. It is link. You can input address of a linked (here comment) document to the 'root' document (here user document). It has one trick. If it is distributed it may be modified at one time in many locations. It will cause conflicts and as a result huge vector clock tree :/ ..not so bad, not so good.
Riak has also yet another 'mechanism'. It has 2-layer key name space, so called bucket and key. So, for student example, If we have club A, B and C and student StudentX, StudentY you could maintain following convention:
{ Key = {ClubA, StudentX}, Value = true },
{ Key = {ClubB, StudentX}, Value = true },
{ Key = {ClubA, StudentY}, Value = true }
and to read relation just list keys in given buckets. Whats wrong with that? It is damn slow. Listing buckets was never priority for riak. It is getting better and better tho. btw. you do not waste memory because this example {true} can be linked to single full profile of StudentX or Y (here conflicts are not possible).
As you see it NoSQL != NoSQL. You need to look at specific implementation and test it for yourself.
Mentioned before Column stores look like good fit for relations.. but it all depends on your A and C and P needs;) If you do not need A and you have less than Peta bytes just leave it, go ahead with MySql or Postgres.
good luck
user:userid:comments is a reasonable approach - think of it as the equivalent of a column index in SQL, with the added requirement that you cannot query on unindexed columns.
This is where you need to think about your requirements. A list with 30 million items is not unreasonable because it is slow, but because it is impractical to ever do anything with it. If your real requirement is to display some recent comments you are better off keeping a very short list that gets updated whenever a comment is added - remember that NoSQL has no normalization requirement. Race conditions are an issue with lists in a basic key value store but generally either your platform supports lists properly, you can do something with locks, or you don't actually care about failed updates.
Same as for user comments - create an index keyword:posts
More of the same - probably a list of clubs as a property of student and an index on that field to get all members of a club
You have
"user": {
"userid": "unique value",
"category": "student",
"metainfo": "yada yada yada",
"clubs": ["archery", "kendo"]
}
"comments": {
"commentid": "unique value",
"pageid": "unique value",
"post-time": "ISO Date",
"userid": "OP id -> THIS IS IMPORTANT"
}
"page": {
"pageid": "unique value",
"post-time": "ISO Date",
"op-id": "user id",
"tag": ["abc", "zxcv", "qwer"]
}
Well in a relational database the normal thing to do would be in a one-to-many relation is to normalize the data. That is the same thing you would do in a NoSQL database as well. Simply index the fields which you will be fetching the information with.
For example, the important indexes for you are
Comment.UserID
Comment.PageID
Comment.PostTime
Page.Tag[]
If you are using NosDB (A .NET based NoSQL Database with SQL support) your queries will be like
SELECT * FROM Comments WHERE userid = ‘That user’;
SELECT * FROM Comments WHERE pageid = ‘That user’;
SELECT * FROM Comments WHERE post-time > DateTime('2016, 1, 1');
SELECT * FROM Page WHERE tag = 'kendo'
Check all the supported query types from their SQL cheat sheet or documentation.
Although, it is best to use RDBMS in such cases instead of NoSQL, yet one possible solution is to maintain additional nodes or collections to manage mapping and indexes. It may have additional cost in form of extra collections/nodes and processing, but it will give an solution easy to maintain and avoid data redundancy.