I have some simple transaction-style data in a flat format like the following:
{
'giver': 'Alexandra',
'receiver': 'Julie',
'amount': 20,
'year_given': 2015
}
There can be multiple entries for any 'giver' or 'receiver'.
I am mostly querying this data based on the giver field, and then split up by year. So, I would like to speed up these queries.
I'm fairly new to Mongo so I'm not sure which of the following methods would be the best course of action:
1. Restructure the data into the format:
{
'giver': 'Alexandra',
'transactions': {
'2015': [
{
'receiver': 'Julie',
'amount': 20
},
...
],
'2014': ...,
...
}
}
This makes me the most sense to me. We place all transactions into subdocuments of a person rather than having transactions all over the collection. It provides the data in the form I query it by the most, so it should be fast to query by 'giver' and then 'transactions.year'
I'm unsure if restructuring data like this is possible inside of mongo or if I should export it and modify it outside of mongo via some programming language.
2. Simply index by 'giver'
This doesn't quite match the way I'm querying this data (by 'giver' then 'year'), but it could be fast enough to do what I'm looking for. It's simple within mongo to do, and doesn't require restructuring of the data.
How should I go about adjusting my database to make my queries faster? And which way is the 'Mongo way'?
Related
I have a collection subscribers.
I want to get a segment of subscribers by applying sometimes complex filters in the query db.subscribers.find({ age: { $gt: 20 }, ...etc }), but I don't want to save the result, since that would be inefficient.
Instead, I would like to save only the filters applied in the query as a set of rules in the segments collection.
Is that a good approach and what would be an efficient way to do that?
Should I just save the query object itself as a document or define a more restrictive schema before saving?
Indexing is possible while fetching some particular records, sorting, ordering etc but suppose a collection contains lot many documents and it is taking time to fetch them all and display. So how to make this query faster using indexing? Is it possible using indexing? If yes then is it the best way or is there any other way too?
EDIT 1
Since indexing can't be used in my case. What is the most efficient way to write a query to fetch millions of records?
Edit 2
This is my mongoose query function for fetching data from some collection. If this collection has millions of data then obviously it will affect performance so how will you use indexing in this case for good performance?
Info.statics.findAllInfo = function( callback) {
this.aggregate([{$project:{
"name":"$some_name",
"age":"$some_age",
"city":"$some_city",
"state":"$some_state",
"country":"$some_country",
"zipcode":"$some_zipcode",
"time":{ $dateToString: { format: "%Y-%m-%d %H:%M:%S ", date: "$some_time" } }
}},
{
$sort:{
_id:-1
}
}],callback);
};
I haven't tried lean() method yet due to some temporary issue. But still i would like to know whether it will help or not?
I am new to mongodb and try to count how many distinct login users per day from existing collection. The data in collection looks like following
[{
_id: xxxxxx,
properties: {
uuid: '4b5b5c2e208811e3b5a722000a97015e',
time: ISODate("2014-12-13T00:00:00Z"),
type: 'login'
}
}]
Due to my limited knowledge, what I figure out so far is group by day first, output the data to a tmp collection and use this tmp collection to do anther map reduce and output the result to a final collection. This solution will get my collections bigger which I do not really like it. Does anyone can help me out or any good/more complex tutorials that I can follow? thanks
Rather than a map reduce, I would suggest an Aggregation. You can think of an aggregation as somewhat like a linux pipe, in that you can pass the results of one operation to the next. With this strategy, you can perform 2 consecutive groups and never have to write anything to the database.
Take a look at this question for more details on the specifics.
Background
I am storing table rows as MongoDb documents, with each column having a name. Let's say table has these columns of interest: Identifier, Person, Date, Count. The MongoDb document also has some extra fields separate from the table data, represented by timestamp. Columns are not fixed (which is why I use schema-free database to store them in the first place).
There will be need to do various complex, but so far unspecified queries. I am not very concerned about performance, though query performance may conceivably become a bottleneck. Once inserted, documents will not be modifed (a new document with same Identifier will be created instead), and insertions are not very frequent (let's say, 1000 new MongoDb documents per day). So amount of data will steadily grow over time.
Example
The straight-forward approach is having a collection of MongoDb documents like:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: {
Identifier: "AB002",
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
}
Now I have seen an alternative approach (for example in accepted answer of this question), using array with two fields per object:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: [
{ field: "Identifier", value: "AB002" },
{ field: "Person", value: "John001" },
{ field: "Date", value: ISODate("2013-11-16T21:26:17Z") },
{ field: "Count", value: 1 }
]
}
Questions
Does the 2nd approach make any sense at all?
If yes, then how to choose which to use? Especially, are there some specific kinds of queries which are easy/cheap with one approach, hard/costly with another? Any "rules of thumb" on which way to go, or pro-con lists for both? Example real-life cases of one aproach being inconvenient would be especially valuable.
In your specific example the First version is a lot more appropriate and simple. You have to think in terms of how you would query your document.
It is a lot simpler to query your database like this: db.collection.find({"data.Identifier": "AB002"})
Although I'm not 100% sure why you even need the inner document. Why can't structure your document like:
{
_id: "AB002",
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
Pros of first example:
Simple to query
Enforces unique keys, but your data won't have two columns with the same name anyway
I would assume mongoDB would generate better query plans because the structure is a lot more simple (haven't tested)
Pros of second example:
Allows multiple entries with the same key/field, but I don't feel that is useful in your case
A single index on the array can be used for all of its entries regardless of their field name
I don't think that the situation in the other example here and yours are the same. In the other example, they're creating a list of items with one of two answers, which would be more appropriately in an array, and the goal is to return a list of subdocuments that match the criteria. In your example, you're really just describing an object since they all hold different types of information, and you won't need to retrieve searchable bits of the subdocuments.
I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?
I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.
As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like
| USER |
--------
|ID
|Name
|Etc.
|TWEET__|
---------
|ID
|UserID
|Etc
It seems like the logical schema in Mongo would be
User
|-Tweet (0..3000)
|-Entities
|-Hashtags (0..10+)
|-urls (0..5)
|-user_mentions (0..12)
|-GeoData (0..20)
|-somegroupID
but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?
You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.
Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.
I mean a design for a tweet document as:
{
'hashtags': [ '#foo', '#bar' ],
'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
'user_mentions' : [ 'queen_uk' ],
'geodata': { ... },
'userid': 'derickr',
'somegroupid' : 40
}
And then for a user collection, the documents could look like:
{
'userid' : 'derickr',
'realname' : Derick Rethans',
...
}
All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ
Chris Winslett # MongoHQ
You will find this video interesting:
http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale
Essentially, in one document, store one days of tweets for one
person. The reasoning:
Querying typically consists of days and users
Therefore, you can have the following index:
{user_id: 1, date: 1} # Date needs to be last because you will range
and sort on the date
Have fun!
Chris MongoHQ
I think it makes the most sense to implement the following:
user
{ user_id: 123123,
screen_name: 'cledwyn',
misc_bits: {...},
groups: [123123_group_tall_people, 123123_group_techies, ],
groups_in: [123123_group_tall_people]
}
tweet
{ tweet_id: 98798798798987987987987,
user_id: 123123,
tweet_date: 20120220,
text: 'MongoDB is pretty sweet',
misc_bits: {...},
groups_in: [123123_group_tall_people]
}