Let me explain my problem, and hopefully someone can offer some good advice.
I am currently working on a web-app that stores information and meta-data for a large amount of applications. For each application there could be anywhere from 10 to 100's of comments that are tied to the application and an application version id. I am using MongoDB because of a need for easy future scalability and speed. I have read that comments should be embedded in a collection for read performance reasons, but I'm not sure that this works in my case. I read on another post:
In general, if you need to work with a given data set on its own, make it a collection.
By: #kb
In my case however I don't need to work on the collection by themselves. Let me explain further. I will have a table of apps (that can be filtered) and will dynamically load entries as you scroll, or filter, through the list of apps. If I embed the comments within the application collection, I am sending ALL the comments when I dynamically load the application entry into the table. However, I would like to do "lazy loading" in that I only want to load the comments when the user requests to see them (by clicking on the entry in the table).
As an example, my table might look like the following
| app name | version | rating | etc. | view comments |
------------------------------------------------------
| app1 | v.1.0 | 4 star | etc. | click me! |
| app2 | v.2.4.5 | 3 star | etc. | click me! |
| ...
My question is what would be more efficient? Are reads fast enough on MongoDB that it really doesn't matter that I am pulling all the comments with each application? If a user did not filter any of the applications and scrolled all the way to the bottom, they might load somewhere between 125k to 250k entries/applications.
I would suggest thinking more specifically about your query - you specify which parts of an object you'd like to return. This should allow you to avoid the overhead of getting a bunch of embedded comments when you're only interested in displaying some specific bits of information about the application.
You can do something like: db.collection.find({ appName : 'Foo'}, {comments : 0 }); to retrieve the application object with appName Foo, but specifically exclude the comments object (more likely array of objects) embedded within it.
From the MongoDB docs
Retrieving a Subset of Fields
By default on a find operation, the entire document/object is returned. However we may also request that only certain fields are returned. Note that the _id field is always returned automatically.
// select z from things where x=3
db.things.find( { x : 3 }, { z : 1 } );
You can also remove specific fields that you know will be large:
// get all posts about mongodb without comments
db.posts.find( { tags : 'mongodb' }, { comments : 0 } );
EDIT
Also remember the limit(n) function to retrieve only n apps at a time. For instance, getting n=50 apps without their comments would be:
db.collection.find({}, {comments : 0 }).limit(50);
Related
Lets say I have an app where users can make posts. I store these in a single DynamoDB table using the following design:
+--------+--------+---------------------------+
| PK | SK | (Attributes) |
+-----------------+---------------------------+
| UserId | UserId | username, profile, etc... | <-- user item
| UserId | PostId | body, timestamp, etc... | <-- post item
+--------+--------+---------------------------+
When a user makes a post, my Lambda function receives the following data:
{
"userId": <UserId>",
"body": <Body>,
etc...
}
My question is, should I first verify that the user exists before adding the post to the table by using dynamodb.get({PK: userId, SK: userId)? This would make sure there won't be any orphaned posts, but also the function will require both a read and write unit.
One idea I have is to just write the post, potentially allowing orphaned posts. Then, I could have another Lambda function that runs periodically to find and remove any orphans.
This is obviously a simple case, but imagine a more complex system where objects have multiple relationships. It seems it could easily get very costly to check for relationship existence in these cases.
"Then, I could have another Lambda function that runs periodically to find and remove any orphans." <-- This could get very expensive over time, especially if you plan to do this by scanning the table.
I develop a system built on DynamoDB that has similar relationships, and I validate relationships before saving data because I do not want to have garbage data in my tables.
One option to consider is implicitly testing for the existence of a valid user via authentication & authorization. If a user has passed your auth tests, then you know that they exists, so you can add their posts with confidence.
I'm trying to figure out how to query with filter with Geofire.
Suppose I have restaurants with different category. and I want to add that category to my query. How do I go about this?
One way I have now is querying the key with Geofire, run the for loop through each key and get the restaurant, and insert the appropriate restaurant to the array.
These seems so inefficient. Is there any other way to go about this?
Ideally I will have the filtered results, and only load each item when they're about to be shown.
Cheers!
Firebase queries can only filter by one condition. Geofire already does quite some "magic" to allow it to filter on both longitude and latitude. Adding another property to that equation might be possible, but is well beyond what Geofire handles by default. See GeoFire: How to add extra conditions within the query?
If you only ever want to access one category at a time, you can put the restaurants in a top-level node per category and point Geofire to one category.
/category1
item1
g: "pns0h0mf2u"
l: [-53.435719, 140.808716]
item2
g: "u417k3dwub"
l: [56.83069, 1.94822]
/category2
item3
g: "8m3rz3s480"
l: [30.902225, -166.66809]
/items
item1: ...
item2: ...
item3: ...
In the above example, we have two categories: category1 with 2 items and category2 with just 1 item. For each item, we see the data that Geofire uses: a geohash and the longitude and latitude. We also keep a single list with the other properties of these 3 items.
But more commonly, you simply do the extra filtering in client-side code. If you're worried about the performance of that: measure it, share the code, JSON data and measurements.
This is an old question, but I've seen it in a few places on the web, so I thought I might share one trick I've used.
The Problem
If you have a large collection in your database, maybe containing hundreds of thousands of keys, for example, it might not be feasible to grab them all. If you're trying to filter results based on location in addition to other criteria, you're stuck with something like:
Execute the location query
Loop through each returned geofire key and grab the corresponding data in the database
Check each returned piece of data to see if it matches the other criteria
Unfortunately, that's a lot of network requests, which is quite slow.
More concretely, let's say we want to get all users within e.g. 100 miles of a particular location that are male and between ages 20 and 25. If there are 10,000 users within 100 miles, that means 10,000 network requests to grab the user data and compare their gender and age.
The Workaround:
You can store the data you need for your comparisons in the geofire key itself, separated by a delimiter. Then, you can just split the keys returned by the geofire query to get access to the data. You still have to filter through them, but it's much faster than sending hundreds or thousands of requests.
For instance, you could use the format:
UserID*gender*age, which might look something like facebook:1234567*male*24. The important points are
Separate data points by a delimiter
Use a valid character for the delimiter -- "It can include any unicode characters except for . $ # [ ] / and ASCII control characters 0-31 and 127.)"
Use a character that is not going to be found elsewhere in your database - I used *, but that might not work for you. Do not use any characters from -0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz, since those are fair-game for keys generated by firebase's push()
Choose a consistent order for the data - in this case, UserID first, then gender, then age.
You can store up to 768 bytes of data in firebase keys, which goes a long way.
Hope this helps!
So I need a DB that can store info for about 300 million users. Each user will have two vectors: their 5 favorite items, and their 5 most similar users (these users also contained in the user set)
ex:
preferences users
user | item user | user
-------------- --------------
user1 | item1 user1 | user2
user1 | item2 user1 | user4
user1 | item3 user2 | user8
user2 | item3 . . .
user2 | item4
. . .
So basically I need two tables, both many-many relationships, and both relatively big.
Ive been exploring cassandra (but im open to other solutions) and I was wondering how I would define the schema, and what type of indexing I need for this to be optimized and working properly.
I will need to query in two fashions:
1.By user of course, and
2. by whatever item is in their list.
(so i can get a list of users with the same favorite item)
Ive already set up cassandra and started messing with it but I cant even get lists to work because i need 'composite' primary keys? I dont understand why.
Any help/a push in the right direction is greatly appreciated.
Thanks!
I am not sure you've adequately described your use case. It is the access patterns that first and foremost define your key design, which is ultimately what defines your workload characteristics with NoSQL databases. For example, are you going to have to do searches for users based on a certain geography or something along those lines or is this just simple , grab 1 user and his favorite items and/or his similar users.
Based on what you've described, you should probably just create a keyspace for user_ids and then your value can be the denormalized copies of "favorite items" and a list of "similar user id's". Assuming your next action is to do something with those similar users, you can quickly get them from the list of id's.
The important point is how big is your key ( i mean in characters / bytes ) and will you be able to fit them into memory so you get really fast performance. If your machines have limited memory for your key size, then you need to plan for a number of nodes which can accommodate a given number of keys and let those nodes run on separate servers. At least that is the most important part for Oracle NoSQL Database (ONDB) .... I am part of that team. Good news is that 300M is still very small.
Hope it helps,
-Robert
I am using CouchDB 1.1.1 for my web app-- everything has worked great so far (saving/retrieving documents, saving/querying views, etc) but I am stuck on a querying a view for a particular key at a particular group level.
The map function in my view outputs keys with the following format: ["Thing 1" "Thing 2"]. I have a reduce function which works fine and outputs correct values for group level 1 (ie by "thing 1") and by group level 2 (ie by "thing 2").
Now-- when I query couchdb I CAN grab just one particular key when I set reduce = true (default), group_level=2 (or group=true, which are the same in this case since I only have 2 levels) and key = "desiredkeyhere." I can also query multiple keys with keys = ["key1" "key2"].
HOWEVER-- I really want to be able to grab a particular key for group_level=1, and I cannot get that to work. It seems to return nothing, or if use a post request, it returns everything. Never just the one key that I need.
Heres a link the the couchdb http view api (querying options) that I've been using:
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
It contains the following sentence:
"Note: Multiple keys request to a reduce function only supports group=true and NO group_level (identical to group_level=exact). The resulting error is "Multi-key fetchs for reduce view must include group=true""
Im not sure if this means that I cannot do what I have described above (grab a particular key for a particular group_level). That would seem like a huge problem with couchdb, so Im assuming Im doing something wrong.
Any ideas? Thanks
I have hit this too. I am not sure if it is a bug, though.
Try using your startkey and endkey in the normal (2-item) format. You want a result for ["Thing 1", *] (obviously pseudocode, the star represents anything). Reducing with group_level=1 will boil all of that down to one row.
So, query basically everything in the Thing 1 "namespace," so to speak. Since the "smallest" value to collate is null and the "greatest" value is the object {}, those make good bookends for your range.
?group_level=1&startkey=["Thing 1",null]&endkey=["Thing 1",{}]
Does that give you the result you need?
I have User model object with quite few fields (properties, if you wish) in it. Say "firstname", "lastname", "city" and "year-of-birth". Each user also gets "unique id".
I want to be able to search by them. How do I do that properly? How to do that at all?
My understanding (will work for pretty much any key-value storage -- first goes key, then value)
u:123456789 = serialized_json_object
("u" as a simple prefix for user's keys, 123456789 is "unique id").
Now, thinking that I want to be able to search by firstname and lastname, I can save in:
f:Steve = u:384734807,u:2398248764,u:23276263
f:Alex = u:12324355,u:121324334
so key is "f" - which is prefix for firstnames, and "Steve" is actual firstname.
For "u:Steve" we save as value all user id's who are "Steve's".
That makes every search very-very easy. Querying by few fields (properties) -- say by firstname (i.e. "Steve") and lastname (i.e. "l:Anything") is still easy - first get list of user ids from "f:Steve", then list from "l:Anything", find crossing user ids, an here you go.
Problems (and there are quite a few):
Saving, updating, deleting user is a pain. It has to be atomic and consistent operation. Also, if we have size of value limited to some value - then we are in (potential) trouble. And really not of an answer here. Only zipping the list of user ids? Not too cool, though.
What id we want to add new field to search by. Eventually. Say by "city". We certainly can do the same way "c:Los Angeles" = ..., "c:Chicago" = ..., but if we didn't foresee all those "search choices" from the very beginning, then we will have to be able to create some night job or something to go by all existing User records and update those "c:CITY" for them... Quite a big job!
Problems with locking. User "u:123" updates his name "Alex", and user "u:456" updates his name "Alex". They both have to update "f:Alex" with their id's. That means either we get into overwriting problem, or one update will wait for another (and imaging if there are many of them?!).
What's the best way of doing that? Keeping in mind that I want to search by many fields?
P.S. Please, the question is about HBase/Cassandra/NoSQL/Key-Value storages. Please please - no advices to use MySQL and "read about" SELECTs; and worry about scaling problems "later". There is a reason why I asked MY question exactly the way I did. :-)
Being able to query properties directly is one of the features you lose when moving away from SQL, so you need a way to maintain your own index to let you find records.
If your datastore does not have built in indexing or atomic list operations, you will need to deal with the locking issues you mention. However, indexing doesn't necessarily need to be synchronous - maintain a queue of updated records to be reindexed and you have a solution for 3 that can be reused to solve 2 also.
If the index list for a particular value becomes too large for the system to handle in a single list, you can replace the list of users with a list of lists. However, if you have that many records with the same value it probably isn't a particularly useful search criteria anyway.
Another option that is useful in some cases is to use a seperate system for the indexing - for example you could set up lucene to index the records in your main datastore.
I guess i would have implemented this as a MapReduce job, which would run on schedule.
Each search word, would be a row-key with lookup to UID.
Rowkey:uid1
profile:firstName: Joe
profile:lastName: Doe
profile:nick: DoeMaster
Rowkey: uid2
profile:firstName: Jane
profile:lastName: Doe
profile:nick: SuperBabe
MapReduse indexes all searchable properties and add them with search word as row key
Rowkey: Jane
lookup:uid: uid2
Rowkey: Doe
lookup:uid: uid2, uid1
Rowkey: DoeMaster
lookup:uid: uid1
..etc
Now, if you need to update the index list on the fly as a user change, you would write the change directly to the index base, by remove uid value from index and add to another row key. In case of this happens at the same time, temporary locking could be implemented.
For users being removed, an additional attribute telling the state of the user could be use to filter them out from search.
Adding additional search word isn't very hard, since its just about which name:value you want to index. you could filter search more also by adding type attribute to your row key/keyword. i.e boston - lookup:type: city.
The idea is to maintain your own row key based search index inside hbase.