cassandra schema data design for many-to-many array relationship - nosql

So I need a DB that can store info for about 300 million users. Each user will have two vectors: their 5 favorite items, and their 5 most similar users (these users also contained in the user set)
ex:
preferences users
user | item user | user
-------------- --------------
user1 | item1 user1 | user2
user1 | item2 user1 | user4
user1 | item3 user2 | user8
user2 | item3 . . .
user2 | item4
. . .
So basically I need two tables, both many-many relationships, and both relatively big.
Ive been exploring cassandra (but im open to other solutions) and I was wondering how I would define the schema, and what type of indexing I need for this to be optimized and working properly.
I will need to query in two fashions:
1.By user of course, and
2. by whatever item is in their list.
(so i can get a list of users with the same favorite item)
Ive already set up cassandra and started messing with it but I cant even get lists to work because i need 'composite' primary keys? I dont understand why.
Any help/a push in the right direction is greatly appreciated.
Thanks!

I am not sure you've adequately described your use case. It is the access patterns that first and foremost define your key design, which is ultimately what defines your workload characteristics with NoSQL databases. For example, are you going to have to do searches for users based on a certain geography or something along those lines or is this just simple , grab 1 user and his favorite items and/or his similar users.
Based on what you've described, you should probably just create a keyspace for user_ids and then your value can be the denormalized copies of "favorite items" and a list of "similar user id's". Assuming your next action is to do something with those similar users, you can quickly get them from the list of id's.
The important point is how big is your key ( i mean in characters / bytes ) and will you be able to fit them into memory so you get really fast performance. If your machines have limited memory for your key size, then you need to plan for a number of nodes which can accommodate a given number of keys and let those nodes run on separate servers. At least that is the most important part for Oracle NoSQL Database (ONDB) .... I am part of that team. Good news is that 300M is still very small.
Hope it helps,
-Robert

Related

Nicely managed lookup tables

We have a people table, each person has a gender defined by a gender_id to a genders table,
| people |
|-----------|
| id |
| name |
| gender_id |
| genders |
|---------|
| id |
| name |
Now, we want to allow people to create forms by themselves using a nice form builder. One of the elements we want to add is a select list with user defined options,
| lists |
|-------|
| id |
| name |
| list_options |
|--------------|
| id |
| list_id |
| label |
| value |
However, they can't use the genders as a dropdown list because it's in a different table. They could create a new list with the same options as genders but this isn't very nice and if a new gender is added they'd need to add it in multiple places.
So we want to move the gender options into a list that the user can edit at will and will be reflected when a new person is created too.
What's the best way to move the genders into a list and list_options while still having a gender_id (or similar) column in the people table? Thoughts I've had so far include;
Create a 'magic' list with a known id and always assume that this contains the gender options.
Not a great fan of this because it sounds like using 'magic' numbers. The code will need some kind of 'map' between system level select boxes and what they mean
Instead of having a 'magic' list, move it out into an option that the user can choose so they have a choice which list contains the genders.
This isn't really much different, but the ID wouldn't be hardcoded. It would require more work looking through DB tables though
Have some kind of column(s) on the lists table that would mark it as pulling its options from another table.
Would likely require a lot more (and more complex) code to make this work.
Some kind of polymorphic table that I'm not sure how would work but I've just thought about and wanted to write down before I forget.
No idea how this would work because I've only just had the idea
The easiest solution would change your list_options table to a view. If you have multiple tables you need have a list drop down for to pull from this table, just UNION result sets together.
SELECT
(your list id here) -- make this a part primary key
id, -- and this a part primary key
Name,
FROM dbo.Genders
UNION
SELECT
(your list id here) -- make this a part primary key
id, -- and this a part primary key
Name,
FROM dbo.SomeOtherTable
This way it's automatically updated anytime the data changes. Now you are going to want to test this, as if this gets big it might get slow, you can get around this by only pulling all this information once in your application (or say cache it for 30 minutes and then refresh just in case).
Your second option is to create a table list_options and then create a procedure (etc.) which goes through all the other lookup tables and pulls the information to compile it. This will be faster for application performance, but it will require you to keep it all in sync. The easiest way to handle this one is to create a series of triggers which will rebuild portions (or the entire) list_options table when something in the look up tables is changed. In this one, I would suggest moving away from creating a automatically generated primary key and move to a composite key, like I mentioned with the views. Since this is going to be rebuilt, the id will change, so it's best to not having anything think that value is at all stable. With the composite (list_id,lookup_Id) it should always be the same no matter how many times that row is inserted into the table.

Use form to create multiple fields in Access 2010

So I have a form I have Vendors fill out when they want to ship to us. It's an excel form that I then import into Access so I can run reports. Sometimes when they send the form back it's in a format in which I have to manually enter the data into our database.
The form looks like this:
The middle section is just for example purposes so it's a rectangle with text in it.
So everything seemed simple enough until I got to the middle section. See in my excel form I have a section for multiple PO's and units. So essentially each shipment can have one to many PO's and Units. Currently I can approach this task with the redundant method of reentering information per PO on the form. But I want to make this simple.
So the task at hand is that I want to have a form field for PO's and Units where I can input multiple lines of information so that when I hit a submit button. It appears in the database on separate lines with the same vendor information.
So if I filled out my form had this in the middle section:
PO | Units
111111 22
222222 33
333333 44
When I hit submit I want it to attach the rest of the forms information to each PO on separate lines so it'd be like:
Vendor | City | State | PO | Units
Nike Memphis TN 111111 22
Nike Memphis TN 222222 33
Nike Memphis TN 333333 44
So how would I go about accomplishing this task?
From your description of the problem and your example of how the data appears to ultimately be stored in Access it looks to me like you are using Access as a spreadsheet and not as a database. This is ok, but you might want to consider normalizing the data to take advantage of the power of databases in general.
For example:
Create a Vendors table whose sole purpose is to keep details about each Vendor you work with. A very basic implementation would have an ID field to uniquely identify each vendor and a Name field for the vendor name.
If Vendors will only ever have a single location you could also store City, State, ZipCode and Email in this same Vendor table, but I suspect having a separate VendorLocation or VendorAddress table would be a better fit long term.
Create a VendorShipment table that tracks the higher level information on your mockup, such as:
ShipmentID (primary key of this table)
VendorID (foreign key back to Vendor table)
Ready Date
Carrier
Estimated Cost
FreightClass
Tracking #
Estimated Transit Time
Finally, create a VendorShipmentDetail table that tracks the information of each shipment, including:
ShipmentDetailID (primary key of this table)
ShipmentID (foreign key back to VendorShipment table)
PO
Units
Any other details that you want to or need to track
Organizing and storing the data in a normalized fashion would ultimately help simplify your data entry \ data management process and potentially make for a better user experience.
For example, rather than having to enter the Vendor Name, Address information, etc. each time you could instead use a combo box control that is tied to the Vendor table. If the Vendor exists in the table you select it from the list and you already have the Address information, no need to re-enter it each time. If the Vendor did not already exist you enter it once (probably on a Vendor screen where you maintain the details for each Vendor) and draw upon the information in the future.
You would then use queries to tie the information back together for reporting purposes (de-normalize the information).
The art of database design can take a while to pick up, but a good starting point might be to check out the Northwind database that Microsoft has maintained over the years. It has some examples you could draw from immediately to get a practical understanding of how to use normalization within Access. You can find more information here: http://office.microsoft.com/en-us/templates/northwind-sales-web-database-TC101114818.aspx

Need a good database design for this situation

I am making an application for a restaurant.
For some food items, there are some add-ons available - e.g. Toppings for Pizza.
My current design for Order Table-
FoodId || AddOnId
If a customer opts for multiple addons for a single food item (say Topping and Cheese Dip for a Pizza), how am I gonna manage?
Solutions I thought of -
Ids separated by commas in AddOnId column (Bad idea i guess)
Saving Combinations of all addon as a different addon in Addon Master Table.
Making another Trans table for only Addon for ordered food item.
Please suggest.
PS - I searched a lot for a similar question but cudnt find one.
Your relationship works like this:
(1 Order) has (1 or more Food Items) which have (0 or more toppings).
The most detailed structure for this will be 3 tables (in addition to Food Item and Topping):
Order
Order to Food Item
Order to Food Item to Topping
Now, for some additional details. Let's start flushing out the tables with some fields...
Order
OrderId
Cashier
Server
OrderTime
Order to Food Item
OrderToFoodItemId
OrderId
FoodItemId
Size
BaseCost
Order to Food Item to Topping
OrderToFoodItemId
ToppingId
LeftRightOrWhole
Notice how much information you can now store about an order that is not dependent on anything except that particular order?
While it may appear to be more work to maintain more tables, the truth is that it structures your data, allowing you many added advantages... not the least of which is being able to more easily compose sophisticated reports.
You want to model two many-to-many realtionships by the sound of it.
i.e. Many products (food items) can belong to many orders, and many addons can belong to many products:
Orders
Id
Products
Id
OrderLines
Id
OrderId
ProductId
Addons
Id
ProductAddons
Id
ProductId
AddonId
Option 1 is certainly a bad idea as it breaks even first normal form.
why dont you go for many-to-many relationship.
situation: one food can have many toppings, and one toppings can be in many food.
you have a food table and a toppings table and another FoodToppings bridge table.
this is just a brief idea. expand the database with your requirement
You're right, first one is a bad idea, because it is not compliant with normal form of tables and it would be hard to maintain it (e.g. if you remove some addon you would need to parse strings to remove ids from each row - really slow).
Having table you have already there is nothing wrong, but the primary key of that table will be (foodId, addonId) and not foodId itself.
Alternatively you can add another "id" not to use compound primary key.

Efficient way to model azure table storage for social networking

I have tables like this in SQL Server
Users
UserId (Unique)
Name
Age
Friends
UserId
FriendId
Topics
UserId
Subject
There can be several thousands of users. and there are several other properties in the table.
I can query to get following answers.
Give me all the friends of user "Tom".
Give me all the topics created by "Tom".
Give me all the topics created by Tom's friends that contains "abc" in the subject.
If I were to do it in Azure table storage, how do I structure my tables?
I have gone through this and this I would like someone who had more experience on modeling Azure Table storage to give some insights..
1 and 2 are pretty easy. You create two Azure tables - Friends and Topics indexed by user id (with user id in the key).
3rd one is much more difficult with Azure tables, especially "that contains 'abc' in the subject" part.
Azure tables don't support full text search. Basically it is only possible to efficiently retrieve values (or range of values) either using exact keys or using 'startswith' operator. Like "Give me all records where key is equal to 'key value'". Or "give me all records where key is greated than 'key lower bound' and is less than 'key upper bound'".
It is also possible to filter using 'startswith' by any non-key field of a record, but this will involve table scan and is not efficient. It's not possible to do similar filtering with 'contains'.
So I think you need something with full text search support here.

Storing Embedded Comments vs. Avoiding overhead in MongoDB

Let me explain my problem, and hopefully someone can offer some good advice.
I am currently working on a web-app that stores information and meta-data for a large amount of applications. For each application there could be anywhere from 10 to 100's of comments that are tied to the application and an application version id. I am using MongoDB because of a need for easy future scalability and speed. I have read that comments should be embedded in a collection for read performance reasons, but I'm not sure that this works in my case. I read on another post:
In general, if you need to work with a given data set on its own, make it a collection.
By: #kb
In my case however I don't need to work on the collection by themselves. Let me explain further. I will have a table of apps (that can be filtered) and will dynamically load entries as you scroll, or filter, through the list of apps. If I embed the comments within the application collection, I am sending ALL the comments when I dynamically load the application entry into the table. However, I would like to do "lazy loading" in that I only want to load the comments when the user requests to see them (by clicking on the entry in the table).
As an example, my table might look like the following
| app name | version | rating | etc. | view comments |
------------------------------------------------------
| app1 | v.1.0 | 4 star | etc. | click me! |
| app2 | v.2.4.5 | 3 star | etc. | click me! |
| ...
My question is what would be more efficient? Are reads fast enough on MongoDB that it really doesn't matter that I am pulling all the comments with each application? If a user did not filter any of the applications and scrolled all the way to the bottom, they might load somewhere between 125k to 250k entries/applications.
I would suggest thinking more specifically about your query - you specify which parts of an object you'd like to return. This should allow you to avoid the overhead of getting a bunch of embedded comments when you're only interested in displaying some specific bits of information about the application.
You can do something like: db.collection.find({ appName : 'Foo'}, {comments : 0 }); to retrieve the application object with appName Foo, but specifically exclude the comments object (more likely array of objects) embedded within it.
From the MongoDB docs
Retrieving a Subset of Fields
By default on a find operation, the entire document/object is returned. However we may also request that only certain fields are returned. Note that the _id field is always returned automatically.
// select z from things where x=3
db.things.find( { x : 3 }, { z : 1 } );
You can also remove specific fields that you know will be large:
// get all posts about mongodb without comments
db.posts.find( { tags : 'mongodb' }, { comments : 0 } );
EDIT
Also remember the limit(n) function to retrieve only n apps at a time. For instance, getting n=50 apps without their comments would be:
db.collection.find({}, {comments : 0 }).limit(50);