In Hierarchical Dirichlet Process, the author gives an interpretation of HDP using Chinese Restaurant Franchise. It said that each restaurant has many tables and different tables may share a common dish in one restaurant. The dish here we can regarded as a topic in the document, then how to understand tables in every document? I think different tables should order different dishes, if two tables with the same dish, then why not merge them into one? Thanks a lot.
In the Chinese Restaurant Franchise (CRF), each document is a restaurant, each word is a customer, and cluster parameters are dishes served to tables from a global menu. A customer enters a restaurant and sits at a table with probability proportional to the number of customers already at a table, or sits at a new table with probability alpha. New tables are then assigned a particular dish with probability proportional to the number of tables already serving that dish, or a new dish with probability gamma.
Thus, for every customer we have an index that maps the customer to the table and for every table we have an index that maps the table to one of the dishes. A Gibbs sampling algorithm, first samples tables associated with data, and then samples dishes associated with each table. For more details see Yee Whye Teh's implementation.
Related
As it says in Dynamodb documentation, it's recommended that we use only one table to model all our entities.
You should maintain as few tables as possible in a DynamoDB application. Most well-designed applications require only one table.
Now suppose that we have a product and a user entity, using only one table we have a schema like this:
In dynamodb, its recommended that we keep related data together, that's why the user data is "duplicated" on the product entry.
My question is, if one day I update the user name, dynamodb will be able to update automatically the copy of that user on my product entry, or this kind of update has to be made manual?
In DynamoDB, it is recommended to keep the items in de-normalized form for achieving the benefits of DynamoDb. Having said that, while designing the table we keep the application layer design in mind based on which we try to fetch the results from the single table to get the values that can be used to create the single entity with all the mappings satisfied. Hence we create the table with columns that can hold the values from other related table. The only difference is we are just putting the relationship values for keeping the connection to other related tables.
In the above scenario, we can have user details in one table and while creating the table for product, keep the primary key of user table in the product table. So that, if the username or user detail is changed in future, there wouldn't be any problem.
In DynamoDB, using sort key for the table, will keep the related items together. There is also a provision of composite sort keys to deal with one-many relation.
Sharing the Best practices of using sort keys:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html
While reading the article http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3 chapter: "Rules of Thumb: Your Guide Through the Rainbow" i came across the words: embedding and denormalizing.
One: favor embedding unless there is a compelling reason not to
Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
I know embedding is nesting documents, instead of writing seperate tables/collections.
But i have no clue what denormalizing means.
Denormalization is the opposite of normalization, a good practice to design relational databases like MySQL, where you split up one logical dataset into separated tables to reduce redundancy.
Because MongoDB does not support joins of tables you prefere to duplicate an attribute into two collections if the attribute is read often and less updated.
E.G.: You want to save data about a person and want to save the gender of the Person:
In a SQL database you would create a table person and a table gender, store a foreign key of the genders ID in the person table and perform a join to get the full information. This way "male" and "female" only exists once as a value.
In MongoDB you would save the gender in every Person so the value "male" and "female" can exist multiple times but the lookup is faster because the data is tightly coupled in one object of a collection.
I am new to MongoDB and I have difficulties implementing a solution in it.
Consider a case where I have two collections: a client and sales collection with such designs
Client
==========
id
full name
mobile
gender
region
emp_status
occupation
religion
Sales
===========
id
client_id //this would be a DBRef
trans_date //date time value
products //an array of collections of product sold in the form {product_code, description, units, unit price, amount}
total sales
Now there is a requirement to develop another collection for analytical queries where the following questions can be answered
What are the distribution of sales by gender, region and emp_status?
What are the mostly purchase products for clients in a particular region?
I considered implementing a very denormalized collection to create a flat and wide collection of the properties of the sales and client collection so that I can use map-reduce to further answer the questions.
In RDBMS, an aggregation back by a join would answer these question but I am at loss to how to make Map-Reduce or Agregation help out.
Questions:
How do I implement Map-Reduce to map across 2 collections?
Is it possible to chain MapReduce operations?
Regards.
MongoDB does not do JOINs - period!
MapReduce always runs on a single collection. You can not have a single MapReduce job which selects from more than one collection. The same applies to aggregation.
When you want to do some data-mining (not MongoDBs strongest suit), you could create a denormalized collection of all Sales with the corresponding Client object embedded. You will have to write a little program or script which iterates over all clients and
finds all Sales documents for the clinet
merges the relevant fields from Client into each document
inserts the resulting document into the new collection
When your Client document is small and doesn't change often, you might consider to always embed it into each Sales. This means that you will have redundant data, which looks very evil from the viewpoint of a seasoned RDB veteran. But remember that MongoDB is not a relational database, so you should not apply all RDBMS dogmas unreflected. The "no redundancy" rule of database normalization is only practicable when JOINs are relatively inexpensive and painless, which isn't the case with MongoDB. Besides, sometimes you might want redundancy to ensure data persistence. When you want to know your historical development of sales by region, you want to know the region where the customer resided when they bought the product, not where they reside now. When each Sale only references the current Client document, that information is lost. Sure, you can solve this with separate Address documents which have date-ranges, but that would make it even more complicated.
Another option would be to embed an array of Sales in each Client. However, MongoDB doesn't like documents which grow over time, so when your clients tend to return often, this might result in sub-par write-performance.
I have the following objects Company, User and Order (contains orderlines). User's place orders with 1 or more orderlines and these relate to a Company. The time period for which orders can be placed for this Company is only a week.
What I'm not sure on is where to place the orders array, should it be a collection of it's own containing a link to the User and a link to the Company or should it sit under the Company or finally should the orders be sat under the User.
Numbers wise I need to plan for 50k+ in orders.
Queries wise, I'll probably be looking at Orders by Company mainly but I would need to find an Order by Company based for a specific user.
1) For folks coming from the SQL world (such as myself) one of the hardest learn about MongoDB is the new style of schema design. In the SQL world, everything goes into third normal form. Folks come to think that there is a single right way to design their schema, because there typically is one.
In the MongoDB world, there is no one best schema design. More accurately, in MongoDB schema design depends on how the application is going to access the data.
2) Here are the key questions that you need to have answered in order to design a good schema for MongoDB:
How much data do you have?
What are your most common operations? Will you be mostly inserting new data, updating existing data, or doing queries?
What are your most common queries?
How many I/O operations do you expect per second?
What you're talking about here is modeling Many-to-One relationships:
Company -> User
User -> Order
Order -> Order Lines
Company -> Order
Using SQL you would create a pair of master/detail tables with a primary key/foreign key relationship. In MongoDB, you have a number of choices: you can embed the data, you can create a linked relationship, you can duplicate and denormalize the data, or you can use a hybrid approach.
The correct approach would depend on a lot of details about the use case of your application, many of which you haven't provided.
3) This is my best guess - and it's only a guess - as to a good schema for you.
a) Have separate collections for Users, Companies, and Orders
If you're looking at 50k+ orders, there are too many to embed in a single document. Having them as a separate collection will allow you to reference them from both the Company and the User documents.
b) Have an array of references to the Order documents in both the Company and the User documents. This makes the query "Find all Orders for this Company" a single-document query
c) If your query pattern supports it, you might also have a duplicate link from Orders back to the owning Company and/or User.
d) Assuming that the order lines are unique to the individual Order, you would embed the Order Lines in an array within the Order documents.
e) If your order lines refer back to individual Products, you might want to have a separate Product collection, and include a reference to the Product document in the order line sub-document
4) Here are some good general references on MongoDB schema design.
MongoDB presentations:
http://www.10gen.com/presentations/mongosf2011/schemabasics
http://www.10gen.com/presentations/mongosv-2011/schema-design-by-example
http://www.10gen.com/presentations/mongosf2011/schemascale
Here are a couple of books about MongoDB schema design that I think you would find useful:
http://www.manning.com/banker/ (MongoDB in Action)
http://shop.oreilly.com/product/0636920018391.do
Here are some sample schema designs:
http://docs.mongodb.org/manual/use-cases/
Note that the "MongoDB in Action" book includes a sample schema for an e-commerce application, which is very similar to what you're trying to build -- I recommend you check it out.
We want to develop a application which need to support custom attribtues to different entities (like user, project, folder, document etc..) in our application.
I googled and prima face it looks like No-SQL database can be suited for our requirement. Do you see any limitation ? What are the prons/cons of using No-SQL instead of RDBMS?
There are many NO-SQL databases available - http://nosql-database.org/ ? But we don't have any experiance in using No SQL database.Don't find any good article which compares these NO-SQL database. Any suggestion which No-SQL data store we can use to achive custom attribtues functionality?
One big advantage of No-sql database is its free-style: you will never specify the columns like "user, project, folder" before you insert your real data. The columns can be added at any time.
While in RDBMS, the table schema is strictly defined, can not modify during run time.
Another advantage is the performance in query. It is quite efficient if you query all the records of a user, say "Michael", since the data is stored following the principle of Big Table, named by google.
There are two ways to solve your question: a column database such as Cassandra; or a name-value pair (also called attribute-value pair) in relational.
First, Cassandra is a structured key-value store. A key can contain multiple and variable attributes and values. Values or columns are grouped into column families. The column families are fixed when a Cassandra database is created. A family is analogous to an entity in a logical data model or to a table in relational. Columns can be added to a family at any time. Thereby, different instances of the column family can have different columns, which is what you need. Furthermore, columns are assigned to specified keys, so different keys can have different numbers of columns in any given family.
A name value pair, also called an attribute value pair, can be created in logical data modeling and in relational. This can be done with three related entities or tables:
The base entity (such as customer), which in analogous to a column family.
A "type" entity, which describes the attribute and its characteristics such as Net Worth Amount,
A "value" entity, which assigns the attribute to an instance of a base entity and assigns it a value.
The "type" entity is simply a code table identified by a type code and containing a description and other domain characteristics. Domain refers to data type, length, meaning, and units of measure. It describes the attribute out of context (i.e., unassigned). An example could be Net Worth Amount, which is a number 8 digits with 2 decimal places, right justified, and its description is "a value representing the total financial value of a customer including liquid and non-liquid amounts".
The "value" entity is an associative entity or table that is identified by the customer id and the attribute type code, and has a value attribute that assigns the Net Worth Amount type the Customer and gives it a value, such as "$2,000,000."
However, in relational name-value pairs are somewhat difficult to query in SQL and generally do not perform well. This could be addressed by denormalizing the "type" and "value" entities into one. Instead of having three tables you have two -- one-to-many. Actually, that is essentially how Cassandra does it. A column family is a fully flattened attribute-value pair.
I hope this helps. If you are going to use NOSQL, I'd use something like Cassandra. If you use relational, I'd denormalize (i.e., collapse into one) the type and value. The advantage of relational is that your already have it. The disadvantage to Cassandra is that you have to learn it but it is build to do what you want.
Couchbase would be a great answer for you, if you can encapsulate your model into JSON then you are already halfway there. You can have any number of properties for your object:
product::001
{
"name": "Hard Drive",
"brand": "Toshiba",
...
...
}
To learn some simple patterns moving from RDBMS to Couchbase, check out their webinars at http://www.couchbase.com/webinars or some simple design patterns at http://CouchbaseModels.com (examples are in Ruby though)
The real advantage of Couchbase is schema flexibility, horizontal scalability on commodity hardware, and speed. After learning the basics, it fits better into Agile processes, with almost no need for migrations. In enterprise organizations it's very effective since every column modification will require business processes and approvals with the DBA. Couchbase schema flexibility circumvents a lot of these issues.