Sorting Cassandra using individual components of Composite Keys - nosql

I want to store a list of users in a Cassandra Column Family(Wide rows).
The columns in the CF will have Composite Keys of pattern id:updated_time:name:score
After inserting all the users, i need to query users in a different sorted order each time.
For example, if i specify updated_time, i could be able to fetch the recent 10 users.
And, if i specify score, then i could be able to fetch the top 10 users based on score.
Does Cassandra supports this?
Kindly help me in this regard...

i need to query users in a different sorted order each time...
Does Cassandra supports this
It does not. Unlike a RDBMS, you can not make arbitrary queries and expect reasonable performance. Instead you must design you data model so the queries you anticipate will be made will be efficient:
The best way to approach data modeling for Cassandra is to start with your queries and work backwards from there. Think about the actions your application needs to perform, how you want to access the data, and then design column families to support those access patterns.
So rather than having one column family (table) for your data, you might want several with cross references between them. That is, you might have to denormalise your data.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

Single Collection Inheritance or multiple collections?

Assuming I have data of high school students across the country. Each high school data are not related each other and also never needed to be related to each other (compartmentalized). Which one is recommended if I use mongoDB:
1) Create single collection inheritance with the following attributes:
high_school_id, student_id, name, address
2) Create multiple collections (possibly thousands) with the following attributes:
student_id, name, address
The name of collection will follow school_data_<X> format, where X is the high_school_id. So, to query, my program can dynamically construct the collection name.
I came from MySQL, PostgreSQL background where having thousands tables are not common (So, option (1) is far more makes sense). How is it in MongoDB?
I recommend you use the first option, because MongoDB has a limit on the number of collections. More about this read docs.
You may want to consider a third option: create a collection with the students, where each student's record will include a high school data. There is nothing wrong in the duplication of data, you should not thinking about this in MongoDB, but you should thinking about more convenient way working with data.

Cassandra DB Design

I come from RDBMS background and designing an app with Cassandra as backend and I am unsure of the validity and scalability of my design.
I am working on some sort of rating/feedback app of books/movies/etc. Since Cassandra has the concept of flexible column families (sparse structure), I thought of using the following schema:
user-id (row key): book-id/movie-id (dynamic column name) - rating (column value)
If I do it this way, I would end up having millions of columns (which would have been rows in RDBMS) though not essentially associated with a row-key, for instance:
user1: {book1:Rating-Ok; book1023:good; book982821:good}
user2: {book75:Ok;book1023:good;book44511:Awesome}
Since all column families are stored in a single file, I am not sure if this is a scalable design (or a design at all!). Furthermore there might be queries like "pick all 'good' reviews of 'book125'".
What approach should I use?
This design is perfectly scalable. Cassandra stores data in sparse form, so empty cells don't consume disk space.
The drawback is that cassandra isn't very good in indexing by value. There are secondary indexes, but they should be used only to index a column or two, not each of million of columns.
There are two options to address this issue:
Materialised views (described, for example, here: http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/). This allows to build some set of predefined queries, probably quite complex ones.
Ad-hoc querying is possible with some sort of map/reduce job, that effectively iterates over the whole dataset. This might sound scary, but still it's pretty fast: Cassandra stores all data in SSTables, and this iterating might be implemented to scan data files sequentially.
Start from a desired set of queries and structure your column families to support those views. Especially with so few fields involved, each CF can act cheaply as its own indexed view of your data. During a fetch, the key will partition the data ultimately to one specific Cassandra node that can rapidly stream a set of wide rows to your app server in a pre-determined order. This plays to one of Cassandra's strengths, as the fragmentation of that read on physical media (when not cached) is extremely low compared to bouncing around the various tracks and sectors on an indexed search of an RDBMS table.
One useful approach when available is to select your key to segment the data such that a full scan of all columns in that segment is a reasonable proposition, and a good rough fit for your query. Then, you filter what you don't need, even if that filtering is performed in your client (app server). All reviews for a movie is a good example. Even if you filter the positive reviews or provide only recent reviews or a summary, you might still reasonably fetch all rows for that key and then toss what you don't need.
Another option is if you can figure out how to partition data(by time, by category), playOrm offers a solution of doing S-SQL into a partition which is very fast. It is very much like an RDBMS EXCEPT that you partition the data to stay scalable and can have as many partitions as you want. partitions can contain millions of rows(I would not exceed 10 million rows though in a partition).
later,
Dean

What is the fundmental difference between MongoDB / NoSQL which allows faster aggregation (MapReduce) compared to MySQL

Greeting!
I have the following problem. I have a table with huge number of rows which I need to search and then group search results by many parameters. Let's say the table is
id, big_text, price, country, field1, field2, ..., fieldX
And we run a request like this
SELECT .... WHERE
[use FULLTEXT index to MATCH() big_text] AND
[use some random clauses that anyway render indexes useless,
like: country IN (1,2,65,69) and price<100]
This we be displayed as search results and then we need to take these search results and group them by a number of fields to generate search filters
(results) GROUP BY field1
(results) GROUP BY field2
(results) GROUP BY field3
(results) GROUP BY field4
This is a simplified case of what I need, the actual task at hand is even more problematic, for example sometimes the first results query does also its own GROUP BY. And example of such functionality would be this site
http://www.indeed.com/q-sales-jobs.html
(search results plus filters on the left)
I've done and still doing a deep research on how MySQL functions and at this point I totally don't see this possible in MySQL. Roughly speaking MySQL table is just a heap of rows lying on HDD and indexes are tiny versions of these tables sorted by the index field(s) and pointing to the actual rows. That's a super oversimplification of course but the point is I don't see how it is possible to fix this at all, i.e. how to use more than one index, be able to do fast GROUP BY-s (by the time query reaches GROUP BY index is completely useless because of range searches and other things). I know that MySQL (or similar databases) have various helpful things such index merges, loose index scans and so on but this is simply not adequate - the queries above will still take forever to execute.
I was told that the problem can be solved by NoSQL which makes use of some radically new ways of storing and dealing with data, including aggregation tasks. What I want to know is some quick schematic explanation of how it does this. I mean I just want to have a quick glimpse at it so that I could really see that it does that because at the moment I can't understand how it is possible to do that at all. I mean data is still data and has to be placed in memory and indexes are still indexes with all their limitation. If this is indeed possible, I'll then start studying NoSQL in detail.
PS. Please don't tell me to go and read a big book on NoSQL. I've already done this for MySQL only to find out that it is not usable in my case :) So I wanted to have some preliminary understanding of the technology before getting a big book.
Thanks!
There are essentially 4 types of "NoSQL", but three of the four are actually similar enough that an SQL syntax could be written on top of it (including MongoDB and it's crazy query syntax [and I say that even though Javascript is one of my favorite languages]).
Key-Value Storage
These are simple NoSQL systems like Redis, that are basically a really fancy hash table. You have a value you want to get later, so you assign it a key and stuff it into the database, you can only query a single object at a time and only by a single key.
You definitely don't want this.
Document Storage
This is one step up above Key-Value Storage and is what most people talk about when they say NoSQL (such as MongoDB).
Basically, these are objects with a hierarchical structure (like XML files, JSON files, and any other sort of tree structure in computer science), but the values of different nodes on the tree can be indexed. They have a higher "speed" relative to traditional row-based SQL databases on lookup because they sacrifice performance on joining.
If you're looking up data in your MySQL database from a single table with tons of columns (assuming it's not a view/virtual table), and assuming you have it indexed properly for your query (that may be you real problem, here), Document Databases like MongoDB won't give you any Big-O benefit over MySQL, so you probably don't want to migrate over for just this reason.
Columnar Storage
These are the most like SQL databases. In fact, some (like Sybase) implement an SQL syntax while others (Cassandra) do not. They store the data in columns rather than rows, so adding and updating are expensive, but most queries are cheap because each column is essentially implicitly indexed.
But, if your query can't use an index, you're in no better shape with a Columnar Store than a regular SQL database.
Graph Storage
Graph Databases expand beyond SQL. Anything that can be represented by Graph theory, including Key-Value, Document Database, and SQL database can be represented by a Graph Database, like neo4j.
Graph Databases make joins as cheap as possible (as opposed to Document Databases) to do this, but they have to, because even a simple "row" query would require many joins to retrieve.
A table-scan type query would probably be slower than a standard SQL database because of all of the extra joins to retrieve the data (which is stored in a disjointed fashion).
So what's the solution?
You've probably noticed that I haven't answered your question, exactly. I'm not saying "you're finished," but the real problem is how the query is being performed.
Are you absolutely sure you can't better index your data? There are things such as Multiple Column Keys that could improve the performance of your particular query. Microsoft's SQL Server has a full text key type that would be applicable to the example you provided, and PostgreSQL can emulate it.
The real advantage most NoSQL databases have over SQL databases is Map-Reduce -- specifically, the integration of a full Turing-complete language that runs at high speed that query constraints can be written in. The querying function can be written to quickly "fail out" of non-matching queries or quickly return with a success on records that meet "priority" requirements, while doing the same in SQL is a bit more cumbersome.
Finally, however, the exact problem you're trying to solve: text search with optional filtering parameters, is more generally known as a search engine, and there are very specialized engines to handle this particular problem. I'd recommend Apache Solr to perform these queries.
Basically, dump the text field, the "filter" fields, and the primary key of the table into Solr, let it index the text field, run the queries through it, and if you need the full record after that, query your SQL database for the specific index you got from Solr. It uses some more memory and requires a second process, but will probably best suite your needs, here.
Why all of this text to get to this answer?
Because the title of your question doesn't really have anything to do with the content of your question, so I answered both. :)

Non Relational Database , Key Value or flat table

My application needs configurable columns , and titles of these columns get configured in the begining, If relation database I would have created generic columns in table like CodeA, CodeB etc for this need because it helps queering on these columns (Code A = 11 ) it also helps in displaying the values (if that columns stores code and value) but now I am using Non Relational database Datastore (and I am new to it), should I follow the same old approach or I should use collection (Key Value pair) type of structure .
There will be lot of filters on these columns. Please suggest
What you've just described is one of the classic scenarios for a Key-Value database. The limitation here is that you will not have many of the set-based tools you're used to.
Most of the K-V databases are really good at loading one "record" or small set thereof. However, they don't tend to be any good at loading anything that may require a join. Given that you're using AppEngine, you probably appreciate this limitation. But it's worth stating.
As an important note, not all K-V database will allow you to "select by any column". Many K-V stores actually only allow for selection by a primary key. If you take a look at MongoDB, you'll find that you can query any column which sounds like a necessary feature.
I would suggest using key/value pairs where keys will act as your column names and value will be their data.