Cassandra column key auto increment - nosql

I am trying to understand Cassandra and how to structure my column families (CF) but it's quite hard since I am used to relational databases.
For example if I create simple users CF and I try to insert new row, how can I make an incremental key like in MySQL?
I saw a lot of examples where you would just put the username instead of unique ID and that would make a little sense, but what if I want users to have duplicated usernames?
Also how can I make searches when from what I understand cassandra does not suport > operators, so something like select * from users where something > something2 would not work.
And probably the most important question what about grouping? Would I need to retrieve all data and then filter it with whatever language I am using? I think that would slow down my system a lot.
So basically I need some brief explanation how to get started with Cassanda.

Your questions are quite general, but let me take a stab at it. First, you need to model your data in terms of your queries. With an RDBMS, you model your data in some normalized form, then optimize later for your specific queries. You cannot do this with Cassandra; you must write your data the way you intend to read it. Often this means writing it more than one way. In general, it helps to completely shed your RDBMS thinking if you want to work effectively with Cassandra.
Regarding keys:
They are used in Cassandra as the unit of distribution across the ring. So your key will get hashed and assigned an "owner" in the ring. Use the RandomPartitioner to guarantee even distribution
Presuming you use RandomPartitioner (you should), keys are not sorted. This means you cannot ask for a range of keys. You can, however, ask for a list of keys in a single query.
Keys are relevant in some models and not in others. If your model requires query-by-key, you can use any unique value that your application is aware of (such as a UUID). Sometimes keys are sentinel values, such as a Unix epoch representing the start of the day. This allows you to hand Cassandra a bunch of known keys, then get a range of data sorted by column (see below).
Regarding query predicates:
You can get ranges of data presuming you model it correctly to answer your queries.
Since columns are written in sorted order, you can query a range from column A to column n with a slice query (which is very fast). You can also use composite columns to abstract this mechanism a bit.
You can use secondary indexes on columns where you have low cardinality--this gives you query-by-value functionality.
You can create your own indexes where the data is sorted the way you need it.
Regarding grouping:
I presume you're referring to creating aggregates. If you need your data in real-time, you'll want to use some external mechanism (like Storm) to track data and constantly update your relevant aggregates into a CF. If you are creating aggregates as part of a batch process, Cassandra has excellent integration with Hadoop, allowing you to write map/reduce jobs in Pig, Hive, or directly in your language of choice.

To your first question:
can i make incremental key like in mysql
No, not really -- not native to Cassandra. How to create auto increment IDs in Cassandra -- You could check here for more information: http://srinathsview.blogspot.ch/2012/04/generating-distributed-sequence-number.html
Your second question is more about how you store and model your Cassandra data.
Check out stackoverflow's search option. Lots of interesting questions!
Switching from MySQL to Cassandra - Pros/Cons?
Cassandra Data Model
Cassandra/NoSQL newbie: the right way to model?
Apache Cassandra schema design
Knowledge sources for Apache Cassandra
Most importantly, When NOT to use Cassandra?

You may want to check out PlayOrm. While I agree you need to break out of RDBMS thinking sometimes having your primary key as userid is just the wrong choice. Sometimes it is the right choice(depends on your requirements).
PlayOrm is a mix of noSQL and relational concepts as you need both and you can do Scalable-SQL with joins and everything. You just need to partition the tables you believe will grow into the billions/trillions of rows and you can query into those partitions. Even with CQL, you need to partition your tables. What can you partition by? time is good for some use-cases. Others can be partitioned by clients as each client is really a mini-database in your noSQL cluster.
As far as keys go, PlayOrm generates unique "cluster" keys which is hostname-uniqueidinThatHost, basically like a TimeUUID except quite a bit shorter and more readable as we use hostnames in our cluster of a1, a2, a3, etc. etc.

Related

Search Engine Database (Cassandra) & Best Practise

I'm currently storing rankings in MongoDB (+ nodejs as API) . It's now at 10 million records, so it's okay for now but the dataset will be growing drastically in the near future.
At this point I see two options:
MongoDB Sharding
Change Database
The queries performed on the database will not be text searches, but for example:
domain, keyword, language, start date, end date
keyword, language, start date, end date
A rank contains a:
1. domain
2. url
3. keyword
4. keyword language
5. position
6. date (unix)
Requirement is to be able to query and analyze the data without caching. For example get all data for domain x, between dates y, z and analyze the data.
I'm noticing a perfomance decrease lately and I'm looking into other databases. The one that seems to fit the job best is Cassandra, I did some testing and it looked promising, performance is good. Using Amazon EC2 + Cassandra seems a good solution, since it's easilly scalable.
Since I'm no expert on Cassandra I would like to know if Cassandra is the way to go. Secondly, what would be the best practice / database model.
Make a collection for (simplified):
domains (domain_id, name)
keywords (keyword_id, name, language)
rank (domain_id, keyword_id, position, url, unix)
Or put all in one row:
domain, keyword, language, position, url, unix
Any tips, insights would be greatly appreciated.
Cassandra relies heavily on query driven modelling. It's very restrictive in how you can query, but it is possible to fit an awful lot of requirements within those capabilities. For any large scale database, knowing your queries is important, but in terms of cassandra, it's almost vital.
Cassandra has the notion of primary keys. Each primary key consists of one or more keys (read columns). The first column (which may be a composite) is referred to as the partition key. Cassandra keeps all "rows" for a partition in the same place (on disk, in mem, etc.), and a partition is the unit of replication, etc.
Additional keys in the primary key are called clustering keys. Data within a partition are ordered according to successive clustering keys. For instance, if your primary key is (a, b, c, d) then data will be partitioned by hashing a, and within a partition, data will be ordered by b, c and d.
For efficient querying, you must hit one (or very few) partitions. So your query must have a partition key. This MUST be exact equality (no starts with, contains, etc.). Then you need to filter down to your targets. This can get interesting too:
Your query can specify exact equality conditions for successive clustering keys, and a range (or equality) for the last key in your query. So, in the previous example, this is allowed:
select * from tbl where a=a1 and b=b1 and c > c1;
This is not:
select * from tbl where a=a1 and b>20 and c=c1;
[You can use allow filtering for this]
or
select * from tbl where a=a1 and c > 20;
Once you understand the data storage model, this makes sense. One of the reason cassandra is so fast for queries is that it pin points data in a range and splats it out. If it needed to do pick and choose, it'd be slower. You can always grab data and filter client side.
You can also have secondary indexes on columns. These would allow you to filter on exact equality on non-key columns. Be warned, never use a query with a secondary index without specifying a partition key. You'll be doing a cluster query which will time out in real usage. (The exception is if you're using Spark and locality is being honoured, but that's a different thing altogether).
In general, it's good to limit partition sizes to less than a 100mb or at most a few hundred meg. Any larger, you'll have problems. Usually, a need for larger partitions suggests a bad data model.
Quite often, you'll need to denormalise data into multiple tables to satisfy all your queries in a fast manner. If your model allows you to query for all your needs with the fewest possible tables, that's a really good model. Often that might not be possible though, and denormalisation will be necessary. For your question, the answer to whether or not all of it goes in one row depends on whether you can still query it and keep partition sizes less than 100 meg or not if everything is in one row.
For OLTP, cassandra will be awesome IF you can build the data model that works the way Cassandra does. Quite often OLAP requirements won't be satisfied by this. The current tool of choice for OLAP with Cassandra data is the DataStax Spark connector + Apache Spark. It's quite simple to use, and is really powerful.
That's quite a brain dump. But it should give you some idea of the things you might need to learn if you intend to use Cassandra for a real world project. I'm not trying to put you off Cassandra or anything. It's an awesome data store. But you have to learn what it's doing to harness its power. It works very different to Mongo, and you should expect a mindshift when switching. It's most definitely NOT like switching from mysql to sql server.

Why would using a nosql/document/MongoDB as a relational database be inferior?

I have recently been introduced to MongoDB and I've come to like a lot (compared to MySQL i used for all projects).
However in some certain situations, storing my data with documents "linking" to each other with simple IDs makes more sense (to reduce duplicated data).
For example, I may have Country and User documents, where a user's location is actually an ID to a Country (since a Country document includes more data, hence duplicating Country data in each user makes no sense).
What I am curious about is.. why would MongoDB be inferior compared to using a proper relationship database?
Is it because I can save transactions by doing joins (as opposed to doing two transactions with MongoDB)?
Thats a good question..!!
I would say there is definitely nothing wrong in using nosql db for the type of data you have described. For simple usecases it will work perfectly well.
The only point is that relational databases have been designed long time back to serve the purpose of storing and querying WELL STRUCTURED DATA.. with proper relations defined. Hence for a large amount of well structured data the performance and the features provided will be a lot more than that provided by a nosql database. Since they are more matured.. its their ball game..!!
On the other hand nosql databases have been designed to handle very large amount of unstructured data and has out of the box support for distributed environment scaling. So its a completely different ball game now..
They basically treat data differently and hence have different strategies / execution plans to fetch a given data..
MongoDB was designed from the ground up to be scalable over multiple servers. When a MongoDB database gets too slow or too big for a single server, you can add additional servers by making the larger collections "sharded". That means that the collection is divided between different servers and each one is responsible for managing a different part of the collection.
The reason why MongoDB doesn't do JOINs is that it is impossible to have JOINs perform well when one or both collections are sharded over multiple nodes. A JOIN requires to compare each entry of table/collection A with each one of table/collection B. There are shortcuts for this when all the data is on one server. But when the data is distributed over multiple servers, large amounts of data need to be compared and synchronized between them. This would require a lot of network traffic and make the operation very slow and expensive.
Is it correct that you have only two tables, country and user. If so, it seems to me the only data duplicated is a foreign key, which is not a big deal. If there is more duplicated, then I question the DB design itself.
In concept, you can do it in NOSQL but why? Just because NOSQL is new? OK, then do it to learn but remember, "if it ain't broke, don't fix it." Apparently the application is already running on relational. If the data is stored in separate documents in MongoDB and you want to interrelate them, you will need to use a link, which will be more work than a join and be slower. You will have to store a link, which would be no better than storing the foreign key. Alternatively, you can embed one document in another in MongoDB, which might even increase duplication.
If it is currently running on MySQL then it is not running on distributed servers, so Mongo's use of distributed servers is irrelevant. You would have to add servers to take advantage of that. If the tables are properly indexed in relational, it will not have to search through large amounts of data.
However, this is not a complex application and you can use either. If the data is stored on an MPP environment with relational, it will run very well and will not need to search to large amounts of data at all. There are two requirements, however, in choosing a partitioning key in MPP: 1. pick one that will achieve an even distribution of data; and 2. pick a key that can allow collocation of data. I recommend you use the same key as the partitioning key (shard key) in both files.
As much as I love MongoDB, I don't see the value in moving your app.

Looking for an architecture that supports streaming counting, sketching and large set intersections

I wonder if the Stackoverflow community could help me by suggesting a technology (i.e. HBase, Raiku, Cassandra, etc.) that would solve my problem. I have a large dataset which we would like to update and query in real-time which is of the order of 10s of terabytes. Our dataset is a pixel stream which contains a user ID and one or more features (usually around 10). The total possible features number in the millions.
We are imagining our data model would look like:
FEATUREID_TO_USER_TABLE:
Feature id -> {UserID Hash, UserID Hash, ...}
FEATUREID_TO_COUNTER_TABLE:
feature id -> { Hour of since epic -> HyperLogLog byte blob }
We would like to keep a sorted set of User IDs sorted by the hash of the User ID. We also like to keep at most ~200k for each FEATUREID_TO_USER_TABLE entry evicting old IDs if a new ID has a lower hash value.
We would like the store to support the following operations (not necessarily expressed in SQL):
select FeatureID, count(FeatureID) from FEATUREID_TO_USER_TABLE where UserID in
(select UserID from FEATUREID_TO_USER_TABLE where FeatureID = 1234)
group by FeatureID;
And
update FEATUREID_TO_COUNTER_TABLE set HyperLogLog = NewBinaryValue where FEATUREID_TO_COUNTER_TABLE.id = 567
We believe the easiest way to shard this data across machines is by User ID.
Thanks for any ideas,
Mark
Cassandra is a great choice for persisting the data, but you'll want something else for processing it in real-time. I recommend you check out Storm, as it gives you real-time streaming data processing with relative ease. It's an open source framework that handles concurrency and parallelization for you. It's written on the JVM, but has language bindings for a variety of non-JVM languages as well.
I am not sure I understand your whole description though so I am shooting in the dark a bit on context.
Is there any way to partition your data so you can query into a partition? This helps alot with scalability and querying as you scale. You typically don't want to query into toooo large a table so instead query into a partition.
ie. PlayOrm has partitioning capabilities on cassandra so you can query one partition.
While PlayOrm does also have join queries, it does not do subselects at this time but typically clients just do a first call into the nosql store and then aggregate results and do a second query and it is still very very fast(probably as fast as if you made one call as even cassandra would have to make two calls internally to the other servers anyways).
hmmm, the more I read your post, I am not sure you should write SQL there as you may be able to do everything by primary key but I am not 100% sure. That SQL is confusing as it grabs all the user ids in the row it seems and then just counts them???? as it is the same table in both select and subselect?
As far as sharding your data, you don't need to do anything since cassandra does that automatically.

What db fits me?

I am currently using mysql. I am finding that my schema is getting incredibly complicated. I seek to find a new db that will suit my needs:
Let's assume I am building a news aggregrator (which collects news from multiple website). I then run algorithms to determine if two news from different sites are actually referring to the same topic. I run this algorithm to cluster news together. The relationship is depicted below:
cluster
\--news1
\--word1
\--word2
\--news2
\--word3
\--news3
\--word1
\--word3
And then I will apply some magic and determine the importance of each word. Summing all the importance of each word gives me the importance of a news article. Summing the importance of each news article gives me the importance of a cluster.
Note that above cluster there are also subgroups( like split by region etc), and categories (like sports, etc) which I have to determine the importance of that in a particular day per se.
I have used views in the past to do so, but I realized that views are very slow. So i will normally do an insert into an actual table and index them for better performance. As you can see this leads to multiple tables derived like (cluster, importance), (news, importance), (words, importance) etc which can get pretty messy.
Also the "importance" metric will change. It has become increasingly difficult to alter tables, update data (which I am using TRUNCATE TABLE) and then inserting from null.
I am currently looking into something schemaless like Mongodb. I do not need distributedness. I would very much want something that is reasonably fast (which can be indexed) and something that is a lot more flexible that traditional RDMBS.
NEW
As requested by various people, I will post my usage to this database (they are not actual SQL queries since I hope everyone here could understand)
TABLE word ( word_id, news_id, word )
TABLE news ( news_id, date, site .. )
TABLE clusters ( cluster_id, cluster_leader, cluster_name, ... )
TABLE mapping_clusters_news( cluster_id, news_id)
TABLE word_importance (word_id, score)
TABLE news_importance (news_id, score)
TABLE cluster_importance( cluster_id, score)
TABLE group_importance( cluster_id, score)
You might notice that TABLE_word has an extra news_id column. This is to correspond to TABLE_word_importance column because the same word can have different importance in different articles (if you are familiar with tfidf, this is basically something like that).
All the "importance" table now calculates the importance of each entity by averaging the importance of all the sub-entities below it. This means that Each cluster's importance is determined by all the news inside it, each news's importance is determined by all the words inside it etc.
TYPICAL USAGE:
1) SELECT clusters FROM db THAT HAS word1, word2, word3, .. ORDER BY cluster_importance_score
2) SELECT words FROM db BELONGING TO THE CLUSTER cluster_id=5 ODER BY word_importance score.
3) SELECT groups ordered by importance score.
As you can see, I am deriving a lot of scores from each layer, and someone have been telling me to use a materialized view for this purpose (which postgresql supports it). However, as you can see, this simple schema already consists of 8 tables (my actual db consists of 26 tables of crap like that, which is adding so much additional layers of complexity for maintainance).
NOTE THIS IS NOT ABOUT FULL-TEXT SEARCH.
When the schema is getting complicated, a graph database can be a good alternative. As I understand your domain, you have lots of entities related to other entities in different ways. Would it make sense to you to model this as a graph/network of entities? As food for thought I whipped up an example using Neo4j:
news-analysis-example http://github.com/neo4j-examples/domain-models/raw/master/news-analysis.png
In a graphdb you can set properties on both nodes and relationships, which could be useful in your case (for instance the number of times a word is used in a news entry could be added to the relationship to that word). BTW, I added an extra is_related relationship between two news items, as I thought that could be interesting as well.
How about db4o? db4o
ORM means "Object-relational mapper". Not using a relational database wouldn't make much sense. I'll pretend you meant "I want to be able to serialize objects".
I don't understand why distributedness is not required. Could you elaborate on that?
Personally, I would reccomend Cassandra. It still has reasonably close ties to (by which I mean easy to integrate with) Hadoop, which you will probably eventually want for your processing. As an added bonus, there's Telephus, so Cassandra supports Twisted beautifully. Cassandra's method of conflict resolution (currently timestamps, soon-ish vector clocks) might work for your changing metric as long as you don't mind getting the old value for as long as the metric hasn't been recalculated. Otherwise, you might move up a level and simply store multiple versions of the data with different versions of the metric. That way, if you decide a metric is a bad idea, you don't have to recompute.
Cassandra, unfortunately, does not have something that serializes/deserializes objects very well yet. However, for the thin wrappers you would be writing (essentially structs with a few methods), would writing a fromCassandra #classmethod really be that big a deal?
Postgresql may be "schema based" but it kind of feels like you're throwing the baby out with the bathwater. If you don't need a distributed db or a particularly schema-less design (which it doesn't sound like offhand you do, but you appear to think you do) then I'm not sure why you would want mongodb. Postgres has lots of indexing options and it sounds like its built in full text searching would be good for you. If you're used to MySQL and altering tables (you mentioned issues there) can be a nightmare, mostly its better in Postgres. I'm a fan on Postgres and MongoDB - it just don't sound like there's a good reason to move away from a relational db for data that certainly sounds relational in nature.
In a word, YES, you should probably be looking at something else: Cassandra, Hadoop, MongoDB, something.
MongoDB is basically going to reduce your sample schema to "clusters" and "news", with everything else basically being contained in those two.
The good news:
This will make it easy to modify fields.
Map-reduce operations are a natural fit for the type of work that you're doing. You perform a map-reduce and then save the data back to the "news" item and all will be well.
The bad news:
It's easy to lose track of the structure of data with something like Mongo. Hadoop and Hive typically force your schema little more. But in any case, you'll need to write down some form of schema or just drown.
If you plan to do this for some non-trivial amount of data, then you're going to want "horizontal" scalability. MongoDB is "ok" for this, Hadoop is definitely a "leader" for this.

Database Optimization techniques for amateurs

Can we get a list of basic optimization techniques going (anything from modeling to querying, creating indexes, views to query optimization). It would be nice to have a list of these, one technique per answer. As a hobbyist I would find this to be very useful, thanks.
And for the sake of not being too vague, let's say we are using a maintstream DB such as MySQL or Oracle, and that the DB will contain 500,000-1m or so records across ~10 tables, some with foreign key contraints, all using the most typical storage engines (eg: InnoDB for MySQL). And of course, the basics such as PKs are defined as well as FK contraints.
Learn about indexes, and use them properly. Generally speaking*, follow these guidelines:
Every table should have a clustered index
Fields used for filters and sorts are good candidates for indexing
More selective fields are better candidates for indexing
For best performance on crucial queries, design "covering indexes" for those queries
Make sure your indexes are actually being used, and remove those that aren't
If your table has 15 fields, and you make 15 indexes, each with only a single field, you're doing it wrong :)
*There are some exceptions to these rules if you know what you're doing. My experience is Microsoft SQL Server, but I would presume most of this advice would still apply to a different RDMS.
IMO, by far the best optimization is to have the data model fit the problem domain for which it was built. When it does not, the resulting symptom is difficult-to-write or convoluted queries in order to get the information desired and that typically rears itself when reports are built against the database. Thus, in designing a database it helps to have an idea as to the types and nature of the information, such as reports, that the users will want from the system.
When talking database design, check out the database normalization, e.g. the wikipedia article: Normal forms.
If you have a good design and still you need to optimize for performance, try Denormalisation.
If you have specific needs which are not covered by relational model efficiently, look at other models covered by the term NoSQL.
Some query/schema optimizations:
Be mindful when using DISTINCT or GROUP BY. I find that many new developers will use DISTINCT in places where it really is not needed or could be rewritten more efficiently using an Exists statement or a derived query.
Be mindful of Left Joins. All too often I find new SQL developers will ignore the schema in place and use Left Joins where they really are not necessary. For example:
Select
From Orders
Left Join Customers
On Customers.Id = Orders.CustomerId
If Orders.CustomerId is a required column, then it is not necessary to use a left join.
Be a student of new features. Currently, MySQL does not support common-table expressions which means that some types of queries are cumbersome and probably slower to write than they would be if CTEs were supported. However, that will not be true forever. Keep up on new syntax features in MySQL which might be used to make existing queries more efficient.
You do not have to use surrogate keys everywhere. There might be tables better suited to an intelligent key (e.g. US State abbreviations, Currency Codes etc) which would enable developers to avoid additional joins in many cases.
If possible, find ways of archiving data to an OLAP or reporting server. The smaller you can make the production data, the faster it will run.
A design that concisely models your problem is always a good start. Overgeneralizing the data model can lead to performance problems. For example, I've heard reports of projects striving for uber-flexibility that use the RDBMS as a dumb "name/value" store - and resulting performance was appalling.
Once a good design is in place, then use the tools provided by the RDBMS to help it achieve good performance. Single field PKs (no composites), but composite business keys as an index with unique constraint, use of appropriate data types, e.g. using appropriate numeric types for numeric values rather than char or similar. Physical attributes of the hardware the RDBMS is running on should also be considered, since the bulk of query time is often disk I/O - but of course don't take this for granted - use a profiler to find out where the time is going.
Depending upon the update/query ratio, materialized views/indexed views can be useful in improving performance for slow running queries. A poor-man's alternative is to use triggers to invoke a procedure that populates the table with a result of a slow-running, infrequently-changed view.
Query optimization is a bit of a black art since it is often database-dependent, but some rules of thumb are given here - Optimizing SQL.
Finally, although possibly outside the intended scope of your question, use a good data access layer in your application, and avoid the temptation to roll your own - there are surely tested and performant implementations available for all major languages. Use of caching at the data access layer, middle tier and application layer can help improve performance considerably.
Do use less query whenever possible. Use "JOIN", and group your tables so that a single query gives your results.
A good example is the Modified Preorder Tree Transversal (MPTT) to get all of a tree node parents, ordered, in a single query.
Take a holistic approach to optimization.
Consider the impact of slow disks, network latency, lack of memory, and server load.