store/query semi-structure data [closed] - perl

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We had application written in perl that create complex data structure for our subscriber (we have move than 4m subscribers). each subscriber have some conmen fields that are are present in all of them and some other subscriber has missing some.
The data looks like this:
%subscribers = {
"user_001" = {
"name" => "sam",
"age" => "13",
"color" =>['red','blue']
"item"=>{
"old" =>['PC','pen'],
"new" =>['tap','car']
},
"user_002" = {
"name" => "ali",
"age" => "54",
"color" =>['red','null','green']
"item"=>{
"old" =>['phone','TV']
},
"user_003" = {
"name" => "foo",
"age" => "02",
"item"=>{
"old" =>['']
},
....
}
}
our data are more nasty and complex
Now we try to store these data in DB then do some query in them like get user that have new 'TAPs' in item or there age is larger than 30 years.
what we need to know is:
What is the best method to store the data (as MySQL or Oracle db not option), we need something for semi-structure data. How to do these queries taken in mind the preformence.
We jast need headline to start our search (and yes we did our homework using Google ^_^).
BR,
Hosen

It sounds like your dataset is still small and manageable, so you need to be very careful about dismissing traditional database solutions at this early point. You haven't really offered any hard reasons as to why SQL solutions have been dismissed (new features in recent years are targeted squarely at NoSQL use-cases), so as someone that's trawled through this issue myself in the past (in a large perl project) I will offer some questions you should ask yourself:
Will the new technology choice become the authoritative data store, or just something you want to bolt-on with minimum changes to help you service queries?
If you just want to quickly bolt-on a new API to service queries, NoSQL technologies such as MongoDB (with excellent perl driver) become a viable option (and you can slurp in a perl hash as you've described with very little code). If you only use it as a (possibly read-only) cache, you mitigate all the durability concerns and avoid a lot of expensive data cleaning/validation/normalization effort to get you to an 80% solution very quickly.
If you want something durable to replace your current data storage, it's true that there are options other than SQL RDBMS. XML stores like eXistDB are very powerful if you work with XML ecosystems already and your data fits the document-object paradigm where XQuery/XPath makes sense (there's even a perl RPC thing for it). It's worth taking a look at commercial vendors like MarkLogic or EnterpriseDB if you have time pressures and a decent budget. If your data is truly messy and can be efficiently modeled as a graph of entities and relationships, it's tempting to consider things like SparkleDB, Neo4j or Virtuoso however in my limited exposure to these things whilst they have a lot of potential for servicing otherwise impossible or difficult queries/analsyses, they make a terrible place to curate and manage your core business data.
What kinds of queries, reports/analyses do you hope to do? This will determine how much data cleaning and normalization effort will be required. Answering this question will help you focus your choice:
If you think you'll end up doing data cleaning/validation/transformation in order to implement your final choice and make the data queryable, you might as well use a traditional SQL database but explore using it in a "NoSQL" way (there's lots of advice/comparison out there).
If you are hoping to avoid doing a lot of data cleaning/validation/normalization due to lack of time or budget, I'm afraid that the more mature XML/RDF/SPARQL solutions will require 10x more engineering effort to design and establish a working system built around the messy data than simply cleaning it properly in the first place.
If you have truly messy, heterogeneous data (especially when you need to continuously import from 3rd-parties over which you have no control and you want to avoid constant data cleaning effort), then leaving your messy data "as-is" lands you in a spectrum of hurt. At one extreme (in terms of cost but also query power/expressiveness and accuracy) you have the XML/RDF/SPARQL solutions mentioned before. At the cheaper/quicker/simpler (perhaps too simple in many cases) you have contenders such as MongoDB, Cassandra and CouchDB (this is by no means an exhaustive list, and they have differing levels of perl support or quality of perl clients).

Related

DynamoDB vs MongoDB NoSQL [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to figure out what can I use for a future project, we plan to store about 500k records per month in the first year and maybe more for the next years this is a vertical application so there's no need to use a database for this, that's the reason why I decided to choose a NoSQL data storage.
The first option that came to my mind was mongo DB since is a very mature product with a lot of support from the community but on the other hand, we got a brand new product that offers a managed service at top performance, I'll develop this application but there's no maintenance plan (at least for now) so I think that will be a huge advantage since amazon provides an elastic way to scale.
My major concern is about the query structure, I haven't looked at the dynamo DB query capabilities yet but since is a k/v data storage I feel that this could be more limited than mongo DB.
If someone had the experience of moving a project from MongoDB to DynamoDB, any advice will be totally appreciated.
I know this is old, but it still comes up when you search for the comparison. We were using Mongo, have moved almost entirely to Dynamo, which is our first choice now. Not because it has more features, it doesn't. Mongo has a better query language, you can index within a structure, there's lots of little things. The superiority of Dynamo is in what the OP stated in his comment: it's easy. You don't have to take care of any servers. When you start to set up a Mongo sharded solution, it gets complicated. You can go to one of the hosting companies, but that's not cheap either. With Dynamo, if you need more throughput, you just click a button. You can write scripts to scale automatically. When it's time to upgrade Dynamo, it's done for you. That is all a lot of precious stress and time not spent. If you don't have dedicated ops people, Dynamo is excellent.
So we are now going on Dynamo by default. Mongo maybe, if the data structure is complicated enough to warrant it, but then we'd probably go back to a SQL database. Dynamo is obtuse, you really need to think about how you're going to build it, and likely you'll use Redis in Elasticcache to make it work for complex stuff. But it sure is nice to not have to take care of it. You code. That's it.
With 500k documents, there is no reason to scale whatsoever. A typical laptop with an SSD and 8GB of ram can easily do 10s of millions of records, so if you are trying to pick because of scaling your choice doesn't really matter. I would suggest you pick what you like the most, and perhaps where you can find the most online support with.
For quick overview comparisons, I really like this website, that has many comparison pages, eg AWS DynamoDB vs MongoDB; http://db-engines.com/en/system/Amazon+DynamoDB%3BMongoDB
Short answer: Start with SQL and add NoSQL only when/if needed. (unless you don't need anything beyond very simple queries)
My personal experience: I haven't used MongoDB for queries but as of April 2015 DynamoDB is still very crippled when it comes to anything beyond the most basic key/value queries. I love it for the basic stuff but if you want query language then look to a real SQL database solution.
In DynamoDB you can query on a hash or on a hash and range key, and you can have multiple secondary global indexes. I'm doing queries on a single table with 4 possible filter parameters and sorting the results, this is supported (barely) through the use of the global secondary indexes with filter expressions. The problem comes in when you try to get the total results matching the filter, you can't just search for the first 10 items matching the filter, but rather it checks 10 items and you may get 0 valid results forcing you to keep re-scanning from the continue key - pain in the neck and consumes too much of your table read quota for a simple scenario.
To be specific about the limit problem with filters in the query, this is from the docs (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#ScanQueryLimit):
In a response, DynamoDB returns all the matching results within
the scope of the Limit value. For example, if you issue a Query
or a Scan request with a Limit value of 6 and without a filter
expression, the operation returns the first six items in the
table that match the request parameters. If you also supply a
FilterExpression, the operation returns the items within the
first six items in the table that match the filter requirements.
My conclusion is that queries involving FilterExpressions are only usable on very rare occasions and are not scalable because each query can easily read most or all of your of your table which consumes far too many DynamoDB read units. Once you use too many read units you'll get throttled and see poor performance.
Expert opinion: In the AWS summit on Apr 9, 2015 Brett Hollman, Manager, Solutions Architecture, AWS in his talk on scalling to your first 10 million users advocates starting with a SQL database and then using NoSQL only when and if it makes sense. Because sooner or later you'll probably need a SQL server somewhere in your stack. His slides are here: http://www.slideshare.net/AmazonWebServices/deep-dive-scaling-up-to-your-first-10-million-users
See slide 28.
We chose a combination of Mongo/Dynamo for a healthcare product. Basically mongo allows better searching, but the hosted Dynamo is great because its HIPAA compliant without any extra work. So we host the mongo portion with no personal data on a standard setup and allow amazon to deal with the HIPAA portion in terms of infrastructure. We can query certain items from mongo which bring up documents with pointers (ID's) of the relatable Dynamo document.
The main reason we chose to do this using mongo instead of hosting the entire application on dynamo was for 2 reasons. First, we needed to preform location based searches which mongo is great at and at the time, Dynamo was not, but they do have an option now.
Secondly was that some documents were unstructured and we did not know ahead of time what the data would be, so for example lets say user a inputs a document in the "form" collection like this: {"username": "user1", "email": "me#me.com"}. And another user puts this in the same collection {"phone": "813-555-3333", "location": [28.1234,-83.2342]}. With mongo we can search any of these dynamic and unknown fields at any time, with Dynamo, you could do this but would have to make a index every time a new field was added that you wanted searchable. So if you have never had a phone field in your Dynamo document before and then all of the sudden, some one adds it, its completely unsearchable.
Now this brings up another point in which you have mentioned. Sometimes choosing the right solution for the job does not always mean choosing the best product for the job. For example you may have a client who needs and will use the system you created for 10+ years. Going with a SaaS/IaaS solution that is good enough to get the job done may be a better option as you can rely on amazon to have up-kept and maintained their systems over the long haul.
I have worked on both and kind of fan of both.
But you need to understand when to use what and for what purpose.
I don't think It's a great idea to move all your database to DynamoDB, reason being querying is difficult except on primary and secondary keys, Indexing is limited and scanning in DynamoDB is painful.
I would go for a hybrid sort of DB, where extensive query-able data should be there is MongoDB, with all it's feature you would never feel constrained to provide enhancements or modifications.
DynamoDB is lightning fast (faster than MongoDB) so DynamoDB is often used as an alternative to sessions in scalable applications. DynamoDB best practices also suggests that if there are plenty of data which are less being used, move it to other table.
So suppose you have a articles or feeds. People are more likely to look for last week stuff or this month's stuff. chances are really rare for people to visit two year old data. For these purposes DynamoDB prefers to have data stored by month or years in different tables.
DynamoDB is seemlessly scalable, something you will have to do manually in MongoDB. however you would lose on performance of DynamoDB, if you don't understand about throughput partition and how scaling works behind the scene.
DynamoDB should be used where speed is critical, MongoDB on the other hand has too many hands and features, something DynamoDB lacks.
for example, you can have a replica set of MongoDB in such a way that one of the replica holds data instance of 8(or whatever) hours old. Really useful, if you messed up something big time in your DB and want to get the data as it is before.
That's my opinion though.
Bear in mind, I've only experimented with MongoDB...
From what I've read, DynamoDB has come a long way in terms of features. It used to be a super-basic key-value store with extremely limited storage and querying capabilities. It has since grown, now supporting bigger document sizes + JSON support and global secondary indices. The gap between what DynamoDB and MongoDB offers in terms of features grows smaller with every month. The new features of DynamoDB are expanded on here.
Much of the MongoDB vs. DynamoDB comparisons are out of date due to the recent addition of DynamoDB features. However, this post offers some other convincing points to choose DynamoDB, namely that it's simple, low maintenance, and often low cost. Another discussion here of database choices was interesting to read, though slightly old.
My takeaway: if you're doing serious database queries or working in languages not supported by DynamoDB, use MongoDB. Otherwise, stick with DynamoDB.

Product Catalog - Document Store or Column Family Store [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Wondering which technology would do better for a typical product catalog of a webshop. I'm writing my master thesis about nosql in the enterprise environment and focused on document stores for to long now I think.
Read a lot articles which recommend document stores because of it's flexibilty which is needed to model thousands of different products. But as far as I know now, Column-Family Stores like Cassandra offer the same flexibility.
What I like most of the idea of using cassandra is, what nosql-database.org says about it (marked the most interesting features):
massively scalable, partitioned row store, masterless architecture, linear scale performance, no single points of failure, read/write support across multiple data centers & cloud availability zones. API / Query Method: CQL and Thrift, replication: peer-to-peer, written in: Java, Concurrency: tunable consistency, Misc: built-in data compression, MapReduce support, primary/secondary indexes, security features.
In the end I focus on building a prototype of a highly available and scaleable Multishop System which makes use of polyglot persistence, saying K/V Stores for Sessions, Document Store or Column-Family Store for Product Catalog and maybe RDBMS for Inventory/Pricing like Sadalage and Fowler mentioned in their book "NoSQL Destilled".
If possible, provide scientific papers or other reliable sources for your answers.
Thanks!
Document Store's Achilles Heel
Stuart Halloway mentioned that a document store is the biggest schema lock solution that is way too inflexible, which I agree with. Couch/Mongo and others try to mitigate that by providing workarounds to create secondary indicies, ability and necessity to be aware of plain object ids, etc. And of course if you think about versioning (i.e. add a "time" variable to your system), document stores fail fast to provide a smooth support and time travel.
Column Store: Problem Relevance
Cassandra is a really compelling solution for building "scalable"/"distributed" systems with real examples such as Netflix, where 500 Cassandra nodes can be brought up in AWS for several minutes, and all the requests hit a Cassandra ring.
However, given the problem as it is stated in your question, Cassandra would be an unnecessary overkill. Not just because it is a bit more complex than "others", or because it is mentally harder to create a solid data model on top of column oriented stores, but also because a "product catalog" problem is not quite a rocket science. It can be, if you want to add machine learning later to predict/recognize/etc.., but a catalog itself is not, and simpler stores such as PostgreSQL for example would solve it easily.
Simple Desire to NoSQL
If you really want to use NoSQL for a product catalog, I would definitely consider 3 solutions to fit your prototype:
Riak as a "K/V for Sessions"
Datomic to solve "Product Catalog, Inventory and Pricing"
Depending on the size and nature of the problem and the final solution, I would consider Redis to cache those sessions, while having Datomic comfortably sit on top of Riak as its storage service.
Practice vs. Theory
Two classical NoSQL papers that made NoSQL sound real in practice for the first time are Dynamo and BigTable. I consider Datomic to be the next evolutionary step in the DB universe by introducing a hybrid data model with true indicies and relations without a schema lock, and immutability from which everything follows: safe time travel, caching, local db values, etc.
Practically, if it wasn't a master theses, depending on the real problem scale and definition, I would be choosing between Datomic and PostreSQL to solve catalog, inventory, pricing, etc.
A big advantage of Datomic here is time travel. In practice it is very important to be able to safely and easily do that in a "Shopping System".
A big advantage of PostgreSQL is its familiarity and SQL tools availability for analytics and reporting.
By now I think that Column-Family Stores are not well suited for product calaloges.
It's because products often contain some kind of collections like tags, tracklists for music records, different sizes for clothes and so on.
Cassandra supports collections by now BUT they are not searchable! This is a must have feature for tags for example.
In contrast MongoDb for example offers the $in operator to search in nested arrays...
I don't want to say it is not possible to model a product calalog in Cassandra but I think it is much more straight forward to do it in a document store.

Are document databases good for storing large amounts of Stock Tick data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.
The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
Example data:
500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)
So what do you guys think?
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
InfluxDB - see my other answer
Cassandra
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
The answer here will depend on scope.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)

Why exactly do we use NoSQL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Having understood some of the advantages that NoSQL offers (scalability, availability, etc.), I am still not clear why a website would want to use a non-relational database.
Can I get some help on this, preferably with an example?
Better performance
NoSQL databases sometimes have better performance, although this depends on the situation and is disputed.
Adaptability
You can add and remove "columns" without downtime. In most SQL servers, this takes a long time and takes up a load of load.
Application design
It is desirable to separate the data storage from the logic. If you join and select things in SQL queries, you are mixing business logic with storage.
NoSQL databases are there to solve several things, mainly:
(buzz) BigData => think TB, PB, etc..
Working with Distributed Systems / datasets => say you have 42 products, so 13 of them will live in Chicago datacenter, 21 in NY's and another and 8 somewhere in Japan, but once you query against all 42 products, you would not need to know where they are located: NoSQL DB will. This also allows to engage a lot more brain power ( servers ) to solve hard computational problems [ does not seem it would fit your use case, but it is an interesting thing to note ]
Partitioning => having your DB be easily distributed, besides those cool 8 products in Japan, also allows for an easy data replication, so those 42 products will be replicated with a factor of 3, for example, which would mean you DB would have 3 copies for every product. Hence if something goes down, no problem => here is a replica available. This is where NoSQL databases actually shine vs. RDBMS. Granted you can shard, partition and cluster Oracle / MySQL / PostgreSQL / etc.. BUT it is a several magnitudes more complicated process and usually a maintenance headache for most people you'd employ.
BUT to your question:
why a website would want to use a non-relational database
When most of the people, I worked with / met / chatted with, choose NoSQL for their "website", it is unfortunately NOT for the reasons above, but simply because it is COOLER to do so. And in fact many projects FAIL / have extreme difficulties due to this reason.
If most of NoSQL gurus take their masks off, they will all agree that MOST of the problems ( or as people call them websites ) that developers solve day to day, can and rather be solved with a SQL solution, such as PostgreSQL, MySQL, etc.. with some cool Redis cache layer on top of it. And only a small subset of problems would REALLY benefit from NoSQL.
I personally love Riak, as I am a firm believer that a NoSQL, fault tolerant DB should have an extremely strong, flexible and naturally distributed foundation => such as Erlang OTP. Plus I am a fan of simplicity. But again, given the problem, I would choose whatever works best, and most of the time I will NEED that consistency ( especially if we are talking about money / financial world / mission critical / etc.. ).
The main reason not to use an SQL database is scalability. The transactional guarantees and the relational model make it almost impossible to scale a database usefully across more than a few machines, especially given the write-heavy workloads generated by modern web applications.
An app like Facebook can't be made to work on a straightforward SQL database, except by massive partitioning and sharding, which requires significant adjustments to the app logic as well. That's why Facebook developed Cassandra.
NoSQL basically means you make do without some SQL-typical features like immediate consistency or easy joins, in exchange for being able to use a database that scales much better.
Conversely, there is no point in using NoSQL if your website never has more than a dozen concurrent users (which is true for the vast majority of all sites).
We need understand what is your problem in the current application?
Transactions
Amount of data
Data structure
NoSQL solves the problems of scalability and availability against that of atomicity or consistency.
Basic drive us to CAP theorem. Eric Brewer also noted that Of the three properties of shared-data systems – Consistency, Availability and tolerance to network Partitions – only two can be achieved any given moment in time. (CAP theorem)
NOSQL Approach
Schemaless data representation:
Most of them offer schemaless data representation & allow storing semi-structured data.
Can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time:
No complex SQL queries.
No JOIN statements.
Speed:
Very High speed delivery & Mostly in-built entity-level caching
Plan ahead for scalability:
Avoiding rework
There are many types of NoSQL databases. The web applications uses document based databases. The document db allows us to store JSON,XML,YAML and even Word documents and manipulate them. So, NoSQL is the obvious choice, especially MongoDB which is a Document Database which supports JSON format by default is the most preferred choice of developers and designers.

Key-Value Stores vs. RDBMs vs. "Cloud" DBs (SDB) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm comfortable in the MySQL space having designed several apps over the past few years, and then continuously refining performance and scalability aspects. I also have some experience working with memcached to provide application side speed-ups on frequently queried result sets. And recently I implemented the Amazon SDB as my primary "database" for an ecommerce experiment.
To oversimplify, a quick justification I went through in my mind for using the SDB service was that using a schema-less database structure would allow me to focus on the logical problem of my project and rapidly accumulate content in my data-store. That is, don't worry about setting up and normalize all possible permutations of a product's attributes before hand; simply start loading in the products and the SDB will simply remember everything that is available.
Now that I have managed to get through the first few iterations of my project and I need to setup simple interfaces to the data, I am running to issues that I had taken for granted working with MySQL. Ex: grouping in select statements and limit syntax to query "items 50 to 100". The ease advantage I gained using schema free architecture of SDB, I lost to a performance hit of querying/looping a resultset with just over 1800 items.
Now I'm reading about projects like Tokyo Cabinet that are extending the concept of in-memory key-value stores to provide pseudo-relational functionality at ridiculously faster speeds (14x i read somewhere).
My question:
Are there some rudimentary guidelines or heuristics that I as an application designer/developer can go through to evaluate which DB tech is the most appropriate at each stage of my project.
Ex: At a prototyping stage where logical/technical unknowns of the application make data structure fluid: use SDB.
At a more mature stage where user deliverables are a priority, use traditional tools where you don't have to spend dev time writing sorting, grouping or pagination logic.
Practical experience with these tools would be very much appreciated.
Thanks SO!
Shaheeb R.
The problems you are finding are why RDBMS specialists view some of the alternative systems with a jaundiced eye. Yes, the alternative systems handle certain specific requirements extremely fast, but as soon as you want to do something else with the same data, the fleetest suddenly becomes the laggard. By contrast, an RDBMS typically manages the variations with greater aplomb; it may not be quite as fast as the fleetest for the specialized workload which the fleetest is micro-optimized to handle, but it seldom deteriorates as fast when called upon to deal with other queries.
The new solutions are not silver bullets.
Compared to traditional RDBMS, these systems make improvements in some aspect (scalability, availability or simplicity) by trading-off other aspects (reduced query capability, eventual consistency, horrible performance for certain operations).
Think of these not as replacements of the traditional database, but they are specialized tools for a known, specific need.
Take Amazon Simple DB for example, SDB is basically a huge spreadsheet, if that is what your data looks like, then it probably works well and the superb scalability and simplicity will save you a lot of time and money.
If your system requires very structured and complex queries but you insist with one of these cool new solution, you will soon find yourself in the middle of re-implementing a amateurish, ill-designed RDBMS, with all of its inherent problems.
In this respect, if you do not know whether these will suit your need, I think it is actually better to do your first few iterations in a traditional RDBMS because they give you the best flexibility and capability especially in a single server deployment and under modest load. (see CAP Theorem).
Once you have a better idea about what your data will look like and how will they be used, then you can match your need with an alternative solution.
If you want the simplicity of a cloud hosted solution, but needs a relational database, you can check out: Amazon Relational Database Service