Is nosql a right tool for multy level forum like comment system? - mongodb

I want to build a web app similar to Reddit.com, where you have multy level of comments, lots of reads and writes. I was wondering if nosql and mongoDB in particular is the right tool for this?

Comments -- it's really thing for nosql database, no doubt. You avoiding multiple joins to itself. And it's means that your system can scale out!
With mongodb you can store all hierarchy within one document. Some peoples can say that here will be problems with atomic updates, but i guess that it's not a problem because of you can load and save back entire comments tree. In any way you can easy redesign your system later to support atomic updates and avoid issues with concurrency.

Reddit itself uses Cassandra. If you want something "similar to reddit.com," maybe you should look at their source -- https://github.com/reddit/reddit/wiki.
Here's what David King (ketralnis) said earlier this year about the Cassandra 0.7 release: "Running any large website is a constant race between scaling your user base and scaling your infrastructure to support it. Our traffic more than tripled this year, and the transparent scalability afforded to us by Apache Cassandra is in large part what allowed us to do it on our limited resources. Cassandra v0.7 represents the real-life operations lessons learned from installations like ours and provides further features like column expiration that allow us to scale even more of our infrastructure."
However, Rick Branson notes that Reddit doesn't take full advantage of Cassandra's features, so if you were to start from scratch, you'd want to do some things differently.

Related

What are the disadvantages of using Event sourcing and CQRS?

Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modelled database which the developer has to work with for the lifetime of the application unless there is a big data migration project. CQRS and ES also has other benefits like scaling eventstore, audit log etc. that are already all over the internet.
But what are the disadvantages ?
Here are some disadvantages that I can think of after researching and writing small demo apps
Complex: Some people say ES is complex. But I'd say having a complex application is better than a complex database model on which you can only run very restricted queries using a query language (multiple joins, indexes etc). I mean some programming languages like Scala have very rich collection library that is very flexible to produce some seriously complex aggregations and also there is Apache Spark which makes it easy query distributed collections. But databases will always be restricted to it's query language capabilities and distributing databases are harder then distributed application code (Just deploy another instance on another machine!).
High disk space usage: Event store might end up using a lot of disk space to store events. But we can schedule a clean up every few weeks and creating snapshot and may be we can store historical events locally on an external HD just incase we need old events in the future ?
High memory usage: State of every domain object is stored in memory which might increase RAM usage and we all how expensive RAM is. BIG PROBLEM!! because I'm poor! any solution to this ? May be use Sqlite instead of storing state in memory ? Am I making things more complex by introducing multiple Sqlite instances in my application ?
Longer bootup time: On failure or software upgrade bootup is slow depending on the number of events. But we can use snapshots to solve this ?
Eventual consistency: Problem for some applications. Imagine if Facebook used Event sourcing with CQRS for storing posts and considering how busy facebook's system is and if I posted a post I would see my fb post the next day :)
Serialized events in Event store: Event stores store events as serialized objects which means we can't query the content of events in the event store which is discouraged anyway. And we won't be able to add another attribute to the event in the future. Solution would be to store events as JSON objects instead of Serialized events ? But is that a good idea ? Or add more Events to support the change to orignal event object ?
Can someone please comment on the disadvantages I brought up here and correct me if I am wrong and suggest any other I may have missed out ?
Here is my take on this.
CQRS + ES can make things a lot simpler in complex software systems by having rich domain objects, simple data models, history tracking, more visibility into concurrency problems, scalability and much more. It does require a different way thinking about the systems so it could be difficult to find qualified developers. But CQRS makes it simpler to separate responsibilities across developers. For example, a junior developer can work purely with the read side without having to touch business logic.
Copies of data will require more disk space for sure. But storage is relatively cheap these days. It may require the IT support team to do more backups and planning how to restore the system in a case in things go wrong. However, server virtualization these days makes it a more streamlined workflow. Also, it is much easier to create redundancy in the system without a monolithic database.
I do not consider higher memory usage a problem. Business object hydration should be done on demand. Objects should not keep references to events that have already been persisted. And event hydration should happen only when persisting data. On the read side you do not have Entity -> DTO -> ViewModel conversions that usually happened in tiered systems, and you would not have any kind of object change tracking that full featured ORMs usually do. Most systems perform significantly more reads than writes.
Longer boot up time can be a slight problem if you are using multiple heterogeneous databases due to initialization of various data contexts. However, if you are using something simple like ADO .NET to interact with the event store and a micro-ORM for the read side, the system will "cold start" faster than any full featured ORM. The important thing here is not to over-complicate how you access the data. That is actually a problem CQRS is supposed to solve. And as I said before, the read side should be modeled for the views and not have any overhead of re-mapping data.
Two-phase commit can work well for systems that do not need to scale for thousands of users in my experience. You would need to choose databases that would work well with the distributed transaction coordinator. PostgreSQL can work well for read and write separate models, for example. If the system needs to scale for a high number of concurrent users, it would have to be designed with eventual consistency in mind. There are cases where you would have aggregate roots or context boundaries that do not use CQRS to avoid eventual consistency. It makes sense for non-collaborative parts of the domain.
You can query events in serialized a format like JSON or XML, if you choose the right database for the event store. And that should be only done for purposes of analytics. Nothing inside the system should query event store by anything other than the aggregate root id and the event type. That data would be indexed and live outside the serialized event.
Just to comment on point 5. I've been told that Facebook does use ES with Eventual Consistency, which is why you can sometimes see a post disappear and reappear after you've posted it.
Usually the read-model your browser is accessing is located 'close' to you, but after you make a post the SPA switches over to a read-model that is close to your write-model. The close proximity between the write-model (events) and the read-model mean you get to see your own post.
However, 15 minutes later your SPA switches back to the first, closer, read-model. If the event containing your post hasn't yet propagated to that read-model you'll see your own post disappear only to reappear sometime later.
I know it's been almost 3 years since this question was asked, but still this article may be useful for someone. Key points are
Scaling with snapshots
Visibility of data
Schema changing
Dealing with complex domains
Need to explain it to most new team members
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modeled database which the developer has to work with for the lifetime of the application unless there is a big data migration project.
This is a big misconception. The relational databases were invented exactly for the evolution of the model (thanks to simple two-dimensional tables as opposed to pre-defined hierarchical structures). With views and procedures ensuring the encapsulation of data access, the logical and physical model can evolve independently. This is also why SQL defines DDL and DML in the same language. Some RDBMS also allow all those evolutions to be versioned and deployed online (continuous delivery) as Oracle Edition Based Redefinition.
Big data structures are predefined and can be read only with the code developed for this structure. Ok when consumed immediately but you will have hard time to read it 10 years later without the exact version, and language compiler or interpreter.
I hope to not be late to try to give an answer. In these months I've done a lot of research on that argument with the goal of implementing a production-grade solution for some parts of my architecture where ES can make sense
Complex: Actually, it should not be complex, its mission is to be deadly simple. How? pushing all the complexity from business logic code to infrastructure code. The data access should be done by frameworks that are not enough mature yet. Still, there is no clear winner in the ES/CQRS race, maybe because is still a niche/hipster approach (?) So some team is rolling its own solution or adopting some ready-made technology such as Axon
High disk space usage: I would say more, I would say * potentially infinite* Disk Usage. But if you go towards ES, you also have a very good reason to tolerate this apparent drawback. Let's give some of them:
Audit Logs : The datastore is an event log, we already know it. Financial apps or every mission/safety critical could need a centralized audit log that enables to state Who made What in Which moment. ES provides this capability of the box...you can also decorate your event entries with some business meaningful metadata (eg. a transaction Id correlated with some API consumer identity, A severity level of the operation...)
High Concurrency: there are systems where logical resource states are mutated by many clients in a concurrent way. These are games, IoT platforms, and so on. Logging events instead of change a state representation could be a smart way to provide a total order of events. The other way is to delegate to DB the synchronization stuff. But this is not what you want if you're into ES
Analytics Let's say you have a lot of data with a lot of business value, but you still don't know which. For years we extracted knowledge from applications information by translating data organization with different information models (OLAP cubes). The event store provides something similar out of the box again. Event logs is the rawest form of representation of information And you can have many ways to process them, in batch or reacting to events stored
High memory usage: I think it should be the same once you have built your projection
Longer bootup time: If the read side caches its projections and "remembers" the last update event, it should not re-apply the entire event sequence. Snapshots are mitigation but if you do a lot of snapshots maybe you made a bad choice with ES. I think that this problem is minor in microservices ecosystems, where the boot time can be masked without service interruption. In fact, you get the most out of ES/CQRS when you apply it so microservices
Eventual consistency: Blame CAP theorem for this, not ES. Many non ES/CQRS have to deal with this, but there are a lot of scenarios where it is not a real problem. These are the scenarios where ES fits well. And you can mix ES and non ES services into the same platform
Serialized events in Event store: if it's important to have a non-serialized event representation, you could use a document-oriented DB, but if you do this to make queries over events payload, you are missing the point of ES/CQRS. ES means to move all data manipulation from the DB side to the application tier, where every piece changes fastly, and all are stateless. This enhances scalability and fault tolerance and provides means to shape the organization of your team, doing things like let the frontend guy/girl write his/her BFF in javascript easily.
I hope to put into practices this principles with good results and draw on the benefits of this exciting approach

Best practise for modeling analytics data

I am working on a product using which a user can create his/her mobile site. Now, as this is a mobile site creation platform, there are lots of site created in the application. I need to keep all the visitor data in the database so that product can show the analytics to the user of his/her site.
When there was less site, all was working fine. But now the data is growing fast as there are lots of requests on the server. I make use of mongo as NoSQL DBMS to keep all the data. In a collection named "analytics", I usually insert row with site id so that it can be shown to the user. As the data is large, performance to show user analytics is also slow. Also disk space is growing gradually.
What should be best modeling to keep this type of BIG data.
Should i create collection per site and store data in separate collection per site ?
Should I also separate collection date wise ?
What should be the cleaning procedure of the data. What is the best practices adopted by other leader in the industry ?
Please help
I would strongly suggest reading through MongoDB Optimization strategies at http://docs.mongodb.org/manual/administration/optimization/ . You will find various ways to identify slow performing queries / ops and suggestions to improve them at the mentioned page. Hopefully that should help you resolve slow queries / performance issues.
If you haven't already seen, I would also suggest taking a look at various use cases at http://docs.mongodb.org/ecosystem/use-cases/ , how they are modeled for those scenarios and if there is any that resembles what you are trying to achieve.
After following through the optimization strategies and making appropriate changes, if you still have performance issues, I would suggest posting following information for further suggestions:
What is your current state in terms of performance and what is the planned target state?
How does your system look i.e. hardware / software characteristics?
Once you have needed performance characteristics, following questions may help you achieve your targets:
What are the common query patterns and which ones are slow?
Potentially look for adding indexes that can enhance query performance
Potentially look for schema refactoring based on access patterns
Potentially look for schema refactoring for rolling-up / aggregating analytics data based on how it will be used.
Are writes also slow and is that a concern as well?
Potentially plan for Sharding which would provide write as well as read scaling. Sharding is entirely a topic in itself and I would suggest reading about it at http://docs.mongodb.org/manual/sharding/
How big the data is and how is it growing or intended to grow
Potentially this would give further insights into what could be suggested

NoSQL Database for ECommerce

I will be constructing an ecommerce site, and would like to use a no-sql database, which will fit well with the plans for the app. But when it comes to which database would fit the job, im not sure. After comparing various DB's, the ones that seem best might be either mongo, couch, or even orientdb. I have seen arguments for all of them to be used or not used compared to something like MySQL. But between themselves (nosql databases), which one would fit well with an ecommerce solution?
Note, for the use case, i wont be having thousands of transactions a second. Or similarly high write rates. they will be moderate sure, but at a level that any established database could handle.
CouchDB: Has master to master replication, which I could really use. If not, I will still have to implement the same functionality in code anyways. I need to be able to have a users database, sync with the mothership. (users will have their own, potentially localhost database, that could sync with the main domains server). Couch is also fast, once your queries have been stored in the db.As i will probably have a higher need for read performance. Though not by a lot.
MongoDB: queries are very easy and user friendly. Also, with the fact that end users may need to query for certain things at a given time that I may not be able to account for ahead of time, this seems like it may be a better fit. I dont have to pre-store my queries in the db. Does support atomic transactions, though only when writing to a single document at a time.
OrientDB: A graph database. much different that most people are used to, but with the needs, it could fit very well too. Orient has the benefits of being schemaless, as well as having support for ACID transactions. There is a lot of customer, and product relationships that a graph database could be great with. Orient also support master to master replication, similar to couchdb.
Dont get me wrong, I can see how to build this traditionally with something like MySQL, but the ease and simplicity of a nosql solution, is very attractive. Although, in my case, needing a schemaless solution, would be much easier in nosql rather than mysql. a given product may have more or less items, than another. and avoiding recreating a table whenever a new field is added, is preferrable.
So between these 3 (or even others you think may be better), what features in each could potentially work for, or against me in regards to an ecommerce based site, when dealing with customer transactions?
Edit: The reason I am not using an existing solution, is because with the integrated features I need, there are no solutions available out there. We are also aiming to use this as a full product for our company. There will be a handful of other integrations than just sales. It is also going to be working with a store's POS system.
Since e-commerce can encompass everything from shopping carts through to membership and recurring subscriptions, it is hard to guess exactly what requirements and complexity you are envisioning.
When constructing an e-commerce site, one of the early considerations should be investigating whether there is already an established e-commerce product or toolkit that could meet your requirements. There are many subtleties to processes like ordering, invoicing, payments, products, and customer relationships even when your use case appears to be straightforward. It may also be possible to separate your application into the catalog management aspects (possibly more custom) versus the billing (potentially third party, perhaps even via a hosted billing/payment API).
Another consideration should be who you are developing the e-commerce site for: is this to scratch your own itch, or for a client? Time, budget, and features for a custom build can be difficult to estimate and schedule .. and a niche choice of technology may make it difficult to find/hire additional development expertise.
A third consideration is what your language(s) of choice are for developing your application. Some languages will have more complete/mature/documented drivers and/or framework abstractions for the different databases.
That said, writing an e-commerce system appears to be a rite of passage for many developers ;-).
Edit: a lot has changed since this answer was originally posted in 2012 and you should definitely refer to current product information. For example, MongoDB has had support for Decimal128 values since MongoDB 3.4 (2016) and multi-document transactions since MongoDB 3.6 (2017).
Check the comparison of different available NoSql databases here. Suit your requirement as per that.
MongoDB 4 now multi-document ACID transactions! That makes it suitable for e-Commerce!
Check out: https://www.mongodb.com/transactions

What are the pros and cons of DynamoDB with respect to other NoSQL databases?

We use MongoDB database add-on on Heroku for our SaaS product. Now that Amazon launched DynamoDB, a cloud database service, I was wondering how that changes the NoSQL offerings landscape?
Specifically for cloud based services or SaaS vendors, how will using DynamoDB be better or worse as compared to say MongoDB? Are there any cost, performance, scalability, reliability, drivers, community etc. benefits of using one versus the other?
For starters, it will be fully managed by Amazon's expert team, so you can bet that it will scale very well with virtually no input from the end user (developer).
Also, since its built and managed by Amazon, you can assume that they have designed it to work very well with their infrastructure so you can can assume that performance will be top notch. In addition to being specifically built for their infrastructure, they have chosen to use SSD's as storage so right from the start, disk throughput will be significantly higher than other data stores on AWS that are HDD backed.
I havent seen any drivers yet and I think its too early to tell how the community will react to this, but I suspect that Amazon will have drivers for all of the most popular languages and the community will likely receive this well - and in turn create additional drivers and tools.
Using MongoDB through an add-on for Heroku effectively turns MongoDB into a SaaS product as well.
In reality one would be comparing whatever service a chosen provider has compared to what Amazon can offer instead of comparing one persistance solution to another.
This is very hard to do. Each provider will have varying levels of service at different price points and one could consider the option of running it on their own hardware locally for development purposes a welcome option.
I think the key difference to consider is MongoDB is a software that you can install anywhere (including at AWS or at other cloud service or in-house) where as DynamoDB is a SaaS available exclusively as hosted service from Amazon (AWS). If you want to retain the option of hosting your application in-house, DynamoDB is not an option. If hosting outside of AWS is not a consideration, then, DynamoDB should be your default choice unless very specific features are of higher consideration.
There's a table in the following link that summarizes the attributes of DynamoDB and Cassandra:
http://www.datastax.com/dev/blog/amazon-dynamodb
Something that needs improvement on DynamoDB in order to become more usable is the possibility to index columns other than the primary key.
UPDATE 1 (06/04/2013)
On 04/18/2013, Amazon announced support for Local Secondary Indexes, which made DynamoDB f***ing great:
http://aws.amazon.com/about-aws/whats-new/2013/04/18/amazon-dynamodb-announces-local-secondary-indexes/
I have to be honest; I was very excited when I heard about the new DynamoDB and did attend the webinar yesterday. However it's so difficult to make a decision right now as everything they said was still very vague; I have no idea the functions that are going to be allowed / used through their service.
The one thing I do know is that scaling is automatically handled; which is pretty awesome, yet there are still so many unknowns that it's tough to really make a great analysis until all the facts are in and we can start using it.
Thus far I still see mongo as working much better for me (personally) in the project undertaking that I've been working on.
Like most DB decisions, it's really going to come down to a project by project decision of what's best for your need.
I anxiously await more information on the product, as for now though it is in beta and I wouldn't jump ship to adopt the latest and greatest only to be a tester :)
I think one of the key differences between DynamoDB and other NoSQL offerings is the provisioned throughput - you pay for a specific throughput level on a table and provided you keep your data well-partitioned you can always expect that throughput to be met. So as your application load grows you can scale up and keep you performance more-or-less constant.
Amazon DynamoDB seems like a pretty decent NoSQL solution. It is fast, and it is pretty easy to use. Other than having an AWS account, there really isn't any setup or maintenance required. The feature set and API is fairly small right now compared to MongoDB/CouchDB/Cassandra, but I would probably expect that to grow over time as feedback from the developer community is received. Right now, all of the official AWS SDKs include a DynamoDB client.
Pros
Lightning Fast (uses SSDs internally)
Really (really) reliable. (chances of write failures are lower)
Seamless scaling (no need to do manual sharding)
Works as webservices (no server, no configuration, no installation)
Easily integrated with other AWS features (can store the whole table into S3 or use EMR etc)
Replication is managed internally, so chances of accidental loss of data is negligible.
Cons
Very (very) limited querying.
Scanning is painful (I remember once a scanning through Java ran for 6 hours)
pre-defined throughput, which means sudden increase beyond the set throughput will be throttled.
throughput is partitioned as table is sharded internally. (which means if you had a throughput for 1000 and its partitioned in two and if you are reading only the latest data(from one part) then your throughput of reading is 500 only)
No joins, Limited indexing allowed (basically 2).
No views, triggers, scripts or stored procedure.
It's really good as an alternative to session storage in scalable application. Another good use would be logging/auditing in extensive system. NOT preferable for feature rich application with frequent enhancement or changes.

Non-relational databases (NoSQL) for small to medium sized applications

The benefits of a non-relational database (such as a key-value pair storage) are evident when used in large scale datasets (google, facebook, linkedin). How do you think small to medium sized applications can benefit from using non-relational databases?
IBM Mainframes have had "non-relational" databases since the 60s (hierarchial databases such as IMS + variants). These databases are still in use because they are extremely fast and handle huge scale well.
The point of relational databases was to provide a regular, relatively abstract method for storing and retrieving data in which the tuning can be done relatively independently of the data model (not true for IMS). They were designed rather in reaction to the inability to reorganize hiearchical databases easily. The upside is nice organization; the downside is medium, not high performance.
Google provides scalable storage and MapReduce to handle scale. It isn't relational.
There was a huge push early in the last decade to store data in XML, in essentially hiearchical form because XML is implicitly hierarchical. That was a huge mistake IMHO, because it repeated the inconvenience of heirarchical databases, but had none of the performance. I'm not very surprised this movement seems to have pretty much died.
Most of the practical push to non-relational seems to me to be towards performance and scale. I don't see how this helps "small" applications much.
People have proposed, but not done a lot of practical data management using knowledge-based schemes. Doug Lenat's CYC comes to mind here. The ability of the database
to help an application draw non-obvious conclusions strikes me a very interesting for "small" applications that are trying be "smart". But there aren't a lot of these yet.
The sweet spot of using a NoSQL database at that scale is when the database model (key-value, document, etc.) is a good match to the application's needs and the advanced relational functionality is not needed.
At the small end of the spectrum, performance is a non issue because just about everything is fast. Storage engines are a non issue, if you don't need a sophisticated query engine, the lack of SQL support is a non issue.
You are left with how well it fits and how easy it is to use. Honestly though, tooling does become an issue. Relational database tooling is mature, NoSQL tooling is less feature rich and less battle hardened. Too often it is roll-your-own tooling. Definitely consider what tools you'd be giving up and how much you need them.
There is an additional slate of advantages for smaller projects when considering a NoSQL service (like Amazon SimpleDB and Microsoft Azure) as compared to a product. If you only have to pay for what you use and you don't use much, it can be cheaper than running a dedicated server, going all the way down to free for something like the SimpleDB free usage tier.
You also avoid some of the server and database maintenance costs. This can be a big win if you don't have a DBA, or when your DBAs are already over worked. Of course you'll still have admin work to do, but it is significantly reduced, and typically simpler.
When it comes to graph databases (like Neo4j - a project I'm involved in) they excel at scaling to complexity. This means, they provide "better substrates for modeling business domains" (see The State of NoSQL, also by Ben Scofield, too). As I see it, this is very important in small to medium sized apps.
This may be better explained through examples, so here's some links to example apps/domain modeling:
Access control lists the graph
database way
Social networks in the database: using a graph database
Domain modeling gallery
The question perhaps requires a bit more context... assuming a Python environment, consider the tutorial at the y_serial project: http://yserial.sourceforge.net/
NoSQL is not merely adopted for reasons of scalability. Serialization (of any arbitrary Python object) and persistence are very convenient at any scale -- so consider the key-value system as one approach.
Well one of the problems with a RDBMS is that you need to spend effort mapping your programming languages domain models to the relational schema of your RDBMS. This effort is usually spent configuring your ORM layer.
With NoSQL databases you are not forced to map your objects to a relational model and in most cases your objects are serialized as-is. Because of the lack of an intermediary schema, data migrations and versioning become easier.
Another benefit is scalability and performance. Since most of the time your data is received by 'keys' effectively everything uses and index. Trivial sharding is possible by doing a % (MOD) on the key against the number of your available NoSQL instances providing natural data partitioning which is crucial for sharding.
If you're interested in seeing how developing with a NoSQL differs from a RDBMS, I have a tutorial where I show how to go about designing a simple blog application using Redis.
If you match up a few common PaaS cloud services like a Key-Value store, a BLOB store, and a Message Queue store you have some handy tools that can free small application developers from the tyranny of the DBA and the infrastructure folks.
Today small developers often resort to Jet MDBs. Why? Easy, shared access is as easy as storing the MDB file on a file share visible to the entire application community. When they can get away with it (i.e. get the necessary support from the gatekeepers) they might use SQL Server Express, MySQL, etc.
Sadly those gatekeepers can be pretty hostile to deal with in a large organization. Mention a "database" and suddenly you face the DBA gang and associated delays, application reviews, prioritization, etc. Mention needing a server and you face that other firing squad.
Using a NoSQL solution and related cloud services can eliminate a ton of this if you don't need an RDBMS.
For one thing, all that's really required is an account with a public cloud provider. This is something that becomes fairly easy once the concept has been approved. And easier for you as a developer once you've been approved and assigned an account, though of course there are the usual bookkeeping issues.
But let's even set that aside. What if your organization implemented a private cloud for such uses? Lots of the issues of outside billing go away, data insecurity worries go away, etc.
Such a thing could be implemented and provisioned in a semi-anonymous fashion, almost as easily as administering file shares. The anonymity comes in because once you've been approved to develop on the in-house cloud nobody needs to nitpick the details of your activities using it any more than they need to examine a request before you can create a file on an existing file share.
Obviously there would be storage and CPU quotas to manage. Nobody can afford to just keep scaling up indefinately. Rogue applications might consume vast quantities of resources. So what you need is some sort of quota system to cap usage. Whether this is monitored by infrastructure folks is an implementation decision, or it might be treated just like file share use: run out and somebody yells at the programmer who in turn looks into it and requests more if appropriate (or fixes his bugs).
But you end up with "utility computing" and by "using no SQL" you don't incur the cost (and issues) of dealing with DBAs. They can still sit quietly surfing the Web in their big offices while you get some work done.
Amazon SimpleDB can be useful for those who need a non-relational database for storage of smaller, non-structural data. Amazon SimpleDB has restricted storage size to 10GB per domain. Amazon SimpleDB offers simplicity and flexibility. SimpleDB automatically indexes all data. Amazon SimpleDB pricing is based on your actual box usage. You can store any UTF-8 string data in Amazon SimpleDB.