Modern hierarchical database [closed] - hierarchical-data

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 9 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Improve this question
I'm working on the architecture of a new system to replace an ancient mainframe app. The mainframe uses IBM IMS and is surprisingly fast with large amounts of data. We've tried 3 DBs so far - MongoDB, SQL Server and Oracle, but they performed poorly under load. We hired an Oracle consultant and a 128 cores server and Oracle still gives us 4x the response time of the old system (same with SQL Server).
Are there any modern hierarchical DBs, that can efficiently support billions of records?

Mainframes have been and remain very fast for certain use cases, so part one is not to assume that mainframe = bad. Having said that, they can be very expensive to maintain, and particularly with legacy apps the skills are starting to evaporate.
If you really wanted a hierarchical database, one valid option would be to modernise your application but retain IMS at the core. IMS is a great hierarchical database, and I don't think IBM are going to EOL IMS any time soon, so is there a real reason to go to a hierarchical database that isn't IMS? A quick visit to their website gave me the impression that they'd discount the product if they thought you were going to migrate to a competing product, so if money is the problem then perhaps the answer is to just ask IBM to discount the product you're already happy with. This white paper (ftp://public.dhe.ibm.com/software/data/ims/pdf/TCG2013015LI.pdf) suggests they're pushing that as an option, and no doubt the later versions of IMS have a bunch of features that might not be available in the version you're running (assuming you've not upgraded to the latest).
I'm surprised you can't get the performance you want out of Oracle though, the system I'm currently working on has a couple of tables at the billion mark and we definitely don't have 128 cores, but we get reasonable performance.
My first question is whether your Oracle consultant really knew their stuff. I've had mixed results, I guess like any skill set people can have variable skills. I often find that when you get performance problems it's because people have over-normalised or over-generalised the database schema - so you've moved from a highly optimised hierarchical structure in IMS that flies to a very abstracted structure in 3NF, and that dies. But sometimes if you put that same hierarchical structure in Oracle, and only allow the same sort of access patterns that were possible in IMS, you'd get all the performance you want.
By that, I mean if in IMS you had clients, clients had orders, and orders had order lines, then I think that means it's pretty hard to do any accesses without starting at the client. It also often means you have large batch processes that process all the clients every day to find out which have orders that you need to do something with.
So, some things here. Firstly, if, in Oracle, you were to build that structure - so I have a client id, the client id is the first element in the primary key of orders, and the client id then the order id are the first two elements in the primary key of order lines, and then I use client id as my clustering key and put client id into every index......probably all my client-based access paths will be really fast. You can also partition by client id, and if needed, run an Oracle RAC cluster with each of those partitions/client ranges effectively running as separate databases on a separate more commodity class machine (say, a dual socket machine = about 20 cores).
Secondly, if I used to have to process all my records once a night to find the orders that needed someone to work on them, then in the new relational world I don't need to do that any more, I just need to find the orders with a status of "pending" or whatever. So maybe Oracle isn't as fast for that batch oriented workload, but if I change my logic and do an indexed query for pending orders, then again I can get all the performance I want. Even more so, perhaps I make order_status into a partitioning key, so my "active" records are all in one partition, and all the older orders are in other partitions - and then I put that partition on an SSD-backed array.
Thirdly, take a look at your storage devices. Performance problems in databases are invariably IO problems - either you're doing too much IO (poorly optimised queries), or your IO subsystem can't keep up with the IO that you need to do. 128 cores is an awful lot of compute, and I've rarely seen a database that is compute bound. Maybe look at a big SSD array, some of them can give you enormous IO throughput. Certainly if you were running Oracle on a RAID 5 spinning disk array your performance is likely to suck.
The last random comment here - a lot of people are getting good results with SAP HANA - a fully in-memory database. That really flies, and is specifically designed for workloads that just won't run fast enough in other databases. I bet SAP would come demo it to you for free if you wanted it.

Related

Choosing the right NoSQL storage for highly connected and flexible domain [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
We're starting a new project and looking for an appropriate storage solution for our case.
Main requirements for the storage are as follows:
Ability to support highly flexible and connected domain
Ability to support queries like "give all children of that item and items linked to that children" in ms
Full text search
Ad hoc analytics
Solid read and write performance
Scalability (as we want to offer a Saas version of our product)
First of all we eliminated all RDBMS, since we have really flexible schema which can also be changed by the customer (add new fields etc.),
so supporting such solution in any RDBMS can become a nightmare...
And we came to NoSQL. We evaluated sevaral NoSQL storage engines and chose 3 most appropriate (as we think).
MongoDB
Pros:
Appropriate to store aggregates with flexible structure (as we have
them)
Scalability/Maturity/Support/Community
Experience with MongoDB on previous project
Drivers, cloud support
Analitycs
Price (it's free)
Cons:
No support for relationships (relly important for us as we have a lot of connected items)
Slow retrieval of connected data (all joins happen in app)
Neo4j:
Pros:
Support of conencted data in modeling, flexibility
Fast retrieval of interconnected data
Drivers, cloud support
Maturity/Support/Comminity (if we compare with other graph Dbs)
Cons:
No support for aggregate storage (we would like to have aggregates in one vertex than in several)
Scalability (as far as I know, now all data is duplicated on other servers)
Analitics ?
Write performance ? (read several blogs where customers complained on its write performance)
Price (it is not free for commercial software)
OrientDB
Pros:
It seems that OrientDB has all the features that we need (aggregates and graphdb in one solution)
Price (looks like is't free)
Cons:
Immaturity (comparing with others)
Really small company behind the technology (In particular one main contributor), so questions about support, known issues etc.
A lot of features, but do they work pretty well
So now, the main dilemma for as is between Neo4j and OrientDB (MongoDb is a third option because its lack of relationships that are really important in our case - this post explains the pitfalls). I've searched for any benchmarks/comparison of these dbs, but all all of them are old. Here is a comparison by features http://vschart.com/compare/neo4j/vs/orientdb. So now we need an advice from people who already used these dbs, what to choose. Thanks in advance.
I think there are interesting trade-offs with each of these:
MongoDB doesn't do graphs;
Neo4j's nodes are flat key-value properties;
OrientDB forces you to choose between graphs and documents (can't do both simultaneously).
So your choice is between a graph store (neo4j or orient) and a document store (mongo or orient). My sense is that MongoDB is the leading document store and Neo4j is the leading graph database which would lead me to pick one of thse. But since connectivity is important, I'd lean towards the graph database and take Neo4j.
Neo4j's scalability is proven: it's in use for graphs larger than Facebook's and by enormous companies like Walmart and EBay. So if your problem is anywhere between 0-120% of Facebook's social graph, Neo4j has you covered. Write throughput is fine with Neo4j - I get in excess of 2,000 proper ACID Transactions per second on a laptop and I can easily queue writes to multiply that out.
Everything else is pretty equal: you can choose to pay for any of these or use them freely under their open source licenses (including Neo4j if you can work with GPL/AGPL). Neo4j's paid licenses have great support (up 24x7x365, 1 hour turnaround worldwide) versus OrientDB's rather lacklustre support (4 hour turnaround in the EU daytime only), and I imagine MongoDB has good support too (though I have not checked up on it).
In short, there's a reason Neo4j is the top database for connected data: it kicks ass!
Jim
To correct some misconceptions regarding mongoDB
Relations are supported, by either linking to other documents or embedding them. Please see the Data Modeling Introduction in the mongoDB docs for details. It may be that you are forced to trade normalization against speed, though. However, there are use cases in which embedding is the better solution compared to relations. Think of orders: When embedding order items and their price, you do not need to have a price history table for each and every product ever sold.
What is not supported are JOINs. Which you can circumvent by embedding documents, as mentioned above.
MongoDB can be used for tree structures. Please see Model Tree Structures with Materialized Paths for details. This approach seems to be the most appropriate way to implement a tree structure for the mentioned use case. An alternative may be an array of ancestors, depending on your needs.
That being said, mongoDB may fail in one of the basic requirements, though this really depends on how you define it: ad hoc analysis. My suggestion would be to model the intended data structure using a document oriented approach (in opposite of putting a relational approach on a document oriented database) and prototype one of the possible analysis use cases with dummy data.

Redis, CouchDB or Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What are the strengths and weaknesses of the various NoSQL databases available?
In particular, it seems like Redis is weak when it comes to distributing write load over multiple servers. Is that the case? Is it a big problem? How big does a service have to grow before that could be a significant problem?
The strengths and weaknesses of the NoSQL databases (and also SQL databases) is highly dependent on your use case. For very large projects, performance is king; but for brand new projects, or projects where time and money are limited, simplicity and time-to-market are probably the most important. For teaching yourself (broadening your perspective, becoming a better, more valuable programmer), perhaps the most important thing is simple, solid fundamental concepts.
What kind of project do you have in mind?
Some strengths and weaknesses, off the top of my head:
Redis
Very simple key-value "global variable server"
Very simple (some would say "non-existent") query system
Easily the fastest in this list
Transactions
Data set must fit in memory
Immature clustering, with unclear future (I'm sure it'll be great, but it's not yet decided.)
Cassandra
Arguably the most community momentum of the BigTable-like databases
Probably the easiest of this list to manage in big/growing clusters
Support for map/reduce, good for analytics, data warehousing
MUlti-datacenter replication
Tunable consistency/availability
No single point of failure
You must know what queries you will run early in the project, to prepare the data shape and indexes
CouchDB
Hands-down the best sync (replication) support, supporting master/slave, master/master, and more exotic architectures
HTTP protocol, browsers/apps can interact directly with the DB partially or entirely. (Sync is also done over HTTP)
After a brief learning curve, pretty sophisticated query system using Javascript and map/reduce
Clustered operation (no SPOF, tunable consistency/availability) is currently a significant fork (BigCouch). It will probably merge into Couch but there is no roadmap.
Similarly, clustering and multi-datacenter are theoretically possible (the "exotic" thing I mentioned) however you must write all that tooling yourself at this time.
Append only file format (both databases and indexes) consumes disk surprisingly quickly, and you must manually run compaction (vacuuming) which makes a full copy of all records in the database. The same is required for each index file. Again, you have to be your own toolsmith.
Take a look at http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis He does a good job summing up why you would use one over the other.

Why exactly do we use NoSQL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Having understood some of the advantages that NoSQL offers (scalability, availability, etc.), I am still not clear why a website would want to use a non-relational database.
Can I get some help on this, preferably with an example?
Better performance
NoSQL databases sometimes have better performance, although this depends on the situation and is disputed.
Adaptability
You can add and remove "columns" without downtime. In most SQL servers, this takes a long time and takes up a load of load.
Application design
It is desirable to separate the data storage from the logic. If you join and select things in SQL queries, you are mixing business logic with storage.
NoSQL databases are there to solve several things, mainly:
(buzz) BigData => think TB, PB, etc..
Working with Distributed Systems / datasets => say you have 42 products, so 13 of them will live in Chicago datacenter, 21 in NY's and another and 8 somewhere in Japan, but once you query against all 42 products, you would not need to know where they are located: NoSQL DB will. This also allows to engage a lot more brain power ( servers ) to solve hard computational problems [ does not seem it would fit your use case, but it is an interesting thing to note ]
Partitioning => having your DB be easily distributed, besides those cool 8 products in Japan, also allows for an easy data replication, so those 42 products will be replicated with a factor of 3, for example, which would mean you DB would have 3 copies for every product. Hence if something goes down, no problem => here is a replica available. This is where NoSQL databases actually shine vs. RDBMS. Granted you can shard, partition and cluster Oracle / MySQL / PostgreSQL / etc.. BUT it is a several magnitudes more complicated process and usually a maintenance headache for most people you'd employ.
BUT to your question:
why a website would want to use a non-relational database
When most of the people, I worked with / met / chatted with, choose NoSQL for their "website", it is unfortunately NOT for the reasons above, but simply because it is COOLER to do so. And in fact many projects FAIL / have extreme difficulties due to this reason.
If most of NoSQL gurus take their masks off, they will all agree that MOST of the problems ( or as people call them websites ) that developers solve day to day, can and rather be solved with a SQL solution, such as PostgreSQL, MySQL, etc.. with some cool Redis cache layer on top of it. And only a small subset of problems would REALLY benefit from NoSQL.
I personally love Riak, as I am a firm believer that a NoSQL, fault tolerant DB should have an extremely strong, flexible and naturally distributed foundation => such as Erlang OTP. Plus I am a fan of simplicity. But again, given the problem, I would choose whatever works best, and most of the time I will NEED that consistency ( especially if we are talking about money / financial world / mission critical / etc.. ).
The main reason not to use an SQL database is scalability. The transactional guarantees and the relational model make it almost impossible to scale a database usefully across more than a few machines, especially given the write-heavy workloads generated by modern web applications.
An app like Facebook can't be made to work on a straightforward SQL database, except by massive partitioning and sharding, which requires significant adjustments to the app logic as well. That's why Facebook developed Cassandra.
NoSQL basically means you make do without some SQL-typical features like immediate consistency or easy joins, in exchange for being able to use a database that scales much better.
Conversely, there is no point in using NoSQL if your website never has more than a dozen concurrent users (which is true for the vast majority of all sites).
We need understand what is your problem in the current application?
Transactions
Amount of data
Data structure
NoSQL solves the problems of scalability and availability against that of atomicity or consistency.
Basic drive us to CAP theorem. Eric Brewer also noted that Of the three properties of shared-data systems – Consistency, Availability and tolerance to network Partitions – only two can be achieved any given moment in time. (CAP theorem)
NOSQL Approach
Schemaless data representation:
Most of them offer schemaless data representation & allow storing semi-structured data.
Can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time:
No complex SQL queries.
No JOIN statements.
Speed:
Very High speed delivery & Mostly in-built entity-level caching
Plan ahead for scalability:
Avoiding rework
There are many types of NoSQL databases. The web applications uses document based databases. The document db allows us to store JSON,XML,YAML and even Word documents and manipulate them. So, NoSQL is the obvious choice, especially MongoDB which is a Document Database which supports JSON format by default is the most preferred choice of developers and designers.

MongoDB vs. Cassandra [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am evaluating what might be the best migration option.
Currently, I am on a sharded MySQL (horizontal partition), with most of my data stored in JSON blobs. I do not have any complex SQL queries (already migrated away after since I partitioned my db).
Right now, it seems like both MongoDB and Cassandra would be likely options. My situation:
Lots of reads in every query, less regular writes
Not worried about "massive" scalability
More concerned about simple setup, maintenance and code
Minimize hardware/server cost
Lots of reads in every query, fewer regular writes
Both databases perform well on reads where the hot data set fits in memory. Both also emphasize join-less data models (and encourage denormalization instead), and both provide indexes on documents or rows, although MongoDB's indexes are currently more flexible.
Cassandra's storage engine provides constant-time writes no matter how big your data set grows. Writes are more problematic in MongoDB, partly because of the b-tree based storage engine, but more because of the multi-granularity locking it does.
For analytics, MongoDB provides a custom map/reduce implementation; Cassandra provides native Hadoop support, including for Hive (a SQL data warehouse built on Hadoop map/reduce) and Pig (a Hadoop-specific analysis language that many think is a better fit for map/reduce workloads than SQL). Cassandra also supports use of Spark.
Not worried about "massive" scalability
If you're looking at a single server, MongoDB is probably a better fit. For those more concerned about scaling, Cassandra's no-single-point-of-failure architecture will be easier to set up and more reliable. (MongoDB's global write lock tends to become more painful, too.) Cassandra also gives a lot more control over how your replication works, including support for multiple data centers.
More concerned about simple setup, maintenance and code
Both are trivial to set up, with reasonable out-of-the-box defaults for a single server. Cassandra is simpler to set up in a multi-server configuration since there are no special-role nodes to worry about.
If you're presently using JSON blobs, MongoDB is an insanely good match for your use case, given that it uses BSON to store the data. You'll be able to have richer and more queryable data than you would in your present database. This would be the most significant win for Mongo.
I've used MongoDB extensively (for the past 6 months), building a hierarchical data management system, and I can vouch for both the ease of setup (install it, run it, use it!) and the speed. As long as you think about indexes carefully, it can absolutely scream along, speed-wise.
I gather that Cassandra, due to its use with large-scale projects like Twitter, has better scaling functionality, although the MongoDB team is working on parity there. I should point out that I've not used Cassandra beyond the trial-run stage, so I can't speak for the detail.
The real swinger for me, when we were assessing NoSQL databases, was the querying - Cassandra is basically just a giant key/value store, and querying is a bit fiddly (at least compared to MongoDB), so for performance you'd have to duplicate quite a lot of data as a sort of manual index. MongoDB, on the other hand, uses a "query by example" model.
For example, say you've got a Collection (MongoDB parlance for the equivalent to a RDMS table) containing Users. MongoDB stores records as Documents, which are basically binary JSON objects. e.g:
{
FirstName: "John",
LastName: "Smith",
Email: "john#smith.com",
Groups: ["Admin", "User", "SuperUser"]
}
If you wanted to find all of the users called Smith who have Admin rights, you'd just create a new document (at the admin console using Javascript, or in production using the language of your choice):
{
LastName: "Smith",
Groups: "Admin"
}
...and then run the query. That's it. There are added operators for comparisons, RegEx filtering etc, but it's all pretty simple, and the Wiki-based documentation is pretty good.
Why choose between a traditional database and a NoSQL data store? Use both! The problem with NoSQL solutions (beyond the initial learning curve) is the lack of transactions -- you do all updates to MySQL and have MySQL populate a NoSQL data store for reads -- you then benefit from each technology's strengths. This does add more complexity, but you already have the MySQL side -- just add MongoDB, Cassandra, etc to the mix.
NoSQL datastores generally scale way better than a traditional DB for the same otherwise specs -- there is a reason why Facebook, Twitter, Google, and most start-ups are using NoSQL solutions. It's not just geeks getting high on new tech.
I'm probably going to be an odd man out, but I think you need to stay with MySQL. You haven't described a real problem you need to solve, and MySQL/InnoDB is an excellent storage back-end even for blob/json data.
There is a common trick among Web engineers to try to use more NoSQL as soon as realization comes that not all features of an RDBMS are used. This alone is not a good reason, since most often NoSQL databases have rather poor data engines (what MySQL calls a storage engine).
Now, if you're not of that kind, then please specify what is missing in MySQL and you're looking for in a different database (like, auto-sharding, automatic failover, multi-master replication, a weaker data consistency guarantee in cluster paying off in higher write throughput, etc).
I haven't used Cassandra, but I have used MongoDB and think it's awesome.
If you're after simple setup, this is it: You simply untar MongoDB and run the mongod daemon and that's it ... it's running.
Obviously that's only a starter, but to get you started it's easy.
I saw a presentation on mongodb yesterday. I can definitely say that setup was "simple", as simple as unpacking it and firing it up. Done.
I believe that both mongodb and cassandra will run on virtually any regular linux hardware so you should not find to much barrier in that area.
I think in this case, at the end of the day, it will come down to which do you personally feel more comfortable with and which has a toolset that you prefer. As far as the presentation on mongodb, the presenter indicated that the toolset for mongodb was pretty light and that there werent many (they said any really) tools similar to whats available for MySQL. This was of course their experience so YMMV. One thing that I did like about mongodb was that there seemed to be lots of language support for it (Python, and .NET being the two that I primarily use).
The list of sites using mongodb is pretty impressive, and I know that twitter just switched to using cassandra.

Key-Value Stores vs. RDBMs vs. "Cloud" DBs (SDB) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm comfortable in the MySQL space having designed several apps over the past few years, and then continuously refining performance and scalability aspects. I also have some experience working with memcached to provide application side speed-ups on frequently queried result sets. And recently I implemented the Amazon SDB as my primary "database" for an ecommerce experiment.
To oversimplify, a quick justification I went through in my mind for using the SDB service was that using a schema-less database structure would allow me to focus on the logical problem of my project and rapidly accumulate content in my data-store. That is, don't worry about setting up and normalize all possible permutations of a product's attributes before hand; simply start loading in the products and the SDB will simply remember everything that is available.
Now that I have managed to get through the first few iterations of my project and I need to setup simple interfaces to the data, I am running to issues that I had taken for granted working with MySQL. Ex: grouping in select statements and limit syntax to query "items 50 to 100". The ease advantage I gained using schema free architecture of SDB, I lost to a performance hit of querying/looping a resultset with just over 1800 items.
Now I'm reading about projects like Tokyo Cabinet that are extending the concept of in-memory key-value stores to provide pseudo-relational functionality at ridiculously faster speeds (14x i read somewhere).
My question:
Are there some rudimentary guidelines or heuristics that I as an application designer/developer can go through to evaluate which DB tech is the most appropriate at each stage of my project.
Ex: At a prototyping stage where logical/technical unknowns of the application make data structure fluid: use SDB.
At a more mature stage where user deliverables are a priority, use traditional tools where you don't have to spend dev time writing sorting, grouping or pagination logic.
Practical experience with these tools would be very much appreciated.
Thanks SO!
Shaheeb R.
The problems you are finding are why RDBMS specialists view some of the alternative systems with a jaundiced eye. Yes, the alternative systems handle certain specific requirements extremely fast, but as soon as you want to do something else with the same data, the fleetest suddenly becomes the laggard. By contrast, an RDBMS typically manages the variations with greater aplomb; it may not be quite as fast as the fleetest for the specialized workload which the fleetest is micro-optimized to handle, but it seldom deteriorates as fast when called upon to deal with other queries.
The new solutions are not silver bullets.
Compared to traditional RDBMS, these systems make improvements in some aspect (scalability, availability or simplicity) by trading-off other aspects (reduced query capability, eventual consistency, horrible performance for certain operations).
Think of these not as replacements of the traditional database, but they are specialized tools for a known, specific need.
Take Amazon Simple DB for example, SDB is basically a huge spreadsheet, if that is what your data looks like, then it probably works well and the superb scalability and simplicity will save you a lot of time and money.
If your system requires very structured and complex queries but you insist with one of these cool new solution, you will soon find yourself in the middle of re-implementing a amateurish, ill-designed RDBMS, with all of its inherent problems.
In this respect, if you do not know whether these will suit your need, I think it is actually better to do your first few iterations in a traditional RDBMS because they give you the best flexibility and capability especially in a single server deployment and under modest load. (see CAP Theorem).
Once you have a better idea about what your data will look like and how will they be used, then you can match your need with an alternative solution.
If you want the simplicity of a cloud hosted solution, but needs a relational database, you can check out: Amazon Relational Database Service