Which is the best free data warehouse products [closed] - warehouse

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am developing a system which constains a lot of olap work. According to my research, column based data warehouse is the best choice. But I am puzzled to choose a good data warehouse product.
All the data warehouse comparison article I see is befor 2012,and there seems little article about it. Is data warehouse out-of-date? Hadoop HBase is better?
As far as I know, InfiniDB is a high performance open source data warehouse product, but it has not been maintained for 2 years https://github.com/infinidb/infinidb. And there is little document about InfiniDB . Has InfiniDB been abundanted by developers ?
Which is the best data warehouse product by now?
How do I incrementally move my Business data stored in the Mysql database to data warehouse ?
Thank you for your answer!

Data warehousing is still a hot topic, and HBase is not the fastes, but a very well known and compatible one (many applications build on it)
I have taken the Journey for a good Column store some years ago and finally went with InfiniDB because of the easy migration from plain mysql. its a nice piece of software, but it has still bugs, so i cannot fully recommend it to be used in production. (not without a 2nd failover instance).
However, MariaDB has picket up the InfiniDB technology and is porting it over to their MariaDB Database Server. This new product ist called MariaDB Columnstore[1], of with a testing build is available. They have already put a lot effort in it, so i think ColumnStore will get a Major product of MariaDB within the next two years.
I cant answer that. Im still with InfiniDB and also helping others with their projects.
This totally depends on your data structure and usage.
InfiniDB is great at querying, it had (in my tests) ~8% better performance than impala, however, while infinidb supports INSERT, UPDATE, DELETE and transactions it is not great on transactional workload. i.e. just moving a community driven website to infinidb where visitors always manipulating data will NOT work well. one insert with 10000 rows will work well, 10000 inserts with 1 row will kill it.
We deployed Infinidb for our customers to 'aid' the query performance of a regular mariadb installation - we created a tool that imports and updates MariaDB database tables into InfiniDB faster querying. manipulations on that table are still done in MairaDB and the changes get batch-imported into InfiniDB with 30 sec delay. as original and infinidb tables have the same structure and are accessable with api mysql, we just can switch the database connection and have super-fast SELECT queries. this works well for our use case.
We also built new statistics/analytics applications from ground up to work with infinidb and replace a older MySQL-Based System, which also works great and above any performance-expectations. (we now have 15x of the data we had in mariadb, and its still easier to maintain and much faster to query).
[1] https://mariadb.com/products/mariadb-columnstore

I would give Splice Machine a shot (Open Source). It stores data on HBase and will provide the core data management functions that a warehouse provides (Primary Keys, Constraints, Foreign Keys, etc.)

Related

Modern hierarchical database [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 9 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Improve this question
I'm working on the architecture of a new system to replace an ancient mainframe app. The mainframe uses IBM IMS and is surprisingly fast with large amounts of data. We've tried 3 DBs so far - MongoDB, SQL Server and Oracle, but they performed poorly under load. We hired an Oracle consultant and a 128 cores server and Oracle still gives us 4x the response time of the old system (same with SQL Server).
Are there any modern hierarchical DBs, that can efficiently support billions of records?
Mainframes have been and remain very fast for certain use cases, so part one is not to assume that mainframe = bad. Having said that, they can be very expensive to maintain, and particularly with legacy apps the skills are starting to evaporate.
If you really wanted a hierarchical database, one valid option would be to modernise your application but retain IMS at the core. IMS is a great hierarchical database, and I don't think IBM are going to EOL IMS any time soon, so is there a real reason to go to a hierarchical database that isn't IMS? A quick visit to their website gave me the impression that they'd discount the product if they thought you were going to migrate to a competing product, so if money is the problem then perhaps the answer is to just ask IBM to discount the product you're already happy with. This white paper (ftp://public.dhe.ibm.com/software/data/ims/pdf/TCG2013015LI.pdf) suggests they're pushing that as an option, and no doubt the later versions of IMS have a bunch of features that might not be available in the version you're running (assuming you've not upgraded to the latest).
I'm surprised you can't get the performance you want out of Oracle though, the system I'm currently working on has a couple of tables at the billion mark and we definitely don't have 128 cores, but we get reasonable performance.
My first question is whether your Oracle consultant really knew their stuff. I've had mixed results, I guess like any skill set people can have variable skills. I often find that when you get performance problems it's because people have over-normalised or over-generalised the database schema - so you've moved from a highly optimised hierarchical structure in IMS that flies to a very abstracted structure in 3NF, and that dies. But sometimes if you put that same hierarchical structure in Oracle, and only allow the same sort of access patterns that were possible in IMS, you'd get all the performance you want.
By that, I mean if in IMS you had clients, clients had orders, and orders had order lines, then I think that means it's pretty hard to do any accesses without starting at the client. It also often means you have large batch processes that process all the clients every day to find out which have orders that you need to do something with.
So, some things here. Firstly, if, in Oracle, you were to build that structure - so I have a client id, the client id is the first element in the primary key of orders, and the client id then the order id are the first two elements in the primary key of order lines, and then I use client id as my clustering key and put client id into every index......probably all my client-based access paths will be really fast. You can also partition by client id, and if needed, run an Oracle RAC cluster with each of those partitions/client ranges effectively running as separate databases on a separate more commodity class machine (say, a dual socket machine = about 20 cores).
Secondly, if I used to have to process all my records once a night to find the orders that needed someone to work on them, then in the new relational world I don't need to do that any more, I just need to find the orders with a status of "pending" or whatever. So maybe Oracle isn't as fast for that batch oriented workload, but if I change my logic and do an indexed query for pending orders, then again I can get all the performance I want. Even more so, perhaps I make order_status into a partitioning key, so my "active" records are all in one partition, and all the older orders are in other partitions - and then I put that partition on an SSD-backed array.
Thirdly, take a look at your storage devices. Performance problems in databases are invariably IO problems - either you're doing too much IO (poorly optimised queries), or your IO subsystem can't keep up with the IO that you need to do. 128 cores is an awful lot of compute, and I've rarely seen a database that is compute bound. Maybe look at a big SSD array, some of them can give you enormous IO throughput. Certainly if you were running Oracle on a RAID 5 spinning disk array your performance is likely to suck.
The last random comment here - a lot of people are getting good results with SAP HANA - a fully in-memory database. That really flies, and is specifically designed for workloads that just won't run fast enough in other databases. I bet SAP would come demo it to you for free if you wanted it.

Is nosql Database good for Online Money Transaction management [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I am planning to use a nosql database as the back-end for my web product. I have a few very basic doubts.
I have read in a blog that Nosql database are not so good for Online Money Transaction i.e. where data integrity is highest importance (my product has online money transactions).
There will be around daily minimum 1000 users.
Will availability be a problem?
Can you please state any more pros and cons related to Nosql database? I am planning to use MongoDb. Can this satisfy my above queries?
NoSQL databases are there to solve several things, mainly:
(buzz) BigData => think TB, PB, etc..
Working with Distributed Systems / datasets => say you have 42 products, so 13 of them will live in Chicago datacenter, 21 in NY's and another and 8 somewhere in Japan, but once you query against all 42 products, you would not need to know where they are located: NoSQL DB will. This also allows to engage a lot more brain power ( servers ) to solve hard computational problems [ does not seem it would fit your use case, but it is an interesting thing to note ]
Partitioning => having your DB be easily distributed, besides those cool 8 products in Japan, also allows for an easy data replication, so those 42 products will be replicated with a factor of 3, for example, which would mean you DB would have 3 copies for every product. Hence if something goes down, no problem => here is a replica available. This is where NoSQL databases actually shine vs. RDBMS. Granted you can shard, partition and cluster Oracle / MySQL / PostgreSQL / etc.. BUT it is a several magnitudes more complicated process and usually a maintenance headache for most people you'd employ.
(to your questions)
There will be around daily minimum 1000 users
1000 users daily is an extremely low volume, unless you choose a NoSQL solution that was written yesterday at 3 a.m. as a proof concept, there should be no worries here. But if you are successful, and will have 100,000,000 users in a couple of months, NoSQL would be simpler to scale.
Will availability be a problem ?
Solid NoSQL solutions allow you to specify something that is called a quorum: "The quantity of replicas that must respond to a read or write request before it is considered successful". Some solutions also do something called a hinted handoff: "neighboring nodes temporarily take over storage operations for the failed node". In general, you should be able to control availability depending on your requirements.
(from your comments)
We are planning to expand. And dont want the database to be a constraint
Expanding is a very relative term. "Financial industry is pretty expanded", and they still mostly use RDBMS for day to day operations. Facebook uses MySQL. Major banks, I did work for, use Oracle / MySQL / PostgreSQL / DB2 / etc.. and only some of them do use NoSQL, but NOT for data that requires 100% consistency all the time. Even Facebook uses Cassandra only for things like "inbox search". But if by expand you mean more data and more users ( requests, connections, etc.. ), NoSQL will be a lot easier to scale. Again, it does not mean that you can't scale RDBMS, it is just more tedious/complicated.
From what I have read, in no SQL database we dont have to think about the schema a lot
In my experience, if I build a system that is any good, I ALWAYS have to think about the schema. NoSQL databases allow you to be a bit more flexible with the data you persist, but it does not mean you should think about the schema any less. Think of indexing the data for example, or sharding it over multiple clusters, or even contracts/interfaces that you may expose to clients, etc..
The maintenance required for NoSQL database is very less
I would not say this is true in general, unless we are talking about BigData. Take PostgreSQL for example. It is an extremely awesome piece of software, that is quite easy to work with and maintain. Another plus to RDBMS world => people feel A LOT more comfortable with SQL. For that reason, for example, Cassandra guys, released CQL in 0.8, which is a very limited subset of SQL. Terms like maintenance also should stand shoulder to shoulder with terms like Talent, Knowledge, Expertise. Since if you use Cassandra, for instance, that girl is a very "high maintenance", but not for guys from DataStax who do have Expertise, but you'd have to pay for that.
Your Main Question
I have read in a blog that NoSQL database are not so good for Online Money Transaction i.e. where data integrity is highest importance.( My product has Online money transactions )
Without truly knowing what your product is, it is hard to say whether a NoSQL database would / would not be a good fit. If the primary goal of the product is "Online Money Transaction", then I would suggest against NoSQL database ( at least today in the year of 2011 ). If "Online Money Transaction" is just one of the requirements, but not "the core" of your product, depending on what "the core" is, you can definitely give NoSQL database a try, and for example use an external service to process (e.g. Google Checkout, etc..) your transactions with a guaranteed consistency.
As a technical note, if the problem you are trying to solve benefits from being solved with distribution, I would recommend databases that are written in Erlang ( e.g. Riak, CouchDB, etc. ), since Erlang as a language already solves successfully most of distributed things for decades.
MarkLogic is a NoSQL database with ACID transactions that gets used to manage both virtual currency in games as well as real-life banking trades.

Redis, CouchDB or Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What are the strengths and weaknesses of the various NoSQL databases available?
In particular, it seems like Redis is weak when it comes to distributing write load over multiple servers. Is that the case? Is it a big problem? How big does a service have to grow before that could be a significant problem?
The strengths and weaknesses of the NoSQL databases (and also SQL databases) is highly dependent on your use case. For very large projects, performance is king; but for brand new projects, or projects where time and money are limited, simplicity and time-to-market are probably the most important. For teaching yourself (broadening your perspective, becoming a better, more valuable programmer), perhaps the most important thing is simple, solid fundamental concepts.
What kind of project do you have in mind?
Some strengths and weaknesses, off the top of my head:
Redis
Very simple key-value "global variable server"
Very simple (some would say "non-existent") query system
Easily the fastest in this list
Transactions
Data set must fit in memory
Immature clustering, with unclear future (I'm sure it'll be great, but it's not yet decided.)
Cassandra
Arguably the most community momentum of the BigTable-like databases
Probably the easiest of this list to manage in big/growing clusters
Support for map/reduce, good for analytics, data warehousing
MUlti-datacenter replication
Tunable consistency/availability
No single point of failure
You must know what queries you will run early in the project, to prepare the data shape and indexes
CouchDB
Hands-down the best sync (replication) support, supporting master/slave, master/master, and more exotic architectures
HTTP protocol, browsers/apps can interact directly with the DB partially or entirely. (Sync is also done over HTTP)
After a brief learning curve, pretty sophisticated query system using Javascript and map/reduce
Clustered operation (no SPOF, tunable consistency/availability) is currently a significant fork (BigCouch). It will probably merge into Couch but there is no roadmap.
Similarly, clustering and multi-datacenter are theoretically possible (the "exotic" thing I mentioned) however you must write all that tooling yourself at this time.
Append only file format (both databases and indexes) consumes disk surprisingly quickly, and you must manually run compaction (vacuuming) which makes a full copy of all records in the database. The same is required for each index file. Again, you have to be your own toolsmith.
Take a look at http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis He does a good job summing up why you would use one over the other.

Why exactly do we use NoSQL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Having understood some of the advantages that NoSQL offers (scalability, availability, etc.), I am still not clear why a website would want to use a non-relational database.
Can I get some help on this, preferably with an example?
Better performance
NoSQL databases sometimes have better performance, although this depends on the situation and is disputed.
Adaptability
You can add and remove "columns" without downtime. In most SQL servers, this takes a long time and takes up a load of load.
Application design
It is desirable to separate the data storage from the logic. If you join and select things in SQL queries, you are mixing business logic with storage.
NoSQL databases are there to solve several things, mainly:
(buzz) BigData => think TB, PB, etc..
Working with Distributed Systems / datasets => say you have 42 products, so 13 of them will live in Chicago datacenter, 21 in NY's and another and 8 somewhere in Japan, but once you query against all 42 products, you would not need to know where they are located: NoSQL DB will. This also allows to engage a lot more brain power ( servers ) to solve hard computational problems [ does not seem it would fit your use case, but it is an interesting thing to note ]
Partitioning => having your DB be easily distributed, besides those cool 8 products in Japan, also allows for an easy data replication, so those 42 products will be replicated with a factor of 3, for example, which would mean you DB would have 3 copies for every product. Hence if something goes down, no problem => here is a replica available. This is where NoSQL databases actually shine vs. RDBMS. Granted you can shard, partition and cluster Oracle / MySQL / PostgreSQL / etc.. BUT it is a several magnitudes more complicated process and usually a maintenance headache for most people you'd employ.
BUT to your question:
why a website would want to use a non-relational database
When most of the people, I worked with / met / chatted with, choose NoSQL for their "website", it is unfortunately NOT for the reasons above, but simply because it is COOLER to do so. And in fact many projects FAIL / have extreme difficulties due to this reason.
If most of NoSQL gurus take their masks off, they will all agree that MOST of the problems ( or as people call them websites ) that developers solve day to day, can and rather be solved with a SQL solution, such as PostgreSQL, MySQL, etc.. with some cool Redis cache layer on top of it. And only a small subset of problems would REALLY benefit from NoSQL.
I personally love Riak, as I am a firm believer that a NoSQL, fault tolerant DB should have an extremely strong, flexible and naturally distributed foundation => such as Erlang OTP. Plus I am a fan of simplicity. But again, given the problem, I would choose whatever works best, and most of the time I will NEED that consistency ( especially if we are talking about money / financial world / mission critical / etc.. ).
The main reason not to use an SQL database is scalability. The transactional guarantees and the relational model make it almost impossible to scale a database usefully across more than a few machines, especially given the write-heavy workloads generated by modern web applications.
An app like Facebook can't be made to work on a straightforward SQL database, except by massive partitioning and sharding, which requires significant adjustments to the app logic as well. That's why Facebook developed Cassandra.
NoSQL basically means you make do without some SQL-typical features like immediate consistency or easy joins, in exchange for being able to use a database that scales much better.
Conversely, there is no point in using NoSQL if your website never has more than a dozen concurrent users (which is true for the vast majority of all sites).
We need understand what is your problem in the current application?
Transactions
Amount of data
Data structure
NoSQL solves the problems of scalability and availability against that of atomicity or consistency.
Basic drive us to CAP theorem. Eric Brewer also noted that Of the three properties of shared-data systems – Consistency, Availability and tolerance to network Partitions – only two can be achieved any given moment in time. (CAP theorem)
NOSQL Approach
Schemaless data representation:
Most of them offer schemaless data representation & allow storing semi-structured data.
Can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time:
No complex SQL queries.
No JOIN statements.
Speed:
Very High speed delivery & Mostly in-built entity-level caching
Plan ahead for scalability:
Avoiding rework
There are many types of NoSQL databases. The web applications uses document based databases. The document db allows us to store JSON,XML,YAML and even Word documents and manipulate them. So, NoSQL is the obvious choice, especially MongoDB which is a Document Database which supports JSON format by default is the most preferred choice of developers and designers.

Key-Value Stores vs. RDBMs vs. "Cloud" DBs (SDB) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm comfortable in the MySQL space having designed several apps over the past few years, and then continuously refining performance and scalability aspects. I also have some experience working with memcached to provide application side speed-ups on frequently queried result sets. And recently I implemented the Amazon SDB as my primary "database" for an ecommerce experiment.
To oversimplify, a quick justification I went through in my mind for using the SDB service was that using a schema-less database structure would allow me to focus on the logical problem of my project and rapidly accumulate content in my data-store. That is, don't worry about setting up and normalize all possible permutations of a product's attributes before hand; simply start loading in the products and the SDB will simply remember everything that is available.
Now that I have managed to get through the first few iterations of my project and I need to setup simple interfaces to the data, I am running to issues that I had taken for granted working with MySQL. Ex: grouping in select statements and limit syntax to query "items 50 to 100". The ease advantage I gained using schema free architecture of SDB, I lost to a performance hit of querying/looping a resultset with just over 1800 items.
Now I'm reading about projects like Tokyo Cabinet that are extending the concept of in-memory key-value stores to provide pseudo-relational functionality at ridiculously faster speeds (14x i read somewhere).
My question:
Are there some rudimentary guidelines or heuristics that I as an application designer/developer can go through to evaluate which DB tech is the most appropriate at each stage of my project.
Ex: At a prototyping stage where logical/technical unknowns of the application make data structure fluid: use SDB.
At a more mature stage where user deliverables are a priority, use traditional tools where you don't have to spend dev time writing sorting, grouping or pagination logic.
Practical experience with these tools would be very much appreciated.
Thanks SO!
Shaheeb R.
The problems you are finding are why RDBMS specialists view some of the alternative systems with a jaundiced eye. Yes, the alternative systems handle certain specific requirements extremely fast, but as soon as you want to do something else with the same data, the fleetest suddenly becomes the laggard. By contrast, an RDBMS typically manages the variations with greater aplomb; it may not be quite as fast as the fleetest for the specialized workload which the fleetest is micro-optimized to handle, but it seldom deteriorates as fast when called upon to deal with other queries.
The new solutions are not silver bullets.
Compared to traditional RDBMS, these systems make improvements in some aspect (scalability, availability or simplicity) by trading-off other aspects (reduced query capability, eventual consistency, horrible performance for certain operations).
Think of these not as replacements of the traditional database, but they are specialized tools for a known, specific need.
Take Amazon Simple DB for example, SDB is basically a huge spreadsheet, if that is what your data looks like, then it probably works well and the superb scalability and simplicity will save you a lot of time and money.
If your system requires very structured and complex queries but you insist with one of these cool new solution, you will soon find yourself in the middle of re-implementing a amateurish, ill-designed RDBMS, with all of its inherent problems.
In this respect, if you do not know whether these will suit your need, I think it is actually better to do your first few iterations in a traditional RDBMS because they give you the best flexibility and capability especially in a single server deployment and under modest load. (see CAP Theorem).
Once you have a better idea about what your data will look like and how will they be used, then you can match your need with an alternative solution.
If you want the simplicity of a cloud hosted solution, but needs a relational database, you can check out: Amazon Relational Database Service