My projects are currently SQL Server focused (with a little Postgres and MongoDB thrown in for fun). A recent project involving some configuration on Oracle reminded me of the complexity of implementing and managing Oracle RDBMS instances compared to the above.
Having dealt with DB2 on OS/2 many years ago, I decided to download a trial, and install it on CentOS for comparison. It was a fairly quick and easy implementation, including docs, and sample data.
Noting that DB2 LUW seems to get relatively little attention, I am wondering why? In certain editions, it is price competitive, and by many measures, is highly capable and scalable.
So, I am interested in knowing, if you use DB2 Express(-C), WSE, or EE, on Linux or Windows, could you share why (if it is your database of choice) ?
I work with DB2 for LUW across the spectrum : from high end Enterprise Server Edition in large enterprises to DB2 Express-C in SMBs.
In my opinion DB2 Express-C is absolutely brilliant for the SMB market. There is virtually no functionality that you would need as an SMB which does not exist in Express-C and all the major technology from the more expensive DB2 editions are there including pureXML (which I use extensively) and (with the Express-C support at $3k per server) full HADR support.
Things which are not in Express-C are -
Oracle compatibility support (ability to run Oracle PL/SQL rather than DB2's standard SQL/PL) : not an issue unless you plan to migrate an existing Oracle application. Note that many of the features underpinning this are available, including such things as associative arrays which you mentioned.
Deep Compression : the DB2 compression which I've found can save us up to 70% of the disk space on DB2 ESE. But SMBs don't tend to have the amount of data which would justify the extra cost of the Compression license even if you could buy it (you are talking about many terabytes of storage before it becomes worthwhile at the current price point). All that stops you using this compression is that you can't buy a license for it however.
Some of the partitioning capabilities are also not available in Express-C : but again partitioning is something that really is only needed by the highest end customers. In fact at least one type of partitioning (DPF) is not even available with ESE : you have to buy InfoSphere Warehouse (what used to be called DB2 Data Warehousing Edition) to get this these days.
If you want those you'd have to buy DB2 ESE (at a major price premium).
There is one other time when I'd recommend something other than Express-C these days, and that is if you want the ultra-scaleability of pureScale. This is an extra cost option on DB2 ESE, but actually comes included (limited only by the total number of processors you can have in the cluster on WSE).
Anyway, the bottom line is that I would recommend DB2 (and especially Express-C) to almost anyone these days. I think the reason you don't hear more of it is because IBM just doesn't do a good job of marketing it.
HTH
Phil Nelson
(teamdba#scotdb.com)
We use DB2 LUW at work (though, I speak only for me, not for work). I like that:
It's fast, and has neat tools that help you make your queries faster.
It has facilities for high availability (HADR).
It has XML support, which may or may not be useful to you (but we don't currently use that at work).
Its procedural language is easy to use (if rather lacking in features, especially for versions prior to 9.7).
It has excellent documentation.
(The decision to use DB2 at work was made long before I started there, so I can't comment on work's rationale for choosing it.)
The only thing I would add to Phil Nelson's excellent answer is that DB2 Express-C is currently unique among the no-cost commercial DBMS products in that it does not limit the size of the database. The newest versions of Microsoft's and Oracle's no-cost database engines top out at around 10-11GB of data.
Related
I was trying to see if MongoDB can be used as database for an accounting/bookkeeping software.
I tried to search for answers from existing questions on stackoverflow, but all of the links that I found were very old, (pre-2017) and all of them said not to use MongoDB as it is not ACID compliant.
However, since version 4.0 which was released in 2018, MongoDB is now multi-document ACID compliant. So, given that it now supports ACID, is it OK to build an accounting application using MongoDB, or is there any other reason why one should still use RDBMS only?
PS: I've very new to programming, so please forgive me if this question seems very novice.
Accounting can likely be done with MongoDB as we can see with Stripe using MongoDB for the majority of their operations which I believe to fall under heavy accounting area.
The main things you should be careful of:
Scalability (SQL Database can be scaled vertically very easily but NoSQL Databases are usually scaled horizontally, which in my experience has been a little painful).
Dynamic Schema (From what I can tell and imagine a Dynamic Schema might not be the best thing for an accounting requirement)
Complex queries (NoSQL databases can have a hard time executing complex queries, as compared to SQL databases).
You should probably give this article about Stripe and their MongoDB usage from MongoDB a read.
https://www.mongodb.com/blog/post/mongodb-powering-the-magic-and-the-monsters-at#:~:text=Stripe%20offers%20a%20simple%20platform,enabling%20transactions%20on%20the%20web.
My client has an existing PostgreSQL database with around 100 tables and most every table has one or more relationships to other tables. He's got around a thousand customers who use an app that hits that database.
Recently he hired a new frontend web developer, and that person is trying to tell him that we should throw out the PostgreSQL database and replace it with a MongoDB solution. That seems odd to me, but I don't have experience with MongoDB.
Is there any clear reasons why he should, or should not, make the change? Obviously I'm arguing against it and the other guy for it, but I would like to remove the "I like this one better" from the argument and really hear from the community on their experience with such things.
1) Performance
During last years, there were several benchmarks comparing Postgres and Mongo.
Here you can find the most recent performance benchmark (Yahoo): https://www.slideshare.net/profyclub_ru/postgres-vs-mongo-postgres-professional (start with slide #58, where some overview of the past becnhmarks is given).
Notice, that traditionally, MongoDB provided benchmarks, where they didn't turn on write ahead logging or even turned fsync off, so their benchmarks were unfair -- in such states the database system doesn't wait for filesystem, so TPS are high but probability to lose data is also very high.
2) Flexibility – JSON
Postgres has non-structured and semistructured data types since 2003 (hstore, XML, array data types). And now has very strong JSON support with indexing (jsonb data type), you can create partial indexes, functional indexes, index only part of JSON documents, index whole documents in different manners (you can tweek index to reduce it's size and speed).
More interestingly, with Postgres, you can combine relational approach and non-relational JSON data – see this talk again https://www.slideshare.net/profyclub_ru/postgres-vs-mongo-postgres-professional for details. This gives you a lot of flexibility and power (I wouldn't keep money-related or basic accounts-related data in JSON format).
3) Standards and costs of support
SQL experiences new born now -- NoSQL products started to add SQL dialects, there is a lot of people making big data analysis with SQL, you can even run machine learning algorithms inside RDBMS (see MADlib project http://madlib.incubator.apache.org).
When you need to work with data, SQL was, is and will be for long time the best language – there are such many things included to it, so all other languages are lagging too much. I recommend http://modern-sql.com/ to learn modern SQL features and https://use-the-index-luke.com (from the same author) to learn how reach the best performance using SQL.
When Mongo needed to create "BI connector", they also needed to speak SQL, so guess what they chose? https://www.linkedin.com/pulse/mongodb-32-now-powered-postgresql-john-de-goes
SQL will go nowhere, it's extended with SQL/JSON now and this means that for future, Postgres is an excellent choice.
4) Scalability
If you data size is up to several terabytes -- it's easy to live on "single master - multiple replicas" architectuyre either on your own installation or in clouds (Amazon RDS, Heroku, Google Cloud Platform, and since recently, Azure – all them support Postgres). There is an increasing number of solutions which help you to work with microservice architecture, have automatic failover, and/or shard your data. Here is only few of them, which are actively developed and supported, without specific order:
https://wiki.postgresql.org/wiki/PL/Proxy
https://github.com/zalando/spilo and https://github.com/zalando/patroni
https://github.com/dalibo/PAF
https://github.com/postgrespro/postgres_cluster
https://www.2ndquadrant.com/en/resources/bdr/
https://www.postgresql.org/docs/10/static/postgres-fdw.html
5) Extensibility
There are much more additional projects built to work with Postgres than with Mongo. You can work with literally any data type (including but not limited to time ranges, geospatial data, JSON, XML, arrays), have index support for it, ACID and manipulate with it using standard SQL. You can develop your own functions, data types, operators, index structures and much more!
If your data is relational (and it appears that it is), it makes no sense whatsoever to use a non-relational db (like mongodb). You can't underestimate the power and expressiveness of standard SQL queries.
On top of that, postgres has full ACID. And it can handle free-form JSON reasonably well, if that is that guy's primary motivation.
I have a fairly busy DB2 on Windows server - 9.7, fix pack 11.
About 60% of the CPU time used by all queries in the package cache is being used by the following two statements:
CALL SYSIBM.SQLSTATISTICS(?,?,?,?,?,?)
CALL SYSIBM.SQLPRIMARYKEYS(?,?,?,?)
I'm fairly decent with physical tuning and have spent a lot of time on SQL tuning on this system as well. The applications are all custom, and educating developers is something I also spend time on.
I get the impression that these two stored procedures are something that perhaps ODBC calls? Reading their descriptions, they also seem like things that are completely unnecessary to do the work being done. The application doesn't need to know the primary key of a table to be able to query it!
Is there anything I can tell my developers to do that will either eliminate/reduce the execution of these or cache the information so that they're not executing against the database millions of times and eating up so much CPU? Or alternately anything I can do at the database level to reduce their impact?
6.5 years later, and I have the answer to my own question. This is a side effect of using an ORM. Part of what it does is to discover the database schema. Rails also has a similar workload. In Rails, you can avoid this by using the schema cache. This becomes particularly important at scale. Not sure if there are equivalencies for other ORMs, but I hope so!
I am trying to find are there any differences between SQL Server 2014 Enterprise and Standard editions in the context of the T-SQL itself. I am aware that there are tools and hardware limitations, for example.
I need to know if there are any T-SQL limitations like:
query to run faster on Enterprise
index seek/scan to run faster on Enterprise
updatable columned stored index to be available for Enterprise only
According this article comparing the Programmability between the editions there is not. Anyway I want to double check it and be sure the performance will be the same (in the context of same hardware) and I will not need to change anything in the T-SQL code.
Example of such difference is the Direct query of indexed views (using NOEXPAND hint):
In SQL Server Enterprise, the query optimizer automatically considers
the indexed view. To use an indexed view in the Standard edition or
the Datacenter edition, the NOEXPAND table hint must be used.
A number of features that are categorized under "Data Warehouse" and Scalability and Performance affect Programmability.
This comes into play when a developer modifies their TSQL syntax to take advantage of a particular feature. The syntax will not necessary produce an error, but it may be less efficient. Partitioned tables, indexed views (as you mention), compression, star joins, for example, will all affect the execution plan. The query optimizer is usually smart enough to find the best execution plan for a given edition, however that is not always the case.
It is also likely that, if you're dealing with a large database, that the optimal indexing strategy may differ from Enterprise to Standard, which in turn might effect the query.
To the degree that different editions suggest different TSQL syntax, the Standard Edition syntax is usually the more intuitive. I would also say that in most modern environments you one is much more likely to be affected by the resource limitations than you are by query optimizer differences.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am currently working in a few projects with MongoDB and Apache Cassandra respectively. I am also using Solr a lot and I am handling "lots" of data with them (approx. 1-2TB). I've heard of Greenplum and Vertica the first time in the last week and I am not really sure, where to put them in my brain. They seem to me like Dataware House (DWH) solutions and I haven't really worked DWH. And they seem to cost lots of money (e.g. $60k for 1TB storage in Greenplum). I am currently not handling Petabyte of data and won't do so I think, but products like cassandra seem also to be able to handle this
Cassandra is the acknowledged NoSQL leader when it comes to
comfortably scaling to terabytes or petabytes of data.
via http://www.datastax.com/why-cassandra
So my question: Why should people use Greenplum & Co? Is there a huge advantage in comparison to these other products?
Thanks.
Cassandra, Greenplum and Vertica all handle huge amounts of data but in very different ways.
Some made up usecases where each database has its strengths:
Use cassandra for:
tweets.insert(key:user, data:blob);
tweets.get(key:user)
Use greenplum for:
begin;
update account set balance = balance - 10 where account_id = 1;
update account set balance = balance + 10 where account_id = 2;
commit;
Use Vertica for:
select sum(balance)
over (partition by region order by account rows unbounded preceding)
from transactions;
I work in the telecom industry. We deal with large data-sets and complex EDW(enterprise data warehouse) models.We started with Teradata and it was good for few years. Then the data increased exponentially, and as you know expansion in Teradata is expensive. So, we evaluated EMCs namely green plum, oracle exadata, hp Vertica and IBM netteza.
In speed, generation of 20 reports
went like this: 1. Vertica, 2. Netteza, 3. green plum, 4. oracle
In compression ratio: Vertica had a natural advantage. Among others IBM is good too.
The worst as per the benchmarks is emc and oracle. As always expected as its both want to sell ton of storage and hardware.
Scalability: All do scale well.
Loading time: emc is the best here, others (teradata , Vertica, oracle , IBM) are good too.
Concurrent user query :Vertica, emc, green plum, then only IBM. Oracle exadata is slow in any type of query case comparatively but much better than its old school 10g.
Price: Teradata > Oracle > IBM > HP > EMC
Note: Need to compare apple to apple, same no of cores ,ram,data volume, and reports
We chose Vertica for hardware independent pricing model, lower pricing and good performance. Now all 40+ users are happy to generate reports without waiting and it all fit in the low cost hp dl380 servers. it is great for olap /edw use case.
All this analysis is only for edw/analytics/olap case. I am still an oracle fan boy for all oltp, rich plsql, connectivity etc on any hardware or system. Exadata gives a decent mixed workload, but unreasonable in Price/performance ratio and still need to migrate 10g code to exadata best practice (sort of MMP like, bulk processing etc, and its time consuming than what they claim.
We've been working in Hadoop for 4 years, and Vertica for 2. We had massive loading and indexing problems with our tables in MySQL. We were running on fumes with our home-grown sharding solution. We could have invested heavily in developing a more sophisticated sharding solution, which would have been quite painful, imo. We could have thought harder about what data we absolutely needed to keep in a SQL database.
But at the end of the day, switching from MySQL to Vertica was what we chose. Vertica performance patterns are quite different from MySQL's, which comes with its own headaches. But it can load a lot of data very quickly, and it is good at heavy duty queries that would make MySQL's head spin.
The way I see it, Vertica is a solution when you are already invested in SQL and need a heavier duty SQL database. I'm not an expert, so I couldn't tell you what a transition to Oracle or DB2 would have been like compared to Vertica, neither in terms of integration effort or monetary cost.
Vertica offers a lot of features we've barely looked into. Those might be very attractive to others with use cases different to ours.
I'm a Vertica DBA and prior to that was a developer with Vertica. Michael Stonebreaker (the guy behind Ingres, Vertica, and other databases) has some critiques of NoSQL that are worth listening to.
Basically, here are the advantages of Vertica as I see them:
it's rather fast on large amounts of data
it's performance is similar (so I can gather) to other data warehousing solutions but it's advantage is clustering and commodity hardware. So you can scale by adding more commodity hardware. It looks cheap in terms of overall cost per TB. (Going from memory not an exact quote.)
Again, it's for data warehousing.
You get to use traditional SQL and tables. It's under the hood that's different.
I can't speak to the other products, but I'm sure a lot of them are fine too.
Edit: Here's a talk from Stonebreaker: http://www.slideshare.net/Dataversity/newsql-vs-nosql-for-new-oltp-michael-stonebraker-voltdb
Pivotal, formerly Greenplum, is the well-funded spinoff from EMC, VMware and GE. Pivotal's market are enterprises (and Homeland Cybersecurity agencies) with multi-Petabyte size databases needing complex analytics and high speed ETL. Greenplum’s origin is a PostgreSQL DB redesigned for Map Reduced MPP, with later additions for columnar-support and HDFS. It marries the best of SQL + NoSQL making NewSQL.
Features:
In 2015H1 most of their code, including Greenplum DB & HAWQ, will go
Open Source. Some advanced management & performance features at the
top of the stack will remain proprietary.
MPP (Massively Parallel Processing) share-nothing RDBMS database designed for multi-terrabyte to multi-petabyte environments.
Full SQL Compliance - supporting all versions of SQL: ‘92, ‘99, 2003 OLAP, etc. 100% compatible with PostgreSQL 8.2.
•Only SQL over HADOOP capable of handling all 99 queries used by the TPC-DS benchmark standard without rewriting. The competition cannot do many of them and are significantly slower. SIGMON whitepaper.
ACID compliance.
Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files.
Solr/Lucene integration for multi-lingual full-text search embedded in the SQL.
Incorporates Open Source Software: Spring, Cloud Foundry, Redis.io, RabbitMQ, Grails, Groovy, Open Chorus, Pig, ZooKeeper, Mahout, MADlib, MapR. Some of these are used at EBSCO.
Native connectivity to HBase, which is a popular column-store-like technology for Hadoop.
VMware's participation in $150m investment in MongoDB will likely lead to integration of petabyte-scale XML files.
Table-by-table specification of distribution keys allow you to design your table schemas to take advantage of node-local joins and group bys, but will perform will even without this.
Row and/or Column-oriented data storage. It is the only database where a table can be polymorphic with both columnar and row-based partitions as defined by the DBA.
A column-store table can have a different compression algorithm per column because different datatypes have different compression characteristics to optimize their storage.
Advanced Map-Reduce-like CBO Query Optimizer – queries can be run on hundreds of thousands of nodes.
It is the only database with a dynamic distributed pipeline execution model for query processing. While older databases rely on materialized execution Greenplum doesn't have to write data to disk with every intermediate query step. It streams data to the next stage of a query plan in memory, and never has to materialize the data to disk, so it's much faster than what anybody has demonstrated on Hadoop.
Complex queries on large data sets are solved in seconds or even sub-seconds.
Data management – provides table statistics, table security.
Deep analytics – including data mining or machine learning algorithms using MADlib. Deep Semantic Textual Analytics using GPText.
Graphical Analysis - billion edge distributed in-memory graph database and algorithms using GraphLab.
Integration of SQL, Solr indexes, GPText, MADlib and GraphLab in a single query for massive syntactical parsing and graph/matrix affinity analysis for deep search analytics.
Fully ODBC/JDBC compliant.
Distributed ETL rate of 16 TB/hr!! Integration with Talend available.
Cloud support: Pivotal plans to package its Cloud Foundry software so that it can be used to host Pivotal atop other clouds as well, including Amazon Web Services' EC2. Pivotal data management will be available for use in a variety of cloud settings and will not be dependent on a proprietary VMware system. Will target OpenStack, vSphere, vCloud Director, or private brands. IBM announced it has standardized on Cloud Foundry for its PaaS. Confluence page.
Two hardware "appliance" offerings: Isilon NAS & Greenplum DCA.
There is a lot of confusion about when to use a row database like MySQL or Oracle or a columnar DB like Infobright or Vertica or a NoSQL variant or Hadoop. We wrote a white paper to try to help sort out which technologies are best suited for which use cases - you can download Emerging Database Landscape (scroll half way down) or watch an on-demand webinar on the same topic.
Hope either is useful for you