Terabyte scale database in Greenplum

Terabyte scale database in Greenplum - postgresql

I am currently using greenplum for little scale of data like 1GB to test it.
As greenplum is said to be "petabytes-scale", I was wondering if having a volume of data like one or ten terabytes is worth going into this MPP processing instead of a normal PostgreSQL database.
All my network interfaces have 10 Mb/s for slaves and master.
Best practices don't include these considerations. The problem is that having maybe a "little database" will have poor result due to network processing.
Did you already implement a database with this scale?

The workloads for PostgreSQL and Greenplum are different. PostgreSQL is great for OLTP, queries with index lookups, referential integrity, etc. You typically know the query patterns in an OLTP database too. It can certainly take on some data warehouse or analytical needs but it scales by buying a bigger machine with more RAM and more cores with faster disks.
Greenplum, on the other hand, is designed for data warehousing and analytics. You design the database without knowing how the users will query the data. This means sequential reads, no indexes, full table scans, etc. It can do some OLTP work but it isn't designed for it. You scale Greenplum by adding more nodes to you cluster. This gives you more CPU, RAM, and disk throughput.
What is your use case? That is the biggest determinant in picking Greenplum vs PostgreSQL.

Related

How to decide when to use replicate sets for mongodb in production

We are currently hosting the MongoDB using its official docker image in ec2, for our production environment, its 32gb memory server dedicated to just this service.
How can using replica sets help us in the improvement of the performance of our MongoDB, we are currently facing that the response for queries is getting slower day by day.
Are there any measures through which we can determine that investing in the replica set will provide worthy benefits as well and will not be premature optimization.

MongoDB replication is a high availability solution (see note at the end of the post for more details on Replication). Replication is not a performance improvement solution.
MongoDB query performance depends upon various factors: size of collection, size of document, database design, query definition and indexes. Inadequate hardware (memory, hard drive, cpu and network) can affect the query performance. The number of operations at a given time can also affect the performance.
For faster query performance the main consideration is using indexes. Indexes affect directly the query filter and sort operations. To find if your query is performing optimally and using the proper indexes generate a query plan using the explainwith "executionStats" mode; study the plan. Explain can be run on MongoDB find, update, delete and aggregation queries. All these queries can benefit from indexes. See Query Optimization.
Adding capabilities to the existing hardware is known as vertical scaling; and replication is not vertical scaling.
Replication:
This is configured as a replica-set - a primary node and multiple secondary nodes. The primary is the main point of contact for application - all writes happen on the primary, (and reads, by default). The data written to the primary is replicated to the secondaries. This way data redundancy is accomplished. When the primary goes down one of the secondaries takes over as primary and keep the system running via a failover process. Data durability, high availability, redundancy and failover are the man concepts with replication. In MongoDB a replica-set cluster can have up to fifty nodes.

It is recommended to use replica-set in production due to HA functionality.
As a result of source limits on one hand and the need of HA in production on the other hand, I would suggest you to create a minimal replica-set which will consist of Primary, Secondary and an Arbiter (an arbiter does not contain any data and is very low memory consumer).
Also, Writes typically effect your memory performance much more than reads. In order to achieve better write performance I would advice you to create more shards (the more masters you have, the more writes you can handle at the same time).
However, I'm not sure what case your mongo's performance to slow so fast. I think you should:
Check what is most effect your production's performance (complicated queries or hard writes).
Change your read preference to "nearest".
Consider to disable Read Concern "majority" (remember that by default there is a write "majority" concern. Members should be up to date).
Check for a better index.
And of curse create a replica-set!
Good Luck! :P

HBase, mongoDB, Cassandra - overhead on small cluster, small data

Ok,this systems are scalable with respect to nr of nodes and big amount of data.
But how about the overhead involved, if I use this systems on a small cluster (5-10 nodes), and on a small amount of data, processing/storing on a scale of a couple of gigabytes? Or on a smaller data, like hundreds of MB ?
Are there better database systems to use for my cluster and my amount of data?

A scalable solution usually pays a penalty required to scale over large data. The penalty is paltry compared to large data that you get to process. If you do not envisage processing data in Terabytes then you could do with a more responsive system that does not pay that penalty.
Use Sqlite database for smaller data. Frankly it depends on what other requirements/constraints you have.

You can probably just use a single node mySQL server for this kind of data with the advantage of a full SQL capabilities, full ACID, mature tools etc.

Redis versus Cassandra(Bigtable data model)

Suppose I need to do the following operations intensively:
put(key, value)
where value is a map of <column name, column value>.
I havn’t known NoSQL for long, what I know is that both Cassandra insert(which conform the api defined in Bigtable paper) and Redis “HSET” command could do that. But what’s the pros and cons of both way? Any performance and scalability difference there?
EDIT :
My requirement is something like an IM server --- I need to store session data , and I want all of them to be in memory so that low latency can be easily achieved. The session last for at most 2 hours. No consistency requirement to consider yet. And disk is only for fail-over. Lost of data is not terrible. All i need is lower latency. Operations per second --- the more, the better.

Both redis and cassandra can be used as a key value store. The difference is in speed, scale and reliability.
Redis works best as a single server, where the entire data set resides in memory.
Cassandra can handle data sets that don't fit in memory, and data sets that don't fit on a single machine. As part of distributing over multiple machines, cassandra is much more reliable. Cassandra can handle machine failures, rebuilding machines, adding capacity to the cluster when needed.
Because redis is entirely in memory, and reads/writes are served by a single machine (a single cassandra write will typically talk to multiple machines), redis will most likely be faster.
If your primary goal is speed, and you don't need to store data reliably, and your data set fits in memory, then redis would probably be a better solution.

Scaling DB2 to increase tps

I wanna have brainstorm with you guys all about scaling option that DB2 have. Hope can helped me to resolve the problem.
I need to scale my DB2 database to anticipate flash crowd transaction to database server. My database can only serve around 200 transaction++ per sec in application term not database tps before my database totally stalled and out of cpu.
What are you guys think, if I want to increase to reach 2000++ or 10 times before, what options i have to scale my database?
Recently i read about pureScale feature. Its look promising but its not flexible solution by mean it just can be deploy on IBM System X and ours is not. Are there other solution like pureScale in shared-everything approach?
The second option maybe database partition. Is database partition or shared-nothing approach can help resolve my problem? Can add processing power to my system?
Thanks and regards,
Fritz

Before you worry about how to scale up (more hardware in 1 server) or out (more servers), look at how to tune your database. Buying your way out of a performance problem is almost always more expensive than spending time to find and fix the performance problem.
Assuming that the process(es) consuming CPU on your database server are the database engine, then high CPU activity and low I/O activity is indicative that you're doing a LOT of reads, but they are just all in memory. Scanning a huge table is still in inefficient, even if that table is stored completely in memory (buffer pools).
Find the SQL statements that are using the most CPU. Look at the explain plans, and figure out how to make them more efficient. There are LOTS of resources on the web for database performance tuning.

MongoDB - Cluster of small/many or few/big nodes

Can it be said what is best in a general case (where the database size is really big): To have a MongoDB cluster consisting of a larger number of smaller blade servers, or a few, really fat, servers?
Given is that the shard key has a quite fine granularity, so splitting should not be a problem.
If there are no "golden bullet", what is the pros and cons with either setup?

Best in what aspect? From a financial point of view, I'd go for lots of cheap hardware :)
MongoDB has been built to easily scale across nodes, so why not take advantage of this? The reason you'd want just one or a few beefy servers for a SQL server is to minimize the spread of relational data across physical nodes. But since MongoDB uses documents, most of your related data is stored in a single document. This means that it's all stored at the same physical location and you don't have to do costly lookups on other nodes to reconstruct the 'complete picture' of your data.
Another thing to keep in mind is that map-reduce jobs can only run in parallel in a sharded environment. So if you plan to do a lot of map-reducing, more shards/servers will result in better performance.
What if your database outgrows your beefy servers? Are you going to invest in another beefy server that handles that small amount of extra growth? Or what if one of them crashes? With smaller and cheaper servers, you can scale up (or down) more gradually if the need arises. Also, the impact of a server crash is much smaller, as it will affect only a small portion of your data.
To summarize: a large cluster of smaller servers isn't a silver bullet, as managing such a cluster has its own challenges, but it is significantly cheaper and possibly faster as well if you're doing map-reduce.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse