Multiple connection pool on same PostgreSQL - postgresql

My application has 2 sections mainly,
User interface written in angular which uses a Django python back end.
Heavy map reduce kind of process.
Both uses postgres for look up, so my doubt is if I use same connection pool for both, at the time when my map reduce is runnning due to heavy lookup my other application won't work because of no connection available. Is there any work around this.(Avoiding the postgres itself is in the backlog)
PS: I am using pgbouncer for pooling

Simplest approach would be separating the two sections.
At least with respect to the connection resources.
(Whether e.g. memory consumption and gc would benefit from restructuring is not asked for)
You may achieve this using one of the following approaches:
use two separate pools, one for each section.
This way, you may setup the pools according to the connection requirements per section.
change your code to maintain sufficient "free" resources for the other section.
This is quite tedious and only useful as soon as the resource requirements
need fine grain control depending on internal state of the algorithms.
Usually you'd want to go with suggestion 1.

Related

Parallel processing input/output, queries, and indexes AS400

IBM V6.1
When using the I system navigator and when you click System values the following display.
By default the Do not allow parallel processing is selected.
What will the impact be on processing in programs when you choose multiple processes, we have allot of rpgiv programs and sql queries being executed and I think it will increase performance?
Basically I want to turn this on in production environment but not sure if I will break anything by doing this for example input or output of different programs running parallel or data getting out of sequence?
I did do some research :
https://publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/rzakz/rzakzqqrydegree.htm
And understand each option but I do not know the risk of changing it from default to multiple.
First off, in order get the most out of *MAX and *OPTIMIZE, you'd need a system with more than one core (enabled for IBM i / DB2) along with the DB2 Symmetric Multiprocessing (SMP) (57xx-SS1 option 26) license program installed; thus allowing the system to use SMP for queries and index builds.
For *IO, the system can use multiple tasks via simultaneous multithreading (SMT) even on a single core POWER 5 or higher box. SMT is enabled via the Processor multi tasking (QPRCMLTTSK) system value
You're unlikely to "break" anything by changing the value. As long as your applications don't make bad assumptions about result set ordering. For example, CPYxxxIMPF makes use of SQL behind the scenes; with anything but *NONE you might end up with the rows in your DB2 table in different order from the rows in the import file.
You will most certainly increase the CPU usage. This is not a bad thing; unless you're currently pushing 90% + CPU usage regularly. If you're only using 50% of your CPU, it's probably a good thing to make use of SMT/SMP to provide better response time even if it increases the CPU utilization to 60%.
Having said that, here's a story of it being a problem... http://archive.midrange.com/midrange-l/200304/msg01338.html
Note that in the above case, the OP was pre-building work tables at sign on in order to minimize the wait when it was time to use them. Great idea 20 years ago with single threaded systems. Today, the alternative would be to take advantage of SMP/SMT and build only what's needed when needed.
As you note in a comment, this kind of change is difficult to test in non-production environments since workloads in DEV & TEST are different. So it's important to collect good performance data before & after the change. You might also consider moving it stages *NONE --> *IO --> *OPTIMIZE and then *MAX if you wish. I'd spend at least a month at each level, if you have periodic month end jobs.

G-WAN Key-Value Store

What would you consider the best solution to store via G-WAN Key-Value Store my values in RAM and multi-threaded, and able to be used by all my scripts (from other virtual servers or not) ?
Thank you in advance.
I wish to store different values in different "storages" so as to be able to recover each one via a "key" (type char).
The G-WAN KV store does that (for any type of data: binary too).
Once your application will have millions of concurrent users, one way to speed-up lookups will be to use different G-WAN servers to host either a partitioned data set or a redundant data set (it all depends on the type of your application).
The G-WAN reverse-proxy featuring an elastic load-balancer makes such things this almost transparent for developers.
I do not care that the data is lost when you restart g-wan.
Then you won't have to use a persistant layer like mySQL, etc.
So it would be fine (I think) to have a persistent pointer but I'm not sure that this is the most suitable solution
Look at the persistence.c example for about how to share common data among all worker threads in G-WAN.
But you can avoid that if you are using G-WAN with one single worker thread (./gwan -w 1). One thread is more than enough to start developing and even to operate your application until the point you will need to process more requests.
With one single thread, you can just use a static pointer to your G-WAN KV store (unless different scripts need to access it).

Is it better to use multiple databases when you are managing independent sets of things in MongoDB?

If, as an example, you have a blogging website done with MongoDB to store data
Is it better to have a database per blogger? given that their blogs and comments are completely independent from other bloggers. Or just lump everything together? or it doesn't make too much difference?
I'm imagining the same web app (not independent webs/urls per blogger) is used by all bloggers. So when someone logs in / accesses the blog the code would find the right database to use and haul data out it.
Does this have any downsides? is this normal for handling these kinds of things?
I am making plenty of assumptions about your needs. But, generally, there are 3 paths to multi-tenant apps in MongoDB:
Single collection per customer; never, ever do this.
Single database per customer. Good. You will trade off free space if your product is on the freemium model. Either way, you will want to run with "smallfiles" option. As stated, you will build the routing system for your environment. Thus, you will want to connect to the proper database for the proper customer.
customer_id key per document + path slug. Good. The trade off here is recovery of free space. Traditionally, MongoDB does not recover space used by deleted documents. Thus customers creating and deleting blog posts would create unused space. By using 'usePowerOf2Sizes' collections, you will recover disk space of deleted documents. However, 'usePowerOf2Sizes' creates bloated padding space.
To get over the disk space padding, take a look at the compression used here: http://blog.appsignal.com/blog/2013/07/30/taming-mongodb-disk-usage.html
Recap, I would recommend using customer_id plus the compression. It gives you the best of both worlds.
As stated in the comments under the original question, there's really no performance benefit to splitting up your MongoDB store into separate databases per blogger, due to the overhead of having each database and minimum storage.
On the flipside: You are going to make some cross-user analysis more difficult for yourself. As a very simple example, based on your blogging example: Imagine you want to look at average post count per user. This is pretty simple if your users (and posts) are in the same database (typically in the same collections), and you can likely use the aggregation framework for this task. This task will not be so straightforward with an unbounded number of databases, where you'll need to first enumerate all databases, then perform your aggregations/averaging once per database. This could end up being a slower operation than within a single-database architecture.
Having said all that: You still might have some reason to split data across databases. Maybe you have to separate data due to legal reasons, or to ensure customers that their sensitive data won't be commingled with other companies' data. Maybe your customer needs full read/write access to their database, and so you use per-database configuration as a security boundary. I'm sure there are other reasons as well...
It is perfectly normal to allocate 100's of databases if that is all you will see.
Database separation can have many benefits. They can be sharded independantly, since sharding occurs on database level. Databases also have the upside of being completely isolated instances (including locks) of the data within them (good example: space allocation occurs on database level).
This means they can be moved around the network as users data is accessed more and since a single users data might not be that big it would be easier than moving all of your users data to a more powerful node.
However, you must consider the problematic sides in the application of managing the connections to each database. There will be over head on it and you will need to have far more complex coding than what is considered standard.
Considering space, you will not see a drastic usage of space. The most problematic part of using separate databases is the journal allocation. Every collection you use in separate databases will also, of course, pre-allocate itself but this is actually considered one of the upsides to using database separation (movement of databases between nodes, isolation).
So the space problem is really only a problem if your scenario makes it one.
is this normal for handling these kinds of things?
For a normal blogger site, no, and I do not know enough about the complexities of your scenario to say any different. Normal operation would be to lump everything together, since you could see into the region of 1,000's maybe 1,000,000's of users and database separation just won't scale over that very well.

Data Synchronization in a Distributed system

We have an REST-based application built on the Restlet framework which supports CRUD operations. It uses a local-file to store the data.
Now the requirement is to deploy this application on multiple VMs and any update operation in one VM needs to be propagated other application instances running on other VMs.
Our idea to solve this was to send multiple POST msgs (to all other applications) when a update operation happens in a given VM.
The assumption here is that each application has a list/URLs of all other applications.
Is there a better way to solve this?
Consistency is a deep topic, and a hard thing to get right. The trouble comes when two nearly-simultaneous changes occur to the same data: conflicting updates can arrive in one order on one server, and in another order on another. This is a problem, since the two servers no longer agree on what the data is, and it isn't clear who is "right".
The short-story: get your favorite RDBMS (for example, mysql is popular) and have your app servers connect to in what is called the three-tier model. Be sure to perform complex updates in transactions, which will provide an acceptable consistency model.
The long-story: The three-tier model serves well for small-to-medium scale web sites/services. You will eventually find that the single database becomes the bottleneck. For services whose read traffic is substantially larger than write traffic, a common optimization is to create a single-master, many-slave database replication arrangement, where all writes go to the single master (required for consistency with non-distributed transactions), but the more-common reads could go to any of the read slaves.
For services with evenly-mixed read/write traffic, you may be better served by dropped some of the conveniences (and accompanying restrictions) that formal SQL provides and instead use of one of the various "nosql" data stores that have recently emerged. Their relative merits and fitness for various problems is a deep topic in itself.
I can see 7 major options for now. You should find out more details and decide whether the facilities / trade-offs are appropriate for your purpose
Perform the CRUD operation on a common RDBMS. Simplest and most consistent
Perform the CRUD operations on a common RDBMS which runs as fast in-memory RDBMS. eg TimesTen from Oracle etc
Perform the CRUD on a distributed cache or your own home cooked distributed hash table which can guarantee synchronization eg Hazelcast/ehcache and others
Use a fast common state server like REDIS/memcached and perform your updates
in a synchronized manner on it and write out the successfull operations to a DB in a lazy manner if required.
Distribute your REST servers such that the CRUD operations on a single entity are only performed by a single master. Once this is done, the details about the changes can be communicated to everyone else using a reliable message bus or a distributed database (eg postgres) that runs underneath and syncs all of your updates fairly fast.
Target eventual consistency and use a distributed data store like Cassandra which lets you target the consistency you require
Use distributed consensus algorithms like Paxos or RAFT or an implementation of the same(recommended) like zookeeper or etcd respectively and take ownership of the item you want to change from each REST server before you perform the CRUD operation - might be a bit slow though and same stuff is what Cassandra might give you.

One big call vs. multiple smaller TSQL calls

I have a ADO.NET/TSQL performance question. We have two options in our application:
1) One big database call with multiple result sets, then in code step through each result set and populate my objects. This results in one round trip to the database.
2) Multiple small database calls.
There is much more code reuse with Option 2 which is an advantage of that option. But I would like to get some input on what the performance cost is. Are two small round trips twice as slow as one big round trip to the database, or is it just a small, say 10% performance loss? We are using C# 3.5 and Sql Server 2008 with stored procedures and ADO.NET.
I would think it in part would depend on when you need the data. For instance if you return ten datasets in one large process, and see all ten on the screen at once, then go for it. But if you return ten datasets and the user may only click through the pages to see three of them then sending the others was a waste of server and network resources. If you return ten datasets but the user really needs to see sets seven and eight only after making changes to sets 5 and 6, then the user would see the wrong info if you returned it too soon.
If you use separate stored procs for each data set called in one master stored proc, there is no reason at all why you can't reuse the code elsewhere, so code reuse is not really an issue in my mind.
It sounds a wee bit obvious, but only send what you need in one call.
For example, we have a "getStuff" stored proc for presentation. The "updateStuff" proc calls "getStuff" proc and the client wrapper method for "updateStuff" expects type "Thing". So one round trip.
Chatty servers are one thing you prevent up front with minimal effort. Then, you can tune the DB or client code as needed... but it's hard to factor out the roundtrips later no matter how fast your code runs. In the extreme, what if your web server is in a different country to your DB server...?
Edit: it's interesting to note the SQL guys (HLGEM, astander, me) saying "one trip" and the client guys saying "multiple, code reuse"...
I am struggling with this problem myself. And I don't have an answer yet, but I do have some thoughts.
Having reviewed the answers given by others to this point, there is still a third option.
In my appllication, around ten or twelve calls are made to the server to get the data I need. Some of the datafields are varchar max and varbinary max fields (pictures, large documents, videos and sound files). All of my calls are synchronous - i.e., while the data is being requested, the user (and the client side program) has no choice but to wait. He may only want to read or view the data which only makes total sense when it is ALL there, not just partially there. The process, I believe, is slower this way and I am in the process of developing an alternative approach which is based on asynchronous calls to the server from a DLL libaray which raises events to the client to announce the progress to the client. The client is programmed to handle the DLL events and set a variable on the client side indicating chich calls have been completed. The client program can then do what it must do to prepare the data received in call #1 while the DLL is proceeding asynchronously to get the data of call #2. When the client is ready to process the data of call #2, it must check the status and wait to proceed if necessary (I am hoping this will be a short or no wait at all). In this manner, both server and client side software are getting the job done in a more efficient manner.
If you're that concerned with performance, try a test of both and see which performs better.
Personally, I prefer the second method. It makes life easier for the developers, makes code more re-usable, and modularizes things so changes down the road are easier.
I personally like option two for the reason you stated: code reuse
But consider this: for small requests the latency might be longer than what you do with the request. You have to find that right balance.
As the ADO.Net developer, your job is to make the code as correct, clear, and maintainable as possible. This means that you must separate your concerns.
It's the job of the SQL Server connection technology to make it fast.
If you implement a correct, clear, maintainable application that solves the business problems, and it turns out that the database access is the major bottleneck that prevents the system from operating within acceptable limits, then, and only then, should you start persuing ways to fix the problem. This may or may not include consolidating database queries.
Don't optimize for performance until a need arisess to do so. This means that you should analyze your anticipated use patterns and determine what the typical frequency of use for this process will be, and what user interface latency will result from the present design. If the user will receive feedback from the app is less than a few (2-3) seconds, and the application load from this process is not an inordinate load on server capacity, then don't worry about it. If otoh the user is waiting an unacceptable amount of time for a response (subjectve but definitiely measurable) or if the server is being overloaded, then it's time to begin optimization. And then, which optimization techniques will make the most sense, or be the most cost effective, depend on what your analysis of the issue tells you.
So, in the meantime, focus on maintainability. That means, in your case, code reuse
Personally I would go with 1 larger round trip.
This will definately be influenced by the exact reusability of the calling code, and how it might be refactored.
But as mentioned, this will depend on your exact situation, where maintainability vs performance could be a factor.