Are "best practices" regarding connection handle re-use and database user design mutually exclusive? - postgresql

SO says this may be subjective. I'm hoping not--I just can't seem to understand how this works in practice, and it seems like a specific enough technical question with I hope a definitive answer.
Context: LAPP stack.
I've read that using a single database user as the login for all connections to the database, and handling security yourself from there, is a bad idea. Databases have sufficient security models and it makes sense to use them.
Database handles have some resource cost associated with them, hence the existence of Apache::DBI, DBIx::Connector, and DBI::connect_cached(), to re-use a recent connection to a database. Making use of them should make a web app faster by avoiding the cost of connecting to a database.
The reason these seem to be mutually exclusive best practices is that, in my understanding, #1 implies that any database connection will be made with separate per-user credentials, which implies (as Apache::DBI documents) that re-using such connections will likely quickly cause your database backend to run out of connections.
The default maximum number of connections for PostgreSQL is 100.
The default numbers of servers and multiplied by subprocesses allowed for each, for Apache 2 running with the prefork MPM, far exceeds that, so it seems Apache::DBI's docs are right.
Thus the question: What do people do then, in practice?
Does this mean people using a LAPP stack generally connect using a single database user, and implement their own security/permissions model? Or does it mean they don't pool connections? Or do they choose between these two strategies based on speed vs security needs if they go with a LAPP stack, and if they need both, go with a desktop app or some other connection model?
Or if these are not, in fact, mutually exclusive strategies, what am I missing in my understanding here?

I've read that using a single database user as the login for all connections to the database, and handling security yourself from there, is a bad idea. Databases have sufficient security models and it makes sense to use them.
You probably misread this, or read it in a highly biased location. A more balanced view is (hopefully) this:
Managing perms (ACL or RBAC or other) within the database is a bloody mess and hard to get right. It can cripple performance, too, if done improperly (think: "select * from table join perms where convoluted_permission_scenario".) Depending on who you ask, you'll get more or less extreme viewpoints, e.g. here's (the very controversial) Zed Shaw: http://vimeo.com/2723800.
Managing perms at the DB level is just as much of a bloody mess. Not all engines implement row-level permissions, and even then there occasionally are leaks. For instance, calling a function in a where clause could (can?) leak rows in Postgres (until a recent version?) if raise gets called. And frankly, if you go past a superficial analysis of what is going on, it basically amounts to the former — just standardized and (usually) in C.
Managing perms at the app level without a database is also a bloody mess. It'll cripple performance no matter what you do from the moment where you need to join outside of SQL, unless you're dealing with trivial amounts of data. If you try it, you'll do fine… until your database grows too large and you basically don't.
So, in short: it's a bloody mess no matter where you manage it. Because permissions are a mess. In addition to the casual and idealistic "Joe needs write access to this set of nodes", you also need to cope with more down to earth scenarios such as "John is going off on vacation for Christmas and needs to temporarily delegate his write permissions on this set of nodes to his assistant Jane". Moreover, whichever scenario you do pick, you need to manage read access (which is usually the most frequent) in such a way that it's fast so you can scale. There's no silver bullet.
Moreover, even in the first and last of the above scenarios, it's ideal to have three DB users. One for reads, one for read/writes, and one for schema changes. Most apps don't, because it's yet another bloody mess to configure your ORM that way, hence the typical one DB user per app.
Anyway, getting back to your question: what people do in practice is one or two database users (read vs read/write/modify), implement RBAC or ACL within the database itself, and avoid access restriction logic like the plague on public-facing pages for performance reasons.

Related

Multiple Siddhi apps or one big one

We are building an application on top of Siddhi (using the Java library) that allows users to dynamically add rules and have all incoming information going forward be run against those rules. My question is if it's better to have one large app with many queries, streams, windows, and partitions, or to break up each query into it's own application.
We have been including everything in one single Siddhi app (SiddhiAppRuntime), but this is starting to become large and I fear things may start interacting with each other in unintended ways. We are also snapshotting the SiddhiAppRuntime and restoring state whenever our application gets restarted. This could likely lead to massive restores if we have hundreds of pattern queries to re-run.
I am considering making a separate SiddhiAppRuntime from a single SiddhiManager for each query. The benefits (as I see them) would be reduce the risk of unintentional interactions, make each query able to function on its own, and restoring the query after a shutdown should be much simpler since it will only need to restore a single query. Potential downsides could be increased overhead for having potentially hundreds of SiddhiAppRuntimes.
What is considered best practice for our scenario? What will offer better performance, both for running data through the rules and for restoring the rules in the case we have to restart.
(If this is too broad or any clarification is needed I will do my best to update this question accordingly)
From the lengthy description that you have given I assume these rules that users add does not interact with each other meaning rules add by user1 will not be interacting with rules added by user2.
In such a case it is recommended to use different Siddhi Apps(SiddhiAppRuntimes) for each user. This wont add much additional performance overhead as apps wont be interacting with each other. This will improve snapshoting process as we will be taking separate snapshopts per each app.
Also this will make sure you will have clear separation between each collection of rules and will be easily manageable.

Unprotected MongoDB server?

I apologize beforehand if this is a stupid or a silly question in any way. Let's just say that I stumbled upon an unprotected MongoDB server belonging to a big company. I tried using a client to connect to the server, without entering a username and password and it connected successfully. Now, I'm not sure if I have access to the data inside the databases, but I can see that there are a few databases on it, and I believe that it's possible for me to create and drop databases on it (haven't tried). How big of a security flaw does this constitute? Please note that I haven't tampered or messed around with anything, I'm just asking so I can discern if this is indeed a security flaw that I should report, or a false positive. Shouldn't such access be limited to database administrators?
I see where this is going, there may be several cases.
It might be a developmental server and data is fake.
It may be abandoned
They must be running some maintenance during which some lazy devs, open the ports and security.
Most production databases are sealed enough, since you call it a "BIG" company, most probably they must have done it.
What ever might be the case depending on the company you can even be slapped with criminal notices, not every companies take bug review by 3rd parties in proper way. If they have a proper bug bounty program though they may offer you a reward. Tread with caution.

How do we develop an application on the mainframe to access DB2/LUW without DB2/z?

We have developed an application which runs on the mainframe (z/OS), and it uses CAF, the Call Attach Facility, to talk to DB2/z for storing its data.
Those customers which already have DB2/z (and hence have to pay for it regardless) are not concerned, but there are others who want to use our application without incurring the expense of the database as well.
They have expressed a desire to have our product not use DB2/z, due to the expense. Under z/OS, the licence fees for DB2 are rather high and our application doesn't really need the insane levels of reliability that it provides.
So what they'd like us to do is to run DB2 under either zLinux (SLES/RHEL), or DB2/LUW on a machine totally separate from the mainframe. Or even, though this will probably be harder, in a non-IBM database.
We're looking for a hopefully-minimal-change solution to our code in achieving this. DB2 has all its federated stuff which will allow a program using DB2/z to seamlessly access data on an instance running elsewhere, but this still requires DB2/z and hence won't result in a cost reduction.
What would be the easiest way to shift all the data off the mainframe and allow us to remove the DB2/z dependency completely from our application?
Building on #NealB's answer, another way to create the layers would be to have no SQL in your application layer, but to call subroutines to accomplish your I/O. You indicate you would be willing to create custom builds, so you could create a set of routines for commonly-asked-for persistence layers.
Call the "database connect" module, which for DB2 on z/OS would do the CAF calls, for DB2 on z/Linux would (say) establish an SSL connection to the DBMS. Maintain a structure in memory with a union of pointers to the necessary data structures to communicate with your DBMS of choice.
FWIW I've seen vendor code that does this, allowing the business logic to be independent of the DBMS implementation. Some shops use VSAM, others DB2, other IMS. The data model is messy, but, sometimes them's the breaks.
This isn't an answer, just a couple of ideas and observations.
One approach I can think of would be to tier your application into an I/O layer and an
application layer. The application would run on Z/os and the I/O layer would run on
whatever machine hosts the database. All data access would then be via remote procedure calls
over TCP/IP or UDP. This would be a lot of work to set up and configure. Worse yet it may only be
appropriate for read-only type operations because managing transaction ACID (Atomicity, Consistency, Isolation, Durability)
properties becomes a real nightmare in the face of update operations.
As cschneid pointed out, you could try "rolling your own" database management system using
open source; but that too would probably lead to more problems than it solves.
I think your observation about "pushing a big rock uphill" sums it up.

How to limit the effect of client modifications to production systems

Our shop has developed a few WEB/SMS/DB solution for a dozen client installations. The applications have some real-time performance requirements, and are just good enough to function properly. The problem is that the clients (owners of the production servers) are using the same server/database for customizations that are causing problems with the performance of the applications that we created and deployed.
A few examples of clients' customizations:
Adding large tables with many text datatypes for the columns that get cast to other data types in the queries
No primary keys, indexes, or FK constraints
Use of external scripts that use count(*) from table where id = x, in a loop from the script, to determine how to construct more queries later in the same script. (no bulk actions that the planner can optimize or just do everything in a single pass)
All new code files on the server are created/owned by root, with 0777 permissions
The clients don't take suggestions/criticism well. If we just go ahead and try to port/change the scripts ourselves, the old code can come back, clobbering any changes that we make! Or with out limited knowledge of their use cases, we break functionality while trying to optimize their changes.
My question is this: how can we limit the resources to queries/applications other that what we create and deploy? Are there any pragmatic options in scenarios like this? We prided ourselves in having an OSS solution, but it seems that it's become a liability.
We use PG 8.3 running on a range on Linux Distos. The clients prefer php, but shell scripts, perl, python, and plpgsql are all used on the system in one form or another.
This problem started about two minutes after the first client was given full access to the first computer, and it hasn't gone away since. Anytime someone whose priorities are getting business oriented work done quickly they will be sloppy about it and screw up things for everyone. That's just how things work, because proper design and implementation are harder than cheap hacks. You're not going to solve this problem, all you can do is figure out how to make it easier for the client to work with you than against you. If you do it right, it will look like excellent service rather than nagging.
First off, the database side. There's now way to control query resources in PostgreSQL. The main difficulty is that tools like "nice" control CPU usage, but if the database doesn't fit in RAM it may very well be I/O usage that is killing you. See this developer message summarizing the issues here.
Now, if in fact it's CPU the clients are burning through, you can use two techniques to improve that situation:
Install a C function that changes the process priority (example 1, example 2) and make sure whenever they run something it gets called first (maybe put it into their psql config file, there are other ways).
Write a script that looks for postmaster processes spawned by their userid and renice them, make it run often in cron or as a daemon.
It sounds like your problem isn't the particular query processes they're running, but rather other modifications they're making to the larger structure. There's only one way to cope with that: you have to treat the client like they're an intruder and use the approaches of that portion of the computer security field to detect when they screw things up. Seriously! Install an intrusion detection system like Tripwire on the server (there are better tools, that's just the classic example), and have it alert you when they touch anything. New file that's 0777? Should jump right out of a proper IDS report.
On the database side, you can't directly detect the database being modified usefully. You should do a pg_dump of the schema every day into a file (pg_dumpall -g and pg_dump -s, then diff that against the last one you delivered and again alert you when it's changed. If you manage that this well, the contact with the client turns into "we noticed you changed on the server...what is it you're trying to accomplish with that?" which makes you look like you're really paying attention to them. That can turn into a sales opportunity, and they may stop fiddling with things as much just knowing you're going to catch it immediately.
The other thing you should start doing immediately is install as much version control software as you can on each client box. You should be able to login to each system, run the appropriate status/diff tool for the install, and see what's changed. Get that mailed to you regularly too. Again, this works best if combined with something that dumps the schema as a component to what it manages. Not enough people use serious version control approaches on the code that lives in the database.
That's the main set of technical approaches useful here. The rest of what you've got is a classic consulting client management problem that's far more of a people problem than a computer one. Cheer up, it could be worse--FSM help you if you give them ODBC access and they discover they can write their own queries in Access or something simple like that.

What technical considerations must a system/network administrator worry about when a site gets onto social bookmarking/sharing sites?

The reason I ask is that Stack Overflow has been Slashdotted, and Redditted.
First, what kinds of effect does this have on the servers that power a website? Second, what can be done by system administrators to ensure that their sites remain up and running as best as possible?
Unfortunately, if you haven't planned for this before it happens, it's probably too late and your users will have a poor experience.
Scalability is your first immediate concern. You may start getting more hits per second than you were getting per month. Your first line of defense is good programming and design. Make sure you're not doing anything stupid like reloading data from a database multiple times per request instead of caching it. Before the spike happens, you need to do some fairly realistic load tests to see where the bottlenecks are.
For absurdly high traffic, consider the ability to switch some dynamic pages over to static pages.
Having a server architecture that can scale also helps. Shared hosts generally don't scale. A single dedicated machine generally doesn't scale. Using something like Amazon's EC2 to host can help, especially if you plan for a cluster of servers from the beginning (even if your cluster is a single computer).
You're next major concern is security. You're suddenly a much bigger target for the bad guys. Make sure you have a good security plan in place. This is something you should always have, but it become more important with high usage.
Firstly, ask if you really want to spend weeks and thousands of $ on planning for something that might not even happen, and if it does happen, lasts about 5 hours.
Easiest solution is to have a good way to switch to a page simply allowing a signup. People will sign up and you can email them when the storm has passed.
More elaborate solutions rely on being able to scale quickly. That's firstly a software issue (can you connect to a db on another server, can you do load balancing). Secondly, your hosting solution needs to support fast expansion. Amazon EC2 comes to mind, or maybe slicehost. With both services you can easily start new instances ("Let's move the database to a different server") and expand your instances ("Let's upgrade the db server to 4GB RAM").
If you keep all data in the db (including sessions), you can easily have multiple front-end servers. For the database I'd usually try a single server with the highest resources available, but only because I haven't worked with db replication and it used to be quite hard to do, at least with mysql. Things might have improved.
The app designer needs to think about scaling up (larger machines with more cores and higher performance) and/or scaling out (distributing workload across multiple systems). The IT guy needs to work out how to best support that. The network is what you look at first, because obviously everything rides on top of it. Starting at the border, that usually means network load balancers and redundant routers being served by multiple providers. You can also look at geographic caching services and apps such as cachefly.
You want to reduce your bottlenecks as much as possible. You also want to design the environment such that it can be scaled out as needed without much work. Do the design work up front and it'll mean less headaches when you do get dugg.
Some ideas (of what I used in the past and current projects):
For boosting performance (if needed) you can put a reverse-proxying, caching squid in front of your server. Of course that only works if you don't have session keys and if the pages are somewhat static (means: they change only once an hour or so) and not personalised.
With the squid you can boost a bloated and slow CMS like typo3, thus having the performance of static websites with the comfort of a CMS.
You can outsource large files to external services like Amazon S3, saving your server's bandwidth.
And if you are able to spend some (three-figures per month) bucks, you can as well use a Content Delivery Network. Whith that in place you automatically have scaling, high-availability and low latencys for your users. Of course, your pages must be cachable, so session keys and personalised pages are a no-no. If designed carefully and with CDNs in mind, you can at least cache SOME content, like pics and videos and static stuff.
The load goes up, as other answers have mentioned.
You'll also get an influx of new users/blog comments/votes from bored folks who are only really interested in vandalism. This is mostly a problem for blogs which allow completely anonymous commenting, where some dreadful stuff will be entered. The blog platform might have spam filters sufficient to block it, but manual intervention is frequently required to clean up remaining drivel.
Even a little barrier to entry, like requiring a user name or email address even if no verification is done, will dramatically reduce the volume of the vandalism.