PostgreSQL availability and merges - postgresql

Is there a PostgreSQL HA solution that can handle a splitbrain situation gracefully. To elaborate, the system i'm working on is expected to run in several areas with users close to the servers there and connectivity between the zones is known to be questionable. I'd like for the users to be able to continue using the system in a degraded state (without updates from disconnected zones) and for a sensible merge once they come back online.

If you're prepared to live with a time delay, there should be some log shipping solutions that you could implement with a scheduled job. Basically, you send slices of the transaction log to the backup server. Here's some links with a better description:
http://developer.postgresql.org/pgdocs/postgres/warm-standby.html
http://developer.postgresql.org/~wieck/slony1/adminguide-1.1.rc1/logshipping.html
http://www.network-theory.co.uk/docs/postgresql/vol3/RecordbasedLogShipping.html
Note that a full implementation of Slony-I may be clunky (at least I found it that way a couple years ago, it may have drastically improved).

Related

Sync two offline masters when network available

I have a use case where I need to set up two physical stations at a venue. Each station will be running a couple of app servers and a mongodb server.
I can't rely on the venue's internet access so I need my app to be able to work offline and "sync" the dbs every once in a while.
I initially thought about having two masters that would somehow sync with a remote one but TIL that master-master replication is not possible with mongodb.
I've read about the active-active approach, however, that won't let me write to a different shard when offline.
I'm running out of ideas, any recommendation would be greatly appreciated.
------ Update on what I'm trying to achieve:
I'm working with a venue that has two entrances. The idea is to be able to capture some information from people attending the events (name, email, etc). After getting registered we will print a name tag with some of the info.
Everything sounds pretty easy, however, if possible, I would like to not rely on the venue's network (internet). So that's where I started struggling figuring out whats the best approach. I guess what I want is being able to have a remote mongo but if the network goes down somehow keep saving records locally and send them to the remote mongo instance when network is available again.
Extra considerations:
- Events last a couple of days, some people lose their name tag overnight, they should be able to go to either of the entrances and get it reprinted. So we should be able to find their info even if they registered in entrance A but they are asking for a reprint in entrance B.
More questions:
- Am I overthinking it? Maybe venue's network + a 4G/LTE modem as a backup should be enough? I would prefer not relying on it tho.
I believe you're overthinking things. Here's what I would do if faced with a similar situation:
From the description, it doesn't sound like the two sites need to be connected in real time at all. I would create a server on Entry A, another in Entry B, and consolidate their data each day after the day ended if required. This is because:
It's unlikely that one person will register in both sites within a single day. If they lost their tag on that day, I'll just tell them to go back to where they registered earlier and get it reprinted there. Worst case, you'll create a duplicate entry (should be obvious which is the duplicate since no one would lose their tag within seconds) but I would not anticipate hundreds of people all lost their tags within a day.
If the attendee lost their tag overnight, both servers will have synced data and should be able to reprint.
If you're concerned about the venue's Wifi access, just run cables from the server to the printing stations.
Personally, I would argue that the overnight sync is not really needed at all (see the likelihood of people registering twice). I would just collect the data from both servers after the event ended. That is, unless you have specific needs for the combined data from both entries during the 2nd day.
Note: please make sure you're running a minimum of 3-node replica set. Running a standalone instance for prod environment is not recommended. Hardware/disk corruption is a common event.

Google Cloud SQL very slow from time to time

It's been almost 3 months I have switched my platform to Google Cloud (Compute Engine + Cloud SQL + Cloud Storage).
I am very happy with it but from time to time I noticed big latency on the Cloud SQL server. My VMs from Compute Engine and my Cloud SQL instance are all on the same location (us-1) datacenter.
Since my Java backend makes a lot of SQL queries to generate a server response, the response times may vary from 250-300ms (normal) up to 2s!
In the console, I notice absolutely nothing: no CPU peaks, no read/write peaks, no backup running, nothing. No alert. Last time it happened, it lasted for a few days and then the response times went suddenly better than ever.
I am pretty sure Google works on the infrastructure behind the scenes... But no way to point that out.
So here's my questions:
Has anybody else ever had noticed the same kind of problem?
It is really annoying for me because my web pages get very slow and I have absolutely no control over it. Plus I loose a lot of time because I generally never first suspect a hardware problem / maintenance but instead something that we introduced in our app. Is it normal or do I have a problem on my SQL instance?
Is there anywhere I can have visibility over what's Google doing on the hardware? I know there are maintenance alerts, but for my zone it seems always empty when it happen.
The only option I have for now is to wait and that is really not acceptable.
I suspect that Google does some sort of IO throttling and their algorithm is not very sophisticated. We have a build server which slows down to a crawl if we do more than two builds within an hour. The build that normally takes 15 minutes will run for more than an hour and we usually terminate it and re-run manually later. This question describes a similar problem and the recommended solution is to use larger volumes as they come with more IO allowance.

How to handle large amounts of scheduled tasks on a web server?

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?
Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps
If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.

How to limit the effect of client modifications to production systems

Our shop has developed a few WEB/SMS/DB solution for a dozen client installations. The applications have some real-time performance requirements, and are just good enough to function properly. The problem is that the clients (owners of the production servers) are using the same server/database for customizations that are causing problems with the performance of the applications that we created and deployed.
A few examples of clients' customizations:
Adding large tables with many text datatypes for the columns that get cast to other data types in the queries
No primary keys, indexes, or FK constraints
Use of external scripts that use count(*) from table where id = x, in a loop from the script, to determine how to construct more queries later in the same script. (no bulk actions that the planner can optimize or just do everything in a single pass)
All new code files on the server are created/owned by root, with 0777 permissions
The clients don't take suggestions/criticism well. If we just go ahead and try to port/change the scripts ourselves, the old code can come back, clobbering any changes that we make! Or with out limited knowledge of their use cases, we break functionality while trying to optimize their changes.
My question is this: how can we limit the resources to queries/applications other that what we create and deploy? Are there any pragmatic options in scenarios like this? We prided ourselves in having an OSS solution, but it seems that it's become a liability.
We use PG 8.3 running on a range on Linux Distos. The clients prefer php, but shell scripts, perl, python, and plpgsql are all used on the system in one form or another.
This problem started about two minutes after the first client was given full access to the first computer, and it hasn't gone away since. Anytime someone whose priorities are getting business oriented work done quickly they will be sloppy about it and screw up things for everyone. That's just how things work, because proper design and implementation are harder than cheap hacks. You're not going to solve this problem, all you can do is figure out how to make it easier for the client to work with you than against you. If you do it right, it will look like excellent service rather than nagging.
First off, the database side. There's now way to control query resources in PostgreSQL. The main difficulty is that tools like "nice" control CPU usage, but if the database doesn't fit in RAM it may very well be I/O usage that is killing you. See this developer message summarizing the issues here.
Now, if in fact it's CPU the clients are burning through, you can use two techniques to improve that situation:
Install a C function that changes the process priority (example 1, example 2) and make sure whenever they run something it gets called first (maybe put it into their psql config file, there are other ways).
Write a script that looks for postmaster processes spawned by their userid and renice them, make it run often in cron or as a daemon.
It sounds like your problem isn't the particular query processes they're running, but rather other modifications they're making to the larger structure. There's only one way to cope with that: you have to treat the client like they're an intruder and use the approaches of that portion of the computer security field to detect when they screw things up. Seriously! Install an intrusion detection system like Tripwire on the server (there are better tools, that's just the classic example), and have it alert you when they touch anything. New file that's 0777? Should jump right out of a proper IDS report.
On the database side, you can't directly detect the database being modified usefully. You should do a pg_dump of the schema every day into a file (pg_dumpall -g and pg_dump -s, then diff that against the last one you delivered and again alert you when it's changed. If you manage that this well, the contact with the client turns into "we noticed you changed on the server...what is it you're trying to accomplish with that?" which makes you look like you're really paying attention to them. That can turn into a sales opportunity, and they may stop fiddling with things as much just knowing you're going to catch it immediately.
The other thing you should start doing immediately is install as much version control software as you can on each client box. You should be able to login to each system, run the appropriate status/diff tool for the install, and see what's changed. Get that mailed to you regularly too. Again, this works best if combined with something that dumps the schema as a component to what it manages. Not enough people use serious version control approaches on the code that lives in the database.
That's the main set of technical approaches useful here. The rest of what you've got is a classic consulting client management problem that's far more of a people problem than a computer one. Cheer up, it could be worse--FSM help you if you give them ODBC access and they discover they can write their own queries in Access or something simple like that.

What technical considerations must a system/network administrator worry about when a site gets onto social bookmarking/sharing sites?

The reason I ask is that Stack Overflow has been Slashdotted, and Redditted.
First, what kinds of effect does this have on the servers that power a website? Second, what can be done by system administrators to ensure that their sites remain up and running as best as possible?
Unfortunately, if you haven't planned for this before it happens, it's probably too late and your users will have a poor experience.
Scalability is your first immediate concern. You may start getting more hits per second than you were getting per month. Your first line of defense is good programming and design. Make sure you're not doing anything stupid like reloading data from a database multiple times per request instead of caching it. Before the spike happens, you need to do some fairly realistic load tests to see where the bottlenecks are.
For absurdly high traffic, consider the ability to switch some dynamic pages over to static pages.
Having a server architecture that can scale also helps. Shared hosts generally don't scale. A single dedicated machine generally doesn't scale. Using something like Amazon's EC2 to host can help, especially if you plan for a cluster of servers from the beginning (even if your cluster is a single computer).
You're next major concern is security. You're suddenly a much bigger target for the bad guys. Make sure you have a good security plan in place. This is something you should always have, but it become more important with high usage.
Firstly, ask if you really want to spend weeks and thousands of $ on planning for something that might not even happen, and if it does happen, lasts about 5 hours.
Easiest solution is to have a good way to switch to a page simply allowing a signup. People will sign up and you can email them when the storm has passed.
More elaborate solutions rely on being able to scale quickly. That's firstly a software issue (can you connect to a db on another server, can you do load balancing). Secondly, your hosting solution needs to support fast expansion. Amazon EC2 comes to mind, or maybe slicehost. With both services you can easily start new instances ("Let's move the database to a different server") and expand your instances ("Let's upgrade the db server to 4GB RAM").
If you keep all data in the db (including sessions), you can easily have multiple front-end servers. For the database I'd usually try a single server with the highest resources available, but only because I haven't worked with db replication and it used to be quite hard to do, at least with mysql. Things might have improved.
The app designer needs to think about scaling up (larger machines with more cores and higher performance) and/or scaling out (distributing workload across multiple systems). The IT guy needs to work out how to best support that. The network is what you look at first, because obviously everything rides on top of it. Starting at the border, that usually means network load balancers and redundant routers being served by multiple providers. You can also look at geographic caching services and apps such as cachefly.
You want to reduce your bottlenecks as much as possible. You also want to design the environment such that it can be scaled out as needed without much work. Do the design work up front and it'll mean less headaches when you do get dugg.
Some ideas (of what I used in the past and current projects):
For boosting performance (if needed) you can put a reverse-proxying, caching squid in front of your server. Of course that only works if you don't have session keys and if the pages are somewhat static (means: they change only once an hour or so) and not personalised.
With the squid you can boost a bloated and slow CMS like typo3, thus having the performance of static websites with the comfort of a CMS.
You can outsource large files to external services like Amazon S3, saving your server's bandwidth.
And if you are able to spend some (three-figures per month) bucks, you can as well use a Content Delivery Network. Whith that in place you automatically have scaling, high-availability and low latencys for your users. Of course, your pages must be cachable, so session keys and personalised pages are a no-no. If designed carefully and with CDNs in mind, you can at least cache SOME content, like pics and videos and static stuff.
The load goes up, as other answers have mentioned.
You'll also get an influx of new users/blog comments/votes from bored folks who are only really interested in vandalism. This is mostly a problem for blogs which allow completely anonymous commenting, where some dreadful stuff will be entered. The blog platform might have spam filters sufficient to block it, but manual intervention is frequently required to clean up remaining drivel.
Even a little barrier to entry, like requiring a user name or email address even if no verification is done, will dramatically reduce the volume of the vandalism.