How encrypted AWS PostgreSQL RDS works under the hood? - postgresql

I read that when creating an encrypted AWS Postgres RDS, it encrypts the underlying EBS volume created for it, all read replicas, backups, and snapshots as well.
Also when I queried and inserted the data into DB, it worked same as unencrypted DB would and gave results in plain text.
I have some questions regarding how exactly it is working under the hood.
Here they are:
How does search work?
A simple value-based search can be performed by encrypting what to search and then search it on the encrypted RDS. But my search in PostgreSQL was also working on JSONB nested objects. How is that achieved?
How does a partial search work?
I was able to do a partial search(like query) on a name and address inside a JSONB object. How can a partial search be done on encrypted DB?
How does insertion work?
I have some JSONB columns in my PostgreSQL and I was able to partial insert on my JSONB objects. Its a cool feature of PostgreSQL but how was it achieved when the whole JSONB was encrypted?
PS: I have some knowledge of how DB works under the hood for storing and retrieving data but can't get my head around if everything is encrypted. Pardon me if I am completely wrong on some concepts.
I would really appreciate if someone can shed light on this as I was not able to find it on the internet.
Thanks

You are overthinking this. The DB files on EBS volume do not appear to be encrypted from the perspective of the PostgreSQL process that is running on the RDS server. The encryption/decryption is happening at the hypervisor layer, and is transparent to any software running on the VM. All those things you keep asking how they work, work exactly the same as they would on unencrypted EBS volumes, because when the DB service requests information from disk it receives unencrypted data.
Think of it like going into the BIOS on your laptop and enabling some sort of encryption of your disk volumes. Would you expect that to break all database engines you tried to run on your computer? Would you expect all software to somehow know how to deal with your laptop's BIOS disk encryption? When you do that all the software you run on your computer (like PostgreSQL) doesn't need to suddenly be made aware of how to decrypt data on your disk, it is transparent to that software, appearing to the running software as if the disk is not encrypted.

Related

Moodle asynchronous replication

I'm looking to deploy moodle in the cloud however I have some 50 odd sites which require access to this moodle possibly even temporarily offline. So I'm looking into replicating moodle down onto each site. From what I understand there are 2 data stores that require replication, moodledata and the database, postgresql in our case. moodledata if I'm not mistaken contains the multimedia data and the database among other things all the user records. Luckily the multimedia data will be centralized and is thus synched only one way down to the nodes, that seems doable. Where I'm stuck is how do I handle the Postgres database where the sync will need to be bidirectional?

Data at rest encryption for remote unattended Ubuntu & PostgreSQL machine

I'm looking for a data at rest solution for our setup.
Our application runs on our clients' machines, set up by their IT guys (however, they don't possess any credentials), and located on-premise. We log in via SSH. The machines are meant to stay up. We're storing sensitive information and would need to encrypt it to meet data-at-rest requirements. We're using Ubuntu 18.04+ and PostgreSQL.
I've looked into different solutions and gathered some information from several related previously asked questions:
Full disk encryption - since their IT is not really available to us, going in that direction might be problematic, as it would require performing more steps on their side. Also, if (when) the server ever gets rebooted, we would need to log in via SSH to enter the passphrase or use some kind of network-bound encryption, which again requires additional set up and the additional resources might not even be available to us.
File-based encryption - use something like eCryptfs and store the PostgreSQL data directory in an encrypted file system. This is currently the only solution I found that solves most of the issues; however, there might be other directories that would need encryption, and I'm not certain if they can be encrypted in that method (like /tmp). Once rebooted, the file system wouldn't be mounted automatically, and we would need to mount it manually. I don't see how we can solve this without, again, network-bound encryption. eCryptfs also let the user enter whatever configurations and passphrase they want upon every mounting, even if they don't match the previously used settings, which I think is prone to files getting corrupted. Writing a program that intercepts the mounting and validates the passphrase might be a possible solution to this. Handling problems like hanging processes when the FS isn't mounted is also okay for now, but overall this solution doesn't scale nicely.
Column-based encryption, client-side encryption, etc - doesn't work for our setup. We want to be able to query the data over SSH. The client is stored on the same machine with the data. Using something like PGP keys would mean the data is effectively unencrypted.
We don't use cloud services of sorts.
Maybe we need a different setup, or there are other solutions I'm not aware of. I'm really new to this subject and in the Stack Overflow community. The solutions I've found over the internet seem sparse and dated, and I'm not sure they're still relevant.

How do I move data from RDS of one AWS account to another account

We have our web services and database set up on AWS a while back and application is now in production. For some reason, we need to terminate the old AWS and move everything under a newly created AWS account. Application and all the infrastructure are pretty straightforward. It is trickier for data though. The current database is still receiving lots of data on daily basis. So it is best to migrate the data after we turn off the old application and switch on new platform.
Both source RDS and target RDS are Postgres. We have about 40GB data to transfer. There are three approaches I could think of and they all have drawbacks.
Take a snapshot of the first RDS and restore it in second one. Problem is I don't need to transfer all the data from source to destination. Probably just records after 10/01 is enough. Also snapshot works best to restore in an empty rds that is just created. For our case, the new RDS will start receiving data already after the cutoff. Only after that, the data will be transferred from old account to new account otherwise we will lose data.
Dump data from tables in old RDS and backup in new RDS. This will have the same problem as #1. Also, if I dump data to local machine and then back up from local, the network speed is bottleneck.
Export table data to csv files and import to new RDS. The advantage is this method allows pick and choose and some data cleaning as well. But it takes forever to export a big fact table to local csv file. Another problem is, for some of the tables, I have surrogate row IDs which are serial (auto-incremental). The row IDs of exported csv may conflicting with existing data in new RDS tables.
I wonder if there is a better way to do it. Maybe some ETL tool AWS has which does point to point direct transfer without involving using local computer as the middle point.
In 2022 the simplest way to achieve this task is using AWS Database Migration Services (AWS DMS).
You can create a migration task, and set the original database as the source endpoint, and the new database as a destination endpoint.
Next create a task with "Full load, ongoing replication" settings.
More details here: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
I have recently moved the data of RDS from one account to other using Bucardo (https://bucardo.org/). Please refer the following blogs
https://www.compose.com/articles/using-bucardo-5-3-to-migrate-a-live-postgresql-database/
https://bucardo.org/pipermail/bucardo-general/2017-February/002875.html
Though this has not mentioned exactly about migration between two RDS account, this could help setting things. We still need some intermediate point such as EC2 instance where we need to configure this Bucardo and migrate the data between accounts. If you are looking for more information, I am happy to help.
In short, we need to take a manual snapshot of the source db and restore it in the another account (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ShareSnapshot.html) and with Bucardo set up in the EC2 instance, we can start to sync the data using triggers and that will update the data in destination db as and then the new data comes in to the source DB.

Store files on disk or MongoDB

I am creating a mongodb/nodejs blogging system (similar to wordpress).
I currently have the images being saved on the disk and a pointer being placed in mongo. I was wondering since I have all sessions being stored in MongoDB to enable easy load balancing across servers, would storing the actual files in Mongo also be a smart idea for easy multiserver setups and/or performance gains.
If everything is stored in a DB, you can simply spawn more web servers and/or mongo replications to scale horizontally
Opinions?
MongoDB is a good option to store your files (I'm talking about GridFS), specially for the use case you described above.
When you store files into MongoDB (GridFS, not documents), you get all the replication and sharding capability for free, which is awesome.
If you have to spawn a new server and you have the files already into MongoDB, all you have to do is to enable replication (thus scale horizontally). I'm sure this can save you a lot of headaches.
Resources:
Is GridFS fast and reliable enough for production?
http://www.mongodb.org/display/DOCS/GridFS
http://www.coffeepowered.net/2010/02/17/serving-files-out-of-gridfs/
Aside from GridFS, you might be considering a cloud-based deployment. In that case, you might consider storing files in cloud-specific storage (Windows Azure has Blob Storage, for example). Sticking with Windows Azure for this example (since that's what I work with), you'd reference a file by its storage account URI. For example:
https://mystorageacct.blob.core.windows.net/mycontainer/myvideo.wmv
Since you'd be storing the MongoDB database itself in its own blob (and mounted as disk volume on your Linux or Windows VM), you could then choose to store your files in either the same storage account or a completely different storage account (with each storage account providing 100TB 200TB of storage).
Storing the image as document in mongodb would be a bad idea, as the resources which could have been used to send a large amount of informational data would be used for sending files.
Have a look at mongoDb file storage GridFS , that might solve your problem
of storing images, and providing horizontal scalability as well.
http://www.mongodb.org/display/DOCS/GridFS
http://www.mongodb.org/display/DOCS/GridFS+Specification

How to link MemCached server together?

I'm looking into using MemCached for a web application I am developing and after researching MemCached over the past few days, I have come across a question I could not find the answer to.
How do you link Memcached server together or how do you replicate data between MemCached server?
Additionally: Is this functionality controlled by the servers or the clients and how?
when you set several servers, the client libraries use a first hash to pick one where to store each key/data pair. that means that there's no replication, and also that every client has to use the same set of servers.
pros:
almost zero overhead, storage and bandwidth grow linearly.
server code is kept simple and reliable.
cons:
any change in the set of servers (one goes down, or you add a new one) suddenly invalidates (almost) the whole cache.
you have to be sure to use the same algorithm on every client.
if you have control to the client's code, you can simply store each key/data pair twice on two servers. just be sure to search on the same places when reading from a different client.
I've used BeITMemcached and in that you create an instance of MemcacheClient and set the servers you want to use, just as strings.
At that point the client itself determines which of the servers it has available to put different items into. You never know which an item will be in.
Check here to see how the servers handle failover.
The easiest thing is to have a repopulate mechanism. In my case, I store several hundred objects in memcache which come out of a database. I can just call repopulate and put them all back in there. Whenever I add, update or delete them to the database, I make those same calls to memcache.
http://repcached.lab.klab.org/
Also, the PHP PECL memcache client can replicate data to multiple servers, see memcache.redundancy.
It sounds like you wish to have caches that can cope with machines rebooting etc if so…
In a lot of case (assuming you are not writing Facebook) a RDMS is fast enough for caching. Just create a table that has a key and a blob column. If the RDBS server has enough ram, all the data will be in RAM and just saved to disk so as to allow recovery.
Remember this could be a separate server(s) from your main database server.
If you wish to get more fancy and are using a high-end RDMS, you may be able to set up change notifications on the queries that are used to build the “cached data” that delete out-of-date rows from the cache.
Someone you can set up triggers to clear invalid rows from the cache, however this can be very complex very quickly.
Memcached does not provide replication property. To do that, you need to add the server to memcached client server list and then hit the DB for the data to be stored in that particular server.
You should seriously consider CouchBase. It uses the memcached protocol, provides nearly the same speed, and delivers the automatic replication you're looking for. It also persists to disk so your cache will never be cold.