How should column-level PostgresSQL encryption implemented in SQLAlchemy using pgcrypt? [closed] - postgresql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
For example, there is a repo for doing this in Django: https://sourcegraph.com/github.com/dcwatson/django-pgcrypto.
There is some discussion in the SQLAlchemy manual, but I am using non-byte columns: http://docs.sqlalchemy.org/en/rel_0_9/core/types.html
I am running Flask on Heroku using SQLAlchemy.
A code example and/or some discussion would be most appreciated.

There are a bunch of stages to this kind of decision making, it's not just "shove a plugin into the stack and that encryption thing is taken care of"
First, you really need to classify each column for its attractiveness to attackers & what searches/queries need to use it, whether it's a join column / index candidate, etc. Some data needs much stronger protection than other data.
Consider who you're trying to protect against:
Casual attacker (e.g SQL injection holes used for remote table copies)
Stolen database backup (tip: Encrypt these too)
Stolen/leaked log files, possibly including queries and parameters
Attacker with direct non-superuser SQL level access
Attacker with direct superuser SQL-level access
Attacker who gains direct access to the "postgres" OS user, so they can modify configuration, copy/edit logs, install malicious extensions, alter function definitions, etc
Attacker who gains root on the DB server
Of course, there's also the app server, upstream compromise of trusted sources for programming languages and toolkits, etc. Eventually you reach a point where you have to say "I can't realistically defend against this". You can't protect against somebody coming in, saying "I'm from the Government and I'll do x/y/z to you unless you allow me to install a rootkit on this customer's server". The point is that you've got to decide what you do have to protect against, and make your security decisions based on that.
A good compromise can be to do as much of the crypto as possible in the app, so PostgreSQL never sees the encryption/decryption keys. Use one-way hashing whenever possible, rather than using reversible encryption, and when you hash, properly salt your hashes.
That means pgcrypto doesn't actually do you much good, because you're never sending plaintext to the server, and you're not sending key material to the server either.
It also means that two people with the same plaintext for column SecretValue have totally different values for SecretValueSalt, SecretValueHashedBytes in the database. So you can't join on it, use it in a WHERE clause usefully, index it usefully, etc.
For that reason, you'll often compromise with security. You might do an unsalted hash of part of the datum, so you get a partial match, then fetch all the results to your application and filter them on the application side where you have the full information required. So your storage for SecretValue now looks like SecretValueFirst10DigitsUnsaltedHash, SecretValueHashSalt, SecretValueHashBytes. But with better column names.
If in doubt, just don't send plaintext of anything sensitive to the database. That means pgcrypto isn't much use to you, and you'll be doing mostly application-side crypto. The #1 reason for that is that if you send plaintext (or worse, key material) to the DB, it might get exposed in log files, pg_stat_activity, etc.
You'll pretty much always want to store encrypted data in bytea columns. If you really insist you can hex- or base64 encode it and shove it in a text column, but developers and DBAs who have to use your system later will cry.

Related

Using Postgres' external procedural languages over application code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am trying to figure out the advantages and disadvantages of using non-plpgsql procedural languages (PL/Python, PL/Perl, PL/v8, etc.) to implement data manipulation logic on the database level instead of going up to model level/ORM of the application framework that interacts with the database (Rails, Entity Framework, Django, etc.) and implementing it there.
To give a concrete example, say, I have a table that contains Mustache templates, and I want to have them "rendered" somehow.
Table definition:
create table templates (
id serial primary key,
content text not null,
data jsonb not null
);
Usually I would go the model code and add and extra method to render the template. Example in Rails:
class Template < ApplicationRecord
def rendered
Mustache.render(content, data)
end
end
However, I could also write a PL/Python function that would do just that but on the database level:
create or replace function fn_mustache(template text, data jsonb)
returns text
language plpython3u
as $$
import chevron
import json
return chevron.render(template, json.loads(data))
$$;
create view v_templates as
select id, content, data, fn_mustache(content, data) as rendered
from templates;
This yields virtually the same result functionality-wise. This example is very basic, yet the idea is to use PL/Python (or others) to manipulate the data in a more advanced manner than PL/pgsql can allow for. That is, PL/pgsql does not have the same amount of libraries that any generic programming language provides today (in the example am relying on implementations of Mustache templating system which would not be practical to implement in PL/pgsql in this case). I obviously would not use PL/Python for any sort of networking or other OS-level features, but for operations exclusively on data this seems like a decent approach (change my mind).
Points that I can observe so far:
PL/Python is an "untrusted" language which I guess makes it by definition more dangerous to write a function in since you have access to syscalls; at least it feels like the cost of messing up a PL/Python function is higher than that of the mistake on the application layer, since the former is executed in the context of the database
Database approach is more extensible since I am working on the level that is the closest to the data, i.e. I am not scattering the presentation logic across multiple "tiers" (ORM and DB in this case). This means that if I need some other external service interested in interacting with the data, I can plug it directly into the database, bypassing the application layer.
Implementing this on model level just seems much simpler in execution
Supporting the application code variant seems easier as well since there are less concepts to keep in mind
What are the other advantages and disadvantages of these two approaches? (e.g. performance, maintainability)
You are wondering whether to have application logic inside the database or not. This is to a great extent a matter of taste. In the days of yore, the approach to implement application logic in database functions was more popular, but today it is usually frowned upon.
Extreme positions in this debate are
The application is implemented in the database to the extent that the database functions produce the HTML code that is sent to the client.
The database is just a dumb collection of tables with no triggers or constraints beyond a primary key, and the application tries to maintain data integrity.
The best solution is typically somewhere in the middle, but where is largely a matter of taste. You see that this is a typical opinion-based question. However, let me supply some arguments that help you make a decision.
Points speaking against application logic in the database:
It makes it more difficult to port to another database.
It is more complicated to develop and debug database functions than client code. For example, you won't have as advanced debugging tools.
That database machine has to perform not only the normal database workload, but also the application code workload. But databases are harder to scale than application servers (you can't just spin up a second database to handle part of the workload).
PostgreSQL-specific: all database functions run inside a single database transaction, so you cannot implement functionality that requires more complicated transaction management.
Points speaking for application logic in the database:
It becomes easier to port to another application server or client programming language.
Less data has to be transferred between client and server, which can make processing more efficient.
The software stack becomes shorter and the overall software architecture simpler.
My personal opinion is that anything that has to do with basic data integrity should be implemented in the database:
Have foreign keys and check constraints in the database. The application will of course also respect these rules (no point in triggering a database error), but it is good for data integrity to have a safety net.
If you have to keep redundant information in the database, use triggers to make sure that all copies of a datum are kept synchronized. This implicitly makes use of transactional atomicity.
Anything that is more complicated is best done in the application. Be wary of database functions that are very long or complicated. Exceptions can be made for performance reasons: perhaps some complicated report could not easily be written in pure SQL, and shipping all the raw data to the client is prohibitively expensive.

Modern hierarchical database [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 9 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Improve this question
I'm working on the architecture of a new system to replace an ancient mainframe app. The mainframe uses IBM IMS and is surprisingly fast with large amounts of data. We've tried 3 DBs so far - MongoDB, SQL Server and Oracle, but they performed poorly under load. We hired an Oracle consultant and a 128 cores server and Oracle still gives us 4x the response time of the old system (same with SQL Server).
Are there any modern hierarchical DBs, that can efficiently support billions of records?
Mainframes have been and remain very fast for certain use cases, so part one is not to assume that mainframe = bad. Having said that, they can be very expensive to maintain, and particularly with legacy apps the skills are starting to evaporate.
If you really wanted a hierarchical database, one valid option would be to modernise your application but retain IMS at the core. IMS is a great hierarchical database, and I don't think IBM are going to EOL IMS any time soon, so is there a real reason to go to a hierarchical database that isn't IMS? A quick visit to their website gave me the impression that they'd discount the product if they thought you were going to migrate to a competing product, so if money is the problem then perhaps the answer is to just ask IBM to discount the product you're already happy with. This white paper (ftp://public.dhe.ibm.com/software/data/ims/pdf/TCG2013015LI.pdf) suggests they're pushing that as an option, and no doubt the later versions of IMS have a bunch of features that might not be available in the version you're running (assuming you've not upgraded to the latest).
I'm surprised you can't get the performance you want out of Oracle though, the system I'm currently working on has a couple of tables at the billion mark and we definitely don't have 128 cores, but we get reasonable performance.
My first question is whether your Oracle consultant really knew their stuff. I've had mixed results, I guess like any skill set people can have variable skills. I often find that when you get performance problems it's because people have over-normalised or over-generalised the database schema - so you've moved from a highly optimised hierarchical structure in IMS that flies to a very abstracted structure in 3NF, and that dies. But sometimes if you put that same hierarchical structure in Oracle, and only allow the same sort of access patterns that were possible in IMS, you'd get all the performance you want.
By that, I mean if in IMS you had clients, clients had orders, and orders had order lines, then I think that means it's pretty hard to do any accesses without starting at the client. It also often means you have large batch processes that process all the clients every day to find out which have orders that you need to do something with.
So, some things here. Firstly, if, in Oracle, you were to build that structure - so I have a client id, the client id is the first element in the primary key of orders, and the client id then the order id are the first two elements in the primary key of order lines, and then I use client id as my clustering key and put client id into every index......probably all my client-based access paths will be really fast. You can also partition by client id, and if needed, run an Oracle RAC cluster with each of those partitions/client ranges effectively running as separate databases on a separate more commodity class machine (say, a dual socket machine = about 20 cores).
Secondly, if I used to have to process all my records once a night to find the orders that needed someone to work on them, then in the new relational world I don't need to do that any more, I just need to find the orders with a status of "pending" or whatever. So maybe Oracle isn't as fast for that batch oriented workload, but if I change my logic and do an indexed query for pending orders, then again I can get all the performance I want. Even more so, perhaps I make order_status into a partitioning key, so my "active" records are all in one partition, and all the older orders are in other partitions - and then I put that partition on an SSD-backed array.
Thirdly, take a look at your storage devices. Performance problems in databases are invariably IO problems - either you're doing too much IO (poorly optimised queries), or your IO subsystem can't keep up with the IO that you need to do. 128 cores is an awful lot of compute, and I've rarely seen a database that is compute bound. Maybe look at a big SSD array, some of them can give you enormous IO throughput. Certainly if you were running Oracle on a RAID 5 spinning disk array your performance is likely to suck.
The last random comment here - a lot of people are getting good results with SAP HANA - a fully in-memory database. That really flies, and is specifically designed for workloads that just won't run fast enough in other databases. I bet SAP would come demo it to you for free if you wanted it.

How to search an encrypted attribute?

I have a sensitive attribute that must be encrypted at all times except during display (not my rule and I think it's overkill, but I must follow this rule). Additionally, the secret used to encrypt/decrypt this data must not be on or accessible through the database. So currently I have a session for the user that stores their encrypted password and decrypts this data when needed. However, now I need to find records by the encrypted attribute. I currently utilize ActiveSupport::MessageEncryptor for encryption/decryption of the attribute. Here's the direction I think I should go to accomplish this:
decryptor = ActiveSupport::MessageEncryptor.new(encrypted_password)
Family.where("decryptor.decrypt_and_verify(name) == ?", some_search_name)
Obviously the first side of that condition does not work as-is, but I need some way to do that. Any ideas?
Quick Primer to Passwords in the DB
This goes to show that encryption in the database is hard, and that you shouldn't do it unless you have thought carefully through your threat model and understand what all the tradeoffs are. To be honest, I have serious doubts that an ORM can ever give you the security you need where you need encryption (for important knowledge reasons), and on PostgreSQL, it is particularly hard because of the possibility of key disclosure in the log files. In general you really need to properly protect both encrypted and plain text with regard to passwords, so you really don't want a relational interface here but a functional one, with a query running under a totally different set of permissions.
Now, I can't tell in your example whether you are trying to protect passwords, but if you are, that's entirely the wrong way to go about it. My example below is going to use MD5. Now I am aware that MD5 is frowned upon by the crypto community because of the relatively short output, but it has the advantage in this case of not requiring pg_crypto to support and being likely stronger than attacking the password directly (in the context of short password strings, it is likely "good enough" particularly when combined with other measures).
Now what you want to do is this: you want to salt the password, then hash it, and then search the hashed value. The most performant way to do this would be to have a users table which does not include the password, but does include the salt, and a shadow table which includes the hashed password but not the user-accessible data. The shadow table would be restricted to its owner and that owner would have access to the user table too.
Then you could write a function like this:
CREATE OR REPLACE FUNCTION get_userid_by_password(in_username text, in_password text)
RETURNS INT LANGUAGE SQL AS
$$
SELECT user_id
FROM shadow
JOIN users ON users.id = shadow.user_id
WHERE users.username = $1 AND shadow.hashed_password = md5(users.salt || $2);
$$ SECURITY DEFINER;
ALTER FUNCTION get_userid_by_password(text, text) OWNER TO shadow_owner;
You would then have to drop to SQL to run this function (don't go through your ORM). However you could index shadow.hashed_password and have it work with an index here (because the matching hash could be generated before scanning the table), and you are reasonably protected against SQL injections giving away the password hashes. You still have to make sure that logging will not be generally enabled of these queries and there are a host of other things to consider, but it gives you an idea of how best to manage passwords per se. Alternatively in your ORM you could do something that would have a resulting SQL query like:
SELECT * FROM users WHERE id = get_userid_by_password($username, $password)
(The above is pseudocode and intended only for illustration purposes. If you use a raw query like that assembled as a text string you are asking for SQL injection.)
What if it isn't a password?
If you need reversible encryption, then you need to go further. Note that in the example above, the index could be used because I was searching merely for an equality on the encrypted data. Searching for an unencrypted data means that indexes are not usable. If you index the unencrypted data then why are you encrypting it in the first place? Also decryption does place burdens on the processor so it will be slow.
In all cases you need to carefully think through your threat model and ask how other vulnerabilities could make your passwords less secure.

Why to use blowfish for passwords?

I'm a little confused about password-safe-keeping.
Let's say I've got database with user-account table.
And this is the place where i keep passwords.
At this time i'm using salted sha1.
I read Blowfish based function are better then sha1 because they need more time to process request.
Is there any reason why not to use salted sha1 and just limit login attempt count to some reasonable number (for example 50times per hour) as a 'firewall' to bruteforce attacks?
person who is working with this database has no need to bruteforce anything because
he can change records by queries.
With blowfish based function, you surely mean the BCrypt hash function. As you already stated BCrypt is designed to be slow (need some computing time), that's the only advantage over other fast hash functions, but this is crucial.
With an off-the-shelf GPU, you are able to calculate about 3 Giga hash values per second, so you can brute-force a whole english dictionary with 5'000'000 words in less than 2 milliseconds. Even if SHA-1 is a safe hash function, that makes it inappropriate for hashing passwords.
BCrypt has a cost factor, which can be adapted to future, and therefore faster, hardware. The cost factor determines how many iterations of hashing are performed. Recently i wrote a tutorial about hashing passwords, i would invite you to have a look at it.
Your point about restricting login attempts makes sense, but the hashing should protect the passwords in case the attacker has access to the database (SQL-injection). Of course you can limit the login attempts, but that has nothing to do with hashing, you could even store the passwords plaintext in this scenario.
Storing passwords in Blowfish is more secure than SHA-1 because, as of now, there has been no reported method of obtaining the value of a Blowfish-encrypted string. SHA-1, on the other hand, does have reported methods of obtaining data from encrypted strings. You cannot trust SHA-1 to prevent someone from obtaining its data.
If you are open to suggestion, I don't see a need to work with two-way encryption at all as you are storing passwords. Hashing your users passwords with a salted SHA-256 method may be an option. Allowing your users to reset their own passwords via Email is generally considered a good policy, and it results in a data set that cannot be easily cracked.
If you do require two-way encryption for any reason, aside from Blowfish, AES-256 (Rijndael) or Twofish are also currently secure enough to handle sensitive data. Don't forget that you are free to use multiple algorithms to store encrypted data.
On the note of brute forcing, it has little to do with encrypted database storage. You are looking at a full security model when you refer to methods of attack. Using a deprecated algorithm and "making up for it" by implementing policies to prevent ease of attack is not considered a mature approach to security.
In Short
Use one way hashing for storing passwords, allow users to reset via email
Don't be afraid use multiple methods to store encrypted data
If you must use an encryption/decryption scheme, keep your keys safe and only use proven algorithms
Preventing brute force attacks is a good mindset, but it will only slow someone down or encourage them to search for other points of entry
Don't take this as gospel: when it comes to security everyone has different requirements, the more research you do the better your methods will become. If you don't completely encapsulate your sensitive data with a full-on security policy, you may get a nasty surprise down the track.
Source: Wikipedia, http://eprint.iacr.org/2005/010
Is there any reason why not to use salted sha1 and just limit login
attempt count to some reasonable number (for example 50times per hour)
as a 'firewall' to bruteforce attacks?
If you don't encrypt your passwords with any decent algorithm you are failing basic security precautions.
Why isn't 'just' blocking login attempts safe?
Well beside the fact you would need to block EVERY possible entrance, eg:
ssh
webservices (your webapp, phpmyadmin, openpanel, etcetera)
ftp
lots more
You would also need to trust every user that has access to the database and server, I wouldn't like people to read my password, but what I dislike even more, is you deciding for me, metaforically speaking :-)
Maybe someone else can shed light on the Blowfish vs SHA discussion, although I doubt that part is a stackworthy formatted question

How to search the value when value is stored as encrypted

in my database i store the student information in encrypted form.
now i want to perform the search to list all student which name is start with "something" or contains "something"
anybody have idea that how can perform this type of query?
Please suggest
Any decent encryption algorithm has as one of its core features the fact that it's impossible to deduce anything about the plaintext just by looking at the encrypted text. If you were able to tell, just by looking at the encrypted text, that the plaintext contained the string william, any attackers would be able to get that information just as easily, and you may as well not be encrypting at all.
The only way to perform this kind of operation on the data is to have access to the decrypted data. Using the model you've described - where the database only ever sees the encrypted data - it's not possible for the database to do this work, as the database has no access to the data it needs.
You need to have the data you're wanting to search on decrypted. The only complete way to do this is to have the application pull all the data out of the database, decrypt it, then do the filtering/sorting/whatever in your application. Obviously this is not going to scale well - but that's surely something you took into consideration when you decided to encrypt the data before putting it in the database.
Another option would be to store fragments of the data unencrypted. For example, if you have a first_name field and you want to be able to retrieve all records where first_name begins with a, have a first_name_first_letter field. Obviously this isn't going to scale well either - if you want to search for all records where first_name contains ill, you're going to have to store the complete first_name unencrypted.
There's a more serious problem with this solution though: by storing unencrypted data, you're leaking information about the encrypted data. The more unencrypted data you store, the more you leak. The more you leak, the more clues you're leaving for an attacker to defeat your encryption - plus, if you've stored the bit they were interested in unencrypted, they've already won.
Another question points to SQLCipher - it's an implemention of sqlite that does the encryption in the database. It seems to be targeted towards your use case - it's even already used on a couple of iPhone apps.
However, it has the database doing the encryption, not the application. This lets the database also handle the decryption, and hence the database is able to inspect the contents of the fields and do the searching you're looking for.
If you're still insisting on not doing the encryption in the database, this won't work for you.
If all you want is the equivalent of "starts with" and "contains", you might be able to do something with a bit field and the bitwise logical operators.
Not sure on the syntax you'd use, exactly (I'm a bit rusty on SQL) but the idea would be to create an additional field for each record which has a bit set for each letter that occurs in the name, then do something like:
SELECT * from someTable where (searchValue & bitField)>0
You then need to iterate over those records, decrypt them, and determine whether they actually meet the criteria you really wanted to search on (since you'd get a superset of the desired records back from the search).
You'd obviously be leaking some information about the contents of the field by doing this, but you can probably reduce that by encrypting the bitfields as well, or by turning on a few extra bits in each bitfield, so you can't tell "bob" from "bobby", for example.
I'm curious about what sort of security goal you're trying to meet with this encryption, though. If you describe the model a bit more, you might get better answers.