in my database i store the student information in encrypted form.
now i want to perform the search to list all student which name is start with "something" or contains "something"
anybody have idea that how can perform this type of query?
Please suggest
Any decent encryption algorithm has as one of its core features the fact that it's impossible to deduce anything about the plaintext just by looking at the encrypted text. If you were able to tell, just by looking at the encrypted text, that the plaintext contained the string william, any attackers would be able to get that information just as easily, and you may as well not be encrypting at all.
The only way to perform this kind of operation on the data is to have access to the decrypted data. Using the model you've described - where the database only ever sees the encrypted data - it's not possible for the database to do this work, as the database has no access to the data it needs.
You need to have the data you're wanting to search on decrypted. The only complete way to do this is to have the application pull all the data out of the database, decrypt it, then do the filtering/sorting/whatever in your application. Obviously this is not going to scale well - but that's surely something you took into consideration when you decided to encrypt the data before putting it in the database.
Another option would be to store fragments of the data unencrypted. For example, if you have a first_name field and you want to be able to retrieve all records where first_name begins with a, have a first_name_first_letter field. Obviously this isn't going to scale well either - if you want to search for all records where first_name contains ill, you're going to have to store the complete first_name unencrypted.
There's a more serious problem with this solution though: by storing unencrypted data, you're leaking information about the encrypted data. The more unencrypted data you store, the more you leak. The more you leak, the more clues you're leaving for an attacker to defeat your encryption - plus, if you've stored the bit they were interested in unencrypted, they've already won.
Another question points to SQLCipher - it's an implemention of sqlite that does the encryption in the database. It seems to be targeted towards your use case - it's even already used on a couple of iPhone apps.
However, it has the database doing the encryption, not the application. This lets the database also handle the decryption, and hence the database is able to inspect the contents of the fields and do the searching you're looking for.
If you're still insisting on not doing the encryption in the database, this won't work for you.
If all you want is the equivalent of "starts with" and "contains", you might be able to do something with a bit field and the bitwise logical operators.
Not sure on the syntax you'd use, exactly (I'm a bit rusty on SQL) but the idea would be to create an additional field for each record which has a bit set for each letter that occurs in the name, then do something like:
SELECT * from someTable where (searchValue & bitField)>0
You then need to iterate over those records, decrypt them, and determine whether they actually meet the criteria you really wanted to search on (since you'd get a superset of the desired records back from the search).
You'd obviously be leaking some information about the contents of the field by doing this, but you can probably reduce that by encrypting the bitfields as well, or by turning on a few extra bits in each bitfield, so you can't tell "bob" from "bobby", for example.
I'm curious about what sort of security goal you're trying to meet with this encryption, though. If you describe the model a bit more, you might get better answers.
Related
I am doing homework on restAPI using Go and MongoDB. But I'm still wondering:
As for whether I should create a dictionary to store data at the model level, it will help me to retrieve data much faster without accessing MongoDB. But the big problem here is to synchronize the data under MongoDB and in the dictionary that I created.
In file models/account.go I have a struct Account and in MongoDB I also have a collection Account to save all account information of the website. Should I create Accountlist to store all the data in the database to increase performance?.
Source as below:
var AccountList map[int]*Account
type Account struct {
ID int
UserName string
Password string
Email string
Role string
}
As with many things in software, "It Depends".
There's not enough information about the systems involved, how often the data is being queried, mutated, and so on to give a concrete answer. But because this is for homework, we can give scenarios.
The root of your question is this: should you cache results from the database?
Is it really needed?
Academically, it's OK to over-optimize. You get to play with technologies and understand how they work. In the real world, we should understand where the need for something is before implementing it. The more complex a solution is, the more important making a correct trade-off becomes.
Caching is best when you're going to use the results more often than they're going to change, and fetching from storage is expensive.
"Expensive" can vary. One operation measured in seconds can be expensive. But so can tens, hundreds, or thousands of operations close together measured in 100ms.
How should you do it?
You called out a couple drawbacks. Most importantly:
But the big problem here is to synchronize the data under MongoDB and in the dictionary that I created.
Synchronization is the most important thing for any distributed system.
It doesn't matter how you cache values if you have one server instance. But once you start adding instances, things get complex.
A common pattern for caching is to use a distributed key-value store. They allow you to store results which can be shared across applications — and invalidate them.
Application checks to see if the key exists in the store.
If so, use it.
If not, fetch from origin and update the cache for next time.
Separately, invalidate the key any time data needs updating.
There are a bunch of products to use. Redis is popular, memcached works. But since you're using Go, checkout groupcache: https://github.com/mailgun/groupcache. It was written by Google to simplify dl.google.com, and extended by Mailgun to support TTLs.
if i want to develop an application, I'm worried about its performance after the number of users and stored data increases.
actually I don't know what is the best way to implement a program that it works with a really large data and do some things like search in it, find and receive user information, search text and so on in real time without any delay !
Let's me explain the problem more
for example i have chosen 'Mongodb' as a database and suppose we have at least five million users and a user want to log in into the system, the user has sent the username and password
The first thing that we should do is to find the user with that username and then check the password, in mongodb we should use something like 'find' method to get the user's information, something like below:
Users.find({ username: entered_username })
then get the user information and we check the password
but the 'find' method should search the username between million users and it's a large number and if any person request for authentication, this method should be run for each of them and it cause a heavy processing on the system
but unfortunately this problem is only for something like finding a user, if we decide to search a text when we have a lot of texts and posts on the database the problem is more bigger
i don't know how big companies like facebook and linkedin search through millions of data in such a short span of time. actually i don't want to create something like facebook or more but i have a large amount of data and i'm looking for a good way to handle it
is there any framework or something else that help me to handle large data on the databases or is there exist a method to implement data on database so that we search and find data fast and quickly? should i use a particular data structure?
i founded an opensource project elasticsearch that it help us to search faster but i don't know if i found something with elastic how can i find it on mongodb too for doing something like updating data and if i use elastic search i should use mongodb too or not!? can i use elastic as a database and as a search engine simultaneous !?
if i use elasticsearch and mongodb together then i should have two copies of my data, one in mongodb and one in elasticsearch!? and this two copies of the data that are separated :( i wish elasticsearch search in the mongodb that does not have to create two copies of the data
thank you if you help me to find out a good way and understand what should i do.
When you talk about performance, it usually boils down to three things:
Your design
Your definition of "quick", and
How much you're willing to pay
Your design
MongoDB is great if you want to iterate on your data model, can scale horizontally, and very quick if used properly. Elasticsearch on the other hand, is not a database. However, it is very quick for searching. A traditional relational database will be useful if you know exactly how your data looks like, and don't expect it to change much, or is relational by nature.
You can, for example, use a relational database for user login, use MongoDB for everything else, and use Elastic for textual, searchable data. There is no rule that tells you to keep everything within a single database.
Make sure you understand indexing, and know how to utilize it to its fullest potential. The fastest hardware will not help you if you don't design your database properly.
Conclusion: use any tool you need, combine if necessary, but understand their strengths and weaknesses.
Your definition of "quick"
How "quick" is quick enough for your application? Is 100ms quick enough? Is 10ms quick enough? Remember that more performance you ask of the machine, more expensive it will be. You can get more performance with a better design, but design can only go so far.
Usually this boils down to what is acceptable for you and your client. Not every application needs a sub-10ms response time. There's plenty of applications that can tolerate queries that return in seconds.
Conclusion: determine what is acceptable, and design accordingly.
How much you're willing to pay
Of course, it all depends on how much you're willing to pay for all the hardware that need to host all that stuff. MongoDB might be open source, but you need some place to host it. Also, you cannot expect magic. You can't throw thousands of queries and updates per second, and expect it to be blazing fast when you only give it 1 GB of RAM.
Conclusion: never under-provision to save money if you want your application to be successful.
I have sometimes seen and have been recommended to store Strings and associative array keys as MD5 hash values. Now I have learnt about hashing from MIT - OCW 6.046j and it seems more like a scheme to store data in an efficient format for fast searching and to prevent people from getting back the original.
But don't languages supporting associative arrays / dictionaries do this internally? What additional advantage is the MD5 hash giving?
Most languages may support this internally, for example see Java's hashcode(), which is used when storing keys in a HashMap:
Returns a hash code value for the object. This method is supported for the benefit of hash tables such as those provided by HashMap.
But there are scenarios where you want to do it yourself.
Scenario 1 - using as key in a database:
Let's suppose you have a big no-sql-ish database of letters and metadata of these letters. You want to be able to find a letter's metadata quickly without searching. What would your index be?
One option is using a running index that's unrelated to the letter's content, but then you have to search the database before being able to find a document's metadata. Another option is to create a signature for the document composed of it's prefix (it's just an example out of many), but some documents may share this property ("Dear John,").
So how about taking into account the entire document? That's where you can use md5 as the row-key for your documents.
In this scenario you're relying on having no collisions, and the argument in favour of this assumption usually mention your chances of running into a demented gorilla being (usually) greater. The Secure Hash Algorithm family produce even less collisions.
I mention this since databases normally do not do this out of the box (frameworks may...).
Scenario 2 - One-way hash for password storage:
note: This may no longer apply for md5, but it does for the SHA-family variants.
In this scenario, you want to store passwords on your database, but storing plain-text passwords may have drawbacks if the database is compromised (user often share passwords across sites - may lead to accounts on other sites compromised as well). The usage of hashing here is storing the hashed password and when a user attempts to log-in you only compare the hash and not the password itself. This way you don't need the password stored locally and it is a lot harder to crack it.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
For example, there is a repo for doing this in Django: https://sourcegraph.com/github.com/dcwatson/django-pgcrypto.
There is some discussion in the SQLAlchemy manual, but I am using non-byte columns: http://docs.sqlalchemy.org/en/rel_0_9/core/types.html
I am running Flask on Heroku using SQLAlchemy.
A code example and/or some discussion would be most appreciated.
There are a bunch of stages to this kind of decision making, it's not just "shove a plugin into the stack and that encryption thing is taken care of"
First, you really need to classify each column for its attractiveness to attackers & what searches/queries need to use it, whether it's a join column / index candidate, etc. Some data needs much stronger protection than other data.
Consider who you're trying to protect against:
Casual attacker (e.g SQL injection holes used for remote table copies)
Stolen database backup (tip: Encrypt these too)
Stolen/leaked log files, possibly including queries and parameters
Attacker with direct non-superuser SQL level access
Attacker with direct superuser SQL-level access
Attacker who gains direct access to the "postgres" OS user, so they can modify configuration, copy/edit logs, install malicious extensions, alter function definitions, etc
Attacker who gains root on the DB server
Of course, there's also the app server, upstream compromise of trusted sources for programming languages and toolkits, etc. Eventually you reach a point where you have to say "I can't realistically defend against this". You can't protect against somebody coming in, saying "I'm from the Government and I'll do x/y/z to you unless you allow me to install a rootkit on this customer's server". The point is that you've got to decide what you do have to protect against, and make your security decisions based on that.
A good compromise can be to do as much of the crypto as possible in the app, so PostgreSQL never sees the encryption/decryption keys. Use one-way hashing whenever possible, rather than using reversible encryption, and when you hash, properly salt your hashes.
That means pgcrypto doesn't actually do you much good, because you're never sending plaintext to the server, and you're not sending key material to the server either.
It also means that two people with the same plaintext for column SecretValue have totally different values for SecretValueSalt, SecretValueHashedBytes in the database. So you can't join on it, use it in a WHERE clause usefully, index it usefully, etc.
For that reason, you'll often compromise with security. You might do an unsalted hash of part of the datum, so you get a partial match, then fetch all the results to your application and filter them on the application side where you have the full information required. So your storage for SecretValue now looks like SecretValueFirst10DigitsUnsaltedHash, SecretValueHashSalt, SecretValueHashBytes. But with better column names.
If in doubt, just don't send plaintext of anything sensitive to the database. That means pgcrypto isn't much use to you, and you'll be doing mostly application-side crypto. The #1 reason for that is that if you send plaintext (or worse, key material) to the DB, it might get exposed in log files, pg_stat_activity, etc.
You'll pretty much always want to store encrypted data in bytea columns. If you really insist you can hex- or base64 encode it and shove it in a text column, but developers and DBAs who have to use your system later will cry.
I have a sensitive attribute that must be encrypted at all times except during display (not my rule and I think it's overkill, but I must follow this rule). Additionally, the secret used to encrypt/decrypt this data must not be on or accessible through the database. So currently I have a session for the user that stores their encrypted password and decrypts this data when needed. However, now I need to find records by the encrypted attribute. I currently utilize ActiveSupport::MessageEncryptor for encryption/decryption of the attribute. Here's the direction I think I should go to accomplish this:
decryptor = ActiveSupport::MessageEncryptor.new(encrypted_password)
Family.where("decryptor.decrypt_and_verify(name) == ?", some_search_name)
Obviously the first side of that condition does not work as-is, but I need some way to do that. Any ideas?
Quick Primer to Passwords in the DB
This goes to show that encryption in the database is hard, and that you shouldn't do it unless you have thought carefully through your threat model and understand what all the tradeoffs are. To be honest, I have serious doubts that an ORM can ever give you the security you need where you need encryption (for important knowledge reasons), and on PostgreSQL, it is particularly hard because of the possibility of key disclosure in the log files. In general you really need to properly protect both encrypted and plain text with regard to passwords, so you really don't want a relational interface here but a functional one, with a query running under a totally different set of permissions.
Now, I can't tell in your example whether you are trying to protect passwords, but if you are, that's entirely the wrong way to go about it. My example below is going to use MD5. Now I am aware that MD5 is frowned upon by the crypto community because of the relatively short output, but it has the advantage in this case of not requiring pg_crypto to support and being likely stronger than attacking the password directly (in the context of short password strings, it is likely "good enough" particularly when combined with other measures).
Now what you want to do is this: you want to salt the password, then hash it, and then search the hashed value. The most performant way to do this would be to have a users table which does not include the password, but does include the salt, and a shadow table which includes the hashed password but not the user-accessible data. The shadow table would be restricted to its owner and that owner would have access to the user table too.
Then you could write a function like this:
CREATE OR REPLACE FUNCTION get_userid_by_password(in_username text, in_password text)
RETURNS INT LANGUAGE SQL AS
$$
SELECT user_id
FROM shadow
JOIN users ON users.id = shadow.user_id
WHERE users.username = $1 AND shadow.hashed_password = md5(users.salt || $2);
$$ SECURITY DEFINER;
ALTER FUNCTION get_userid_by_password(text, text) OWNER TO shadow_owner;
You would then have to drop to SQL to run this function (don't go through your ORM). However you could index shadow.hashed_password and have it work with an index here (because the matching hash could be generated before scanning the table), and you are reasonably protected against SQL injections giving away the password hashes. You still have to make sure that logging will not be generally enabled of these queries and there are a host of other things to consider, but it gives you an idea of how best to manage passwords per se. Alternatively in your ORM you could do something that would have a resulting SQL query like:
SELECT * FROM users WHERE id = get_userid_by_password($username, $password)
(The above is pseudocode and intended only for illustration purposes. If you use a raw query like that assembled as a text string you are asking for SQL injection.)
What if it isn't a password?
If you need reversible encryption, then you need to go further. Note that in the example above, the index could be used because I was searching merely for an equality on the encrypted data. Searching for an unencrypted data means that indexes are not usable. If you index the unencrypted data then why are you encrypting it in the first place? Also decryption does place burdens on the processor so it will be slow.
In all cases you need to carefully think through your threat model and ask how other vulnerabilities could make your passwords less secure.