How can I protect Amazon SimpleDB from SQL Injection? - sql-injection

Under the principle of "if it walks like a duck and it sounds like a duck," it sure seems like the SQL-flavored queries that Amazon's SimpleDB supports should be susceptible to SQL injection-type attacks. Here's a simple example that assumes the attacker's input is going into the variable $category, and that he can guess a column name:
$category = "Clothes' OR Category LIKE '%";
$results = $sdb->select("SELECT * FROM `{$domain}` WHERE Category = '$category'");
If you're playing the home game, these lines can be an in-place replacement for line 119 in the file html-sdb_create_domain_data.php in the sample code in Amazon's PHP SDK (1.2).
Amazon publishes quoting rules, and I suppose I could write something that ensures that any " or ' in user input gets doubled up... but I've always understood that escaping is basically an arms race, which makes parametrization my weapon of choice when using, for example, MySQL.
What are other people using to defend SimpleDB queries?

The SimpleDB Select operation is non destructive, so the only thing to protect against is extra query data going out to the attacker.
The solution to sanitize user input to the query is pretty easy with SimpleDB since sub-selects and compound statements are not allowed. So it's not really an arms race; sequences of one or more quote characters in the input must be escaped if the length of the sequence is odd.

Related

POSTGRESQL JSONB column for storing follower-following relationship [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

Is it OK to use emojis/symbols in DynamoDB keys?

I'm getting into single-table ddb design and I'm discovering the need for delimiters and other significant characters in the keys themselves.
In order to avoid the possibility of having the delimiter symbol show up in the key value-itself, I'm thinking of using emojis/symbols as delimiters:
'parent➡️childType≔{childId}➡️grandchildType≔{grandchildId}'
I read here that dynamo accepts UTF-8, and I read here that emojis can be UTF-8 encoded. But I'm far from expert on the matter, so, an authoritative answer would be well appreciated : )
I tested your text as is in a real DynamoDB table and it works just fine as a key and a value, but personally I would use double colons. So it looks like this:
parent::childType=123::grandchildType=456
IMO, it is easier to read is why I use them and nothing else uses that.
Whatever you choose, just a small tip. Remember that these characters count as part of the overall size of the item. When it comes to GetItem, Query, and Scan operations the size of the names matters. So, do not go wild here unless it really makes sense.

How to search an encrypted attribute?

I have a sensitive attribute that must be encrypted at all times except during display (not my rule and I think it's overkill, but I must follow this rule). Additionally, the secret used to encrypt/decrypt this data must not be on or accessible through the database. So currently I have a session for the user that stores their encrypted password and decrypts this data when needed. However, now I need to find records by the encrypted attribute. I currently utilize ActiveSupport::MessageEncryptor for encryption/decryption of the attribute. Here's the direction I think I should go to accomplish this:
decryptor = ActiveSupport::MessageEncryptor.new(encrypted_password)
Family.where("decryptor.decrypt_and_verify(name) == ?", some_search_name)
Obviously the first side of that condition does not work as-is, but I need some way to do that. Any ideas?
Quick Primer to Passwords in the DB
This goes to show that encryption in the database is hard, and that you shouldn't do it unless you have thought carefully through your threat model and understand what all the tradeoffs are. To be honest, I have serious doubts that an ORM can ever give you the security you need where you need encryption (for important knowledge reasons), and on PostgreSQL, it is particularly hard because of the possibility of key disclosure in the log files. In general you really need to properly protect both encrypted and plain text with regard to passwords, so you really don't want a relational interface here but a functional one, with a query running under a totally different set of permissions.
Now, I can't tell in your example whether you are trying to protect passwords, but if you are, that's entirely the wrong way to go about it. My example below is going to use MD5. Now I am aware that MD5 is frowned upon by the crypto community because of the relatively short output, but it has the advantage in this case of not requiring pg_crypto to support and being likely stronger than attacking the password directly (in the context of short password strings, it is likely "good enough" particularly when combined with other measures).
Now what you want to do is this: you want to salt the password, then hash it, and then search the hashed value. The most performant way to do this would be to have a users table which does not include the password, but does include the salt, and a shadow table which includes the hashed password but not the user-accessible data. The shadow table would be restricted to its owner and that owner would have access to the user table too.
Then you could write a function like this:
CREATE OR REPLACE FUNCTION get_userid_by_password(in_username text, in_password text)
RETURNS INT LANGUAGE SQL AS
$$
SELECT user_id
FROM shadow
JOIN users ON users.id = shadow.user_id
WHERE users.username = $1 AND shadow.hashed_password = md5(users.salt || $2);
$$ SECURITY DEFINER;
ALTER FUNCTION get_userid_by_password(text, text) OWNER TO shadow_owner;
You would then have to drop to SQL to run this function (don't go through your ORM). However you could index shadow.hashed_password and have it work with an index here (because the matching hash could be generated before scanning the table), and you are reasonably protected against SQL injections giving away the password hashes. You still have to make sure that logging will not be generally enabled of these queries and there are a host of other things to consider, but it gives you an idea of how best to manage passwords per se. Alternatively in your ORM you could do something that would have a resulting SQL query like:
SELECT * FROM users WHERE id = get_userid_by_password($username, $password)
(The above is pseudocode and intended only for illustration purposes. If you use a raw query like that assembled as a text string you are asking for SQL injection.)
What if it isn't a password?
If you need reversible encryption, then you need to go further. Note that in the example above, the index could be used because I was searching merely for an equality on the encrypted data. Searching for an unencrypted data means that indexes are not usable. If you index the unencrypted data then why are you encrypting it in the first place? Also decryption does place burdens on the processor so it will be slow.
In all cases you need to carefully think through your threat model and ask how other vulnerabilities could make your passwords less secure.

Is it a good idea to store attributes in an integer column and perform bitwise operations to retrieve them?

In a recent CODE Magazine article, John Petersen shows how to use bitwise operators in TSQL in order to store a list of attributes in one column of a db table.
Article here.
In his example he's using one integer column to hold how a customer wants to be contacted (email,phone,fax,mail). The query for pulling out customers that want to be contacted by email would look like this:
SELECT C.*
FROM dbo.Customers C
,(SELECT 1 AS donotcontact
,2 AS email
,4 AS phone
,8 AS fax
,16 AS mail) AS contacttypes
WHERE ( C.contactmethods & contacttypes.email <> 0 )
AND ( C.contactmethods & contacttypes.donotcontact = 0 )
Afterwards he shows how to encapsulate this in to a table function.
My questions are these:
1. Is this a good idea? Any drawbacks? What problems might I run in to using this approach of storing attributes versus storing them in two extra tables (Customer_ContactType, ContactType) and doing a join with the Customer table? I guess one problem might be if my attribute list gets too long. If the column is an integer then my attribute list could only be at most 32.
2. What is the performance of doing these bitwise operations in queries as you move in to the tens of thousands of records? I'm guessing that it would not be any more expensive than any other comparison operation.
If you wish to filter your query based on the value of any of those bit values, then yes this is a very bad idea, and is likely to cause performance problems.
Besides, there simply isn't any need - just use the bit data type.
The reason why using bitwise operators in this way is a bad idea is that SQL server maintains statistics on various columns in order to improve query performance - for example if you have an email column, SQL server can tell you roughly what percentage of values that email column are true and select an appropriate execution plan based on that knowledge.
If however you have flags column, SQL server will have absolutely no idea how many records in a table match flags & 2 (email) - it doesn't maintain these sorts of indexes. Without this sort of information available to it SQL server is far more likely to choose a poor execution plan.
And don't forget the maintenance problems using this technique would cause. As it is not standard, all new devs will probably be confused by the code and not know how to adjust it properly. Errors will abound and be hard to find. It is also hard to do reporting type queries from. This sort of trick stuff is almost never a good idea from a maintenance perspective. It might look cool and elegant, but all it really is - is clunky and hard to work with over time.
One major performance implication is that there will not be a lookup operator for indexes that works in this way. If you said WHERE contact_email=1 there might be an index on that column and the query would use it; if you said WHERE (contact_flags & 1)=1 then it wouldn't.
** One column stores one piece of information only - it's the database way. **
(Didnt see - Kragen's answer also states this point, way before mine)
In opposite order: The best way to know what your performance is going to be is to profile.
This is, most definately, an "It Depends" question. I personally would never store such things as integers. For one thing, as you mention, there's the conversion factor. For another, at some point you or some other DBA, or someone is going to have to type:
Select CustomerName, CustomerAddress, ContactMethods, [etc]
From Customer
Where CustomerId = xxxxx
because some data has become corrupt, or because someone entered the wrong data, or something. Having to do a join and/or a function call just to get at that basic information is way more trouble than it's worth, IMO.
Others, however, will probably point to the diversity of your options, or the ability to store multiple value types (email, vs phone, vs fax, whatever) all in the same column, or some other advantage to this approach. So you would really need to look at the problem you're attempting to solve and determine which approach is the best fit.

Storing parts of user data in files for preventing SQL injection

I am new to web programming and have been exploring issues related to web security.
I have a form where the user can post two types of data - lets call them "safe" and "unsafe" (from the point of view of sql).
Most places recommend storing both parts of the data in database after sanitizing the "unsafe" part (to make it "safe").
I am wondering about a different approach - to store the "safe" data in database and "unsafe" data in files (outside the database). Ofcourse this approach creates its own set of problems related to maintaining association between files and DB entries. But are there any other major issues with this approach, especially related to security?
UPDATE: Thanks for the responses! Apologies for not being clear regarding what I am
considering "safe" so some clarification is in order. I am using Django, and the form
data that I am considering "safe" is accessed through the form's "cleaned_data"
dictionary which does all the necessary escaping.
For the purpose of this question, let us consider a wiki page. The title of
wiki page does not need to have any styling attached with it. So, this can be accessed
through form's "cleaned_data" dictionary which will convert the user input to
"safe" format. But since I wish to provide the users the ability to arbitrarily style
their content, I can't perhaps access the content part using "cleaned_data" dictionary.
Does the file approach solve the security aspects of this problem? Or are there other
security issues that I am overlooking?
You know the "safe" data you're talking about? It isn't. It's all unsafe and you should treat it as such. Not by storing it al in files, but by properly constructing your SQL statements.
As others have mentioned, using prepared statements, or a library which which simulates them, is the way to go, e.g.
$db->Execute("insert into foo(x,y,z) values (?,?,?)", array($one, $two, $three));
What do you consider "safe" and "unsafe"? Are you considering data with the slashes escaped to be "safe"? If so, please don't.
Use bound variables with SQL placeholders. It is the only sensible way to protect against SQL injection.
Splitting your data will not protect you from SQL injection, it'll just limit the data which can be exposed through it, but that's not the only risk of the attack. They can also delete data, add bogus data and so on.
I see no justification to use your approach, especially given that using prepared statements (supported in many, if not all, development platforms and databases).
That without even entering in the nightmare that your approach will end up being.
In the end, why will you use a database if you don't trust it? Just use plain files if you wish, a mix is a no-no.
SQL injection can targeted whole database not only user, and it is the matter of query (poisoning query), so for me the best way (if not the only) to avoid SQL injection attack is control your query, protect it from possibility injected with malicious characters rather than splitting the storage.