Best practices for sanitizing Unicode input - unicode

I'm working on a web application at the moment (using Ruby) that I would ultimately like to be usable by people from anywhere in the world. With that in mind, support for non-ASCII characters is essential. However, I don't want the database to be full of "noise" characters in fields such as username etc.
Are there any accepted best practices for dealing with Unicode input under these circumstances without alienating users? Any thoughts on dealing with homographs in usernames to make impersonation harder?
Some of my thoughts so far -
normalizing text before storing or using it in queries
filtering non-printable characters
limiting the number of sequential combining diacritics allowed in input
Any further thoughts, or am I making unnecessary work for myself?
Thanks.

http://www.ietf.org/rfc/rfc3454.txt will tell you what you should be doing, which is to say worrying about normalization and security issues.

Related

Is it OK to use emojis/symbols in DynamoDB keys?

I'm getting into single-table ddb design and I'm discovering the need for delimiters and other significant characters in the keys themselves.
In order to avoid the possibility of having the delimiter symbol show up in the key value-itself, I'm thinking of using emojis/symbols as delimiters:
'parent➡️childType≔{childId}➡️grandchildType≔{grandchildId}'
I read here that dynamo accepts UTF-8, and I read here that emojis can be UTF-8 encoded. But I'm far from expert on the matter, so, an authoritative answer would be well appreciated : )
I tested your text as is in a real DynamoDB table and it works just fine as a key and a value, but personally I would use double colons. So it looks like this:
parent::childType=123::grandchildType=456
IMO, it is easier to read is why I use them and nothing else uses that.
Whatever you choose, just a small tip. Remember that these characters count as part of the overall size of the item. When it comes to GetItem, Query, and Scan operations the size of the names matters. So, do not go wild here unless it really makes sense.

Teaching OCR to understand NSA and FISC redactions

I'm maintaining an archive of the heavily redacted documents coming out of the Foreign Intelligence Surveillance Court.
They come with big sections of text that look like this:
And when the OCR tries to work with this, you get text like:
production of this data on a daily basis for a period of 90 days. The sole purpose of this
production is to obtain foreign intelligence information in support of
individual authorized investigations to protect against international terrorism and
So in the OCRed version, where there are blacked out spots, there are just missing words. Sometimes, the missing words create a grammatically correct sentence with a different/weird meaning (like above). Other times, the resulting sentences make no sense, but either way it's a problem. It would be much better if the OCR engine could return X's for these spots or Unicode squares like ▮▮▮▮ instead.
The result I'd like is something like:
production of this data on a daily basis for a period of 90 days. The sole purpose of this
production is to obtain foreign intelligence information in support of XXXXXXXXXXX
individual authorized investigations to protect against international terrorism and
My question is how to go about getting these X's. Is there a way to analyze the images to identify the black spots? Is there a way to replace them with X's or some better unicode character? I'm open to any ideas to make this look right, but image editing is not a strong suit for me nor is hacking deep within the OCR engine.
You may want to train Tesseract for those long blobs. Depending on the length of the blob, you would assign a different number of 'X' characters. Read TrainingTesseract3 for training process.

To use unicode or not in web development project using flask and sqlalchemy

I am working on a web development project using flask and sqlalchemy orm. My post is related to use of unicode in developing this app.
What I have understood till now about unicode :
If I want my webapp to handle data in languages other than English I need to use unicode data type for my variables. Because string variables can't handle unicode data.
I use some database which stores unicode data or take responsibility to convert unicode to raw while saving and vice versa while retrieving. Sqlalchemy gives me option to set automatic conversion both ways, so that I dont have to worry about them.
I am using python2.7 so I have to be aware of processing unicode data properly. Normal string operations on unicode data maybe buggy.
Correct me if any of the above assumption is wrong.
Now my doubts or questions :
If I dont use unicodes now then will I have some problems if I or flask people decide to port to python3?
I dont want to hassle with the thought of my webapp catering to different languages right now. I just want to concentrate on first creating the app. Can I do that later without using unicode right now?
If I use unicode now then how it affects my code. Do I replace every string input and output with unicode or what?
Conversion of unicode when saving to database, Can it be source of performance problems?
Basically I am asking whether to use unicode or not with explaining my needs and requirement out of the project.
No, but make sure you separate binary data from text data. That makes it easier to port.
It's easier to use Unicode from the start, but of course you can postpone it. But it's really not very difficult.
You replace everything that should be text data with Unicode, yes.
Only of you make loads of conversions of really massive amounts of text.

Normalizing Unicode data for indexing (for Multi-byte languages): What products do this? Does Lucene/Hadoop/Solr?

I have several (1 million+) documents, email messages, etc, that I need to index and search through. Each document potentally has a different encoding.
What products (or configuration for the products) do I need to learn and understand to do this properly?
My first guess is something Lucene-based, but this is something I'm just learning as I go. My main desire is to start the time consuming encoding process ASAP so that we can concurrently build the search front end. This may require some sort of normalisation of double byte characters.
Any help is appreciated.
Convert everything to UTF-8 and run it through Normalization Form D, too. That will help for your searches.
You could try Tika.
Are you implying you need to transform the documents themselves? This sounds like a bad idea, especially on a large, heterogeneous collection.
A good search engine will have robust encoding detection. Lucene does and Solr uses it (Hadoop isn't a search engine). And I don't think it's possible to have a search engine that doesn't use a normalised encoding in its internal index format. So normalisation won't be a choice criteria, though trying out the encoding detection would be.
I suggest you use Solr. The ExtractingRequestHandler handles encodings and document formats. It is relatively easy to get a working prototype using Solr. DataImportHandler enables importing a document repository into Solr.

Storing parts of user data in files for preventing SQL injection

I am new to web programming and have been exploring issues related to web security.
I have a form where the user can post two types of data - lets call them "safe" and "unsafe" (from the point of view of sql).
Most places recommend storing both parts of the data in database after sanitizing the "unsafe" part (to make it "safe").
I am wondering about a different approach - to store the "safe" data in database and "unsafe" data in files (outside the database). Ofcourse this approach creates its own set of problems related to maintaining association between files and DB entries. But are there any other major issues with this approach, especially related to security?
UPDATE: Thanks for the responses! Apologies for not being clear regarding what I am
considering "safe" so some clarification is in order. I am using Django, and the form
data that I am considering "safe" is accessed through the form's "cleaned_data"
dictionary which does all the necessary escaping.
For the purpose of this question, let us consider a wiki page. The title of
wiki page does not need to have any styling attached with it. So, this can be accessed
through form's "cleaned_data" dictionary which will convert the user input to
"safe" format. But since I wish to provide the users the ability to arbitrarily style
their content, I can't perhaps access the content part using "cleaned_data" dictionary.
Does the file approach solve the security aspects of this problem? Or are there other
security issues that I am overlooking?
You know the "safe" data you're talking about? It isn't. It's all unsafe and you should treat it as such. Not by storing it al in files, but by properly constructing your SQL statements.
As others have mentioned, using prepared statements, or a library which which simulates them, is the way to go, e.g.
$db->Execute("insert into foo(x,y,z) values (?,?,?)", array($one, $two, $three));
What do you consider "safe" and "unsafe"? Are you considering data with the slashes escaped to be "safe"? If so, please don't.
Use bound variables with SQL placeholders. It is the only sensible way to protect against SQL injection.
Splitting your data will not protect you from SQL injection, it'll just limit the data which can be exposed through it, but that's not the only risk of the attack. They can also delete data, add bogus data and so on.
I see no justification to use your approach, especially given that using prepared statements (supported in many, if not all, development platforms and databases).
That without even entering in the nightmare that your approach will end up being.
In the end, why will you use a database if you don't trust it? Just use plain files if you wish, a mix is a no-no.
SQL injection can targeted whole database not only user, and it is the matter of query (poisoning query), so for me the best way (if not the only) to avoid SQL injection attack is control your query, protect it from possibility injected with malicious characters rather than splitting the storage.