Should I support Unicode in passwords? - unicode

I would like to allow my users to use Unicode for their passwords.
However I see a lot of sites don't support that (e.g. Gmail, Hotmail).
So I'm wondering if there's some technical or usability issue that I'm overlooking.
I'm thinking if anything it must be a usability issue since by default .NET accepts Unicode and if Hotmail--er, the new Live mail--is built on that, I don't see why they would restrict it.
Has anyone encountered similar issues?

I am sure there is no technical problem but maybe gmail and hotmail are not supporting that on purpose. This kind of websites have a wide audience and should be accessible from everywhere.
Let's imagine the user have a password in Japanese but he is on travel and go to a cyber cafe and there is no Japanese support the user won't be able to login.
One other problem is to analyze the password complexity, it's not so difficult to make sure the user didn't type a common word in English but what about in Chinese/Russian/Thai. It is much more difficult to analyze the complexity of a password as you add more languages.
So in case you want your system to be accessible, it's better to ensure that the user would be able to type his password on every kind of devices/OSes/environments, so the alpha numeric password with most common symbols(!<>"#$%& etc..) is kind of good set of characters available everywhere.

Generally I am strongly in favor of not restricting what kinds of characters are allowed in passwords. However, remember that you have to compare something to something stored which may be the password or a hash. In the former case you have to make sure that comparison is done correctly which is much more complex with Unicode than with ASCII alone; in the latter case you would have to ensure that you are hashing exactly the same whenever it is entered. Normalization forms may help here or be a curse, depending on who applies them.
For example, in an application I'm working on I am using a hash over a UTF-8 conversion of the password which was normalized beforehand to weed out potential problems with combining characters and such.
The biggest problem the user may face is that they can't enter it in some places, like on another keyboard layout. This is already the case for one of my passwords but never was a problem so far. And after all, that's a decision the user has to make in choosing their password and not one the application should make on behalf of the user. I doubt there are users who happily use arbitrary Unicode in their passwords and not think of the problems that may arise when using another keyboard layout. (This may be an issue for web-based services more than anything else, though.)
There are instances where Unicode is rightly forbidden, though. One such example is TrueCrypt which forces the use of the US keyboard layout for boot-time passwords (for full-volume encryption). There is no other layout there and therefore Unicode or any other keyboard layout only produces problems.
However, that doesn't explain why they forbid Unicode in normal passwords. A warning might be nice but outright forbidding is wrong in my eyes.

So I'm wondering if there's some technical or usability issue that I'm overlooking.
There's a technical issue with non-ASCII passwords (and usernames, for that matter) with HTTP Basic Authentication. As far as I know the sites you mentioned don't generally use Basic Authentication, but it might be a hangover from systems that do.
The HTTP Basic Authentication standard defines a base64-encoded username:password token. This means if you have a colon in the username or password the results are ambiguous. Also, base64-decoding the token gives you only bytes, with no direction of how to convert those bytes to characters. And guess what? The different browsers use different encodings to do it.
Opera and Chrome use UTF-8.
IE uses the client system's default code page (which is of course never UTF-8) and mangles characters that don't fit in it using the Windows standard Try To Find A Character That Looks A Bit Like It, Or Maybe Just Not (Who Cares) algorithm.
Safari uses ISO-8859-1, and silently refuses to send any auth token at all when the username or password has characters that don't fit.
Mozilla takes the lowest 8 bits of the code point (similar to ISO-8859-1, but more broken). See bug 41489 for tortuous discussion with no outcome or progress.
So if you allow non-ASCII usernames or passwords then the Basic Authentication process will be at best complicated and inconsistent, with users wondering why it randomly works or fails when they use different computers or browsers.

No. Restrict passwords to ASCII characters.
When you input a password, bullets are displayed to conceal the password.
But when you input Japanese and other languages, you must go through an input method, converting the keystrokes into the desired characters. This requires you to see what the characters are.

I support Unicode passwords in all of my web applications. If using a recent browser the visitor can use any code point in their preferred or native scripts.
For enhanced security I store a salted hash rather than using reversible encryption.
The important thing is to correctly normalize and encode the password string before adding the byte sequence to the hash (I prefer UTF-8 for endian independence).

Unicode sucks if you have to do programmatic matching. The "minus sign" and "dash" look the same, but might be separate codes. "n with a funny tilde over it" might be one letter, or a diacritic and a letter.
If people use different encoding methods, then their passwords might not match, even though the passwords look the same. See omg-ponies aka humanity=epic fail.
You can normalize, but what happens when:
the normalization rules change
you have some users with diacritics in their password
you have some users with combined letters in their password
the passwords are hashed, so you can't change the passwords
Guess what - you need to force a password reset on some of your users.

Good idea.
Makes the password stronger, gives more freedom to the users.
And it is already done by Windows (since at least Win 2000), Active Directory and LDAP, Novell (since at least 2004)
Some customers want it (http://mailman.mit.edu/pipermail/kerberos/2008-July/013923.html) and there is even a standard on how to do it right (https://www.rfc-editor.org/rfc/rfc8265[3], obsoletes https://www.rfc-editor.org/rfc/rfc4013, thanks John).

I'm sure that the multilingual counterparts of those sites do support unicode. It sounds like a user requirements issue rather than a technical challenge.

I would not be surprised if there is a technical issue with the server not being certain of the encoding the client is sending the password in.
However, I would guess that, say, sites with mainly native-speaking Japanese, Chinese or Russian audiences would use the commonly used respective non-ASCII character set (Big5, EUC-KR, koi8, etc.) for passwords. Maybe you can research what they are doing to cope with older web clients using any of the non-Unicode stuff.

with HTML 5, with the ability to send to your users a font, you can integrate a visual keyboard on your system, so users will be able to use your language,
Hint: use Deja Vu font, and modify it using FontForge so you can make it smaller, then, with a visual javascript keyboard, you can make it possible ;)
Look here, it is a project where i did the trick.

Related

Is it still necessary to support plain-text emails?

We're creating a web application that sends emails for different purposes. Since we need to embed images and links in some of the messages, HTML is a must.
Most of the email messages can be customized by our users. We provide a web-based editor to do that. Requiring our users to enter always two message variants, one in HTML, one in plain text, is no option, that's just annoying for the users, so our current approach is to specify a plain-text part with something like "please use an HTML-capable mail client".
Is this a valid approach, or do I break certain clients that could still be relevant? I know that this question depends on our user base, but I'd like to get a general suggestion for "most cases" in the year 2015.
If this is not an option, are there any sensible ways to automatically construct a plain-text message out of the HTML message?
IMO it's a valid approach. I can't think of any commonly-used mail clients that are still plain-text only. Can you?

Do ESPs penalize accented vowels for spam?

I'm using a mailing platform which provides a spam score/meter.
When testing a subject line with an accented vowel such as "envía", the spam meter alerts that I'm using special characters which increases the chances my mail goes to the spam folder.
This platform has awful support for any language other than english and I'm wondering if it just be that or if accented characters are really penalized.
It depends on the particular spam filter and how it's configured. Some filters do, or can be configured to, penalize accented characters (non-ASCII characters in general), but not so much so that it will automatically hit the spam threshold.
Since the filters are on the clients' end and you have no control over it, I would recommend not worrying about it too much. If your emails genuinely are not spam, they should get through fine. If they don't, chances are the client has an overly aggressive filter and there's not a lot you can do about it.
The reason non-ASCII characters might be penalized is that spammers often use them to disguise keywords. For example, "viagra" could be spelled with an accented í as "víagra"; this would circumvent a naive filter programmed to penalize emails containing the word "viagra." I don't know this for certain, but I would imagine the more advanced filters are smart enough to heuristically distinguish this type of usage from genuine human language using accented characters.

Clickable email-links encryption? How to do them?

I would like to know if and how it is possible to create a clickable email-link for websites, that are "encrypted" in a way emailspiders can't collect them and it is still possible for living users to click it to open in email-clients or even copy it.
I saw some links that were done in javascript but I on't know how they did this and how "safe" they are.
thank you in advance for any reply
Most approaches to this are splitting the address across multiple elements and inserting extra formatting; then for JS-enabled browsers, they use JavaScript to turn it back into an e-mail address.
The poster example for this is SpamSpan, which even has several "levels" of obfuscation - each level progressively less and less resembles an e-mail in the source code, yet it still manages to piece it back together by JS. Although some spambots today are supposedly capable of executing JavaScript, te vast majority doesn't - and the e-mails are still human-readable with JS off. An advantage of JS-assisted de/obfuscation is that it doesn't rely on external servers, you just need to (simply) integrate the JS library.
Another approach is taken by reCAPTCHA Mailhide - the e-mail is revealed only after solving a CAPTCHA (same type as for normal reCAPTCHA). This is less convenient for the user, but practically safe against robots. A disadvantage of this is that it depends on reCAPTCHA's servers (in essence, on Google) - some people are dead-set against any external dependencies.
This would be a very simple and effective way:
Scramble email addresses
All it does is convert it into ASCII, and all you need to do is insert it where your email address would go!
Although there are more (crazily) secure ways you can choose, this would be the simply option. You can also try this solution, it uses JavaScript to protect your email.
Hope this helps!

How to overcome fear of user-input (web development)

I'm writing a web application for public consumption...How do you get over/ deal with the fear of User Input? As a web developer, you know the tricks and holes that exist that can be exploited particularly on the web which are made all the more easier with add-ons like Firebug etc
Sometimes it's so overwhelming you just want to forget the whole deal (does make you appreciate Intranet Development though!)
Sorry if this isn't a question that can be answered simply, but perhaps ideas or strategies that are helpful...Thanks!
One word: server-side validation (ok, that may have been three words).
There's lots of sound advice in other answers, but I'll add a less "programming" answer:
Have a plan for dealing with it.
Be ready for the contingency that malicious users do manage to sneak something past you. Have plans in place to mitigate damage, restore clean and complete data, and communicate with users (and potentially other interested parties such as the issuers of any credit card details you hold) to tell them what's going on. Know how you will detect the breach and close it. Know that key operational and development personnel are reachable, so that a bad guy striking at 5:01pm on the Friday before a public holiday won't get 72+ clear hours before you can go offline let alone start fixing things.
Having plans in place won't help you stop bad user input, but it should help a bit with overcoming your fears.
If its "security" related concerns you need to just push through it, security and exploits are a fact of life in software, and they need to be addressed head-on as part of the development process.
Here are some suggestions:
Keep it in perspective - Security, Exploits and compromises are going to happen to any application which is popular or useful, be prepared for them and expect them to occur
Test it, then test it again - QA, Acceptance testing and sign off should be first class parts of your design and production process, even if you are a one-man shop. Enlist users to test as a dedicated (and vocal) user will be your most useful tool in finding problems
Know your platform - Make sure you know the technology, and hardware you are deploying on. Ensure that relevant patches and security updates are applied
research - look at applications similar to your own and see what issues they experience, surf their forums, read their bug logs etc.
Be realistic - You are not going to be able to fix every bug and close every hole. Pick the most impactful ones and address those
Lots of eyes - Enlist as many people to review your designs and code as possible. This should be in addition to your QA resources
You don't get over it.
Check everything at server side - validate input again, check permissions, etc.
Sanitize all data.
That's very easy to write in bold letter and a little harder to do in practice.
Something I always did was wrap all user strings in an object, something like StringWrapper which forces you to call an encoding method to get the string. In other words, just provide access to s.htmlEncode() s.urlEncode().htmlEncode() etc. Of course you need to get the raw string so you can have a s.rawString() method, but now you have something you can grep for to review all uses of raw strings.
So when you come to 'echo userString' you will get a type error, and you are then reminded to encode/escape the string through the public methods.
Some other general things:
Prefer white-lists over black lists
Don't go overboard with stripping out bad input. I want to be able to use the < character in posts/comments/etc! Just make sure you encode data correctly
Use parameterized SQL queries. If you are SQL escaping user input yourself, you are doing it wrong.
First, I'll try to comfort you a bit by pointing out that it's good to be paranoid. Just as it's good to be a little scared while driving, it's good to be afraid of user input. Assume the worst as much as you can, and you won't be disappointed.
Second, program defensively. Assume any communication you have with the outside world is entirely compromised. Take in only parameters that the user should be able to control. Expose only that data that the user should be able to see.
Sanitize input. Sanitize sanitize sanitize. If it's input that will be displayed on the site (nicknames for a leaderboard, messages on a forum, anything), sanitize it appropriately. If it's input that might be sent to SQL, sanitize that too. In fact, don't even write SQL directly, use an intermediary of some sort.
There's really only one thing you can't defend from if you're using HTTP. If you use a cookie to identify somebody's identity, there's nothing you can do from preventing somebody else in a coffeehouse from sniffing the cookie of somebody else in that coffee house if they're both using the same wireless connection. As long as they're not using a secure connection, nothing can save you from that. Even Gmail isn't safe from that attack. The only thing you can do is make sure an authorization cookie can't last forever, and consider making them re-login before they do something big like change password or buy something.
But don't sweat it. A lot of the security details have been taken care of by whatever system you're building on top of (you ARE building on top of SOMETHING, aren't you? Spring MVC? Rails? Struts? ). It's really not that tough. If there's big money at stake, you can pay a security auditing company to try and break it. If there's not, just try to think of everything reasonable and fix holes when they're found.
But don't stop being paranoid. They're always out to get you. That's just part of being popular.
P.S. One more hint. If you have javascript like this:
if( document.forms["myForm"]["payment"].value < 0 ) {
alert("You must enter a positive number!");
return false;
}
Then you'd sure as hell have code in the backend that goes:
verify( input.payment >= 0 )
"Quote" everything so that it can not have any meaning in the 'target' language: SQL, HTML, JavaScript, etc.
This will get in the way of course, so you have to be careful to identify when this needs special handling, like through administrative privileges to deal with some if the data.
There are multiple types of injection and cross-site scripting (see this earlier answer), but there are defenses against all of them. You'll clearly want to look at stored procedures, white-listing (e.g. for HTML input), and validation, to start.
Beyond that, it's hard to give general advice. Other people have given some good tips, such as always doing server-side validation and researching past attacks.
Be vigilant, but not afraid.
No validation in web-application layer.
All validations and security checks should be done by the domain layer or business layer.
Throw exceptions with valid error messages and let these execptions be caught and processed at presentation layer or web-application.
You can use validation framework
to automate validations with the help
of custom validation attributes.
http://imar.spaanjaars.com/QuickDocId.aspx?quickdoc=477
There should be some documentation of known exploits for the language/system you're using. I know the Zend PHP Certification covers that issue a bit and you can read the study guide.
Why not hire an expert to audit your applications from time to time? It's a worthwhile investment considering your level of concern.
Our client always say: "Deal with my users as they dont differentiate between the date and text fields!!"
I code in Java, and my code is full of asserts i assume everything is wrong from the client and i check it all at server.
#1 thing for me is to always construct static SQL queries and pass your data as parameters. This limits the quoting issues you have to deal with enormously. See also http://xkcd.com/327/
This also has performance benefits, as you can re-use the prepared queries.
There are actually only 2 things you need to take care with:
Avoid SQL injection. Use parameterized queries to save user-controlled input in database. In Java terms: use PreparedStatement. In PHP terms: use mysql_real_escape_string() or PDO.
Avoid XSS. Escape user-controlled input during display. In Java/JSP terms: use JSTL <c:out>. In PHP terms: use htmlspecialchars().
That's all. You don't need to worry about the format of the data. Just about the way how you handle it.

Email obfuscation question

Yes, I realize this question was asked and answered, but I have specific questions about this that I feel were not clear on that thread and I'd prefer not to get lost in the shuffle on another thread as well.
Previous threads said that rendering the email address to an image the way Facebook does is overkill and unprofessional user experience for business/professional websites. And it seems that the general consensus is to use a JavaScript document.write solution using html entities or some other method that breaks up and/or makes the string unreadable by a simple bot. The application I'm building doesn't even need the "mailto:" functionality, I just need to display the email address. Also, this is a business web application, so it needs to look/act as professional as possible. Here are my questions:
If I go the document.write route and pass the html entity version of each character, are there no web crawlers sophisticated enough to execute the javascript and pull the rendered text anyway? Or is this considered best practice and completely (or almost completely) spammer proof?
What's so unprofessional about the image solution? If Facebook is one of the highest trafficked applications in the world and not at all run by amateurs, why is their method completely dismissed in the other thread about this subject?
If your answer (as in the other thread) is to not bother myself with this issue and let the users' spam filters do all the work, please explain why you feel this way. We are displaying our users' email addresses that they have given us, and I feel responsible to protect them as much as I can. If you feel this is unnecessary, please explain why.
Thanks.
It is not spammer proof. If someone looks at the code for your site and determines the pattern that you are using for your email addresses, then specific code can be written to try and decipher that.
I don't know that I would say it is unprofessional, but it prevents copy-and-paste functionality, which is quite a big deal. With images, you simply don't get that functionality. What if you want to copy a relatively complex email address to your address book in Outlook? You have to resort to typing it out which is prone to error.
Moving the responsibility to the users spam filters is really a poor response. While I believe that users should be diligent in guarding against spam, that doesn't absolve the person publishing the address from responsibility.
To that end, trying to do this in an absolutely secure manner is nearly impossible. The only way to do that is to have a shared secret which the code uses to decipher the encoded email address. The problem with this is that because the javascript is interpreted on the client side, there isn't anything that you can keep a secret from scrapers.
Encoders for email addresses nowadays generally work because most email bot harvesters aren't going to concern themselves with coding specifically for every site. They are going to try and have a minimal algorithm which will get maximum results (the payoff isn't worth it otherwise). Because of this, simple encoders will defeat most bots. But if someone REALLY wants to get at the emails on your site, then they can and probably easily as well, since the code that writes the addresses is publically available.
Taking all this into consideration, it makes sense that Facebook went the image route. Because they can alter the image to make OCR all but impossible, they can virtually guarantee that email addresses won't be harvested. Given that they are probably one of the largest email address repositories in the world, it could be argued that they carry a heavier burden than any of us, and while inconvenient, are forced down that route to ensure security and privacy for their vast user base.
Quite a few reasons Javascript is a good solution for now (that may change as the landscape evolves).
Javascript obfuscation is a better mouse trap for now
You just need to outrun the others. As long as there are low hanging fruit, spammers will go for those. So unless everyone starts moving to javascript, you're okay for now at least
most spammers use http based scripts which GET and parse using regex. using a javascript engine to parse is certainly possible but will slow things down
Regarding the facebook solution, I don't consider it unprofessional but I can clearly see why purists may disagree.
It breaks accessibility standards (cannot be parsed by browsers, voice readers or be clicked.
It breaks semantic construct (it's an image, not a mailto link anymore)
It breaks the presentational layer. If you increase browser default font size or use high contrast custom CSS, it won't apply to the email.
Here is a nice blog post comparing a few methods, with benchmarks.
http://techblog.tilllate.com/2008/07/20/ten-methods-to-obfuscate-e-mail-addresses-compared/