I'm using a mailing platform which provides a spam score/meter.
When testing a subject line with an accented vowel such as "envía", the spam meter alerts that I'm using special characters which increases the chances my mail goes to the spam folder.
This platform has awful support for any language other than english and I'm wondering if it just be that or if accented characters are really penalized.
It depends on the particular spam filter and how it's configured. Some filters do, or can be configured to, penalize accented characters (non-ASCII characters in general), but not so much so that it will automatically hit the spam threshold.
Since the filters are on the clients' end and you have no control over it, I would recommend not worrying about it too much. If your emails genuinely are not spam, they should get through fine. If they don't, chances are the client has an overly aggressive filter and there's not a lot you can do about it.
The reason non-ASCII characters might be penalized is that spammers often use them to disguise keywords. For example, "viagra" could be spelled with an accented í as "víagra"; this would circumvent a naive filter programmed to penalize emails containing the word "viagra." I don't know this for certain, but I would imagine the more advanced filters are smart enough to heuristically distinguish this type of usage from genuine human language using accented characters.
Related
I have lots of emails of my users that I am sending email to, but I don't know their language preference. Is there a way that I can code my email so that it can detect the language preference when a user opens the email, and it will pull the text from a set of translations that I have already made?
I currently send emails in English, but I know that I have users from S. America and Europe. But, all of these emails are .com domains.
Any thoughts?
I do not think there is a way to detect this in your email code. Even if there was, there would likely be problems:
The user may not have their client set to their preferred language. This is very common with people who :
Are comfortable in many languages, including the default language of the client.
Are unwilling or don't know how to set a preferred language.
Are using a basic client which only offers one or two languages.
People who are learning other languages and change their language settings for learning purposes.
The preferred language may not be the one they use (shared computers) which chances an unreadable email.
Instead, you could provide links in your emails to prepared translations, or an automated translator which translates your emails.
We're developing an educational multiplayer game for kids and want to allow players to chat with each other using a whitelist system. When using whitelist chat, players will be able to type only words which appear in the whitelist.
We're aware of the limitations of whitelists in general, but we think a whitelist chat system is something that would allow our players to express themselves better in the game, while allowing a higher level of security than moderated or blacklist chat.
While the system is easy enough to implement, we haven't been able to find a sample whitelist of "safe" words online. Does anyone know of where we can find such a list, preferably with a license that allows us to use it in a commercial project?
Thanks.
I do not believe that a simple whitelist of words will cut it. There are quite a few euphemisms for a lot of stuff out there, that a whitelist would never block (e.g. "he is growing like a weed" is fine, "he is growing weed" is NOT). And let's not mention the basic "would you like to meet?" which would be fine if the meeting were to happen in-game, but very dangerous if it were to happen out of it. Then there is also the issue of blocking rare, foreign or mistyped words, that might make your chat system frustrating enough that it would not be used.
In my opinion, there is absolutely no way you could ever match the security offered by an active and competent human moderator. Of course, depending on the volume of chat traffic and any real-time requirements there are quite a few practical issues with using humans for this. Considering that your application is targeted at children, however, human moderation might be quite acceptable, despite its much higer cost.
A second choise, but one very far from the abilities of human moderation, is to use some statistical filter such as Bogofilter, which will happily sort arbitrary text if you train it well. A blacklist would also help to immediately cut down messages with words that little kids should not (but usually do) know. You would also need a bunch of filters that would cut down messages with stuff like telephone numbers, email and street addresses and web links.
Perhaps the method with the best effectiveness/cost ratio would be to use human moderators assisted by multiple statistical filters to better make use of their time. Keep in mind, however, that if there are malicious users (i.e. anything else than same-age kids in a classroom) there is no way to make sure that nothing questionable or dangerous ever goes through.
You can try the standard unix dictionary. /usr/share/dict/words. But you'll have to modify it to remove the naughty words.
http://en.wikipedia.org/wiki/Words_%28Unix%29
http://www.openwall.com/wordlists/
While this doesn't exactly answer your question, Runescape uses a white list of phrases, rather than words.
The implementation in Runescape is awkward, because there are so many phrases to choose from. You have to go through 3 or 4 menus sometimes to get to the phrase you want.
If you can come up with a better organization of phrases, then this might work for you.
I would like to allow my users to use Unicode for their passwords.
However I see a lot of sites don't support that (e.g. Gmail, Hotmail).
So I'm wondering if there's some technical or usability issue that I'm overlooking.
I'm thinking if anything it must be a usability issue since by default .NET accepts Unicode and if Hotmail--er, the new Live mail--is built on that, I don't see why they would restrict it.
Has anyone encountered similar issues?
I am sure there is no technical problem but maybe gmail and hotmail are not supporting that on purpose. This kind of websites have a wide audience and should be accessible from everywhere.
Let's imagine the user have a password in Japanese but he is on travel and go to a cyber cafe and there is no Japanese support the user won't be able to login.
One other problem is to analyze the password complexity, it's not so difficult to make sure the user didn't type a common word in English but what about in Chinese/Russian/Thai. It is much more difficult to analyze the complexity of a password as you add more languages.
So in case you want your system to be accessible, it's better to ensure that the user would be able to type his password on every kind of devices/OSes/environments, so the alpha numeric password with most common symbols(!<>"#$%& etc..) is kind of good set of characters available everywhere.
Generally I am strongly in favor of not restricting what kinds of characters are allowed in passwords. However, remember that you have to compare something to something stored which may be the password or a hash. In the former case you have to make sure that comparison is done correctly which is much more complex with Unicode than with ASCII alone; in the latter case you would have to ensure that you are hashing exactly the same whenever it is entered. Normalization forms may help here or be a curse, depending on who applies them.
For example, in an application I'm working on I am using a hash over a UTF-8 conversion of the password which was normalized beforehand to weed out potential problems with combining characters and such.
The biggest problem the user may face is that they can't enter it in some places, like on another keyboard layout. This is already the case for one of my passwords but never was a problem so far. And after all, that's a decision the user has to make in choosing their password and not one the application should make on behalf of the user. I doubt there are users who happily use arbitrary Unicode in their passwords and not think of the problems that may arise when using another keyboard layout. (This may be an issue for web-based services more than anything else, though.)
There are instances where Unicode is rightly forbidden, though. One such example is TrueCrypt which forces the use of the US keyboard layout for boot-time passwords (for full-volume encryption). There is no other layout there and therefore Unicode or any other keyboard layout only produces problems.
However, that doesn't explain why they forbid Unicode in normal passwords. A warning might be nice but outright forbidding is wrong in my eyes.
So I'm wondering if there's some technical or usability issue that I'm overlooking.
There's a technical issue with non-ASCII passwords (and usernames, for that matter) with HTTP Basic Authentication. As far as I know the sites you mentioned don't generally use Basic Authentication, but it might be a hangover from systems that do.
The HTTP Basic Authentication standard defines a base64-encoded username:password token. This means if you have a colon in the username or password the results are ambiguous. Also, base64-decoding the token gives you only bytes, with no direction of how to convert those bytes to characters. And guess what? The different browsers use different encodings to do it.
Opera and Chrome use UTF-8.
IE uses the client system's default code page (which is of course never UTF-8) and mangles characters that don't fit in it using the Windows standard Try To Find A Character That Looks A Bit Like It, Or Maybe Just Not (Who Cares) algorithm.
Safari uses ISO-8859-1, and silently refuses to send any auth token at all when the username or password has characters that don't fit.
Mozilla takes the lowest 8 bits of the code point (similar to ISO-8859-1, but more broken). See bug 41489 for tortuous discussion with no outcome or progress.
So if you allow non-ASCII usernames or passwords then the Basic Authentication process will be at best complicated and inconsistent, with users wondering why it randomly works or fails when they use different computers or browsers.
No. Restrict passwords to ASCII characters.
When you input a password, bullets are displayed to conceal the password.
But when you input Japanese and other languages, you must go through an input method, converting the keystrokes into the desired characters. This requires you to see what the characters are.
I support Unicode passwords in all of my web applications. If using a recent browser the visitor can use any code point in their preferred or native scripts.
For enhanced security I store a salted hash rather than using reversible encryption.
The important thing is to correctly normalize and encode the password string before adding the byte sequence to the hash (I prefer UTF-8 for endian independence).
Unicode sucks if you have to do programmatic matching. The "minus sign" and "dash" look the same, but might be separate codes. "n with a funny tilde over it" might be one letter, or a diacritic and a letter.
If people use different encoding methods, then their passwords might not match, even though the passwords look the same. See omg-ponies aka humanity=epic fail.
You can normalize, but what happens when:
the normalization rules change
you have some users with diacritics in their password
you have some users with combined letters in their password
the passwords are hashed, so you can't change the passwords
Guess what - you need to force a password reset on some of your users.
Good idea.
Makes the password stronger, gives more freedom to the users.
And it is already done by Windows (since at least Win 2000), Active Directory and LDAP, Novell (since at least 2004)
Some customers want it (http://mailman.mit.edu/pipermail/kerberos/2008-July/013923.html) and there is even a standard on how to do it right (https://www.rfc-editor.org/rfc/rfc8265[3], obsoletes https://www.rfc-editor.org/rfc/rfc4013, thanks John).
I'm sure that the multilingual counterparts of those sites do support unicode. It sounds like a user requirements issue rather than a technical challenge.
I would not be surprised if there is a technical issue with the server not being certain of the encoding the client is sending the password in.
However, I would guess that, say, sites with mainly native-speaking Japanese, Chinese or Russian audiences would use the commonly used respective non-ASCII character set (Big5, EUC-KR, koi8, etc.) for passwords. Maybe you can research what they are doing to cope with older web clients using any of the non-Unicode stuff.
with HTML 5, with the ability to send to your users a font, you can integrate a visual keyboard on your system, so users will be able to use your language,
Hint: use Deja Vu font, and modify it using FontForge so you can make it smaller, then, with a visual javascript keyboard, you can make it possible ;)
Look here, it is a project where i did the trick.
Yes, I realize this question was asked and answered, but I have specific questions about this that I feel were not clear on that thread and I'd prefer not to get lost in the shuffle on another thread as well.
Previous threads said that rendering the email address to an image the way Facebook does is overkill and unprofessional user experience for business/professional websites. And it seems that the general consensus is to use a JavaScript document.write solution using html entities or some other method that breaks up and/or makes the string unreadable by a simple bot. The application I'm building doesn't even need the "mailto:" functionality, I just need to display the email address. Also, this is a business web application, so it needs to look/act as professional as possible. Here are my questions:
If I go the document.write route and pass the html entity version of each character, are there no web crawlers sophisticated enough to execute the javascript and pull the rendered text anyway? Or is this considered best practice and completely (or almost completely) spammer proof?
What's so unprofessional about the image solution? If Facebook is one of the highest trafficked applications in the world and not at all run by amateurs, why is their method completely dismissed in the other thread about this subject?
If your answer (as in the other thread) is to not bother myself with this issue and let the users' spam filters do all the work, please explain why you feel this way. We are displaying our users' email addresses that they have given us, and I feel responsible to protect them as much as I can. If you feel this is unnecessary, please explain why.
Thanks.
It is not spammer proof. If someone looks at the code for your site and determines the pattern that you are using for your email addresses, then specific code can be written to try and decipher that.
I don't know that I would say it is unprofessional, but it prevents copy-and-paste functionality, which is quite a big deal. With images, you simply don't get that functionality. What if you want to copy a relatively complex email address to your address book in Outlook? You have to resort to typing it out which is prone to error.
Moving the responsibility to the users spam filters is really a poor response. While I believe that users should be diligent in guarding against spam, that doesn't absolve the person publishing the address from responsibility.
To that end, trying to do this in an absolutely secure manner is nearly impossible. The only way to do that is to have a shared secret which the code uses to decipher the encoded email address. The problem with this is that because the javascript is interpreted on the client side, there isn't anything that you can keep a secret from scrapers.
Encoders for email addresses nowadays generally work because most email bot harvesters aren't going to concern themselves with coding specifically for every site. They are going to try and have a minimal algorithm which will get maximum results (the payoff isn't worth it otherwise). Because of this, simple encoders will defeat most bots. But if someone REALLY wants to get at the emails on your site, then they can and probably easily as well, since the code that writes the addresses is publically available.
Taking all this into consideration, it makes sense that Facebook went the image route. Because they can alter the image to make OCR all but impossible, they can virtually guarantee that email addresses won't be harvested. Given that they are probably one of the largest email address repositories in the world, it could be argued that they carry a heavier burden than any of us, and while inconvenient, are forced down that route to ensure security and privacy for their vast user base.
Quite a few reasons Javascript is a good solution for now (that may change as the landscape evolves).
Javascript obfuscation is a better mouse trap for now
You just need to outrun the others. As long as there are low hanging fruit, spammers will go for those. So unless everyone starts moving to javascript, you're okay for now at least
most spammers use http based scripts which GET and parse using regex. using a javascript engine to parse is certainly possible but will slow things down
Regarding the facebook solution, I don't consider it unprofessional but I can clearly see why purists may disagree.
It breaks accessibility standards (cannot be parsed by browsers, voice readers or be clicked.
It breaks semantic construct (it's an image, not a mailto link anymore)
It breaks the presentational layer. If you increase browser default font size or use high contrast custom CSS, it won't apply to the email.
Here is a nice blog post comparing a few methods, with benchmarks.
http://techblog.tilllate.com/2008/07/20/ten-methods-to-obfuscate-e-mail-addresses-compared/
What the latest figures are on people viewing their emails in text only mode vs. HTML?
Wikipedia and it's source both seem to reference this research from 2006 which is an eternity ago in internet terms.
An issue with combining both HTML and text based emails is taking a disproportionate amount of time to resolve given the likely number of users it is affecting.
As with web browser usage statistics, it depends entirely on the audience.
I have access to a bit of data on this subject and it seems that text-only email use is very low (for non-technical audiences, at least). <0.1% up to ~6% depending on demographic.
It's not that much effort to do both (especially if you can find something to help you do the heavy lifting when creating multipart MIME containers), and you can always write a script to generate text from your HTML or something.