I need an elisp function that guesses the charset of some html, and since Emacs already does that when opening a file, I wonder if I can reuse it somehow, perhaps by writing the string in a temporary buffer, setting the correct charset, and getting it. Are there such functions?
Thanks!
See detect-coding-string.
I don't think that Emacs has something built-in to guess a character encoding, but it can read character encoding hints in files like -- coding: utf8 -- and etc. You can take a look at this external library though. I guess that you're using some web browser for Emacs like W3M and probably it has something to deal with character encodings based on the http metainformation it receives. This article might also be of some help.
Related
Well this is strange and hard to explain what's wrong but I'll try do my best.
For some reason values to template changes their encoding (I'm pretty sure they are).
Controller file (encoded in UTF-8):
print STDERR "ąęść";
$c->stash->{some_variable} = "ąęść"; # some unicode chars
Template file (encoded in UTF-8):
[% some_variable %]<br>
test: ąęść
As output in browser I'm getting:
ÄÄÅÄ
test: ąęść
Output on console (with UTF-8 encoding enabled):
ąęść
Please take a look at the good documentation provided by the Catalyst Wiki at Using Unicode and also Catalyst::View::TT. The Perl Unicode Cookbook may help you get a better understanding on Perl support of UTF-8, usually better than most other languages available today.
You may need to save your templates with the UTF-8 BOM mark using your editor, so your text editor does encode properly your template file when saving, or if not setting BOM, then at least define file encoding as UTF-8 every time you save it.
There's been a ton of fixes to unicode support and UTF-8 in general with the most recent stable release of Catalyst (5.90084). Catalyst now is UTF-8 by default, but of course there's always some hard edges. You should review the most recent docs on the subject to see what is going wrong.
If you template contains multibyte character than you do indeed need to set the BOM or review the documentation for your template view of choice.
Can scalaz be used without a keyboard containing the appropriate Unicode characters or does every Unicode identifier also have an "ASCII" equivalent (and if yes, is there any guarantee that it stays that way)? Are there special keyboard layouts for usage with scalaz?
What's the best practice? Inputting the Unicode identifiers directly or using the ASCII substitutes and using a script to replace them with the Unicode ones before commit?
No, you don't need anything besides ASCII to use Scalaz.
However, most editors and IDEs have some way of automatically or semi-automatically (like, -space) converting a sequence of characters into something else. That takes care of it if you want to keep your source code in Unicode.
Now, the problem with keeping stuff in Unicode is that you might trouble with some fonts when displaying stuff in web pages, etc. Hell, you might even be forced to convert the code to ASCII for some reason. Yes, it is unlikely, but it is an issue you should be aware of.
This post from Superuser has some information about this.
This wikipedia article on Unicode input might be helpful.
No. Yes. Yes. No. Benign guarantees are for sissies. Write code. I use an appropriate development environment that allows me to type whatever I like.
I have a UILabel which I change through the code. However when I create a NSString with the charaters æ,ø,å(Danish) I get an input conversion warning. The code look as this:
NSString *label=[[NSString alloc]initWithFormat:#"Prøv igen"];
And the warning I get is this - warning: input conversion stopped due to an input byte that does not belong to the input codeset UTF-8. I can understand that ø is probably not an UTF encoding but what to do? Anyone who can give me a hint about what to do to solves this?
Regards
Bjarke
Your source code is not saved as UTF-8, but most likely as something like ISO-8859-1.
Just open the file and re-save it as UTF-8 - and while you're at it, you should probably also make that the default. Exactly how to do that depends on what editor you're using.
Make sure your file text encoding is set to UTF-8, not Western (ISO) or something else. You can use the Xcode file info inspector to do this.
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/XcodeWorkspace/050-File_Management/file_management.html%23//apple_ref/doc/uid/TP40002677-BABICEHI
Make sure it says Unicode (UTF-8) for the File Encoding. If it asks you, tell it to reinterpret your file with the new encoding. Also, you may want to delete the problematic text and reinput it to get it to work.
I had the same problem, but my source code files were already UTF-8 encoded so I fix it in a different way.
In your case, it would have been something like
NSString *label=[NSString stringWithUTF8String"Prøv igen"];
I hope this will be helpful for others who stumble on this question
I'm using an API that processes my files and presents optimized output, but some special characters are not preserved, for example:
Input: äöü
Output: äöü
How do I fix this? What encoding should I use?
Many thanks for your help!
It really depend what processing you are done to your data. But in general, one powerful technique is to convert it to UTF-8 by Iconv, for example, and pass it through ASCII-capable API or functions. In general, if those functions don't mess with data they don't understand as ASCII, then the UTF-8 is preserved -- that's a nice property of UTF-8.
I am not sure what language you're using, but things like this occur when there is a mismatch between the encoding of the content when entered and encoding of the content when read in.
So, you might want to specify exactly what encoding to read the data. You may have to play with the actual encoding you need to use
string.getBytes("UTF-8")
string.getBytes("UTF-16")
string.getBytes("UTF-16LE")
string.getBytes("UTF-16BE")
etc...
Also, do some research about the system where this data is coming from. For example, web services from ASP.NET deliver the content as UTF-16LE, but Java uses UTF-16BE encoding. When these two system talk to each other with extended characters, they might not understand each other exactly the same way.
I support a website written in Tcl which displays data in Traditional Chinese (big5). We then have a Java servlet, using the translation code from mandarintools.com, to translate a page request into Simplified Chinese. The conversion as specified to the translation code is from UTF-8 to UTF-8S; Java is apparently correctly translating the data to UTF-8 as it comes in.
The Java translation code works but is slow, and since the website is written in Tcl someone on another list suggested I try using that. Unfortunately, Tcl doesn't support UTF-8S and I have been unable to figure out what translation to use in its place. I've tried gb2312, gb2312-raw,gb1988, euc-cn... all result in gibberish. My assumption is that Tcl is also translating to UTF-8 as it comes in, though I have tried converting from big5 first and it doesn't help.
My test code looks like this:
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 $page_body]
ns_write $translated_page_body
I have also tried
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 [encoding convertfrom big5 $page_body]]
ns_write $translated_page_body
But it didn't change anything.
Does anyone out there have enough experience with this to help me figure it out?
FYI for completeness' sake, I've been told by Tcl experts that you can't do the conversion this way, it has to be done via character replacement.
By any chance, are you grabbing your data from Oracle?
If so, see if you can use the CONVERT function to convert to from "utf8" to "al32utf8", which is the true Utf8 standard and which Tcl should work-with seamlessly.
If not, well, I guess I'll wait for you comment(s).