Lectora Chinese characters that are manipulated using Variables are not coming in UTF-8 - chinese-locale

In lectora Inspire V11, while doing chinese course all Traditional Chinese characters that are directly assigned are coming in UTF 8 properly. Only the variables that are manipulated are not coming in UTF-8 there by resulting in Junk characters. Does anybody have any work around?

You are probably better off asking for help at Lectora's community website, https://community.trivantis.com/connect/

Related

Chinese in Japanese encoding

This may sound like a stupid question. I typed some Chinese characters into an empty text file in VS code text editor (default utf8). Then I saved the file in an encoding for Japanese: shift JIS, which apparently doesn't cover all the characters I have typed in.
However, before I close the file, all Chinese characters are displayed properly in VS code. Now after I closed the file and reopened it using shift JIS encoding, several characters are displayed as a question mark ?. I guess these are the Chinese characters not covered by the Japanese encoding?
What happened in the process? Is there anyway I can 'get back' the Chinese characters that are now shown in ?? I don't really understand how encoding works in this scenario...
Not all encodings cover all characters. (Unicode encodings, in principle, do, but even they don't have quite everything yet.) If you save some text in an encoding which does not include all characters in that text, something has to give.
Options:
you get an error message,
nothing saves at all,
the characters which cannot be included are silently dropped,
the characters which cannot be included are converted to some other character (such as the question mark).
Once that conversion is done, the data is lost, and cannot be recovered. Why not use UTF-8 or another Unicode encoding? (GB 18030 might be the best for large amounts of Chinese text.)

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

Hebrew characters processed by HTML Tidy turn into gibberish

I'm using HTML Tidy Online (http://infohound.net/tidy/) to tidy up some very old and messed up HTML file which contains some Hebrew characters. Whenever the page is processed by Tidy the output turns Hebrew characters into gibberish, even after changing encoding methods in the settings. Using different settings, I do manage to get the same output with the Hebrew characters as unicode entities.
I Googled around for a possible solution but found none.
I had a couple ideas in mind, but I'm not sure exactly how to approach them, if at all (maybe someone has a better solution).
I thought maybe I could (after processing the page) scan the page for unicode entities and replace them with the corresponding Hebrew characters (in a systematic way, of course).
Maybe I could take the HTML Tidy source code and modify it to output Hebrew characters appropriately. The problem with this is that I doubt I am knowledgeable enough to even get started on something like this.
I had a similar problem. Document in UTF-8, containing unicode characters. HTML Tidy turned them into HTML entities. This in HTMLTIDY.CFG fixed it:
char-encoding: utf8
input-encoding: utf8
output-encoding: utf8
Hope it helps.
The website http://infohound.net/tidy/ that you are using has a "Char encoding" clause at the bottom right. You need to choose utf-8, but first you need to make sure that the page is encoded in UTF-8 in your test editor. In Notepad++ for example, you can go to Encoding > Convert to UTF-8 without BOM.

Should I use hex ascii accented character code in HTML or use the actual character?

I have several huge CSVs with lots of accented characters in html hex code: é for é and lots of others, even – for –, etc.
My site is a wiki for people to update listings. So when they are presented a textarea for update, the existing content is filled in, and obviously those hex codes will be shown.
Should I be bothered replacing those codes with actual accented characters, or just leave it as it is? I wrote a script to replace the characters, but somehow the output are weird characters. Probably the format saved in Ruby isn't in UTF-8 format.
By default my site is in UTF-8, and the accented characters are displayed properly with some html coding in the view.
Please advise. Thanks.
Could you clarify what the problem is?
If your data (CSV) is in UTF-8, and the default encoding of your site is UTF-8, then all you would need to do is make sure that when users are editing content, that content is properly treated as UTF-8.
You may not need to display the markup to the users. Perhaps you could leverage a WYSIWIG editor package like TinyMCE?

Codepages and encodings

Before anyone recommends that I do a google search on this, I have. I just need a bit more clarity around what codepages and encodings.
If I use UTF8 encoding, and use an italian code page and then a french code page, does this mean ill get different characters even though the bytes havent changed?
Joel has a nice summary of this:
http://www.joelonsoftware.com/articles/Unicode.html
And no. if I understand your question correctly it doesn't mean that.
When you're converting UTF-8 to a specific code page, it is possible that only some of the characters are going to be converted. What happens to the ones that don't get converted depends on how you call the conversion. A possible result is that the characters which could not be mapped to the code page would be converted to question mark characters.
An encoding is simply a mapping between numerical values and "characters".
US-ASCII maps the number 65 to the letter A, 32 to a space and 49 to the digit "1". (How these things are rendered is another matter.) In fact, UTF-8 does the same! But there are other values which UTF-8 treats differently to ASCII. It is a variable-length encoding, i.e. a character may be encoded with 1, 2, 3, or 4 bytes; common characters generally consume less bytes.
Plain text files, including web pages, are stored and transmitted as sequences of bytes. These bytes are supposed to represent something textual. Software applications (like text editors and web browsers) are responsible for rending the information within these files on the screen. Usually they make use of library or OS functions.
If the software assumes a different encoding to the software that created the file, the wrong characters may be displayed!
Note that it is possible to convert between different encodings; however if you convert to an encoding that does not contain a certain character, the software must make a choice as to what to use instead. This conversion often happens transparently (when you save a file with a certain encoding, whatever you've typed must be changed into that encoding).
UTF-8 includes all characters from your French and Italian code page, but the language specific code pages does not include all of each others characters.
So you can take input from each language and convert it to UTF-8 for storage, but you can not be certain that you will get the right characters if you take Italian input and show it as French.
Use UTF-8 all the way if you can.