I am looking for encoding/decoding a string in q script
.Q.x10,.Q.j10,.Q.x12 and .Q.j12 does not seem to meet requirement.
e.g.
I want to encode "Hello world" and I should be able to decode it further
Your issue is that the default .Q.j10 and .Q.x10 don't allow for the space character " " since the space character is not in the default alphabet used:
q).Q.j10
64/:?["ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"]
If you look at the "tip" comment in the official documentation: https://code.kx.com/q/ref/dotq/#qj10-encode-binhex
you'll see that they suggest creating your own .Q.j10/.Q.x10 functions where the space character is the first character in your custom alphabet. Your alphabet still has to be only 64 characters though, so you would have to get rid of either the + or / (or replace them with another character of your choosing).
A similar question came up on the k4 topicbox on August 16th 2019 (subject "b64 encode") where Geo Carncross came up with this solution for b64 decoding:
q).Q.btoa"Hello World!"
"SGVsbG8gV29ybGQh"
q)
q){c:sum x="=";neg[c]_"c"$raze 256 vs'64 sv'0N 4#.Q.b6?x}"SGVsbG8gV29ybGQh"
"Hello World!"
I haven't tested the latter though.
Related
This is a follow-up of this question. I'm interested by different glyphs for the same character, also known as "Unicode Compatibility Characters".
Let's take the following two Arabic "reversed-character" words: كلمة ةملك
First word is:
كلمة
in hex code:
0643 0644 0645 0629
Second word is:
ةملك
in hex code:
0629 0645 0644 0643
If I paste those two words in Microsoft Word using Deja Vu Sans, I get this:
With the following pseudo-code using FreeType2, I get:
FT_Face face;
FT_New_Face(library, "DejaVuSans.ttf", 0, &face);
FT_GlyphSlot slot;
FT_Load_Char(face, each_character, FT_LOAD_RENDER);
slot = face->glyph;
//Use slot->bitmap.buffer
FT_Done_Face(face);
What am I missing? How can I have the right glyphs depending of the context?
My key issue is that I store each "character" (I should say glyph - but for me, character was equivalent to glyph) in a table so it's going to be complicated. I'm limited in speed, not in space. Can I have two different unicode characters for the same logical character?
libraqm is a solution to get the glyth for each character depending of its position in the sentence. But I'm still interested to get the character corresponding to the glyth (I know it's not a 1-to-1 relation). For instance, there are 4 characters for the 4 glyths of the letter Kaf as stated in the comment above.
I have a file, file1.txt, containing text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each region of text within the file according to language, except for English, and output a new file, e.g., here is a sample line:
The 恐龙 ate 鱼.
As this contains text in Chinese characters, this will get marked like this:
The \language[cn]{恐龙} ate \language[cn]{鱼}.
The document is saved as UTF-8.
Text in Chinese should be marked \language[cn]{*}.
Text in Japanese should be marked \language[ja]{*}.
Text in Korean should be marked \language[ko]{*}.
The content never continues from one line to the next.
If the code is ever in doubt about whether something is Chinese, Japanese, or Korean, it is best if it defaults to Chinese.
How can I mark the text according to the language present?
A crude algorithm:
use 5.014;
use utf8;
while (<DATA>) {
s
{(\p{Hangul}+)}
{\\language[ko]{$1}}g;
s
{(\p{Hani}+)}
{\\language[zh]{$1}}g;
s
{(\p{Hiragana}+|\p{Katakana}+)}
{\\language[ja]{$1}}g;
say;
}
__DATA__
The 恐龙 ate 鱼.
The 恐竜 ate 魚.
The キョウリュウ ate うお.
The 공룡 ate 물고기.
(Also see Detect chinese character using perl?)
There are problems with that. Daenyth comments that e.g. 恐竜 is misidentified as Chinese. I find it unlikely that you are really working with mixed English-CJK, and are just giving bad example text. Perform a lexical analysis first to differentiate Chinese from Japanese.
I'd like to provide a Python solution. No matter which language, it is based on Unicode Script information (from Unicode Database, aka UCD). Perl has rather detailed UCD compared to Python.
Python has no Script information opened in its "unicodedata" module. But someone has added it at here https://gist.github.com/2204527 (tiny and useful). My implementaion is based on it. BTW, it is not space sensitive(no need of any lexical analysis).
# coding=utf8
import unicodedata2
text=u"""The恐龙ate鱼.
The 恐竜ate 魚.
Theキョウリュウ ate うお.
The공룡 ate 물고기. """
langs = {
'Han':'cn',
'Katakana':'ja',
'Hiragana':'ja',
'Hangul':'ko'
}
alist = [(x,unicodedata2.script_cat(x)[0]) for x in text]
# Add Last
alist.append(("",""))
newlist = []
langlist = []
prevlang = ""
for raw, lang in alist:
if prevlang in langs and prevlang != lang:
newlist.append("\language[%s]{" % langs[prevlang] +"".join(langlist) + "}")
langlist = []
if lang not in langs:
newlist.append(raw)
else:
langlist.append(raw)
prevlang = lang
newtext = "".join(newlist)
print newtext
The Output is :
$ python test.py
The\language[cn]{恐龙}ate\language[cn]{鱼}.
The \language[cn]{恐竜}ate \language[cn]{魚}.
The\language[ja]{キョウリュウ} ate \language[ja]{うお}.
The\language[ko]{공룡} ate \language[ko]{물고기}.
While Korean doesn't use much sinograms [漢字/Kanji] anymore, they still pop up sometimes. Some Japanese sinograms are solely Japanese, like 竜, but many are identical to either Simplified Chinese or Traditional. So you're kind of stuck. So you need to look at a full sentence if you have some "Han" chars. If it has some hiragana/katakana + kanji, probability is very high it's Japanese. Likewise, a bunch of hangul syllables and a couple of sinograms will tell you the sentence is in Korean.
Then, if it's all Han characters, ie Chinese, you can look at whether some of the chars are simplified: kZVariant denotes a Simplified Chinese char. Oh, and kSpecializedSemanticVariant is very often used for Japanese specific simplified chars. 内 and 內 may look the same to you, but the first is Japanese, the second Traditional Chinese and Korean (Korean uses Traditional Chinese as a standard).
I have code somewhere that returns, for one codepoint, the script name. That could help. You go through a sentence, and see what's left at the end. I'll put up the code somewhere.
EDIT: the code
http://pastebin.com/e276zn6y
In response to the comment below:
This function above is built based on data provided by Unicode.org... While not being an expert per se, I contributed quite a bit to the Unihan database – and I happen to speak CJK. Yes, all 3. I do have some code that takes advantage of the kXXX properties in the Unihan database, but A/ I wasn't aware we were supposed to write code for the OP, and B/ it would require a logistics that might go beyond what the OP is ready to implement. My advice stands. With the function above, loop through one full sentence. If all codepoints are "Han", (or "Han"+"Latin"), chances are high it's Chinese. If on the other hand the result is a mix of "Han"+"Hangul"(+"latin" possibly) you can't go wrong with Korean. Likewise, a mix of "Han" and "Katakana"/"Hiragana" you have Japanese.
A QUICK TEST
Some code to be used with the function I linked to before.
function guessLanguage(x) {
var results={};
var s='';
var i,j=x.length;
for(i=0;i<j;i++) {
s=scriptName(x.substr(i,1));
if(results.hasOwnProperty(s)) {
results[s]+=1;
} else {
results[s]=1;
}
}
console.log(results);
mostCount=0;
mostName='';
for(x in results) {
if (results.hasOwnProperty(x)) {
if(results[x]>mostCount) {
mostCount=results[x];
mostName=x;
}
}
}
return mostName;
}
Some tests:
r=guessLanguage("外人だけど、日本語をペラペラしゃべるよ!");
Object
Common: 2
Han: 5
Hiragana: 9
Katakana: 4
__proto__: Object
"Hiragana"
The r object contains the number of occurrences of each script. Hiragana is the most frequent, and Hiragana+Katakana --> 2/3 of the sentence.
r=guessLanguage("我唔知道,佢講乜話.")
Object
Common: 2
Han: 8
__proto__: Object
"Han"
An obvious case of Chinese (Cantonese in this case).
r=guessLanguage("中國이 韓國보다 훨씬 크지만, 꼭 아름다운 나라가 아니다...");
Object
Common: 11
Han: 4
Hangul: 19
__proto__: Object
"Hangul"
Some Han characters, and a whole lot of Hangul. A Korean sentence, assuredly.
I am working on decoding text. I am trying to find the character code for the — character, not to be mistaken for -, in ASCII. I have tried unsuccessfully. Does anybody know how to convert it?
Quotation from wiki (Em dash)
When an actual em dash is unavailable—as in the ASCII character set—a double ("--") or triple hyphen-minus ("---") is used. In Unicode, the em dash is U+2014 (decimal 8212).
Em dash character is not a part of ASCII character set.
— is known as an Em Dash. It's character code is \u2014. It is not an ASCII character, so you cannot decode it with the ASCII character set because it is not in the ASCII character table. You would probably want to use UTF8 instead.
Windows
For Windows on a keyboard with a Numeric keypad:
Use Alt+0150 (en dash), Alt+0151 (em dash), or Alt+8722 (minus sign) using the numeric keypad.
This character does not exist in ASCII, but only in Unicode, usually encoded by UTF-8.
In UTF-8, characters are encoded by 2- or 3-byte sequences (or occasionally longer), where none of the two or three bytes is a valid ASCII code, where all of them are outside the ASCII range of 0 through 127.
One suspects that the foregoing only partly answers your question, but if so then this is probably because your question is, inadvertently, only partly asked. For further details, you can extend your question with more specifics.
The character — is not part of the ASCII set.
But if you are looking to convert it to some other format (like U+hex), you can use this online tool. Put your character into the first green box and click "Convert" (above the box)
further below you'll find a number of different codes, including U+hex:
U+2014
Feel free to edit this answer if the link breaks or leave a comment so I can find a replacement.
Alt + 0151 seems to do the trick—perhaps it doesn't work on all keyboards.
alt-196 - while holding down the 'Alt' key, type 196 on the numeric keypad, then release the 'Alt' key
As you know, the print function in 8086, puts character in 8bits ( db ) and shows it in screen. Now, i want to print the Unicode character in 8086emu environment not ASCII. So, my challenge is how to use Unicode character in my program ? Does 8086 support Unicode characters?
Thanks in advance :)
If you mean printing in text mode, via interrupt 10h: you can't, as you only have a character map with just 256 characters available. You can redefine how these characters look like (load your custom font), but that still gives you only 256 characters. So you would need to identify the ones you need and then first somehow "render" the ones you need into the character table and for printing you would need to map the Unicode glyph to you character table indexes.
See also my answer to a similar question for more details.
I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?
Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.
import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')
This prints:
spam eggs bacon
EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)
One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.
RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).
But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.