Erlang identify umlauts

Erlang identify umlauts - unicode

How can I identify german umlauts in Erlang? I tried for days now, when I read a text as list it just doesn't get them. I tried this for example
change_umlaut(Word) -> change_umlaut(lists:reverse(Word), []).
change_umlaut([],Acc) -> Acc;
change_umlaut([H|T],Acc) ->
if
%extended ascii characters
H =:= 129 -> change_umlaut(T, ["ue"|Acc]);
H =:= 132 -> change_umlaut(T, ["ae"|Acc]);
H =:= 148 -> change_umlaut(T, ["oe"|Acc]);
%extended ascii characters
H == 129 -> change_umlaut(T, ["ue"|Acc]);
H == 132 -> change_umlaut(T, ["ae"|Acc]);
H == 148 -> change_umlaut(T, ["oe"|Acc]);
%literals
H == "ü" -> change_umlaut(T, ["ue"|Acc]);
H == "ä" -> change_umlaut(T, ["ae"|Acc]);
H == "ö" -> change_umlaut(T, ["oe"|Acc]);
%else
true -> change_umlaut(T, [H|Acc])
end;
it just passes all the arguments without matching until true...
Thank you for your help.

In Erlang, strings usually contain Latin-1 or Unicode codepoints, so you should be looking for 228 for "ä", 246 for "ö" and 252 for "ü".
Your literals section should have made this work transparently, except for the fact that H is a single character, and you're comparing it to strings ("ü", "ä" and "ö"). The corresponding character literals are $ü, $ä and $ö - make sure that your source file is saved as Latin-1 for this to work.

Related

What does the \u{...} notation mean in UNICODE and why are only some characters displayed like this in the CLDR project?

In this link you will find the most used characters for each language. Why are some characters in some languages displayed under the \u{...} notation?
I think that what is in the brackets is the hexadecimal code of the character, but I can't understand why they would only do it with some characters.

The character sequences enclosed in curly brackets {} are digraphs (trigraphs, …) counted as a distinct letter in given language (supposedly with its own place in the alphabet), for instance
digraph {ch} in cs (Czech language);
trigraph {dzs} in hu (Hungarian alphabet);
more complex digraph examples in kkj (Kako language) shows the following Python code snippet:
>>> kkj='[a á à â {a\u0327} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ\u0301} {ɛ\u0300} {ɛ\u0302} {ɛ\u0327} f g {gb} {gw} h i í ì î {i\u0327} j k {kp} {kw} l m {mb} n {nd} ǌ {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ\u0301} {ɔ\u0300} {ɔ\u0302} {ɔ\u0327} p r s t u ú ù û {u\u0327} v w y]'
>>> print( kkj)
[a á à â {a̧} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̧} f g {gb} {gw} h i í ì î {i̧} j k {kp} {kw} l m {mb} n {nd} ǌ {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̧} p r s t u ú ù û {u̧} v w y]
>>>
For instance, {a\u0327} renders as {a̧} i.e. something like Latin Small Letter A with Combining Cedilla which has no Unicode equivalent. A counterexample:
ņ (U+0146) Latin Small Letter N With Cedilla with decomposition 004E 0327:
>>> import unicodedata
>>> print( 'ņ', unicodedata.normalize('NFC','{n\u0327}'))
ņ {ņ}
Edit:
Characters presented as unicode literals (\uxxxx = a character with 16-bit hex value xxxx) are unrenderable ones (or hard to render, at least). The following Python script shows some of them (Bidi_Class Values L-Left_To_Right, R-Right_To_Left, NSM-Nonspacing_Mark, BN-Boundary_Neutral):
# -*- coding: utf-8 -*-
import unicodedata
pa = 'ੱੰ਼੍ੁੂੇੈੋੌ'
pa = '\u0327 \u0A71 \u0A70 \u0A3C ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ ੴ ੳ ਉ ਊ ਓ ਅ ਆ ਐ ਔ ੲ ਇ ਈ ਏ ਸ {ਸ\u0A3C} ਹ ਕ ਖ {ਖ\u0A3C} ਗ {ਗ\u0A3C} ਘ ਙ ਚ ਛ ਜ {ਜ\u0A3C} ਝ ਞ ਟ ਠ ਡ ਢ ਣ ਤ ਥ ਦ ਧ ਨ ਪ ਫ {ਫ\u0A3C} ਬ ਭ ਮ ਯ ਰ ਲ ਵ ੜ \u0A4D ਾ ਿ ੀ \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C'
pa = '\u0300 \u0301 \u0302 \u1DC6 \u1DC7 \u0A71 \u0A70 \u0A3C \u0A4D \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C \u05B7 \u05B8 \u05BF \u200C \u200D \u200E \u200F \u064B \u064C \u064E \u064F \u0650'
# above examples from ·kkj· ·bas· ·pa· ·yi· ·kn· ·ur· ·mzn·
print( pa )
for chr in pa:
if chr != ' ':
if chr == '{' or chr == '}':
print( chr )
else:
print( '\\u%04x' % ord(chr), chr,
unicodedata.category(chr),
unicodedata.bidirectional(chr) + '\t',
str( unicodedata.combining(chr)) + '\t',
unicodedata.name(chr, '?') )
Result: .\SO\63659122.py
̀ ́ ̂ ᷆ ᷇ ੱ ੰ ਼ ੍ ੁ ੂ ੇ ੈ ੋ ੌ ַ ָ ֿ ‌ ‍ ‎ ‏ ً ٌ َ ُ ِ
\u0300 ̀ Mn NSM 230 COMBINING GRAVE ACCENT
\u0301 ́ Mn NSM 230 COMBINING ACUTE ACCENT
\u0302 ̂ Mn NSM 230 COMBINING CIRCUMFLEX ACCENT
\u1dc6 ᷆ Mn NSM 230 COMBINING MACRON-GRAVE
\u1dc7 ᷇ Mn NSM 230 COMBINING ACUTE-MACRON
\u0a71 ੱ Mn NSM 0 GURMUKHI ADDAK
\u0a70 ੰ Mn NSM 0 GURMUKHI TIPPI
\u0a3c ਼ Mn NSM 7 GURMUKHI SIGN NUKTA
\u0a4d ੍ Mn NSM 9 GURMUKHI SIGN VIRAMA
\u0a41 ੁ Mn NSM 0 GURMUKHI VOWEL SIGN U
\u0a42 ੂ Mn NSM 0 GURMUKHI VOWEL SIGN UU
\u0a47 ੇ Mn NSM 0 GURMUKHI VOWEL SIGN EE
\u0a48 ੈ Mn NSM 0 GURMUKHI VOWEL SIGN AI
\u0a4b ੋ Mn NSM 0 GURMUKHI VOWEL SIGN OO
\u0a4c ੌ Mn NSM 0 GURMUKHI VOWEL SIGN AU
\u05b7 ַ Mn NSM 17 HEBREW POINT PATAH
\u05b8 ָ Mn NSM 18 HEBREW POINT QAMATS
\u05bf ֿ Mn NSM 23 HEBREW POINT RAFE
\u200c ‌ Cf BN 0 ZERO WIDTH NON-JOINER
\u200d ‍ Cf BN 0 ZERO WIDTH JOINER
\u200e ‎ Cf L 0 LEFT-TO-RIGHT MARK
\u200f ‏ Cf R 0 RIGHT-TO-LEFT MARK
\u064b ً Mn NSM 27 ARABIC FATHATAN
\u064c ٌ Mn NSM 28 ARABIC DAMMATAN
\u064e َ Mn NSM 30 ARABIC FATHA
\u064f ُ Mn NSM 31 ARABIC DAMMA
\u0650 ِ Mn NSM 32 ARABIC KASRA

It seems like all codepoints that don't have a well-defined stand-alone look (or are not meant to be used as stand-alone characters) are represented with this notation.
For example U+0A3C is present in the "character" {ਫ\u0A3C}. U+0A3C is a combining codepoint that modifies the one that is before it.

Utf8 encoding makes me confused

let buf1 = Buffer.from("3", "utf8");
let buf2 = Buffer.from("Здравствуйте", "utf8");
// <Buffer 33>
// <Buffer d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5>
Why does char '3' encode to '33' in buf1 but 'd0 97' in buf2?

Because 3 is not З, despite the similarity to the untrained eye. Look closer and you'll see the difference, however subtle.
The former is Unicode code point U+0033 - DIGIT THREE (see here), while the latter is U+0417 - CYRILLIC CAPITAL LETTER ZE (see here), encoded in UTF-8 as d0 97.
The Russian word is actually hello, pronounced (very roughly, since I only know hello and goodbye, taught by a Russian girlfriend many decades ago) "Strasvoytza", with no "three" anywhere in the concept.

The first character of the second buffer is the Cyrillic character "Ze" https://en.m.wikipedia.org/wiki/Ze_(Cyrillic) and not the Arabic numeral 3 https://en.m.wikipedia.org/wiki/3

How do I print a string in one line in MARIE?

I want to print a set of letters in one line in MARIE. I modified the code to print Hello World and came up with:
ORG 0 / implemented using "do while" loop
WHILE, LOAD STR_BASE / load str_base into ac
ADD ITR / add index to str_base
STORE INDEX / store (str_base + index) into ac
CLEAR / set ac to zero
ADDI INDEX / get the value at ADDR
SKIPCOND 400 / SKIP if ADDR = 0 (or null char)
JUMP DO / jump to DO
JUMP PRINT / JUMP to END
DO, STORE TEMP / output value at ADDR
LOAD ITR / load iterator into ac
ADD ONE / increment iterator by one
STORE ITR / store ac in iterator
JUMP WHILE / jump to while
PRINT, SUBT ONE
SKIPCOND 000
JUMP PR
HALT
PR, OUTPUT
JUMP WHILE
ONE, DEC 1
ITR, DEC 0
INDEX, HEX 0
STR_BASE, HEX 12 / memory location of str
STR, HEX 48 / H
HEX 65 / E
HEX 6C / L
HEX 6C / L
HEX 6F / O
HEX 0 / carriage return
HEX 57 / W
HEX 6F / O
HEX 72 / R
HEX 6C / L
HEX 64 / D
HEX 0 / NULL char
My program ends up halting past two iterations. I can't seem to figure out how to print a set of characters in one line. Thanks.

Your value of STR_BASE is almost certainly incorrect. Based on what is here I would say it needs to be 18 instead of 12. Also you would either want to remove current null char that is between "HELLO" and "WORLD" and replace it with a space or simply remove that line, depending on your intended output.

program giving different output than what i expected

nt main()
{
cout << ('a'^'b');
}
when i wrote this simple code(in C++) program giving the "3" output. but it must be "1".
do you know why? is there problem with the xor operator??

There is no problem with XOR and the result of 3 is correct.
'a' XOR 'b'
-> 0x61 XOR 0x62 (hex, per ASCII)
-> 01100001 XOR 01100010 (binary)
-> 00000011 (only these bits differ)
-> 3 (decimal)
Consider the following, which is 1 - why?
'`' ^ 'a'

AutoHotKey: Making (caps+key) -> special character?

In Linux, I use xmodmap with the following configuration:
clear lock
keycode 66 = Mode_switch
keycode 34 = bracketleft braceleft aring Aring
keycode 47 = semicolon colon oslash Ooblique
keycode 48 = apostrophe quotedbl ae AE
keycode 21 = equal plus
keycode 35 = bracketright braceright
How can I do the same in AutoHotKey?
In other words, how can I make (Caps+[certain key]) -> [certain character]?

Christian,
Try this as an example for CapsLock + F1:
CapsLock & F1::Send, abcdefg
CapsLock & a::Send, Æ ; to send lowercase Æ, CapsLock+Shift+a is uppercase
Hope this is what you were looking for.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Erlang identify umlauts - unicode

Related

What does the \u{...} notation mean in UNICODE and why are only some characters displayed like this in the CLDR project?

Utf8 encoding makes me confused

How do I print a string in one line in MARIE?

program giving different output than what i expected

AutoHotKey: Making (caps+key) -> special character?

Categories

Resources