Is there an uppercase letter that has two lowercase alternatives? - unicode

Are there any two characters ch1, ch2, ch1 <> ch2, and ch1 and ch2 are lowercase letters, where uppercase(ch1) == uppercase(ch2)? Is there actually such a character in Unicode?
A follow-up question is: For any ch that's a lowercase letter, is the following expression always true?
ch == lowercase(uppercase(ch))

Did a quick test in Java:
public static void main(String[] args) {
for (char ch1 = 0; ch1 < 65534; ch1++) {
if (!isLetter(ch1) || !isLowerCase(ch1)) {
continue;
}
String s1 = "" + ch1;
for (char ch2 = (char) (ch1 + 1); ch2 < 65535; ch2++) {
if (!isLetter(ch2) || !isLowerCase(ch2)) {
continue;
}
String s2 = "" + ch2;
if (s1.toUpperCase(Locale.US).equals(s2.toUpperCase(Locale.US))) {
System.out.println("ch1=" + ch1 + " (" + (int) ch1 + "), ch2=" + ch2 + " (" + (int) ch2 + ")");
}
}
}
}
It prints:
ch1=i (105), ch2=ı (305)
ch1=s (115), ch2=ſ (383)
ch1=µ (181), ch2=μ (956)
ch1=ΐ (912), ch2=ΐ (8147)
ch1=ΰ (944), ch2=ΰ (8163)
ch1=β (946), ch2=ϐ (976)
ch1=ε (949), ch2=ϵ (1013)
ch1=θ (952), ch2=ϑ (977)
ch1=ι (953), ch2=ι (8126)
ch1=κ (954), ch2=ϰ (1008)
ch1=π (960), ch2=ϖ (982)
ch1=ρ (961), ch2=ϱ (1009)
ch1=ς (962), ch2=σ (963)
ch1=φ (966), ch2=ϕ (981)
ch1=в (1074), ch2=ᲀ (7296)
ch1=д (1076), ch2=ᲁ (7297)
ch1=о (1086), ch2=ᲂ (7298)
ch1=с (1089), ch2=ᲃ (7299)
ch1=т (1090), ch2=ᲄ (7300)
ch1=т (1090), ch2=ᲅ (7301)
ch1=ъ (1098), ch2=ᲆ (7302)
ch1=ѣ (1123), ch2=ᲇ (7303)
ch1=ᲄ (7300), ch2=ᲅ (7301)
ch1=ᲈ (7304), ch2=ꙋ (42571)
ch1=ṡ (7777), ch2=ẛ (7835)
ch1=ſt (64261), ch2=st (64262)
So the answer is: yes, there are distinct characters that have the same upper-case representation.

Upper, title, and lower cases are locale-specific, and so in different locales you may have different lower case letter (e.g. French upper cases may lose accents).
But Unicode defines also a standard way to convert to upper case or to lower case, and with an exception for Turkish languages, which may have different rules (marked with T in the CaseFolding.txt Unicode database, and further special cases for Turkish, Greek, and Lithuanian, in SpecialCasing.txt).
For most cases, you have a unique way to convert lower to upper (and the contrary), but see SIGN KELVIN which maps with K and other signs which use the same glyphs as other letters (that should go away, if you remove compatibility characters with a normalization).
One case is the Greek Sigma letter. There is only one in upper case, but you may use two different in lower case, depending on whether it is at the end of a word.
You will find more information in the Unicode document about Unicode database: http://www.unicode.org/reports/tr44/#Casemapping and in the Unicode standard (linked in the document, as well the two files I named above).
Note: some characters increase the number of code points, so when converting back, one should check the longest match.

Related

How can I display "«" in Japanese? Shift JIS encoding

I have text and I want to add '«' at the start and '»' at the end of the string.
for example: «MyText».
#define START_MARKER 0x00AB // '«'
#define END_MARKER 0x00BB // '»'
char startMarker[2], endMarker[2];
sprintf(startMarker, "%c",START_MARKER);
sprintf(endMarker, "%c",END_MARKER);
const UTF16String unicodeStartMarker = ai::UnicodeString(startMarker).as_ASUnicode();
const UTF16String unicodeEndMarker = ai::UnicodeString(endMarker).as_ASUnicode();
sUnicodePlaceholder = unicodeStartMarker + UTF8String(sFieldName).GetUnicodeString() + unicodeEndMarker;
It works for all the languages except Japanese.
I get different character as you can see below:
The character "«" is also defined : "U+00ab" but with suffix SJIS (character encoding for the Japanese language)
Do you have any idea how can I display "«" in Japanese?
thanks!!

Why is my code returning 0? And not the numbers of Upper and Lower characters?

I'm trying to code that calculates how many upper and lower characters in a string. Here's my code.
I've been trying to convert it to string, but not working.
def up_low(string):
result1 = 0
result2 = 0
for x in string:
if x == x.upper():
result1 + 1
elif x == x.lower():
result2 + 1
print('You have ' + str(result1) + ' upper characters and ' +
str(result2) + ' lower characters!')
up_low('Hello Mr. Rogers, how are you this fine Tuesday?')
I expect my outcome to calculate the upper and lower characters. Right now I'm getting "You have 0 upper characters and 0 lower characters!".
It's not adding up to result1 and result2.
Seems your error is in the assignation, missimg a '=' symbol (E.g. result1 += 1)
for x in string:
if x == x.upper():
result1 += 1
elif x == x.lower():
result2 +**=** 1
The problem is in the line result1 + 1 and result2 + 1. This is an expression, but not an assignment. In other words, you increment the counter, and then the incremented value goes nowhere.
The solution is to work the assignment operator = into there somewhere.

Unicode characters in ggplot labels

I can get ggplot to print Japanese Unicode characters in axis labels and legends, but not in labels. Is this a bug?
library(extrafont)
library(ggplot2)
data_frame <- cbind.data.frame("number"=c(1:3), "kana"=c("い","ろ","は"))
ggplot(data=data_frame, aes(kana, number)) +
geom_point() + theme_gray(base_family = "Meiryo") ##works great
ggplot(data=data_frame, aes(kana, number, label=kana)) +
geom_point() + geom_label() + theme_gray(base_family = "Meiryo") ##no such luck

Using big unicode signs in java

I'm writing a programm for school which should compress text. So at first I want to build a kind of dictionary from a huge number of texts for compressing later.
My idea was that if i have 2 signs, I want to replace it with only 1. So at first i am building a treemap with all the pairs I have in my String.
So for example: String s = "Hello";
He -> 1
el -> 1
ll -> 1
lo -> 1
at the end my Treemap values are different high, and at a given point i want to write a rule in my dictionary. For example:
He -> x
el -> y
lo -> z
So here is the point. I want to start with the "new signs" at the unicode number 65536 and want to increase it for every rule by 1.
When i want to reanalyze my text to pairs i think i got a error but i am not sure about this..
TreeMap<String, Integer> map = new TreeMap<String, Integer>();
char[] text = s.toCharArray();
String signPair = "";
// search sign in map
for (int i = 0; i < s.length()-1; i++) {
// 1.Zeichen prüfen ob >65535 ->2chars
if (Character.codePointAt(text, i) > 65535) {
// 2.sign checking >65535 ->2chars
if (Character.codePointAt(text, i + 2) > 65535) {
signPair = s.substring(i, i + 4);
// compensate additional chars
i += 2;
// if not there
if (!map.containsKey(signPair)) {
// Key anlegen, Value auf 1 setzen
map.put(signPair, 1);
} else {
// Key vorhanden -> Value um 1 erhöhen
int value = map.get(signPair);
value++;
map.put(signPair, value);
}
At the end when i want to print my map in the console i only got � signs with a second one.. or later i also have a lot of 𐃰-typ signs which i cant interpret. In my output text there are mostly signs between 5000 and 60000. No one is higher than 65535...
Is it wrong to look at the chars and substring like them or is it a mistake to get the codepoint at them?
Thanks for help!

Renaming a Word document and saving its filename with its first 10 letters

I have recovered some Word documents from a corrupted hard drive using a piece of software called photorec. The problem is that the documents' names can't be recovered; they are all renamed by a sequence of numbers. There are over 2000 documents to sort through and I was wondering if I could rename them using some automated process.
Is there a script I could use to find the first 10 letters in the document and rename it with that? It would have to be able to cope with multiple documents having the same first 10 letters and so not write over documents with the same name. Also, it would have to avoid renaming the document with illegal characters (such as '?', '*', '/', etc.)
I only have a little bit of experience with Python, C, and even less with bash programming in Linux, so bear with me if I don't know exactly what I'm doing if I have to write a new script.
How about VBScript? Here is a sketch:
FolderName = "C:\Docs\"
Set fs = CreateObject("Scripting.FileSystemObject")
Set fldr = fs.GetFolder(Foldername)
Set ws = CreateObject("Word.Application")
For Each f In fldr.Files
If Left(f.name,2)<>"~$" Then
If InStr(f.Type, "Microsoft Word") Then
MsgBox f.Name
Set doc = ws.Documents.Open(Foldername & f.Name)
s = vbNullString
i = 1
Do While Trim(s) = vbNullString And i <= doc.Paragraphs.Count
s = doc.Paragraphs(i)
s = CleanString(Left(s, 10))
i = i + 1
Loop
doc.Close False
If s = "" Then s = "NoParas"
s1 = s
i = 1
Do While fs.FileExists(s1)
s1 = s & i
i = i + 1
Loop
MsgBox "Name " & Foldername & f.Name & " As " & Foldername & s1 _
& Right(f.Name, InStrRev(f.Name, "."))
'' This uses copy, because it seems safer
f.Copy Foldername & s1 & Right(f.Name, InStrRev(f.Name, ".")), False
'' MoveFile will copy the file:
'' fs.MoveFile Foldername & f.Name, Foldername & s1 _
'' & Right(f.Name, InStrRev(f.Name, "."))
End If
End If
Next
msgbox "Done"
ws.Quit
Set ws = Nothing
Set fs = Nothing
Function CleanString(StringToClean)
''http://msdn.microsoft.com/en-us/library/ms974570.aspx
Dim objRegEx
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.IgnoreCase = True
objRegEx.Global = True
''Find anything not a-z, 0-9
objRegEx.Pattern = "[^a-z0-9]"
CleanString = objRegEx.Replace(StringToClean, "")
End Function
Word documents are stored in a custom format which places a load of binary cruft on the beginning of the file.
The simplest thing would be to knock something up in Python that searched for the first line beginning with ASCII chars. Here you go:
#!/usr/bin/python
import glob
import os
for file in glob.glob("*.doc"):
f = open(file, "rb")
new_name = ""
chars = 0
char = f.read(1)
while char != "":
if 0 < ord(char) < 128:
if ord("a") <= ord(char) <= ord("z") or ord("A") <= ord(char) <= ord("Z") or ord("0") <= ord(char) <= ord("9"):
new_name += char
else:
new_name += "_"
chars += 1
if chars == 100:
new_name = new_name[:20] + ".doc"
print "renaming " + file + " to " + new_name
f.close()
break;
else:
new_name = ""
chars = 0
char = f.read(1)
if new_name != "":
os.rename(file, new_name)
NOTE: if you want to glob multiple directories you'll need to change the glob line accordingly. Also this takes no account of whether the file you're trying to rename to already exists, so if you have multiple docs with the same first few chars then you'll need to handle that.
I found the first chunk of 100 ASCII chars in a row (if you look for less than that you end up picking up doc keywords and such) and then used the first 20 of these to make the new name, replacing anything that's not a-z A-Z or 0-9 with underscores to avoid file name issues.