Microsoft Keyboard Layout Creator: diacritics shown wrong after install - unicode

I'm trying to make a keyboard for myself in Microsoft Keyboard Layout Creator based on German with two additional diacritics: a caron (wedge above the character) and a macron (dash above the character). I'm defining these as dead keys. Everything works fine in "Test Keyboard Layout".
Once I build and install, it works fine for the macron (dash) but the carons are displayed as breves, i.e. instead of a little wedge over the vowels, I get a little semi-circle.
I do get errors like the one below on in the verification log, but a) in the actual log, the character displays fine (i.e. the issue isn't encoding, which is set to UTF-8) and b) I also get that same error for the macron but that one works.
Here's the error:
"The dead key ̌ (U+030c) when combined with I (U+0049) returns Ǐ (U+01cf), but Ǐ (U+01cf) is not on the default system code page (1252) of the German (Germany) language you specified. This may cause compatibility problems in non-Unicode applications."
I also tried setting the language of my custom keyboard to US and a bunch of other languages in case that made a difference, but it looks like US and DE use the same code page, so it shouldn't matter.
My suspicion is that it has something to do with the limit of the language code page or unicode range or something, but then again someone seemed to have managed to get it working with a US base (same code page), so... I don't know. (And then why can I copy/paste it from other texts and insert it from the character map? Is the issue maybe directly in the keyboard driver?)
All thoughts are appreciated!
EDIT: It's now the next day and now even the dashed characters don't work anymore! Re-installed the keyboard, nothing ò_ó

Related

Visually changing how a string of characters is represented in VScode

Prettify symbols is an extension in vscode that changes a sequence of characters visually without affecting what the code does. For example, visually changing --> to ⟶ while the coding language uses -->. However, this extension creates seemingly random symbols throughout the file and is no longer maintained. Therefore, that extension is hardly usable at the moment.
Fira Code uses ligatures to do something similar (or the same I am not sure).
What other ways are there to visually change a string of letters ? I am mostly interested in solutions for vscode. As an example I would like to change
~[\Omega]
to
Ω
visually for the user but the code uses the original ~~[\Omega]
[EDIT: I found this github page that adds ligatures to a font. Unfortunately when one creates a ligature where the "hidden" symbol contains many characters and the visible symbol contains a few symbols, a long trace of spaces replacing the missing characters is left behind. The prettify symbols extension mentioned before does not have these spaces. For those that are still interested in making ligatures with the second link, this Fira code font page shows the names of symbols in Fira code. That might be helpful when making a new font from Fira code using the first link of this edit (which is the second link of the question) ]

Japanse characters unreadable

I am working on my thesis and got acces to a database that was used by Japanese scientists. They included some readme files, but the text that was supposed to be in Japanese, is displayed in characters like these:
ÉRÅ[ÉqÅ[Ç…É~ÉãÉNÇì¸ÇÍÇ‹Ç∑Ç©ÅB
I've tried everything to convert them to Japanese characters, but I can't get it right. De database is from 1999, maybe that makes it harder to convert it?
Does anybody know how to fix this?
So you have a text file, but with these strange characters ? Does your text editor allow you to change the page encoding ?
For exemple, in Atom, once your text file is open, you can switch the page encoding using the status bar: Atom knows (but perhaps it is inherited from the host system) Shift JIS, CP 932 and EUC-JP, which seem to be all related to japanese character encoding.
Maybe you can find helpful details from this page ?
But even once done, I guess you have to find out a native speaker in order to tell you if the results make sense...

Unicode characters aren't combined properly

I am working with some Devanagari text data I want to display in the browser. Unfortunately, there's one combination of nonspacing combining characters that doesn't get rendered as a proberly combined character.
The problem occurs every time a base character is combined with the Devanagari Stress Sign Udatta ॑ (U+0951) and the Devanagari Sign Visarga ः (U+0903).
An example for this would be र॑ः, which is र (U+0930) + ॑ + ः and should be rendered as one character. But the stress sign and the other one don't seem to like each other (as you can see above!).
It's no problem to combine the base char with each of the other two signs alone, btw: र॑ / रः
I already tried to use several fonts which should be able to render Devanagari characters (some Noto fonts, Siddhanta, GentiumPlus) and tested it with different browsers, but the problem seems to be something else.
Does anyone have an idea? Is this not a valid combination of symbols?
EDIT: I just tried to switch around the two marks just to see what if - it renders as रः॑, so U+0951 and U+0903 don't seem to have the same function, as the stress sign gets rendered on top of the other mark.
It looks like i don't understand Unicode enough, yet.
This is NOT a solution for your problem, but might be useful information:
I am working with some Devanagari text data I want to display in the
browser.
Like you, I couldn't get this to work in any browser despite trying several fonts, including Arial Unicode MS:
The browser was simply rendering the text Devanagari Test: रः॑ from within the <body> of a JSP. The stress sign is clearly appearing above the Sign Visarga instead of the base character.
Is this not a valid combination of symbols?
It is a valid combination. I don't know Devanagari, so I don't know whether it is semantically "valid", but it is trivial to generate exactly the character you want from a Java application:
System.out.println("Devanagari test: \u0930\u0903\u0951");
This is the output from executing the println() call, showing the stress sign above the base character:
The screenshot above is from NetBeans 8.2 on Windows 10, but the rendering also worked fine using the latest releases of Eclipse and Intellij IDEA. The constraints are:
The three characters must be specified in that order in println() for the rendering to work.
The Sign Visarga and the Stress Sign Udatta must be presented in their Unicode form. Pasting their glyph representations into the source code won't work, although this can be done for the base character.
An appropriate font must be used for the display. I used Arial Unicode MS for the screen shot above, but other fonts such as Serif, SansSerif and Monospaced also worked.
Does anyone have an idea?
Unfortunately not, although it is clear that:
The grapheme you want to render exists, and is valid.
Although it won't render in a browser, it can be written to the console by a Java application.
The problem seems to be that all browsers apply the diacritic (Stress Sign Udatta) to the immediately preceding character rather than the base character.
See Why are some combining diacritics shifted to the right in some programs? for more information on this.

How do I add a new Arabic vowel-sign in the PUA area of a font?

I am using Ubuntu 14.04, with FontForge compiled from the Git repo as of 31
July.
I'm trying to add a vowel-sign to an Arabic font, Graph, by Future Soft Egypt:
http://openfontlibrary.org/en/font/graph
I have added glyphs where the Unicode code-point already exists (eg peh,
U+067E), and that works fine. I am now trying to add a vowel sign where no
Unicode code-point exists - it is a "damma with tail", used by some writers in
Swahili to mean "o".
I decided to put it in the PUA at U+E909, and copied the font's damma (U+064F)
and added a tail:
http://kevindonnelly.org.uk/swahili/images/dammas.png
I generated the font, and set up the keyboard to emit that character.
The glyph comes up OK, but there are two problems, as can be seen here:
http://kevindonnelly.org.uk/swahili/images/output.png
showing at top "bubu", using the original damma, and at bottom "bobo", using
the new damma-with-tail.
(1) The damma-with-tail is too far to the left, even though the anchor points
in FF have not been moved.
(2) Worse, the damma-with-tail means that only the isolated versions of the
consonant glyphs get used - in the second line the two bs should be joined, as
in the first line.
I'm not sure whether this is a function of using the PUA, or whether it's due
to my missing some step I need to take in FF (eg the Encoding -> Add Encoding
Slots that needs to be done for the consonants), but if anyone could shed some
light on how to fix the two problems, I'd be very grateful.

JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.