Find characters that are similar glyphically in Unicode? - unicode

Lets say I have the characters Ú, Ù, Ü. All of them are similar glyphically to the English U.
Is there some list or algorithm to do this:
Given a Ú or Ù or Ü return the English U
Given a English U, return the list of all U-similar characters
I'm not sure if the code point of the Unicode characters is the same across all fonts?
If it is, I suppose there could be some easy way and efficient to do this?
UPDATE
If you're using Ruby, there is a gem available unicode-confusable for this that may help in some cases.

It is very unclear what you are asking to do here.
There are characters whose canonical decompositions all start with the same base character: e, é, ê, ë, ē, ĕ, ė, ę, ě, ȅ, ȇ, ȩ, ḕ, ḗ, ḙ, ḛ, ḝ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, e̳, … or s, ś, ŝ, ş, š, ș, ṡ, ṣ, ṥ, ṧ, ṩ, ….
There are characters whose compatibility decompositions all include a particular character: ᵉ, ₑ, ℯ, ⅇ, ⒠, ⓔ, ㋍, ㋎, e, … or s, ſ, ˢ, ẛ, ₨, ℁, ⒮, ⓢ, ㎧, ㎨, ㎮, ㎯, ㎰, ㎱, ㎲, ㎳, ㏛, ſt, st, s, … or R, ᴿ, ₨, ℛ, ℜ, ℝ, Ⓡ, ㏚, R, ….
There are characters that just happen to look alike in some fonts: ß and β and ϐ, or 3 and Ʒ and Ȝ and ȝ and ʒ and ӡ and ᴣ, or ɣ and ɤ and γ, or F and Ϝ and ϝ, or B and Β and В, or ∅ and ○ and 0 and O and ০ and ੦ and ౦ and ૦, or 1 and l and I and Ⅰ and ᛁ and | and ǀ and ∣, ….
Characters that are the same case-insensitively, like s and S and ſ, or ss and Ss and SS and ß and ẞ, ….
Characters that all have the same numeric value, like all these for the value 1: 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១៱᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁⅟ ① ⑴ ⒈ ⓵ ❶➀➊꘡꣑꤁꧑꩑꯱𐄇𐅂𐅘𐅙𐅚𐌠𐏑𐒡𐡘𐤖𐩀𐩽𐭘𐭸𐹠𒐕𒐞𒐬𒐴𒑏𒑘𝍠𝟏𝟙𝟣𝟭𝟷 🄂 Ⅰⅰꛦ㆒㈠㊀𑁒𑁧.
Characters that all have the same primary collation strength, like all these that are the same as d: DdÐðĎďĐđ◌ͩᴰᵈᶞ◌ᷘ◌ᷙḊḋḌḍḎḏḐḑḒḓⅅⅆⅮⅾ Ⓓ ⓓ ꝹꝺDd𝐃𝐝𝐷𝑑𝑫𝒅𝒟𝒹𝓓𝓭𝔇𝔡𝔻𝕕𝕯𝖉𝖣𝖽𝗗𝗱𝘋𝘥𝘿𝙙𝙳𝚍 🄳 🅓 🅳 🇩 . Note that some of those are not accessible through any kind of decomposition, but only through the DUCET/UCA values; for example, the fairly common ð or the newish ꝺ can be equated to d only through a primary UCA strength comparison; same with ƶ and z, ȼ and c, etc.
Characters that are same in certain locales, like æ and ae, or ä and ae, or ä and aa, or MacKinley and McKinley, …. Note that locale can make a really big difference, since in some locales both c and ç are the same character while in others they are not; similarly for n and ñ, or a and á and ã, ….
Some of these can be handled. Some cannot. All require different approaches depending on different needs.
What is your real goal?

This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:
# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')
Output
U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U
To find accent characters, use something like:
import unicodedata as ud
import string
def asc(unichr):
return ud.normalize('NFD',unichr).encode('ascii','ignore')
U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
print u''.join(u for u in U if asc(u) == c)
Output
aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
:
etc.

Why not just compare glyphs with something like this?
package similarglyphcharacterdetector;
import java.awt.Color;
import java.awt.Font;
import java.awt.Graphics2D;
import java.awt.Rectangle;
import java.awt.font.FontRenderContext;
import java.awt.image.BufferedImage;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;
public class SimilarGlyphCharacterDetector {
static char[] TEST_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890".toCharArray();
static BufferedImage[] SAMPLES = null;
public static BufferedImage drawGlyph(Font font, String string) {
FontRenderContext frc = ((Graphics2D) new BufferedImage(1, 1, BufferedImage.TYPE_BYTE_GRAY).getGraphics()).getFontRenderContext();
Rectangle r= font.getMaxCharBounds(frc).getBounds();
BufferedImage res = new BufferedImage(r.width, r.height, BufferedImage.TYPE_BYTE_GRAY);
Graphics2D g = (Graphics2D) res.getGraphics();
g.setBackground(Color.WHITE);
g.fillRect(0, 0, r.width, r.height);
g.setPaint(Color.BLACK);
g.setFont(font);
g.drawString(string, 0, r.height - font.getLineMetrics(string, g.getFontRenderContext()).getDescent());
return res;
}
private static void drawSamples(Font f) {
SAMPLES = new BufferedImage[TEST_CHARS.length];
for (int i = 0; i < TEST_CHARS.length; i++)
SAMPLES[i] = drawGlyph(f, String.valueOf(TEST_CHARS[i]));
}
private static int compareImages(BufferedImage img1, BufferedImage img2) {
if (img1.getWidth() != img2.getWidth() || img1.getHeight() != img2.getHeight())
throw new IllegalArgumentException();
int d = 0;
for (int y = 0; y < img1.getHeight(); y++) {
for (int x = 0; x < img1.getWidth(); x++) {
if (img1.getRGB(x, y) != img2.getRGB(x, y))
d++;
}
}
return d;
}
private static int nearestSampleIndex(BufferedImage image, int maxDistance) {
int best = Integer.MAX_VALUE;
int bestIdx = -1;
for (int i = 0; i < SAMPLES.length; i++) {
int diff = compareImages(image, SAMPLES[i]);
if (diff < best) {
best = diff;
bestIdx = i;
}
}
if (best > maxDistance)
return -1;
return bestIdx;
}
public static void main(String[] args) throws Exception {
Font f = new Font("FreeMono", Font.PLAIN, 13);
drawSamples(f);
HashMap<Character, StringBuilder> res = new LinkedHashMap<Character, StringBuilder>();
for (char c : TEST_CHARS)
res.put(c, new StringBuilder(String.valueOf(c)));
int maxDistance = 5;
for (int i = 0x80; i <= 0xFFFF; i++) {
char c = (char)i;
if (f.canDisplay(c)) {
int n = nearestSampleIndex(drawGlyph(f, String.valueOf(c)), maxDistance);
if (n != -1) {
char nc = TEST_CHARS[n];
res.get(nc).append(c);
}
}
}
for (Map.Entry<Character, StringBuilder> entry : res.entrySet())
if (entry.getValue().length() > 1)
System.out.println(entry.getValue());
}
}
Output:
AÀÁÂÃÄÅĀĂĄǍǞȀȦΆΑΛАѦӒẠẢἈἉᾸᾹᾺᾼ₳Å
BƁƂΒБВЬḂḄḆ
CĆĈĊČƇΓЄГСὉℂⅭ
...

Related

How to emulate *really simple* variable bit shifts with SSE?

I have two variable bit-shifting code fragments that I want to SSE-vectorize by some means:
1) a = 1 << b (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/2/4/8/16/32/64/128/256
2) a = 1 << (8 * b) (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/0x100/0x10000/etc
OK, I know that AMD's XOP VPSHLQ would do this, as would AVX2's VPSHLQ. But my challenge here is whether this can be achieved on 'normal' (i.e. up to SSE4.2) SSE.
So, is there some funky SSE-family opcode sequence that will achieve the effect of either of these code fragments? These only need yield the listed output values for the specific input values (0-7).
Update: here's my attempt at 1), based on Peter Cordes' suggestion of using the floating point exponent to do simple variable bitshifting:
#include <stdint.h>
typedef union
{
int32_t i;
float f;
} uSpec;
void do_pow2(uint64_t *in_array, uint64_t *out_array, int num_loops)
{
uSpec u;
for (int i=0; i<num_loops; i++)
{
int32_t x = *(int32_t *)&in_array[i];
u.i = (127 + x) << 23;
int32_t r = (int32_t) u.f;
out_array[i] = r;
}
}

Eigen: how can I substitute matrix positive values with 1 and 0 otherwise?

I want to write the following matlab code in Eigen (where K is pxp and W is pxb):
H = (K*W)>0;
However the only thing that I came up so far is:
H = ((K*W.array() > 0).select(1,0));
This code doesn't work as explained here, but replacing 0 with VectorXd::Constant(p,0) (as suggested in the link question) generates a runtime error:
Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 1]: Assertion `v == T(Value)' failed.
How can I solve this?
You don't need .select(). You just need to cast an array of bool to an array of H's component type.
H = ((K * W).array() > 0.0).cast<double>();
Your original attempt failed because the size of your constant 1/0 array is not match with the size of H. Using VectorXd::Constant is not a good choice when H is MatrixXd. You also have a problem with parentheses. I think you want * rather than .* in matlab notation.
#include <iostream>
#include <Eigen/Eigen>
using namespace Eigen;
int main() {
const int p = 5;
const int b = 10;
MatrixXd H(p, b), K(p, p), W(p, b);
K.setRandom();
W.setRandom();
H = ((K * W).array() > 0.0).cast<double>();
std::cout << H << std::endl << std::endl;
H = ((K * W).array() > 0).select(MatrixXd::Constant(p, b, 1),
MatrixXd::Constant(p, b, 0));
std::cout << H << std::endl;
return 0;
}
When calling a template member function in a template, you need to use the template keyword.
#include <iostream>
#include <Eigen/Eigen>
using namespace Eigen;
template<typename Mat, typename Vec>
void createHashTable(const Mat &K, Eigen::MatrixXi &H, Mat &W, int b) {
Mat CK = K;
H = ((CK * W).array() > 0.0).template cast<int>();
}
int main() {
const int p = 5;
const int b = 10;
Eigen::MatrixXi H(p, b);
Eigen::MatrixXf W(p, b), K(p, p);
K.setRandom();
W.setRandom();
createHashTable<Eigen::MatrixXf, Eigen::VectorXf>(K, H, W, b);
std::cout << H << std::endl;
return 0;
}
See this for some explanation.
Issue casting C++ Eigen::Matrix types via templates

cula use of culaSgels - wrong argument?

I am trying to use the culaSgels function in order to solve Ax=B.
I modified the systemSolve example of the cula package.
void culaFloatExample()
{
int N=2;
int NRHS = 2;
int i,j;
double cula_time,start_time,end_time;
culaStatus status;
culaFloat* A = NULL;
culaFloat* B = NULL;
culaFloat* X = NULL;
culaFloat one = 1.0f;
culaFloat thresh = 1e-6f;
culaFloat diff;
printf("Allocating Matrices\n");
A = (culaFloat*)malloc(N*N*sizeof(culaFloat));
B = (culaFloat*)malloc(N*N*sizeof(culaFloat));
X = (culaFloat*)malloc(N*N*sizeof(culaFloat));
if(!A || !B )
exit(EXIT_FAILURE);
printf("Initializing CULA\n");
status = culaInitialize();
checkStatus(status);
// Set A
A[0]=1;
A[1]=2;
A[2]=3;
A[3]=4;
// Set B
B[0]=5;
B[1]=6;
B[2]=2;
B[3]=3;
printf("Calling culaSgels\n");
// Run CULA's version
start_time = getHighResolutionTime();
status = culaSgels('N',N,N, NRHS, A, N, A, N);
end_time = getHighResolutionTime();
cula_time = end_time - start_time;
checkStatus(status);
printf("Verifying Result\n");
for(i = 0; i < N; ++i){
for (j=0;j<N;j++)
{
diff = X[i+j*N] - B[i+j*N];
if(diff < 0.0f)
diff = -diff;
if(diff > thresh)
printf("\nResult check failed: X[%d]=%f B[%d]=%f\n", i, X[i+j*N],i, B[i+j*N]);
printf("\nResults:X= %f \t B= %f:\n",X[i+j*N],B[i+j*N]);
}
}
printRuntime(cula_time);
printf("Shutting down CULA\n\n");
culaShutdown();
free(A);
free(B);
}
I am using culaSgels('N',N,N, NRHS, A, N, A, N); to solve the system but :
1) The results show me that every element of X=0 , but B is right.
Also , it shows me the
Result check failed message
2) Studying the reference manual ,it says that one argument before the last argument (the A I have) ,should be the matrix B stored columnwised,but if I use "B" instead of "A" as parameter ,then I am not getting the correct B matrix.
Ok,code needs 3 things to work.
1) Change A to B ,so culaSgels('N',N,N, NRHS, A, N, B, N);
(I misunderstood that at exit B contains the solution)
2) Because CULA uses column major change A,B matrices accordingly.
3) Change to :
B = (culaFloat*)malloc(N*NRHS*sizeof(culaFloat));
X = (culaFloat*)malloc(N*NRHS*sizeof(culaFloat));
(use NHRS and not N which is the same in this example)
Thanks!

reading files using fgetc

I have a question about using fgetc to count characters in a specified file.
How do you use it when you have to count character types separately? Like for example I only want to count the number of lowercase characters only, or number of spaces, or punctuations, etc? Can someone show a brief example? Thank you
I tried to do this program that would hopefully count the total number of characters, how do you squeeze in though the number of the separate character types? I'm not exactly sure if this program is correct
#include <stdio.h>
int main (void)
{
//Local declarations
int a;
int count = 0;
FILE* fp;
//Statements
if (!(fp = fopen("piFile.c", "r")))
{
printf("Error opening file.\n");
return (1);
}//if open error
while ((a = fgetc (fp)) != EOF)
{
if (a != '\n')
count++;
printf("Number of characters: %d \n", count);
else
printf("There are no characters to count.\n");
}
fclose(fp);
return 0;
}
Read up on these functions:
int isalnum(int c);
int isalpha(int c);
int isascii(int c);
int isblank(int c);
int iscntrl(int c);
int isdigit(int c);
int isgraph(int c);
int islower(int c);
int isprint(int c);
int ispunct(int c);
int isspace(int c);
int isupper(int c);
int isxdigit(int c);
and you'll see right away how to do it.
In your while, you can use if statements for each character you want to check.
if(isalnum(a){
counta++;
}
else if(isalpha(a)){
countb++;
}
else if(isascii(a)){
countc++;
}

Windows C API for UTF8 to 1252

I'm familiar with WideCharToMultiByte and MultiByteToWideChar conversions and could use these to do something like:
UTF8 -> UTF16 -> 1252
I know that iconv will do what I need, but does anybody know of any MS libs that will allow this in a single call?
I should probably just pull in the iconv library, but am feeling lazy.
Thanks
Windows 1252 is mostly equivalent to latin-1, aka ISO-8859-1: Windows-1252 just has some additional characters allocated in the latin-1 reserved range 128-159. If you are ready to ignore those extra characters, and stick to latin-1, then conversion is rather easy. Try this:
#include <stddef.h>
/*
* Convert from UTF-8 to latin-1. Invalid encodings, and encodings of
* code points beyond 255, are replaced by question marks. No more than
* dst_max_len bytes are stored in the destination array. Returned value
* is the length that the latin-1 string would have had, assuming a big
* enough destination buffer.
*/
size_t
utf8_to_latin1(char *src, size_t src_len,
char *dst, size_t dst_max_len)
{
unsigned char *sb;
size_t u, v;
u = v = 0;
sb = (unsigned char *)src;
while (u < src_len) {
int c = sb[u ++];
if (c >= 0x80) {
if (c >= 0xC0 && c < 0xE0) {
if (u == src_len) {
c = '?';
} else {
int w = sb[u];
if (w >= 0x80 && w < 0xC0) {
u ++;
c = ((c & 0x1F) << 6)
+ (w & 0x3F);
} else {
c = '?';
}
}
} else {
int i;
for (i = 6; i >= 0; i --)
if (!(c & (1 << i)))
break;
c = '?';
u += i;
}
}
if (v < dst_max_len)
dst[v] = (char)c;
v ++;
}
return v;
}
/*
* Convert from latin-1 to UTF-8. No more than dst_max_len bytes are
* stored in the destination array. Returned value is the length that
* the UTF-8 string would have had, assuming a big enough destination
* buffer.
*/
size_t
latin1_to_utf8(char *src, size_t src_len,
char *dst, size_t dst_max_len)
{
unsigned char *sb;
size_t u, v;
u = v = 0;
sb = (unsigned char *)src;
while (u < src_len) {
int c = sb[u ++];
if (c < 0x80) {
if (v < dst_max_len)
dst[v] = (char)c;
v ++;
} else {
int h = 0xC0 + (c >> 6);
int l = 0x80 + (c & 0x3F);
if (v < dst_max_len) {
dst[v] = (char)h;
if ((v + 1) < dst_max_len)
dst[v + 1] = (char)l;
}
v += 2;
}
}
return v;
}
Note that I make no guarantee about this code. This is completely untested.