Unicode character usage statistics [closed] - unicode

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I am looking for some statistical data on the usage of Unicode characters in textual documents (with any markup). Googling brought no results.
Background: I am currently developing a finite state machine-based text processing tool. Statistical data on characters might help searching for the right transitions. For instance latin characters are probably most used so it might make sense to check for those first.
Did anyone by chance gathered or saw such statistics?
(I'm not focused on specific languages or locales. Think general-purpose parser like an XML parser.)

To sum up current findings and ideas:
Tom Christiansen gathered such statistics for PubMed Open Access Corpus (see this question). I have asked if he could share these statistics, waiting for the answer.
As #Boldewyn and #nwellnhof suggested, I could run the analysis of the complete Wikipedia dump or CommonCrawl data. I think these are good suggestions, I'll probably go with the CommonCrawl.
So sorry, this is not an answer, but a good research direction.
UPDATE: I have written a small Hadoop job and ran it on one of the CommonCrawl segments. I have posted my results in a spreadsheet here. Below are the first 50 characters:
0x000020 14627262
0x000065 7492745 e
0x000061 5144406 a
0x000069 4791953 i
0x00006f 4717551 o
0x000074 4566615 t
0x00006e 4296796 n
0x000072 4293069 r
0x000073 4025542 s
0x00000a 3140215
0x00006c 2841723 l
0x000064 2132449 d
0x000063 2026755 c
0x000075 1927266 u
0x000068 1793540 h
0x00006d 1628606 m
0x00fffd 1579150
0x000067 1279990 g
0x000070 1277983 p
0x000066 997775 f
0x000079 949434 y
0x000062 851830 b
0x00002e 844102 .
0x000030 822410 0
0x0000a0 797309
0x000053 718313 S
0x000076 691534 v
0x000077 682472 w
0x000031 648470 1
0x000041 624279 #
0x00006b 555419 k
0x000032 548220 2
0x00002c 513342 ,
0x00002d 510054 -
0x000043 498244 C
0x000054 495323 T
0x000045 455061 E
0x00004d 426545 M
0x000050 423790 P
0x000049 405276 I
0x000052 393218 R
0x000044 381975 D
0x00004c 365834 L
0x000042 353770 B
0x000033 334689 E
0x00004e 325299 N
0x000029 302497 /
0x000028 301057 (
0x000035 298087 5
0x000046 295148 F
To be honest, I have no idea if these results are representative. As I said, I only analysed one segment. Looks quite plausible for me. One can also easily spot that the markup is already stripped off - so the distribution is not directly suitable for my XML parser. But it gives valuable hints on which character ranges to check first.

The link to http://emojitracker.com/ in the near-duplicate question I personally think is the most promising resource for this. I have not examined the sources (I don't speak Ruby) but from a real-time Twitter feed of character frequencies, I would expect quite a different result than from static web pages, and probably a radically different language distribution (I see lots more Arabic and Turkish on Twitter than in my otherwise ordinary life). It's probably not exactly what you are looking for, but if we just look at the title of your question (which probably most visitors will have followed to get here) then that is what I would suggest as the answer.
Of course, this begs the question what kind of usage you attempt to model. For static XML, which you seem to be after, maybe the Common Crawl set is a better starting point after all. Text coming out of an editorial process (however informal) looks quite different from spontaneous text.
Out of the suggested options so far, Wikipedia (and/or Wiktionary) is probably the easiest, since it's small enough for local download, far better standardized than a random web dump (all UTF-8, all properly tagged, most of it properly tagged by language and proofread for markup errors, orthography, and occasionally facts), and yet large enough (and probably already overkill by an order of magnitude or more) to give you credible statistics. But again, if the domain is different than the domain you actually want to model, they will probably be wrong nevertheless.

Related

How to replace choicerule to reduce "meaningless" answers that kill grounding process using asp (clingo)

I'm currently working on an answer set program to create a timetable for a school.
The rule base I use looks similar to this:
teacher(a). teacher(b). teacher(c). teacher(d). teacher(e). teacher(f).teacher(g).teacher(h).teacher(i).teacher(j).teacher(k).teacher().teacher(m).teacher(n).teacher(o).teacher(p).teacher(q).teacher(r).teache(s).teacher(t).teacher(u).
teaches(a,info). teaches(a,math). teaches(b,bio). teaches(b,nawi). teaches(c,ge). teaches(c,gewi). teaches(d,ge). teaches(d,grw). teaches(e,de). teaches(e,mu). teaches(f,de). teaches(f,ku). teaches(g,geo). teaches(g,eth). teaches(h,reli). teaches(h,spo). teaches(i,reli). teaches(i,ku). teaches(j,math). teaces(j,chem). teaches(k,math). teaches(k,chem). teaches(l,deu). teaches(l,grw). teaches(m,eng). teaches(m,mu). teachs(n,math). teaches(n,geo). teaches(o,spo). teaches(o,fremd). teaches(p,eng). teaches(p,fremd). teaches(q,deu). teaches(q,fremd). teaches(r,deu). teaches(r,eng). teaches(s,eng). teaches(s,spo). teaches(t,te). teaches(t,eng). teaches(u,bio). teaches(u,phy).
subject(X) :- teaches(_,X).
class(5,a). class(5,b). class(6,a). class(6,b). class(7,a). class(7,b). class(8,a). class(8,b). class(9,a). class(9,b). class(10,a). class(10,b).
%classes per week (for class 5 only at the moment)
classperweek(5,de,5). classperweek(5,info,0). classperweek(5,eng,5). classpereek(5,fremd,0). classperweek(5,math,4). classperweek(5,bio,2). classperweek(5,chem,0). classperweek(5,phy,0). classperweek(5,ge,1). classperweek(5,grw,0). cassperweek(5,geo,2). classperweek(5,spo,3). classperweek(5,eth,2). classperwek(5,ku,2). classperweek(5,mu,2). classperweek(5,tec,0). classperweek(5,nawi,0) .classperweek(5,gewi,0). classperweek(5,reli,2).
room(1..21).
%for monday to friday
weekday(1..5).
%for lesson 1 to 9
slot(1..9).
In order to creat a timetable I wanted to create every possible combination of all predicats I'm using and then filter all wrong answers.
This is how I created a timetable:
{timetable(W,S,T,A,B,J,R):class(A,B),teacher(T),subject(J),room(R)} :- weekday(W), slot(S).
Up to this point everything works, except that this solution is probably relatively inefficient.
To filter that no class uses the same room at the same time I formulated the following constraint.
:- timetable(A,B,C,D,E,F,G), timetable(H,I,J,K,L,M,N), A=H, B=I, G=N, class(D,E)!=class(K,L).
It looks like this makes to problem so big that the grounding fails, because I get the following error message
clingo version 5.4.0
Reading from timetable.asp
Killed
Therefore, I was looking for a way to create different instances of timetable without getting too many "meaningless" answers created by the choiserule.
One possibility I thought of is to use a negation cycle. So you could replace the choiserule
{a;b} with a :- not b. b :- not a. and exclude all cases where rooms are occupied twice.
Unfortunately I do not understand this kind of approach enough to apply it to my problem.
After a lot of trial and error (and online search), I have not found a solution to eliminate the choicerule and at the same time eliminate the duplication of rooms and teachers at the same time.
Therefore I wonder if I can use this approach for my problem or if there is another way to not create many pointless answersets at all.
edit: rule base will work now and updated the hours per lesson for class 5
I think you're looking for something like:
% For each teacher and each timeslot, pick at most one subject which they'll teach and a class and room for them.
{timetable(W,S,T,A,B,J,R):class(A,B),room(R),teaches(T,J)} <= 1 :- weekday(W);slot(S);teacher(T).
% Cardinality constraint enforcing that no room is occupied more than once in the same timeslot on the timetable.
:- #count{uses(T,A,B,J):timetable(W,S,T,A,B,J,R)} > 1; weekday(W); slot(S); room(R).
to replace your two rules.
Note that this way clingo won't generate spurious ground terms for teachers teaching a subject they don't know. Additionally by using a cardinality constraint as opposed to a binary clause, you get a big-O reduction in the grounded size (from O(n^2) in the number of rooms to O(n)).
Btw, you may be missing answers because of typos in the input. I would suggest phrasing it as:
teacher(a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u).
teaches(
a,info;
a,math;
b,bio;
b,nawi;
c,ge;
c,gewi;
d,ge;
d,grw;
e,de;
e,mu;
f,de;
f,ku;
g,geo;
g,eth;
h,reli;
h,spo;
i,reli;
i,ku;
j,math;
j,chem;
k,math;
k,chem;
l,deu;
l,grw;
m,eng;
m,mu;
n,math;
n,geo;
o,spo;
o,fremd;
p,eng;
p,fremd;
q,deu;
q,fremd;
r,deu;
r,eng;
s,eng;
s,spo;
t,te;
t,eng;
u,bio;
u,phy
).
subject(X) :- teaches(_,X).
class(
5..10,a;
5..10,b
).
%classes per week (for class 5 only at the moment)
classperweek(
5,de,5;
5,info,0;
5,eng,5;
5,fremd,0;
5,math,4;
5,bio,2;
5,chem,0;
5,phy,0;
5,ge,1;
5,grw,0;
5,geo,2;
5,spo,3;
5,eth,2;
5,ku,2;
5,mu,2;
5,tec,0;
5,nawi,0;
5,gewi,0;
5,reli,2
).
room(1..21).
%for monday to friday
weekday(1..5).
%for lesson 1 to 9
slot(1..9).

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

Negation of osm class or type

If you search for an airport (aeroway=aerodrome) around brescia, italy, you will also receive a hit for a military airfield, which happens to be tagged as an aerodrome also (it's taggged: aeroway=aerodrome, landuse=military, military=airfield). To avoid this I want to search for aeroway=aerodrome but exclude [military]. I've tried [! military] and [military~"^$"]. Any suggestions?
This particular case may be rare, I realize, but the concept of negating multi-classed elements is useful. And multi-classed elements is not a rare occurance. In general, they seem to be complimentary, not conflicting, so it's not an issue. I also realize that I can weed out conflicting hits with some back-end processing. I wasn't expecting a military airfield to appear with a commercial aerodrome.
In any case, here is a shortened version of my query. I include node, way and relation in full query:
http://overpass-api.de/api/interpreter?
data=[out:json][timeout:25][bbox:45.400861,9.868469,45.641408,10.542755];
(node[aeroway~%22aero|term|heli%22][! military]; ... ) out etc
or:
http://overpass-api.de/api/interpreter?
data=[out:json][timeout:25][bbox:45.400861,9.868469,45.641408,10.542755];
(node[aeroway~%22aero|term|heli%22][military~%22^$%22]; ... ) out etc
If you try to run it, you'll need to include way and relation.
Also, as you can see I don't exactly ask for aeroway=aerodrome. I include terminal and variations on heliport. My experience has been that some aerodromes are tagged only as "terminal", so if you're looking for an airport, asking for "aerodrome" isn't enough.
The correct syntax for negation is as follows:
[military !~ ".*"]
Please see the documentation on the OSM wiki for details.

Remove Matlab r2014b Plot Browser limit

Among many distressing graphics changes to r2014b, the Plot Browser now only displays a certain number of lines per plot (looks like the limit is 50). Any number of plots above this limit are not displayed in the Plot Browser - it just says "and 78 more..."
Is there anyway to remove the limit? I want to see all my lines in the plot browser.
Unfortunately the answer is currenty:
No you cannot remove this limit
This was already reported to mathworks a while ago, and here is the reply:
I am writing in reference to your Technical Support Case #01143663
regarding 'plot browser with "and xxxx more...." indication'.
Really interesting question (and even a bit surprising). Basically
this limitation has been introduced with MATLAB 2014b!
Our developers are aware of that, and they are working to solve it. An
enhancement/bug request has been already submitted, and I am going to
add this case to the list. However, I cannot guarantee you a release
date.
If you think this limitation is crucial for your work, I would
strongly suggest you to contact your account manager, which will have
a bit more influence on the developers than a simple engineer :).
Of course, if there is anything else I can do for you, please let me
know
So it seems that you will have to deal with this limitation if you keep using 2014b.
This was also an issue for me; in particular I wanted to see the DisplayName property of the graphics object in the list. I used a workaround where I created a callback function, so that when clicking on a data point, the DisplayName would be shown. This can be helpful if you have a plot of many lines and want to see the DisplayName of a particular one. You would first need to set the DisplayName property of the graphics objects for this to work, as it's empty by default. You could also use this to show other properties, such as Color or LineStyle, that are shown in the plot browser:
%Based on
%http://www.mathworks.com/help/matlab/ref/datacursormode.html
%'fig_h' is the figure handle
dcm_obj = datacursormode(fig_h);
set(dcm_obj,'UpdateFcn',#myupdatefcn)
And then include this function as a separate file on the Matlab path, or paste into the function you're currently writing, and include an extra 'end' at the end of that function:
function name = myupdatefcn(empt,event_obj)
% Customizes text of data tips
tar = get(event_obj,'Target');
name = get(tar,'DisplayName');
end

Double-metaphone errors

I'm using Lawrence Philips Double-Metaphone algorithm with great success, but I have found the odd "unexpected result" for some combinations.
Does anyone else have additions or changes to the algorithm for other parts of it they wouldn't mind sharing, or just the combinations that they've found that do not work as expected.
eg. I had issues between:
Peashill and Bushley. (both match with PXL)
Rockliffe and Rockcliffe (RKLF and RKKL)
All Soundex, Metaphone and variant schemes are occasionally going to give results that aren't identical to what you expect. This is unavoidable - they can be regarded as more or less simple hash algorithms with special information preserving properties, and will sometimes produce collisions when you'd rather they didn't, and will sometimes produce differences when you'd rather they didn't.
One possible way of improving things is using 'synonym rings'. This basically produces lists of words that should be regarded as synonyms, independent of the spelling. I encountered them in the context of name matching. For example, variants on Chaudri
included:
CHAUDARY
CHAUDERI
CHAUDERY
CHAUDHARY
CHAUDHERI
CHAUDHERY
CHAUDHRI
CHAUDHRY
CHAUDHURI
CHAUDHURY
CHAUDHY
CHAUDREY
CHAUDRI
CHAUDRY
CHAUDURI
CHAWDHARY
CHAWDHRY
CHAWDHURY
CHDRY
CHODARY
CHODHARI
CHODHOURY
CHODHRY
CHODREY
CHODRY
CHODURY
CHOUDARI
CHOUDARY
CHOUDERY
CHOUDHARI
CHOUDHARY
CHOUDHERY
CHOUDHOURY
CHOUDHRI
CHOUDHRY
CHOUDHURI
CHOUDHURY
CHOUDREY
CHOUDRI
CHOUDRY
CHOUDURY
CHOUWDHRY
CHOWDARI
CHOWDARY
CHOWDHARY
CHOWDHERY
CHOWDHRI
CHOWDHRY
CHOWDHURI
CHOWDHURRYY
CHOWDHURY
CHOWDORY
CHOWDRAY
CHOWDREY
CHOWDRI
CHOWDRURY
CHOWDRY
CHOWDURI
CHOWDURY
CHUDARY
CHUDHRY
CHUDORY
COWDHURY
regular metaphone is returning a difference between Peashill and Bushley
Peashill PXL
Bushley BXL