Sinhala language model issue for pocketsphinx - unicode

I am trying to create a speech recognition system for Sinhalese language. I tried to create a language model but following the answer in Build NEW Acoustic model, Dictionary , Language model for uncommon language speech recognition .I used both online lmtool and cmuclmtk-0.7-win32 on windows.My input file as follows,
එක eka
දෙක de ka
තුන thu na
හතර ha tha ra
පහ pa ha
හය ha iya
හත ha tha
අට ah ta
නවය na wa ya
After submitting to lmtool and cmuclmtk i got the output as follows,
AHTA AE T AH
DEKA D AH K AA
EKA EH K AH
HAIYA HH EY AY AH
HATHA HH AE TH AH
HATHARA HH AE TH AH R AH
NAWAYA N AO EY AH
PAHA P AE HH AH
THUNA TH UW N AH
අට
තුන
දෙක
නවය
පහ
හත
හතර
හය
එක
both .dic and .lm files contains above characters. I feel these are some garbage characters. what did i do wrong to get this?

You did everything wrong.
For corpus construction you need a text file, not a dictionary file. You create dictionary separately.
You should not use online lmtool for your language. It works for English only.
To train language model from texts you should use srilm.

Related

tesseract parameter to reduce noise in the image

I am using tesseract for 2 months now and using opencv for reducing the dots/noise in the images. But I am trying to solve this issue at tesseract level.
Is there any tesseract parameter to remove the background dots?
or can i tell the tesseract not to recognize the dots(depending on the size)?
I am very thankful if anyone guides me on this issue.
For the image below:
https://i.stack.imgur.com/9TjN6.png
I am getting the output like.
lb ane a a a ee ee
Ee ah Tani ANOTES tsi Ca Ee RR
RAT TE CORRE NE Re ele TTR a ee Tol a te es
see © Students should schedile 21 points of work sech years...
fen Es ee EE i ea
| fdvenced Coreral Sciemes ©. |. eroral Home Feonomits (limited to.
Co mlgebras i ULE LE cl BE unions andi sentors) Dh
7od 1 Art’ SpeelaliAvt [for those tC meman Ta GET
Lhd recommended by Art Supervisor. ii Industrial Arts hal
I am using below command to run tesseract:
tesseract --psm 6 --oem 1 image.png output_text_file
There won't be any noise removal option at the tesseract level as the preprocessing methods can't be generalized for all images.You can use denoising methods in opencv like fastNlMeansDenoising, Dilation ,Erosion etc.
tesseract is OCR engine and not image manipulation tool.

Matlab function parameter includes h. I want plot title to include same h. How to do?

function parameter includes h(=10). I want plot title to include the same h. how to do?
function G=graphit(X,Y,ye,h)
plot(X,Y,'-');
grid
title([ 'Approximate and Exact Solution #h= .', num2str(h)])
Thanks.
MM
You can use sprintf to create a formatted string
title( sprintf( 'Approximate and Exact Solution. h = %.0f', h ) );
title(['Approximate and Exact Solution ',num2str(h),' .'])
title_string = sprintf('Approximate and Exact Solution #h= %d.',h) % change d to f for floats
title(title_string)
I'd use a proper string-formatting tool such as sprintf for building correctly formatted titles.
Despite 3 excellent answers given in less than 5 minutes, none of the suggested code would run properly. Basically, I got almost identical results as when I ran my original code.
It turns out that leading zeroes in a number for h such as 01 or 05 will cause the system to drop the zero. This was a problem for me since I wanted h values to be as they are .05, .025, .01. Further, the Matlab software seemed to become confused with a specified decimal point followed by a number with leading zeroes. The way around this was to pass the decimal point with the h value (.10,.05,.025,.01). See code below.
Input is
X,Y,xe,ye,.01
Working code:
function G=graphit(X,Y,xe,ye,h)
hold on;
plot(X,Y,'-'); plot(X,ye,'-.');
hold off
title([ 'Approximate and Exact Solution #h=', num2str(h)])
Expected and attained output:
Approximate and Exact Solution #h=0.01
Voila! Thanks for those replies...

Getting communication between a PN532 RFID reader powered by arduino UNO using matlab

I am trying to up a read and write mode on my PN532 RFID reader powered by Arduino UNO using Matlab code. I am totally new at this. I have coded something out however it is not displaying the information that my RFID reader is reading from the tag.
clear all; clc; close all;
s = serial('COM3');
set(s,'BaudRate', 115200);
set(s,'DataBits', 7);
set(s,'StopBits', 1);
fopen(s);
s.ReadAsyncMode = 'continuous';
readasync(s);
data = fscanf(s, '%x');
char=(dec2bin(data));
sprintf('Read data is %x', char);
The data that i got was 'ypyA r d u i n o S e r v e r I O L i b r a r y w'.
I have made changes and trial and error however it is not making it any better. Hope anyone can help me with this. The information that my tag is sending will be in 8-bit hexadecimal values that i wish to convert and display in binary. Thank you in advance.

The Matlab code for the K chart - what does x mean?

I found the paper
"Performance evaluation of one-class classification-based control charts through an industrial application" (Gani, Limam), In:
Quality and Reliability Engineering International, 29, pp. 841–854, 2012
The appendix of this paper contains Matlab code for the K-chart algorithm. I have tried to run this code, but I am facing some trouble.
% show the KD of each training observation
KD_training = W*offs - 2*sum(repmat(W.a',n,1) x K_training,2);
Can anybody explain me what the x stands for?! Is it a cross product or something different?! If it represents the cross product, the Matlab function couldn't apply it to the 60x60 double matrices:
u = repmat(W.a',n,1); %60x60 double
v = K_training; %60x60 double
z = cross(u,v);
A and B must have at least one dimension of length 3.
I can't run this code, I always encounter errors here. Thank you for your help!
http://onlinelibrary.wiley.com/doi/10.1002/qre.1440/pdf

Problems which matlab is good for

Let me ask whether using Matlab for my particular problem is nonsense or some people do the similar.
I have an initial sequence S(1), where each term is a 2D point.
I create a new sequence S(2) by inserting a new term point p
between each consecutive 2 term points p(i) and p(i+1).
Where p is a function f of 4 term points of nearest indices on S(2).
Namely,
p= f( p(i-1),p(i),p(i+1),p(i+2) )
And the function f is written in a C like style
but not in the pure style of matrix language.
In the same way , I repeat generating the new longer sequence S(i+1) up to S(m).
The above may be vague for you, but please give some advice.
I do not ask whether Matlab is the best choice for the problem , but whether no expert will use Matlab for such a problem or some will.
Thank you in advance.
It heavily depends on f. If f could be coded efficiently in Matlab or you are willing to spend the time to MEX it (Matlab C extension), then Matlab will perform efficiently.
The code could be vectorized like this:
f = #(x) mean(x,3);
m=3;
S{1}=[1,2,3;4,5,6];
for i=2:m
S{i} = cat(3,...
[[0;0] S{i-1}(:,1:end-2)],...
S{i-1}(:,1:end-1),...
S{i-1}(:,2:end),...
[S{i-1}(:,3:end) [0;0]]);
S{i} = [f(S{i}) [0;0]];
S{i} = cat(3,S{i-1},S{i});
S{i} = permute(S{i},[1 3 2]);
S{i} = S{i}(:,:);
S{i}(:,end)=[];
end
Yes, Matlab seems to be suitable for such a task. For the data structure of your list of sequences, consider using cell arrays. You could have S as a cell array, and S{1} would correspond to your S(1), and could again be a cell array of points, or a usual matrix if points are just pairs or triples of numbers.
As an alternative, Python in my opinion is particulary strong when it comes to all kind of sequences.