How do I train tesseract but not create a new language? - tesseract

So I am trying out tesseract at the moment, and it does work, but it is not accurate enough. I know that the image quality plays a role as well, etc. etc., but some of the documents I am using use a rather unusual font. It still does recognise parts of it though (about 50-60%, which is pretty good), but this is obviously not entirely satisfying.
I would like to know now whether it's possible to train tesseract, but not to create an entirely new language, but to use the data I am already using, and build on this and improve it?
Second, if this is possible, would this even be advisable? Or (2) would it be better to create new languages for every new font I encounter, or (3) create new languages for each new font I encounter, but not from scratch but always built upon the default data I am using right now? What do you think? If you can provide any links on how to train tesseract & make use of the training data already provided, do let me know please.

You can extract the files from .traineddata file as given in documentation :
specify option -u to unpack all the components to the specified path:
combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
This will create /home/$USER/temp/eng.* files with individual tessdata components from tessdata/eng.traineddata.
There are other options too,please check the documentation on the following link.
https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc
But rather than playing with original files its advisable to train tesseract for a new language.
(2)You dont have to create new language for each font.You have to create image,box and training file for each font .All of these will then be combined into a single language's traineddata file.
(3)This is possible too.PLease visit
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02#bootstrapping-a-new-character-set

Related

Opening Anylogic model in an older version

I have created a model in Anylogic 8.3. Now I want to open this model on a different computer that contains an older version, Anylogic 8.2.3. This, however, does not work, as I am prompted with the fact that the model is created in a newer Anylogic version.
Is there a way to circumvent this issue?
I am not a system admin on the computer with the older Anylogic, nor does our license cover updating to a newer version of Anylogic (expired in december 2018).
You can easily do that by opening the .alp file of your model with Notepad or a similar text editor. Then:
Get your actual AnyLogic build version (open AnyLogic, click "Help" and then "About". You can find your build-version as in the image below
replace AnyLogicVersion and AlpVersion with your required values, e.g. something like AnyLogicVersion="8.2.3.xxxxxxxx" and AlpVersion="8.2.3"
save the file and open with AnyLogic 8.2.3
(Note that if you want to open a model in AnyLogic 7 that was developed in AnyLogic 8, you would also need to remove the entire <RunConfiguration> section. But this is not relevant in your case.)
I think it's only possible to go back to an earlier AnyLogicVersion by hacking the .alp if the AlpVersion is the same, because it denotes the structure of the XML. I don't have an 8.4 file handy, but I have, for example, an 8.5.1 and an 8.2.4, and the AlpVersion is 8.4.9 for AnyLogicVersion 8.5.1, but 8.0.4 for AnyLogicVersion 8.2.4.
If the XML structure is different, the newer version of AnyLogic will likely be unable to load the file. Looking at the two examples of essentially the same model that I've detailed above, there are readily apparent structural differences in the ActiveObjectClass, for example. If there are not too many structural differences, you could try replicating them. I've succeeded in doing that manually at least once, that I can recall.
There are a variety of online tools that allow you to compare the XML schemas of two XML documents, from which you will be able to judge whether a manual hack is feasible.

Convert kaitai-struct .ksy file to "pretty" tree view

I need to write documentation for a parser that was developed in Kaitai. Given a .ksy file, is there any way to produce "pretty" views of the tree?
There is a two year old fork of ksc that supports GraphViz output but the resulting output is pretty hard to work with.
(https://www.reddit.com/r/dataisbeautiful/comments/4zhpvh/binary_data_formats_network_packets_archives/)
I can easily determine what the nodes are but getting their immediate parent would add very useful context.
Thank you.
-David
Please define what exactly do you expect from a "pretty tree".
GraphViz support is available in master and stable releases for a long time (as -t graphviz), and is very well supported — basically every ksy in official repo is accompanied nowadays with a chart: for example, http://formats.kaitai.io/lzh/index.html
If you want to have a tree of values (as opposed to "tree of data types"), we actually have ksdump, which allows you to dump arbitary data file using arbitrary .ksy in a YAML/JSON/XML tree of values. Will it work for you?

Most efficient way to change the value of a specific tag in a DICOM file using GDCM

I have a need to go through a set of DICOM files and modify certain tags to be current with the data maintained in the database of an external system. I am looking to use GDCM. I am new to GDCM. A search through stack overflow posts demonstrates that the anonymizer class can be used to change tag values.
Generating a simple CT DICOM image using GDCM
My question is if this is the best use of the GDCM API or if there is a better approach for changing the values of individual tags such as patient name or accession number. I am unfamiliar with all of the API options but have a link to the API documentation. It looks like the DataElement SetValue member could be used, but it doesn't appear that there is a valid constructor for doing this in the Value class. Any assistance would appreciated. This is my current approach:
Anonymizer anon = new Anonymizer();
anon.SetFile(myFile);
anon.Replace(new Tag(0x0010, 0x0010), "BUGS^BUNNY");
Quite late, but maybe it would be still useful. You have not mention if you write in C++ or C#, but I assume the latter, as you do not use pointers. Generally, your approach is correct (unless you use System.IO.File instead of gdcm.File). The value (second parameter of Replace function) has to be a plain string so no special constructor is needed. You should probably start with doxygen documentation of gdcm, and there is especially one complete example. It is in C++, but there should be no problems with translation.
There are two different ways to pad dicom tags:
Anonymizer
gdcm::Anonymizer anon;
anon.SetFile(file);
anon.Replace(gdcm::Tag(0x0002, 0x0013), "Implementation Version Name");
//Implementation Version Name
DatsElement
gdcm::Attribute<0x0018, 0x0088> ss;
ss.SetValue(10.0);
ds.Insert(ss.GetAsDataElement());

training a new model using pascal kit

need some help on this.
Currently I am doing a project on computer vision that requires me to train a new model to detect a certain object.
In this case, I am using the system provided by P. Felzenszwalb, D. McAllester, D. Ramaman and his team => Discriminatively trained deformable part models which is implemented in Matlab.
Project webpage: http://www.cs.uchicago.edu/~pff/latent/.
However I have no idea how to direct the system to use my dataset(a collection of images and annotation) which is different from the the PASCAL datasets so as to train a new model.
By directing, I meant a line of code that allows me to change the dataset the system reads from, for training a model.
E.g.
% directory for caching models, intermediate data, and results
cachedir = ['/var/tmp/rbg/YOURPATH/' VOCyear '/'];
I tried looking at their Readme and documentation guides but they do not make any mention. Do correct me if I am wrong.
Let me know if I have not made my problem clear enough.
I tried looking at some files such as global.m but no go.
Your help is much appreciated and thanks in advance!
You can try to read pascal.m in the DPM package(voc-release5), there are similar code working on VOC2007/2010 dataset.
There are plenty of parts that need to be adapted to achieve this. For example the voc_config has to be adapted in order to read from your files.
The same with the pascal_train.m function. Depending on the images and the way you parse them, this may require quite some time to adapt this function.
Other functions to consider:
imreadx
pascal_test
pascaleval

ESS workflow for R project/package development

Can anyone share his experience on workflow for R peject development under ESS? I tried several times to learn emacs but I have not get it yet. I can understand ESS as an editor, but is there a project view in ESS? what's the efficient ways to set up/view R project directory, coding, and testing, and how's ESS has an edge to facilitate the whole process?
Do you use ESS as a good R editor only or tend to emulate a R IDE environment within ESS?
Thanks for any advices.
It sounds like you're asking two separate questions.
One question concerns workflow and the other concerns using ESS.
As I use StatET and Eclipse, I'll just share my experience regarding the workflow aspect of your question.
As with Vincent I also follow something like the workflow set out by Josh Reich here (also see Hadley's useful comments):
Workflow for statistical analysis and report writing
Although it can vary between projects, I tend to have a couple of main R files
import.R: this imports data files and does any necessary cleaning and manipulation
analyse.R: This generates the output that I need for any final report
main.R: This calls import.R and analyse.R
The aim is for import.R and analyse.R to represent the complete and final workflow for producing the final results of any analyses.
In terms of a directory structure for an analysis project, I'll often also have the following folders
data: for storing any raw data files
meta: for storing meta data, such as variable labels, scoring systems for tests, recoding information, etc.
output: for storing any graphics, tables, or text generated by my analyses that I might want to incorporate into an external program
temp: When exploring the data and brainstorming analyses, I like to type code into files instead of using the console. I tend to label these temp1.R, temp2.R, temp3.R. I store these in a temp folder. That way I have a permanent record that's easily accessible. If the analyses become final they get incorporated into one of the main R files (i.e., import.R or analysis.R)
functions: If I think that a function will be needed across a couple of projects, I often place it one function per file or a set of related functions in a file in a folder called functions. This makes it relatively easy to reuse functions across projects, when the formal requirements of package development are more than needed.
library: If I want to create some general functions that I think will be project specific, I'll place them in this folder
save: A folder to store any saved R objects
StatET and Eclipse make it easy to interact with such a file system.
Of course, given all the R gurus that use ESS and Emacs, I'm sure it also handles interactions with the file system well.
I'm not exactly sure what you expect as an answer on this one. I, for one, have stolen (and adapted) a system that was suggested here a little while ago (by Josh Reich):
Create a folder for every project, and split up your work in a bunch of different .R files:
Load.R for getting your raw data into R;
Prep.R for cleaning the data, recoding variables, etc.;
Func.R for coding any custom functions you will need for evaluation; and
Eval.R for running your final stuff.
If that doesn't fit your style, just change it.
Then, you can either have a master file to call each of the parts one after each other (good for reproducibility), or save at different stages and have the individual scripts load the appropriate data (good if some of the prep work is very computationally/time intensive).
**
On a different note, the trick that is posted at the link really helped me get into ESS. It turns Shift-Enter into a one-stop-ESS-shop: http://www.kieranhealy.org/blog/archives/2009/10/12/make-shift-enter-do-a-lot-in-ess/
Others have given you some good ideas about how to setup your directory/file structure for a project.
You also asked about "project views," in which case you might want to look into the Emacs Code Browser (ECB).
You can find some screen shots of it in action on its site, here:
http://ecb.sourceforge.net/screenshots/index.html