How do I set up configuration variables in Tesseract to better recognize code?

How do I set up configuration variables in Tesseract to better recognize code? - tesseract

I want to use Tesseract to recognize code. It is said on their website that I can disable dictionaries by setting both of the configuration variables load_system_dawg and load_freq_dawg to false.
However I haven't been able to do it correctly.
$ tesseract img.jpg output.txt --oem 0 -c load_system_dawg=0 load_freq_dawg=0
read_params_file: Can't open load_freq_dawg=0
Error: Tesseract (legacy) engine requested, but components are not present in /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Any ideas on best ways to handle it?

First of all, get eng.traineddata with the legacy engine or other OCR engine value (OEM).
Next, read the output of tesseract --help-extra carefully:
-c VAR=VALUE Set value for config variables.
Multiple -c arguments are allowed.

Related

how to import file name / line numbers when using IDApython?

When I use the UI of IDA to load a binary file with debug info, a window which prints "DWARF info found" will come out and I can choose "Import file names/line numbers" to present the correponding source lines of certain addresses.
However, I find when I using command line to run ida like "idat64 -A -S'script' binary_name", idat64 would not load the DWARF without specific setting.
I tried to find possible setting to load the DWARF infos but failed. Could someone help me solve this problem?
I want to use
ida_nalt.get_source_linnum
and
ida_lines.get_sourcefile
to export the address-to-line mapping informations.
However, without loading the DWARF info, running the scripts will only print invalid results.

Passing a captured binary file through snort

Is it possible to pass pre-captured binary files through snort for analysis (with flagging or detection as the ultimate goal)?

You can use -r option for that

Using libav without generating output file

I use libav for an application in which transform coefficients of video should extracted. But the output file of libav is not important in this application. There is memory limitation and I don't want produce the output file.
I read some documentations about libav along with its help. But I couldn't solve this issue. How can force libav from producing the output file?

If you want to decode your file without output, you can use the following approaches:
1: specify its output format as null
2: specify its output as null:
In windows and linux you can use this command:
avconv.exe -i input.mkv -f null null
./avconv -i input.mkv -f null /dev/null
Base on (ffmpeg description) -f Force input or output file format. The format is normally auto detected for input files and guessed from the file extension for output files, so this option is not needed in most cases.
And The second null specify its output as null.

How to split frontend and backend translations?

I have a web project (php+js) translated by gettext. Later it was only translated at server side, pushing translations to JS by varying weird ways. Now i converted it to all gettext, convert my .po files by po2json and load them into Jed library. But this way I load all translations, even never used on client !
What i want to do now:
xgettext -js-options *.js > js-empty.po
xgettext -php-options *.php > php-empty.po
magic both-translated.po js-empty.po > js-translated.po
magic both-translated.po php-empty.po > php-translated.po
What should i use as 'magic' ?
P.S. I will be doing actual translation in one file and then split just for optimization, on every build.

I have found the solution:
msgcomm both-translated.po js-empty.po -o js-translated.po
Visual inspection confirms what lines are translated, and the following command confirms what number of 'msgid' is equal in empty and translated files:
grep msgid $1 | wc -l
Adding more value to answer: where's well recommended Python library https://pypi.python.org/pypi/polib, or in main Linux distributions, packages 'python3-polib' or 'python-polib'. I was considering using it to perform the task and maybe will use in future for other gettext-related tasks.

CVS command to get brief history of repository

I am using following command to get a brief history of the CVS repository.
cvs -d :pserver:*User*:*Password*#*Repo* rlog -N -d "*StartDate* < *EndDate*" *Module*
This works just fine except for one small problem. It lists all tags created on each file in that repository. I want the tag info, but I only want the tags that are created in the date range specified. How do I change this command to do that.

I don't see a way to do that natively with the rlog command. Faced with this problem, I would write a Perl script to parse the output of the command, correlate the tags to the date range that I want and print them.
Another solution would be to parse the ,v files directly, but I haven't found any robust libraries for doing that. I prefer Perl for that type of task, and the parsing modules don't seem to be very high quality.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How do I set up configuration variables in Tesseract to better recognize code? - tesseract

First of all, get eng.traineddata with the legacy engine or other OCR engine value (OEM). Next, read the output of tesseract --help-extra carefully: -c VAR=VALUE Set value for config variables. Multiple -c arguments are allowed.

Related

how to import file name / line numbers when using IDApython?

Passing a captured binary file through snort

Using libav without generating output file

How to split frontend and backend translations?

CVS command to get brief history of repository

Categories

Resources