(GATE) How to let Minipar play with special characters like Ö, Ü, Ä? - special-characters

while learning Gate, I encountered the following problem:
Minipar throws exception when it sees uncommen characters like Ö, Ü, Ä.
For example in the sentence "Batten disease (also known as Spielmeyer-Vogt-Sjögren-Batten disease ) is a rare, fatal autosomal recessive neurodegenerative disorder that begins in childhood." (from a wiki article)
The annotation Minipar got before it stopped working is "Batten disease (also known as Spielmeyer-Vogt-Sj" which is exactly before the character ö, so this makes me guessing that this is a case worth attention while using Gate. Because the same pipeline processed several other articles like a breeze.
In Messages Tab, it reprots:
gate.util.InvalidOffsetException
at gate.annotation.AnnotationSetImpl.getNodes(AnnotationSetImpl.java:773)
at gate.annotation.AnnotationSetImpl.add(AnnotationSetImpl.java:802)
at minipar.Minipar.runMinipar(Minipar.java:419)
at minipar.Minipar.execute(Minipar.java:527)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
at gate.creole.SerialController.executeImpl(SerialController.java:153)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
at gate.creole.AbstractController.execute(AbstractController.java:75)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
at java.lang.Thread.run(Unknown Source)
gate.creole.ExecutionException: gate.util.InvalidOffsetException
at minipar.Minipar.runMinipar(Minipar.java:491)
at minipar.Minipar.execute(Minipar.java:527)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
at gate.creole.SerialController.executeImpl(SerialController.java:153)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
at gate.creole.AbstractController.execute(AbstractController.java:75)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
at java.lang.Thread.run(Unknown Source)
Caused by: gate.util.InvalidOffsetException
at gate.annotation.AnnotationSetImpl.getNodes(AnnotationSetImpl.java:773)
at gate.annotation.AnnotationSetImpl.add(AnnotationSetImpl.java:802)
at minipar.Minipar.runMinipar(Minipar.java:419)
... 9 more
gate.creole.ExecutionException: Document doesn't have sentence annotations. please run tokenizer, sentence splitter and then Minipar
at minipar.Minipar.saveGateSentences(Minipar.java:194)
at minipar.Minipar.execute(Minipar.java:525)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
at gate.creole.SerialController.executeImpl(SerialController.java:153)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
at gate.creole.AbstractController.execute(AbstractController.java:75)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
at java.lang.Thread.run(Unknown Source)
I'd to thank Ian for his warm support once again.
Matt

This appears to be an encoding-related issue of some sort, but unfortunately I can't do any debugging myself as the minipar parser binary no longer appears to be available from the usual download page - I get a small (less than 2kB) greyscale JPEG image instead of a multi-MB .tgz.
There's a few things you could try off the top of my head. The GATE Minipar wrapper writes the input file for the parser and reads the parser's output using whatever is the default encoding on the system where you're running. My speculation is that the parser is producing its output in a different encoding (possibly related to the encoding of the original training data?).
The GATE wrapper writes its input to a temporary file which you should be able to find in your temporary directory as long as you leave GATE Developer running in the background (the temp files are deleted when Developer exits). I would try running minipar-windows.exe on that file from the command line and seeing what the output looks like
C:\path\to\minipar-windows.exe -p C:\path\to\minipar\data -file GATESentencesNNNNNN.txt
The output may give you a clue as to what's failing. If it looks right and you can determine the encoding it's trying to use you could set your GATE Developer to use that as its default encoding (if you're using gate.exe to start it then you do this by adding a line -Dfile.encoding=ISO-8859-1 or whatever to gate.l4j.ini) and see if that helps. If so we can consider adding a parameter to the PR to specify the encoding to use when exchanging data with the parser executable.

Related

How can I avoid package/character errors in (read) in Common Lisp?

I'm getting some surprising errors when I try to input a string using (read). Context: I'm building a mini language, with inputs deliminated using characters like {, }, :, etc.
Here's what happens, I run (read) and enter {9.I:{8.II:hello}{8.III:hi}} (an example input string from my mini language).
I then get 2 errors:
1:
too many colons in "{8.II"
2:
Package HELLO}{8.III does not exist.
It seems as though there's something extra going on in the (read) function that's tripping me up. Can someone point me in the right direction?
read is designed to read a valid Lisp S-expression. It's going to use Common Lisp's parser. If your language is sufficiently Lisp-like, you may be able to make it work for you, but given the example input you've shown, I doubt it's what you want.
You're probably looking for read-line, which reads a single line of text as a string and does not perform any additional parsing on it.

How to display static analysis warnings in MATLAB?

I've noticed the MATLAB editor will often show quite helpful warnings for ".m" files. As I tend to run my MATLAB code remotely I prefer not to use the MATLAB editor, instead keeping open a long running emacs session. It would be great if these warnings could be printed out when running a script, perhaps if some setting was enabled (I could imagine not wanting to do that by default for performance). Is this possible?
I believe you're looking for checkcode. From the documentation:
checkcode(filename) displays messages about filename that report potential problems and opportunities for code improvement. These messages are sometimes referred to as Code Analyzer messages. The line number in the message is a hyperlink that you can click to go directly to that line in the Editor. The exact text of the checkcode messages is subject to some change between versions.
...
info = checkcode(___,'-struct') returns the information as an n-by-1 structure array, where n is the number of messages found.

How does computer display a character on the screen with the correct encoding?

I'm interested in the encoding of the character in the computer.
When I open my xxx.c with visual studio code, how does the VS code detect the encoding of my file and interprets these "01" sequence. Further on, how the visual studio code (or even the computer system) display the character on the screen acorrding to my "01" sequence file and the character encoding?
Thank you!
I also uses Chinese during my projects. Sometimes, the file encoding really drive my crazy. Sometimes,my correct utf-8 file created by edit A for example, was destroyed by some text editor B that interpret it as GBK file, and edit A can never get it back correct.
I searched a lot, but the most answers seems to be too abstract or irrelevant. I want to figure out how the software and the computer system( or operating system) cooperate together to make this simple but important job done!
First things first, "can never get it back": Always Use Source Code Control
"How the software and the computer system (or operating system) cooperate together to make this simple but important job done!": They don't that's the problem!
Short history: Many decades ago people used small character sets. The idea was a system would always use the same one. Simple. Every time a text file was transferred between systems, it would be immediately transcribed to the local character encoding. Then came the globalization of file exchanges and systems needed to hold text files in different encodings. There was no general way of recording what the encoding was. In 1991 came the huge character set Unicode. Languages (VB4, Java), operating system APIs (Win32), file systems (NTFS), … began adopting it. However, its encodings (UTF-8, UTF-16) are just yet more possibilities for which encoding a text file uses. Many programs that read text files either rely on the old system of a system default encoding or guess ("detect").
In the programming world, some languages require source files to use a specific encoding (say UTF-8); In others, tools default to specific encoding (say UTF-8). In most cases, the toolset provided with a C or C++ implementation will have a consistent set of rules. If you also use an IDE or other form of project system, you can set the encoding for the entire project and in some cases specific files.
So, the only solution is to only use tools that work for you and to properly configure them. If it hurts, stop doing it.
Aside: On the topic of programming and default character encodings, be careful not to get tricked with various language libraries' use of the system default character encoding—unless that is exactly what's needed. Otherwise, you are giving your users the same problem that you are encountering. (In Java, just avoid it with explicit arguments. In C and C++ libraries, encoding is combined into Locales. But note that many systems initialize a program to use default character encoding.

Special characters in Sikuli script

I am trying to use some French special characters with Sikuli, when I type this in the Sikuli IDE,
App.open('C:\\à table\\app.exe')
But I get this error :
[log] App.open C:\à table\NDC.exe(0)
[error] App.open failed: C:\à table\NDC.exe not found
It seems that Sikuli doesn't handle utf-8 properly for the moment. All I could find in Google was the same problem with type() function and to use paste() instead, which uses the clipboard.
Is there a workaround in the case of App.open ?
Thanks a lot.
Could make a bat file, and have App.Open('path/to/bat/file.bat') which inside contains the path to the .exe
The reason for this problem seems that Python 2.5.X doesn't support character encodings properly. One has to use tricks like encode('cp1252'), encode('utf8')...
Since Sikuli is based on Jython which is based on Python 2.5.2, we are stuck!
I wished I lived in a country only using the standard ASCII table, I really hate all these problems related to codepage and encodings.

How to work with logfiles

Usually I pipe my log through a lot of greps to remove the "noise" before i open it in an editor.
I think it should be possible to do this filtering inside an editor (Especially Emacs)
Is this what chainsaw is doing? For log4j format only or more general?
(It is the only logfile viewer tool I can find)
How do you guys do it?
(I think UNIX grep syntax would be easiest for me)
Chainsaw does support both positive and negative filter matching. You can define positive and negative matches based on the logger tree (right click on nodes for the options), and you can define positive-match expressions in the 'refine focus' field, and negative-match expressions using the 'ignore' option below the logger tree. There is a tutorial available from the help menu which describes the expression syntax.
Chainsaw has had a lot of new features added since the last official release. The developer snapshot (including a reworked configuration screen) is available here:
http://people.apache.org/~sdeboy
Chainsaw doesn't just work with log4j. There are 'receivers' available that make it work with log4net, java.util.logging, log4php and others.
You can also process any regularly formatted text file using a VFSLogFilePatternReceiver (use the 'process a log file' option to configure Chainsaw to define one). There are some pre-defined log formats in the configuration dialog that act as example formats - tweak one to match your format. The JavaDoc provides more information: http://logging.apache.org/chainsaw/apidocs/org/apache/log4j/chainsaw/vfs/VFSLogFilePatternReceiver.html
If the greps are the same, you can simply code a script to do all that work for you (e.g. vimscript for vi). That way, you don't have to repeat all the tasks while leaving the logs intact for further investigation.
You're right on chainsaw and log4j - it's a logging viewer with different capabilities, e.g. a filter mechanism. However, I am unsure, if you are able to have multiple filters activated simultaneously.
Yes, you should try Chainsaw first.. it does support various ways of getting your logs.
Necroposting: I created a mode for Emacs that is specifically targeted at Log4j-like logs, but supports many more formats, especially if you customize it for yourself.
Features:
Colorization (just customize faces if that's too much)
Interactive filtering in the same buffer:
by level
by logger name
by thread
by message
easy narrowing
edit all currently set filters at once
A few log-specific movement commands
Copy (with M-w) only visible text, goes well with filtering (this is customizable)