Compare between different NER systems in GATE - annotations

I am new to GATE. I was trying to analyse the performance of different tools on a wide range of corpus.
The problem is the diff tool or corpus QA tool require the annotation sets to be identical -even case sensitive. Indeed, each system has its own schema and generate different labels. For example: organisation in one system is Org in the other.
Is there a way to normalise these schemas to be able to compare between different systems?

In such cases (renaming, adding empty annotation sets, ...) I recommend to work on the exported XML of a corpus:
Rightclick on corpus -> Save as ... -> GATE XML
If you look at the exported files you see the annotation sets at the end of the files (after your actual data) like this:
... data ...
</TextWithNodes>
<AnnotationSet Name="myAnnotationSet">
<Annotation Id="1" Type="AnnotationName" StartNode="11" EndNode="111">
<Feature>
<Name className="java.lang.String">feature-key</Name>
<Value className="java.lang.String">feature-value</Value>
</Feature>
...
</Annotation>
...
</AnnotationSet>
...
Simply replace whatever you need e.g. with
find . -name '*.xml' -exec sed -i 's/\>feature-key</>new-key</g' "{}" \;
(assumung that the phrase >feature-key< is nowhere else in the document) or with your favourite text exitor and re-import the corpus again
Rightclick on an (empty) corpus -> populate

Related

Esttab: Append rtf files with page break?

I use a loop to append each regression table for various dependent variables into one file:
global all_var var1 var2 var3
foreach var of global all_var {
capture noisily : eststo mod0: reg `var' i.female
capture noisily : eststo mod1: reg `var' i.female
capture noisily : eststo mod2: reg `var' i.female
esttab mod0 mod1 mod2 using "file_name.rtf", append
}
However, in the final rtf file some tables are stretching over two pages which does not look good.
Is there any way to avoid that, e.g. introduce some sort of pagebreak?
The community-contributed package rtfutil provides a solution:
net describe rtfutil, from(http://fmwww.bc.edu/RePEc/bocode/r)
TITLE
'RTFUTIL': module to provide utilities for writing Rich Text Format (RTF) files
DESCRIPTION/AUTHOR(S)
The rtfutil package is a suite of file handling utilities for
producing Rich Text Format (RTF) files in Stata, possibly
containing plots and tables. These RTF files can then be opened
by Microsoft Word, and possibly by alternative free word
processors. The plots can be included by inserting, as linked
objects, graphics files that might be produced by the graph
export command in Stata. The tables can be included by using the
listtex command, downloadable from SSC, with the handle() option.
Exact syntax will depend on your specific use case for which you do not provide any example data.
After installing rtfutil, you may use rtfappend. Suppose you want a page break between mod1 and mod2.
esttab mod0 mod1 using "file_name.rtf", replace
tempname handle
rtfappend `handle' using "file_name.rtf", replace
file write `handle' "\page" _n
rtfclose `handle'
esttab mod2 using "file_name.rtf", append
If you want a line break, just replace \page with \line.

Merge two PO files and overwrite matching translation rules

I'm attempting to merge two PO files.
I have a base.po file that has general translations.
I have an extra.po which has extra translations that I'd like to add to the base file OR overwrite translations for if there are matching translation IDs.
I've tried using msgmerge:
$ msgmerge extra.po base.po -o merge.po
But this comments out any translations with matching IDs.
Looking at the msgmerge documentation, it doesn't look like there is any option to effect this behavior.
I'd like to be able to have multiple extra translation files (extra1.po, extra2.po, etc.) so I can merge them with the base translation file and use them in different contexts.
Does anyone know how to do what I'm attempting?
Turns out I needed to be using msgcat instead.
The below command creates a PO file merge.po that contains all of the translations from extra.po and adds any additional translations from base.po.
The --use-first option specifies that if there is a matching translation id between the two files, to choose the translation from extra.po.
$ msgcat extra.po base.po -o merge.po --use-first

How to join two files in Linux for example if i have many files

I have 133 files named as Trace1.log Trace2.log and so on so how can I merge all these files together and save it in one
To simply concatenate the files in alphabetical order,
cat Trace*.log >combined
Take care to name the destination file so it doesn't match the wildcard, or you will get weird results.
Alphabetical order means Trace10.log sorts before Trace2.log. If you need them in numeric order, use a more suitable naming convention (e.g. rename Trace1.log to Trace001.log, etc) or use multiple wildcards;
cat Trace?.log Trace??.log Trace???.log >combined
The locale will affect what exactly "alphabetic order" means; these guidelines apply to the traditional C locale and English-language locales at least (and most other Western languages).
You can try using the cat command.
$ cat Trace* > TraceFull.log
Take a look at this site Joining files together
if you're on a unix based system use the following command:
cat Trace*.log > TraceMerged.log
(while in the directory holding the logs)

Concatenate content of TAGS files from different directories

I'm referring to TAGS file generated by ctags or etags in order to have some code navigation in Emacs with M-..
The typical project looks like this:
Large standard library (more than 100 files, but rarely updated).
Project-specific library (updated on the daily basis).
I would like the project to be able to use two (or maybe more TAGS files), but regenerate only the portion of them, only the ones used inside the particular project. How would I approach this problem?
etags --help:
-i FILE, --include=FILE
Include a note in tag file indicating that, when searching for
a tag, one should also consult the tags file FILE after
checking the current file.

Can I use `diff -r` to just tell me the files that are in one of the trees that have changed in the other?

I want to generate a summary of the files that are in one tree that are also in the other, that have been modified in the second.
The use case is this: I have a product distribution, which contains web content files. Those files are then imported into a client-specific project, and may be modified from there. I now want to see all the files in the client-specific project that have changed since the prduct was imported, so I can update the product, and keep the client-specific changes.
I'm thinking something like this might work
diff -r productDistribution/WebContent clientProject/WebContent
However, there are a number of files that are in the client specific project that are not in the product distribution, that I am not concerned with in this process. Essentially, I want an 'outer join', in SQL parlance.
Ideally, I want to be able to create a patch that contains all the client-specific changes. Then, I can just overlay the new product files, and apply the patch, and I should be all set.
Any ideas?
By default diff only prints a single line for each file that is in only one of the trees, so it's easy to filter these out:
diff -r productDistribution/WebContent clientProject/WebContent | \
grep -v 'Only in clientProject'