How to store and compare annotation (with Gold Standard) in GATE

How to store and compare annotation (with Gold Standard) in GATE - annotations

I am very comfortable with UIMA, but my new work require me to use GATE
So, I started learning GATE. My question is regarding how to calculate performance of my tagging engines (java based).
With UIMA, I generally dump all my system annotation into a xmi file and, then using a Java code compare that with a human annotated (gold standard) annotations to calculate Precision/Recall and F-score.
But, I am still struggling to find something similar with GATE.
After going through Gate Annotation-Diff and other info on that page, I can feel there has to be an easy way to do it in JAVA. But, I am not able to figure out how to do it using JAVA. Thought to put this question here, someone might have already figured this out.
How to store system annotation into a xmi or any format file programmatically.
How to create one time gold standard data (i.e. human annotated data) for performance calculation.
Let me know if you need more specific or details.

This code seems helpful in writing the annotations to a xml file.
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/BatchProcessApp.java
String docXMLString = null;
// if we want to just write out specific annotation types, we must
// extract the annotations into a Set
if(annotTypesToWrite != null) {
// Create a temporary Set to hold the annotations we wish to write out
Set annotationsToWrite = new HashSet();
// we only extract annotations from the default (unnamed) AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
Iterator annotTypesIt = annotTypesToWrite.iterator();
while(annotTypesIt.hasNext()) {
// extract all the annotations of each requested type and add them to
// the temporary set
AnnotationSet annotsOfThisType =
defaultAnnots.get((String)annotTypesIt.next());
if(annotsOfThisType != null) {
annotationsToWrite.addAll(annotsOfThisType);
}
}
// create the XML string using these annotations
docXMLString = doc.toXml(annotationsToWrite);
}
// otherwise, just write out the whole document as GateXML
else {
docXMLString = doc.toXml();
}
// Release the document, as it is no longer needed
Factory.deleteResource(doc);
// output the XML to <inputFile>.out.xml
String outputFileName = docFile.getName() + ".out.xml";
File outputFile = new File(docFile.getParentFile(), outputFileName);
// Write output files using the same encoding as the original
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter out;
if(encoding == null) {
out = new OutputStreamWriter(bos);
}
else {
out = new OutputStreamWriter(bos, encoding);
}
out.write(docXMLString);
out.close();
System.out.println("done");

Related

Salesforce trigger-Not able to understand

Below is the code written by my collegue who doesnt work in the firm anymore. I am inserting records in object with data loader and I can see success message but I do not see any records in my object. I am not able to understand what below trigger is doing.Please someone help me understand as I am new to salesforce.
trigger DataLoggingTrigger on QMBDataLogging__c (after insert) {
Map<string,Schema.RecordTypeInfo> recordTypeInfo = Schema.SObjectType.QMB_Initial_Letter__c.getRecordTypeInfosByName();
List<QMBDataLogging__c> logList = (List<QMBDataLogging__c>)Trigger.new;
List<Sobject> sobjList = (List<Sobject>)Type.forName('List<'+'QMB_Initial_Letter__c'+'>').newInstance();
Map<string, QMBLetteTypeToVfPage__c> QMBLetteTypeToVfPage = QMBLetteTypeToVfPage__c.getAll();
Map<String,QMBLetteTypeToVfPage__c> mapofLetterTypeRec = new Map<String,QMBLetteTypeToVfPage__c>();
set<Id>processdIds = new set<Id>();
for(string key : QMBLetteTypeToVfPage.keyset())
{
if(!mapofLetterTypeRec.containsKey(key)) mapofLetterTypeRec.put(QMBLetteTypeToVfPage.get(Key).Letter_Type__c, QMBLetteTypeToVfPage.get(Key));
}
for(QMBDataLogging__c log : logList)
{
Sobject logRecord = (sobject)log;
Sobject QMBLetterRecord = new QMB_Initial_Letter__c();
if(mapofLetterTypeRec.containskey(log.Field1__c))
{
string recordTypeId = recordTypeInfo.get(mapofLetterTypeRec.get(log.Field1__c).RecordType__c).isAvailable() ? recordTypeInfo.get(mapofLetterTypeRec.get(log.Field1__c).RecordType__c).getRecordTypeId() : recordTypeInfo.get('Master').getRecordTypeId();
string fieldApiNames = mapofLetterTypeRec.containskey(log.Field1__c) ? mapofLetterTypeRec.get(log.Field1__c).FieldAPINames__c : '';
//QMBLetterRecord.put('Letter_Type__c',log.Name);
QMBLetterRecord.put('RecordTypeId',tgh);
processdIds.add(log.Id);
if(string.isNotBlank(fieldApiNames) && fieldApiNames.contains(','))
{
Integer i = 1;
for(string fieldApiName : fieldApiNames.split(','))
{
string logFieldApiName = 'Field'+i+'__c';
fieldApiName = fieldApiName.trim();
system.debug('fieldApiName=='+fieldApiName);
Schema.DisplayType fielddataType = getFieldType('QMB_Initial_Letter__c',fieldApiName);
if(fielddataType == Schema.DisplayType.Date)
{
Date dateValue = Date.parse(string.valueof(logRecord.get(logFieldApiName)));
QMBLetterRecord.put(fieldApiName,dateValue);
}
else if(fielddataType == Schema.DisplayType.DOUBLE)
{
string value = (string)logRecord.get(logFieldApiName);
Double dec = Double.valueOf(value.replace(',',''));
QMBLetterRecord.put(fieldApiName,dec);
}
else if(fielddataType == Schema.DisplayType.CURRENCY)
{
Decimal decimalValue = Decimal.valueOf((string)logRecord.get(logFieldApiName));
QMBLetterRecord.put(fieldApiName,decimalValue);
}
else if(fielddataType == Schema.DisplayType.INTEGER)
{
string value = (string)logRecord.get(logFieldApiName);
Integer integerValue = Integer.valueOf(value.replace(',',''));
QMBLetterRecord.put(fieldApiName,integerValue);
}
else if(fielddataType == Schema.DisplayType.DATETIME)
{
DateTime dateTimeValue = DateTime.valueOf(logRecord.get(logFieldApiName));
QMBLetterRecord.put(fieldApiName,dateTimeValue);
}
else
{
QMBLetterRecord.put(fieldApiName,logRecord.get(logFieldApiName));
}
i++;
}
}
}
sobjList.add(QMBLetterRecord);
}
if(!sobjList.isEmpty())
{
insert sobjList;
if(!processdIds.isEmpty()) DeleteDoAsLoggingRecords.deleteTheProcessRecords(processdIds);
}
Public static Schema.DisplayType getFieldType(string objectName,string fieldName)
{
SObjectType r = ((SObject)(Type.forName('Schema.'+objectName).newInstance())).getSObjectType();
DescribeSObjectResult d = r.getDescribe();
return(d.fields.getMap().get(fieldName).getDescribe().getType());
}
}

You might be looking in the wrong place. Check if there's an unit test written for this thing (there should be one, especially if it's deployed to production), it should help you understand how it's supposed to be used.
You're inserting records of QMBDataLogging__c but then it seems they're immediately deleted in DeleteDoAsLoggingRecords.deleteTheProcessRecords(processdIds). Whether whatever this thing was supposed to do succeeds or not.
This seems to be some poor man's CSV parser or generic "upload anything"... that takes data stored in QMBDataLogging__c and creates QMB_Initial_Letter__c out of it.
QMBLetteTypeToVfPage__c.getAll() suggests you could go to Setup -> Custom Settings, try to find this thing and examine. Maybe it has some values in production but in your sandbox it's empty and that's why essentially nothing works? Or maybe some values that are there are outdated?
There's some comparison if what you upload into Field1__c can be matched to what's in that custom setting. I guess you load some kind of subtype of your QMB_Initial_Letter__c in there. Record Type name and list of fields to read from your log record is also fetched from custom setting based on that match.
Then this thing takes what you pasted, looks at the list of fields in from the custom setting and parses it.
Let's say the custom setting contains something like
Name = XYZ, FieldAPINames__c = 'Name,SomePicklist__c,SomeDate__c,IsActive__c'
This thing will look at first record you inserted, let's say you have the CSV like that
Field1__c,Field2__c,Field3__c,Field4__c
XYZ,Closed,2022-09-15,true
This thing will try to parse and map it so eventually you create record that a "normal" apex code would express as
new QMB_Initial_Letter__c(
Name = 'XYZ',
SomePicklist__c = 'Closed',
SomeDate__c = Date.parse('2022-09-15'),
IsActive__c = true
);
It's pretty fragile, as you probably already know. And because parsing CSV is an art - I expect it to absolutely crash and burn when text with commas in it shows up (some text,"text, with commas in it, should be quoted",more text).
In theory admin can change mapping in setup - but then they'd need to add new field anyway to the loaded file. Overcomplicated. I guess somebody did it to solve issue with Record Type Ids - but there are better ways to achieve that and still have normal CSV file with normal columns and strong type matching, not just chucking everything in as strings.
In theory this lets you have "jagged" csv files (row 1 having 5 fields, row 2 having different record type and 17 fields? no problem)
Your call whether it's salvageable or you'd rather ditch it and try normal loading of QMB_Initial_Letter__c records. (get back to your business people and ask for requirements?) If you do have variable number of columns at source - you'd need to standardise it or group the data so only 1 "type" of records (well, whatever's in that "Field1__c") goes into each file.

store data persistent for learn app with unity

I'm currently working on a language learn app with unity. I want to implement that when you guessed a work (e.g. a number) incorrect, you need to guess the word again in the next iteration. I thought of a way that you store for each word in every play mode a value between +10 to -10 and when an item has a big negative number the word occurrence more often than if it has a big positiv number.
My Problem is that I don't know how to store the data properly. PlayerPrefs are too inconvenient for this problem, and I don't know how to modify a JSON file properly.
Currently, I store the data for the items in a class.
Maybe you could have a structure like:
Numbers
write
zero: -5
one: +3
match
zero: +5
one: +4
Alphabet
write:
A: -10
match:
A: +5
Does anyone have an idea how to solve this problem?

One of the best JSON serialization libraries is Newtonsoft.Json.
You can use your class and serialize an object to JSON object, and then save it as a string to file.
public static string Serialize(object obj)
{
var settings = new JsonSerializerSettings
{
MissingMemberHandling = MissingMemberHandling.Ignore,
NullValueHandling = NullValueHandling.Ignore
};
return JsonConvert.SerializeObject(obj, settings);
}
After that you can save it to file in the Application.persistentDataPath directory.
var text = Serialize(data);
var tmpFilePath = Path.Combine(Application.persistentDataPath, "filename");
Directory.CreateDirectory(Path.GetDirectoryName(tmpFilePath));
if (File.Exists(tmpFilePath))
{
File.Delete(tmpFilePath);
}
File.WriteAllText(tmpFilePath, text);
After that you can read the file at any time using File.ReadAllText and deserialize it to an object.
public static T Deserialize<T>(string text)
{
var settings = new JsonSerializerSettings
{
MissingMemberHandling = MissingMemberHandling.Ignore,
NullValueHandling = NullValueHandling.Ignore
};
try
{
var result = JsonConvert.DeserializeObject<T>(text, settings);
return result ?? default;
}
catch (Exception e)
{
Debug.Log(e);
}
return default;
}
T result = default;
try
{
if (File.Exists(path))
{
var text = File.ReadAllText(path);
result = Deserialize<T>(text);
}
}
catch (Exception e)
{
Debug.LogException(e);
}
return result;

Unfortunately, there is no easy way to store data persistently between play sessions in Unity. PlayerPrefs and creating your own JSON file are the simplest ways of doing this.
The good news is that JSON files are quite easy to make and modify, thanks to the builtin JSONUtility Unity provides.
If you make a separate class or struct to hold your scores, give that clas a [Serializable] tag and keep a reference to that in your current setup (which is probably a MonoBehaviour).
You can use the File class (specifically File.CreateText() and File.OpenText()) to write to/read from a file. If you do this every time a value changes, you should end up with persistent saved data across multiple play sessions.

How to edit pasted content using the Open XML SDK

I have a custom template in which I'd like to control (as best I can) the types of content that can exist in a document. To that end, I disable controls, and I also intercept pastes to remove some of those content types, e.g. charts. I am aware that this content can also be drag-and-dropped, so I also check for it later, but I'd prefer to stop or warn the user as soon as possible.
I have tried a few strategies:
RTF manipulation
Open XML manipulation
RTF manipulation is so far working fairly well, but I'd really prefer to use Open XML as I expect it to be more useful in the future. I just can't get it working.
Open XML Manipulation
The wonderfully-undocumented (as far as I can tell) "Embed Source" appears to contain a compound document object, which I can use to modify the copied content using the Open XML SDK. But I have been unable to put the modified content back into an object that lets it be pasted correctly.
The modification part seems to work fine. I can see, if I save the modified content to a temporary .docx file, that the changes are being made correctly. It's the return to the clipboard that seems to be giving me trouble.
I have tried assigning just the Embed Source object back to the clipboard (so that the other types such as RTF get wiped out), and in this case nothing at all gets pasted. I've also tried re-assigning the Embed Source object back to the clipboard's data object, so that the remaining data types are still there (but with mismatched content, probably), which results in an empty embedded document getting pasted.
Here's a sample of what I'm doing with Open XML:
using OpenMcdf;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
...
object dataObj = Forms.Clipboard.GetDataObject();
object embedSrcObj = dateObj.GetData("Embed Source");
if (embedSrcObj is Stream)
{
// read it with OpenMCDF
Stream stream = embedSrcObj as Stream;
CompoundFile cf = new CompoundFile(stream);
CFStream cfs = cf.RootStorage.GetStream("package");
byte[] bytes = cfs.GetData();
string savedDoc = Path.GetTempFileName() + ".docx";
File.WriteAllBytes(savedDoc, bytes);
// And then use the OpenXML SDK to read/edit the document:
using (WordprocessingDocument openDoc = WordprocessingDocument.Open(savedDoc, true))
{
OpenXmlElement body = openDoc.MainDocumentPart.RootElement.ChildElements[0];
foreach (OpenXmlElement ele in body.ChildElements)
{
if (ele is Paragraph)
{
Paragraph para = (Paragraph)ele;
if (para.ParagraphProperties != null && para.ParagraphProperties.ParagraphStyleId != null)
{
string styleName = para.ParagraphProperties.ParagraphStyleId.Val;
Run run = para.LastChild as Run; // I know I'm assuming things here but it's sufficient for a test case
run.RunProperties = new RunProperties();
run.RunProperties.AppendChild(new DocumentFormat.OpenXml.Wordprocessing.Text("test"));
}
}
// etc.
}
openDoc.MainDocumentPart.Document.Save(); // I think this is redundant in later versions than what I'm using
}
// repackage the document
bytes = File.ReadAllBytes(savedDoc);
cf.RootStorage.Delete("Package");
cfs = cf.RootStorage.AddStream("Package");
cfs.Append(bytes);
MemoryStream ms = new MemoryStream();
cf.Save(ms);
ms.Position = 0;
dataObj.SetData("Embed Source", ms);
// or,
// Clipboard.SetData("Embed Source", ms);
}
Question
What am I doing wrong? Is this just a bad/unworkable approach?

How to get the current tool SitePage and/or its Properties?

With the ToolManager I can get the the current placement, the context and of course, the Site through the SiteService. But I want to get the current SitePage properties the user is currently accessing.
This doubt can be extended to the current Tool properties with a
little more emphasis considering that once I have the Tool I could not
find any methods covering the its properties.
I could get the tool properties and I'm using it (it is by instance) through Properties got with sitepage.getTool(TOOLID).getConfig(). To save a property, I'm using the ToolConfiguration approach and saving the data after editing with the ToolConfiguration.save() method. Is it the correct approach?

You can do this by getting the current tool session and then working your way backward from that. Here is a method that should do it.
public SitePage findCurrentPage() {
SitePage sp = null;
ToolSession ts = SessionManager.getCurrentToolSession();
if (ts != null) {
ToolConfiguration tool = SiteService.findTool(ts.getPlacementId());
if (tool != null) {
String sitePageId = tool.getPageId();
sp = s.getPage(sitePageId);
}
}
return sp;
}
Alternatively, you could use the current tool to work your way to it but I think this method is harder.
String toolId = toolManager.getCurrentTool().getId();
String context = toolManager.getCurrentPlacement().getContext();
Site s = siteService.getSite( context );
ToolConfiguration tc = s.getTool(toolId);
String sitePageId = tc.getPageId();
SitePage sp = s.getPage(sitePageId);
NOTE: I have not tested this code to make sure it works.

Getting line locations with iText

How can one find where are lines located in a document with iText?
Suppose say I have a table in a PDF document, and want to read its contents; I would like to find where exactly the cells are located. In order to do that I thought I might find the intersections of lines.

I think your only option using iText will be to parse the PDF tokens manually. Before doing that I would have a copy of the PDF spec handy.
(I'm a .Net guy so I use iTextSharp but other than some capitalization differences and property declarations they're almost 100% the same.)
You can get the individual tokens using the PRTokeniser object which you feed bytes into from calling getPageContent(pageNum) on your PdfReader.
//Get bytes for page 1
byte[] pageBytes = reader.getPageContent(1);
//Get the tokens for page 1
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
Then just loop through the PRTokeniser:
PRTokeniser.TokType tokenType;
string tokenValue;
while (tokeniser.nextToken()) {
tokenType = tokeniser.tokenType;
tokenValue = tokeniser.stringValue;
//...check tokenValue, do something with it
}
As far a tokenValue, you'd want to probably look for re and l values for rectangle and line. If you see an re then you want to look at the previous 4 values and if you see an l then previous 2 values. This also means that you need to store each tokenValue in an array so you can look back later.
Depending on what you used to create the PDF with you might get some interesting results. For instance, I created a 4 cell table with Microsoft Word and saved as a PDF. For some reason there are two sets of 10 rectangles with many duplicates, but the general idea still works.
Below is C# code targeting iTextSharp 5.1.1.0. You should be able to convert it to Java and iText very easily, I noted the one line that has .Net-specific code that needs to be adjusted from a Generic List (List<string>) to a Java equivalent, probably an ArrayList. You'll also need to adjust some casing, .Net uses Object.Method() whereas Java uses Object.method(). Lastly, .Net accesses properties without gets and sets, so Object.Property is both the getter and setter compared to Java's Object.getProperty and Object.setProperty.
Hopefully this gets you started at least!
//Source file to read from
string sourceFile = "c:\\Hello.pdf";
//Bind a reader to our PDF
PdfReader reader = new PdfReader(sourceFile);
//Create our buffer for previous token values. For Java users, List<string> is a generic list, probably most similar to an ArrayList
List<string> buf = new List<string>();
//Get the raw bytes for the page
byte[] pageBytes = reader.GetPageContent(1);
//Get the raw tokens from the bytes
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
//Create some variables to set later
PRTokeniser.TokType tokenType;
string tokenValue;
//Loop through each token
while (tokeniser.NextToken()) {
//Get the types and value
tokenType = tokeniser.TokenType;
tokenValue = tokeniser.StringValue;
//If the type is a numeric type
if (tokenType == PRTokeniser.TokType.NUMBER) {
//Store it in our buffer for later user
buf.Add(tokenValue);
//Otherwise we only care about raw commands which are categorized as "OTHER"
} else if (tokenType == PRTokeniser.TokType.OTHER) {
//Look for a rectangle token
if (tokenValue == "re") {
//Sanity check, make sure we have enough items in the buffer
if (buf.Count < 4) throw new Exception("Not enough elements in buffer for a rectangle");
//Read and convert the values
float x = float.Parse(buf[buf.Count - 4]);
float y = float.Parse(buf[buf.Count - 3]);
float w = float.Parse(buf[buf.Count - 2]);
float h = float.Parse(buf[buf.Count - 1]);
//..do something with them here
}
}
}