Apache Tika - Parsing and extracting only metadata without reading content

Apache Tika - Parsing and extracting only metadata without reading content - metadata

Is there a way to configure the Apache Tikka so that it only extracts the metadata properties from the file and does not access the content of the file. ? We need a way to do this so as to avoid reading the entire content in larger files.
The code to extract we are using is as follows:
var tikaConfig = TikaConfig.getDefaultConfig();
var metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
BodyContentHandler handler = new BodyContentHandler();
using (TikaInputStream stream = TikaInputStream.get(new File(filename), metadata))
{
parser.parse(stream, handler, metadata, new ParseContext());
Array metadataKeys = metadata.names();
Array.Sort(metadataKeys);
}
With the above code sample, when we try to extract the metadata even the content is being read. We would need a way to avoid the same.

Related

Mirth - basic HL7 to HL7 transformation question

new to Mirth, not new to engines... finding it a bit challenging to do a basic source to destination HL7v2 transformation.
I've set up my Channel to read from a file as the source, and spit out the destination to a file as well. My output template is ${message.encodedData}. The channel seems to be reading the source correctly, and generating an output. But what I'm struggling with is how cumbersome this is.
I'm playing with an HL7 SIU message, my source has a lot more fields than the destination wants to receive, just need a simple way to map the few fields that are required.
I inserted the source system message template into the Destination Transformer Inbound Message Templates, then I'm doing the following which seems to work:
//MSH Segment
if (msg['MSH'][0]){
var MSH1 = msg['MSH']['MSH.1'];
var MSH2 = msg['MSH']['MSH.2'];
var MSH7 = msg['MSH']['MSH.7'];
var MSH9 = msg['MSH']['MSH.9'];
msg['MSH'] = '';
msg['MSH']['MSH.1']=MSH1;
msg['MSH']['MSH.2']=MSH2;
msg['MSH']['MSH.7']=MSH7;
msg['MSH']['MSH.9']=MSH9;
}
Rinse and repeat for the segments that I need, seems very painful to me.
On a second destination, I'm trying to leverage the Inbound and Outbound Message Template. Inserted the source system template as above, inserted the destination system template in Outbound Message Template.
My Javascript for that one looks something like this:
//MSH Segment
if (msg['MSH'][0]){
tmp['MSH'] = "";
tmp['MSH']['MSH.1'] = msg['MSH']['MSH.1'];
tmp['MSH']['MSH.2'] = msg['MSH']['MSH.2'];
tmp['MSH']['MSH.7'] = msg['MSH']['MSH.7'];
tmp['MSH']['MSH.9'] = msg['MSH']['MSH.9'];
}
It's cleaner, but doesn't seem to work properly, in some messages, my source doesn't have a PV1 segment, but the output contains the sample PV1 segment in the Output Message Template. Do I need to have an initial statement that is tmp = "";
There has to be a easier way to accomplish what I'm trying here, any advise is appreciated!
M

Eventually figured out a different route. Removed the outbound template entirely and built the outbound message from scratch. Here's a snapshot of what it looks like.
var output = <HL7Message/>;
//MSH Segment
createSegment('MSH',output);
output.MSH['MSH.1'] = msg['MSH']['MSH.1'];
output.MSH['MSH.2'] = msg['MSH']['MSH.2'];
output.MSH['MSH.7'] = msg['MSH']['MSH.7'];
output.MSH['MSH.9'] = msg['MSH']['MSH.9'];
//SCH Segment
if (msg['SCH'][0]){
createSegment('SCH',output);
output.SCH['SCH.1'] = msg['SCH']['SCH.1'];
output.SCH['SCH.2'] = msg['SCH']['SCH.2'];
output.SCH['SCH.6'] = msg['SCH']['SCH.6'];
output.SCH['SCH.7'] = msg['SCH']['SCH.7'];
output.SCH['SCH.8'] = msg['SCH']['SCH.8'];
output.SCH['SCH.11'] = msg['SCH']['SCH.11'];
output.SCH['SCH.12'] = msg['SCH']['SCH.12'];
output.SCH['SCH.16'] = msg['SCH']['SCH.16'];
output.SCH['SCH.25'] = msg['SCH']['SCH.25'];
}
var message = SerializerFactory.getSerializer('HL7V2').fromXML(output);
channelMap.put('outmsg',message);
And then in my destination, I use ${outmsg} for the Template.

Word OpenXml Word Found Unreadable Content

We are trying to manipulate a word document to remove a paragraph based on certain conditions. But the word file produced always ends up being corrupted when we try to open it with the error:
Word found unreadable content
The below code corrupts the file but if we remove the line:
Document document = mdp.Document;
The the file is saved and opens without issue. Is there an obvious issue that I am missing?
var readAllBytes = File.ReadAllBytes(#"C:\Original.docx");
using (var stream = new MemoryStream(readAllBytes))
{
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
}
}
File.WriteAllBytes(#"C:\New.docx", readAllBytes);
UPDATE:
using (WordprocessingDocument wpd = WordprocessingDocument.Open(#"C:\Original.docx", true))
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
document.Save();
}
Running the code above on a physical file we can still open Original.docx without the error so it seems limited to modifying a stream.

Here's a method that reads a document into a MemoryStream:
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
Note how the MemoryStream is instantiated. I am passing the capacity rather than the buffer (as in your own code). Why is that?
When using MemoryStream() or MemoryStream(int), you are creating a resizable MemoryStream instance, which you will want in case you make changes to your document. When using MemoryStream(byte[]) (as in your code), the MemoryStream instance is not resizable, which will be problematic unless you don't make any changes to your document or your changes will only ever make it shrink in size.
Now, to read a Word document into a MemoryStream, manipulate that Word document in memory, and end up with a consistent MemoryStream, you will have to do the following:
// Get a MemoryStream.
// In this example, the MemoryStream is created by reading a file stored
// in the file system. Depending on the Stream you "receive", it makes
// sense to copy the Stream to a MemoryStream before processing.
MemoryStream stream = ReadAllBytesToMemoryStream(#"C:\Original.docx");
// Open the Word document on the MemoryStream.
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true)
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
// Manipulate document ...
}
// After having closed the WordprocessingDocument (by leaving the using statement),
// you can use the MemoryStream for whatever comes next, e.g., to write it to a
// file stored in the file system.
File.WriteAllBytes(#"C:\New.docx", stream.GetBuffer());
Note that you will have to reset the stream.Position property by calling stream.Seek(0, SeekOrigin.Begin) whenever your next action depends on that MemoryStream.Position property (e.g., CopyTo, CopyToAsync). Right after having left the using statement, the stream's position will be equal to its length.

How to edit pasted content using the Open XML SDK

I have a custom template in which I'd like to control (as best I can) the types of content that can exist in a document. To that end, I disable controls, and I also intercept pastes to remove some of those content types, e.g. charts. I am aware that this content can also be drag-and-dropped, so I also check for it later, but I'd prefer to stop or warn the user as soon as possible.
I have tried a few strategies:
RTF manipulation
Open XML manipulation
RTF manipulation is so far working fairly well, but I'd really prefer to use Open XML as I expect it to be more useful in the future. I just can't get it working.
Open XML Manipulation
The wonderfully-undocumented (as far as I can tell) "Embed Source" appears to contain a compound document object, which I can use to modify the copied content using the Open XML SDK. But I have been unable to put the modified content back into an object that lets it be pasted correctly.
The modification part seems to work fine. I can see, if I save the modified content to a temporary .docx file, that the changes are being made correctly. It's the return to the clipboard that seems to be giving me trouble.
I have tried assigning just the Embed Source object back to the clipboard (so that the other types such as RTF get wiped out), and in this case nothing at all gets pasted. I've also tried re-assigning the Embed Source object back to the clipboard's data object, so that the remaining data types are still there (but with mismatched content, probably), which results in an empty embedded document getting pasted.
Here's a sample of what I'm doing with Open XML:
using OpenMcdf;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
...
object dataObj = Forms.Clipboard.GetDataObject();
object embedSrcObj = dateObj.GetData("Embed Source");
if (embedSrcObj is Stream)
{
// read it with OpenMCDF
Stream stream = embedSrcObj as Stream;
CompoundFile cf = new CompoundFile(stream);
CFStream cfs = cf.RootStorage.GetStream("package");
byte[] bytes = cfs.GetData();
string savedDoc = Path.GetTempFileName() + ".docx";
File.WriteAllBytes(savedDoc, bytes);
// And then use the OpenXML SDK to read/edit the document:
using (WordprocessingDocument openDoc = WordprocessingDocument.Open(savedDoc, true))
{
OpenXmlElement body = openDoc.MainDocumentPart.RootElement.ChildElements[0];
foreach (OpenXmlElement ele in body.ChildElements)
{
if (ele is Paragraph)
{
Paragraph para = (Paragraph)ele;
if (para.ParagraphProperties != null && para.ParagraphProperties.ParagraphStyleId != null)
{
string styleName = para.ParagraphProperties.ParagraphStyleId.Val;
Run run = para.LastChild as Run; // I know I'm assuming things here but it's sufficient for a test case
run.RunProperties = new RunProperties();
run.RunProperties.AppendChild(new DocumentFormat.OpenXml.Wordprocessing.Text("test"));
}
}
// etc.
}
openDoc.MainDocumentPart.Document.Save(); // I think this is redundant in later versions than what I'm using
}
// repackage the document
bytes = File.ReadAllBytes(savedDoc);
cf.RootStorage.Delete("Package");
cfs = cf.RootStorage.AddStream("Package");
cfs.Append(bytes);
MemoryStream ms = new MemoryStream();
cf.Save(ms);
ms.Position = 0;
dataObj.SetData("Embed Source", ms);
// or,
// Clipboard.SetData("Embed Source", ms);
}
Question
What am I doing wrong? Is this just a bad/unworkable approach?

How to store and compare annotation (with Gold Standard) in GATE

I am very comfortable with UIMA, but my new work require me to use GATE
So, I started learning GATE. My question is regarding how to calculate performance of my tagging engines (java based).
With UIMA, I generally dump all my system annotation into a xmi file and, then using a Java code compare that with a human annotated (gold standard) annotations to calculate Precision/Recall and F-score.
But, I am still struggling to find something similar with GATE.
After going through Gate Annotation-Diff and other info on that page, I can feel there has to be an easy way to do it in JAVA. But, I am not able to figure out how to do it using JAVA. Thought to put this question here, someone might have already figured this out.
How to store system annotation into a xmi or any format file programmatically.
How to create one time gold standard data (i.e. human annotated data) for performance calculation.
Let me know if you need more specific or details.

This code seems helpful in writing the annotations to a xml file.
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/BatchProcessApp.java
String docXMLString = null;
// if we want to just write out specific annotation types, we must
// extract the annotations into a Set
if(annotTypesToWrite != null) {
// Create a temporary Set to hold the annotations we wish to write out
Set annotationsToWrite = new HashSet();
// we only extract annotations from the default (unnamed) AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
Iterator annotTypesIt = annotTypesToWrite.iterator();
while(annotTypesIt.hasNext()) {
// extract all the annotations of each requested type and add them to
// the temporary set
AnnotationSet annotsOfThisType =
defaultAnnots.get((String)annotTypesIt.next());
if(annotsOfThisType != null) {
annotationsToWrite.addAll(annotsOfThisType);
}
}
// create the XML string using these annotations
docXMLString = doc.toXml(annotationsToWrite);
}
// otherwise, just write out the whole document as GateXML
else {
docXMLString = doc.toXml();
}
// Release the document, as it is no longer needed
Factory.deleteResource(doc);
// output the XML to <inputFile>.out.xml
String outputFileName = docFile.getName() + ".out.xml";
File outputFile = new File(docFile.getParentFile(), outputFileName);
// Write output files using the same encoding as the original
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter out;
if(encoding == null) {
out = new OutputStreamWriter(bos);
}
else {
out = new OutputStreamWriter(bos, encoding);
}
out.write(docXMLString);
out.close();
System.out.println("done");

Enterprise Library Fluent API and Rolling Log Files Not Rolling

I am using the Fluent API to handle various configuration options for Logging using EntLib.
I am building up the loggingConfiguration section manually in code. It seems to work great except that the RollingFlatFileTraceListener doesn't actually Roll the file. It will respect the size limit and cap the amount of data it writes to the file appropriately, but it doesn't not actually create a new file and continue the logs.
I've tested it with a sample app and the app.config and it seems to work. So I'm guess that I am missing something although every config option that seems like it needs is there.
Here is the basics of the code (with hard-coded values to show a config that doesn't seem to be working):
//Create the config builder for the Fluent API
var configBuilder = new ConfigurationSourceBuilder();
//Start building the logging config section
var logginConfigurationSection = new LoggingSettings("loggingConfiguration", true, "General");
logginConfigurationSection.RevertImpersonation = false;
var _rollingFileListener = new RollingFlatFileTraceListenerData("Rolling Flat File Trace Listener", "C:\\tracelog.log", "----------------------", "",
10, "MM/dd/yyyy", RollFileExistsBehavior.Increment,
RollInterval.Day, TraceOptions.None,
"Text Formatter", SourceLevels.All);
_rollingFileListener.MaxArchivedFiles = 2;
//Add trace listener to current config
logginConfigurationSection.TraceListeners.Add(_rollingFileListener);
//Configure the category source section of config for flat file
var _rollingFileCategorySource = new TraceSourceData("General", SourceLevels.All);
//Must be named exactly the same as the flat file trace listener above.
_rollingFileCategorySource.TraceListeners.Add(new TraceListenerReferenceData("Rolling Flat File Trace Listener"));
//Add category source information to current config
logginConfigurationSection.TraceSources.Add(_rollingFileCategorySource);
//Add the loggingConfiguration section to the config.
configBuilder.AddSection("loggingConfiguration", logginConfigurationSection);
//Required code to update the EntLib Configuration with settings set above.
var configSource = new DictionaryConfigurationSource();
configBuilder.UpdateConfigurationWithReplace(configSource);
//Set the Enterprise Library Container for the inner workings of EntLib to use when logging
EnterpriseLibraryContainer.Current = EnterpriseLibraryContainer.CreateDefaultContainer(configSource);
Any help would be appreciated!

Your timestamp pattern is wrong. It should be yyy-mm-dd instead of MM/dd/yyyy. The ‘/’ character is not supported.
Also, you could accomplish your objective by using the fluent configuration interface much easier. Here's how:
ConfigurationSourceBuilder formatBuilder = new ConfigurationSourceBuilder();
ConfigurationSourceBuilder builder = new ConfigurationSourceBuilder();
builder.ConfigureLogging().LogToCategoryNamed("General").
SendTo.
RollingFile("Rolling Flat File Trace Listener")
.CleanUpArchivedFilesWhenMoreThan(2).WhenRollFileExists(RollFileExistsBehavior.Increment)
.WithTraceOptions(TraceOptions.None)
.RollEvery(RollInterval.Minute)
.RollAfterSize(10)
.UseTimeStampPattern("yyyy-MM-dd")
.ToFile("C:\\logs\\Trace.log")
.FormatWith(new FormatterBuilder().TextFormatterNamed("textFormatter"));
var configSource = new DictionaryConfigurationSource();
builder.UpdateConfigurationWithReplace(configSource);
EnterpriseLibraryContainer.Current = EnterpriseLibraryContainer.CreateDefaultContainer(configSource);
var writer = EnterpriseLibraryContainer.Current.GetInstance<LogWriter>();
DateTime stopWritingTime = DateTime.Now.AddMinutes(10);
while (DateTime.Now < stopWritingTime)
{
writer.Write("test", "General");
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache Tika - Parsing and extracting only metadata without reading content - metadata

Related

Mirth - basic HL7 to HL7 transformation question

Word OpenXml Word Found Unreadable Content

How to edit pasted content using the Open XML SDK

How to store and compare annotation (with Gold Standard) in GATE

Enterprise Library Fluent API and Rolling Log Files Not Rolling

Categories

Resources