Split TextChunk into words - itext

I've found this example which splits a pdf document into TextChunks
Is there either
a) a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?
or
b) a method to get parse a PDF into words/characters instead of chunks and find the location?

Is there a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?
You cannot split these TextChunk objects further because this TextChunk class is merely a helper class transporting a very small amount of information, cf. its constructor arguments String str, Vector startLocation, Vector endLocation, float charSpaceWidth, especially there is no information on the individual character widths or the associated text size and font to derive the individual character widths from.
But you can of course change the method RenderText (in which the incoming more complete TextRenderInfo instances are reduced to TextChunk instances):
public virtual void RenderText(TextRenderInfo renderInfo) {
LineSegment segment = renderInfo.GetBaseline();
TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
locationalResult.Add(location);
}
In particular you can first split the TextRenderInfo instance using its GetCharacterRenderInfos() method into single character TextRenderInfo instances, loop through these and create individual TextChunk instances for each of them.
You probably don't see that method in the repository where you are looking as iTextSharp has already switched to the new SourceForge versioning infrastructure. Thus, you should switch to the current iTextSharp repository.
Is there a method to get parse a PDF into words/characters instead of chunks and find the location?
Of course you can implement IRenderListener to create an extraction strategy which does exactly what you need. You can find some discussions of that topic on stackoverflow for iText and iTextSharp, e.g. ITextSharp Find coordinates of specific text in PDF, Get the exact Stringposition in PDF, Retrieve the respective coordinates of all words on the page with itextsharp and others.

Related

Is it possible to move the location pointer up or down with MS Word Javascript API

I am currently working on a plugin which will take voice commands and upon receiving a response from the server, act accordingly. For example if i say "delete word", the last word will be deleted. I want to be able to move the pointer left/right/up one line/down one line. Does Word Javascript API provide a way to achieve this?
There are no cursor movement APIs. But there are methods on the Paragraph object for getting the previous paragraph and the next paragraph. There are also ways to move among Ranges, if you can get a collection or an array of the ranges that you want to move around. And you can find the AdjacentAfter and AdjacentBefore ranges using the Range.compareLocationWith method. The Range object also has getNextRange and getNextRangeOrNullObject methods. Finally, the Range.select("Start") and Range.select("End") will put the cursor just before/after the current selected range.

How to read info on voltage/beam energy, imaging mode, acquisition date/timestamp, etc. from image meta-data? (Tags)

DM scripting beginner here, almost no programming skills.
I would like to know the commands to access all the metadata of DM images/spectra.
I realized that all my STEM images at 80 kV taken between 2 dates (let's say 02.11.2017-05.04.2019) have the scale calibration wrong by the same factor (scale of all such images needs to be multiplied by 1.21).
I would like to write a script which multiplies the scale value by a factor only for images in scanning mode at 80 kV taken during a period for all images in a folder with subfolders or for all images opened in DM and save the new scale value.
I checked this website http://digitalmicrograph-scripting.tavernmaker.de/other%20resources/Old-DMHelp/AllFunctions.html but only found how to call the scale value (ImageGetDimensionCalibration). I have a general idea how to write the script based on other scripts if I find out how to call the metadata.
If anyone can write the whole script for me I would greatly appreciate your effort.
All general meta-data is organized in the image tag-structure
You can see this, if you open the Image Display Info of an image. (Via the menu, or by pressing CTRL + D) and then browse to the "Tags" section:
All info on the right are image tags and they are organized in a hierarchical tree.
How this tree looks like, and what information is written where, is totally open and will depend on what GMS version you are using, how the hardware is configured etc. Also custom scripts might alter this information.
So for a scripting start, open the data you want to modify and have a look in this tree.
Hint: The following min-script can be useful. It opens a tag-browsing window for the front-most image but as a modeless dialog (i.e. you can keep it open and interact with other parts):
GetFrontImage().ImageGetTagGroup().TagGroupOpenBrowserWindow(0)
The information you need to check against is most probably found in the Microscope Info sub-tree. Here, usually all information gathered from the microscope during acquisition is stored. What is there, will depend on your system and how it is set up.
The information of the STEM image acquisition - as far as the scanning engine and detector is concerned - is most probably in the DigiScan sub-tree.
The Data Bar sub-tree usually contains date and time of creation etc.
Calibration values are not stored in the image tag-structure
What you will not find in this tag-structure is the image calibration, i.e. the values actually used by DM to display calibrated values. These values are "one level up" so to speak here:
This is important to know in the following for your script, because you will need different commands for both the "meta-data" from the tags, and the "calibration" you want to change.
Accessing meta-data by script
The script-commands you need to read from the tags are all described in the F1 help documentation here:
Essentially, you need a command to get the "root" TagGroup of an image, which is ImageGetTagGroup() and then you traverse within this tree.
This might seem confusing - because there are a lot of slightly different commands for the different types of stored tags - but the essential bits are easy:
All "Paths" through the tree are just the individual names (typed exactly)
For each "branch" you have to use a single colon :
The commands to set/get a tag-value all require as input the "root" tagGroup object and the "path" as a string. The get commands require a variable of matching type to store the value in, the set commands need the value which should be written.
= The get commands themeselves return true or false depending on whether or not a tag-path could be found and the value could be read.
So the following script would read the "Imaging Mode" from the tags of the image shown as example above:
string mode
GetFrontImage().ImageGetTagGroup().TagGroupGetTagAsString( "Microscope Info:Imaging Mode", mode )
OKDialog( "Mode: " + mode )
and in a little more verbose form:
string mode // variable to hold the value
image img // variable for the image
string path // variable/constant to specify the where
TagGroup tg // variable to hold the "tagGroup" object
img := GetFrontImage() // Use the selected image
tg = img.ImageGetTagGroup() // From the image get the tags (root)
path = "Microscope Info:Imaging Mode" // specify the path
if ( tg.TagGroupGetTagAsString( path, mode ) )
OKDialog( "Mode: " + mode )
else
Throw( "Tag not found" )
If the tag is not a string but a value, you will need the according commands, i.e.
TagGroupGetTagAsNumber().

Write a struct into a DICOM header

I created a private DICOM tag and I would like to know if it is possible to use this tag to store a struct in a DICOM file using dicomwrite (or alike), instead of creating a field inside the DICOM header for each struct field.
(Something like saving a Patient's name, but instead of using a char data, I would use double)
Here is an example:
headerdicom = dicominfo('Test.dcm');
a.a = 1; a.b = 2; a.c = 3;
headerdicom.Private_0011_10xx_Creator = a;
img = dicomread('Test.dcm');
dicomwrite(img, 'test_modif.dcm', 'ObjectType', 'MR Image Storage', 'WritePrivate', true, headerdicom)
Undefined function 'fieldnames' for input arguments of type 'double'.
Thank you all in advance,
Depending on what "struct" means, here are your options. As you want to use a private tag which means no application but yours will be able to interpret it, you can choose the solution which is technically most appropriate. Basically your question is "which Value Representation should I assign to my private attribute using the DICOM toolkit of my choice?":
Sequence:
There is a DICOM Value Representation "Sequence" (VR=SQ) which allows you to store a list of attributes of different types. This VR is closest to a struct. A sequence can contain an arbitrary number of items each of which has the same attributes in the same order. Each attribute can have its own VR, so if your struct contains different data types (like string, integer, float), this would be my recommendation
Multi-value attribute:
DICOM supports the concept of "Value Multiplicity". This means that a single attribute can contain multiple values which are separated by backslashes. As the VR is a property of the attribute, all values must have the same type. If I understand you correctly, you have a list of floating point numbers which could be encoded as an array of doubles in one field with VR=FD (=Floating Point Double): 0.001\0.003\1.234...
Most toolkits support an indexed access to the attributes.
"Blob":
You can use an attribute with VR=OB (Other Byte) which is also used for encoding pixel data. It can contain up to 4 GB of binary data. The length of the attribute tells you of how many bytes the attribute's value consists. If you just want to copy the memory from / to the struct, this would be the way to go, but obviously it is the weakest approach in terms of type-safety and correctness of encoding. You are going to lose built in methods of your DICOM toolkit that ensure these properties.
To add a private attribute, you have to
reserve a range for the attribute specifying an odd group number and a prefix (2 hex digits) for the element numbers. (e.g. group = 0x0011, Element = 0x10xx) reserves a range from (0x0011, 0x10xx) - (0x0011, 0x10ff). This is done by specifying a Private Creator DICOM tag which holds a manufacturer name. So I suspect that instead of
headerdicom.Private_0011_10xx_Creator = a;
it should read e.g.
headerdicom.Private_0011_10xx_Creator = "Gabs";
register your private tags in the private dictionary, most of the time by specifying the Private Creator, group, element and VR (one of the options above)
Not sure how this can be done in matlab.

Text classification using Weka

I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?
It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:
The_det dog_n barks_v ._p
So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.
If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.
My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.

Incorrect partition detected by PartitionScanner in custom TextEditor for eclipse

I have a PartitionScanner that extends RuleBasedPartitionScanner in my custom text editor plugin for Eclipse. I am having issues with the partition scanner detecting character sequences within larger strings, resulting in document being partitioned incorrectly. For example, within the constructor of m partition scanner I have following rule set-up:
public MyPartitionScanner() {
...
rules.add(new MultiLineRule("SET", "ENDSET", mytoken));
...
}
However, if I happen to use a token that contains the character sequence "SET," it seems like partition scanner would continue searching for endSequence("ENDSET") and will make the rest of the document as single partition set to "mytoken."
var myRESULTSET34 = ...
Is there a way to make the partition scanner ignore the word "SET" from the token above? And only recognize the whole word "SET"?
Thank you.
Using the MultilineRule as is, you won't be able to differentiate. But you can create your own subclass that overrides sequenceDetected and does a lookback/lookahead when the super impl returns true, to make sure that it's preceded/followed by EOF/whitespace. If it doesn't, then push back the characters onto the scanner and return false.

Categories