Lucene.NET highlighter plugin highlighting strangely - lucene.net

I'm trying to add the Lucene.NET Highlighter to my search, however it's doing some really strange highlighting, what am I doing wrong?
Here's the highlighting code:
// stuff here to get scoreDocs
var content = doc.GetField("content").StringValue();
// content = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been"
var highlighter = new Highlighter(new StrongFormatter(), new HtmlEncoder(), new QueryScorer(query.Rewrite(indexSearcher.GetIndexReader())));
highlighter.SetTextFragmenter(new SimpleFragmenter(100));
var tokenStream = analyzer.TokenStream("content", new StringReader(content));
var bestFragment = highlighter.GetBestFragment(tokenStream, content);
Searching for "lorem" gives me this bestFragment value:
<strong>Lorem</strong> <strong>Ipsum</strong> is <strong>simply</strong> <strong>dummy</strong> <strong>text</strong> of the <strong>printing</strong> and <strong>typesetting</strong> <strong>industry</strong>. <strong>Lorem</strong> <strong>Ipsum</strong> <strong>has</strong> <strong>been</strong>
As you can see, its highlighted much more than just "Lorem". Why?
How do I make this behave sensibly?
I'm using a StandardAnalyzer and my query looks like "content:lorem"
Edit: I'm using Lucene.NET 2.9.2

You haven't submitted your implementation of StrongFormatter or HtmlEncoder, but I would say that your implementation error is in the first one. It needs to check the score of the passed TokenGroup to decide if any formatting is needed.
public class StrongFormatter : Formatter {
public String HighlightTerm(String originalText, TokenGroup tokenGroup) {
var score = tokenGroup.GetTotalScore();
if (score == 0)
return originalText;
return String.Concat("<strong>", originalText, "<strong>");
}
}
However, you're not the first one that wants to wrap matches in a html element. You could just use the SimpleHTMLFormatter formatter that comes with Highlighter.Net. And while at it, there's also a SimpleHTMLEncoder which probably does what your HtmlEncoder does.

Related

I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font

I am using iText7 and I want to extract all the texts from a pdf and put html tag for bold ( <b>...</b> ) around all the words that uses bold fonts and save it in text file. Any pointers? I am able to independently extract text and also extract all the bold words but not able to co-relate the two.
Here is the code snippet I am using for extracting the text:
PdfDocument MyDocument = new PdfDocument(new PdfReader("C:\\MyTest.pdf"));
string MyText = PdfTextExtractor.GetTextFromPage(MyDocument.GetPage(1), new
SimpleTextExtractionStrategy());
Here is the code I am using for extracting all the words using the bold font:
MyRectangle = new Rectangle(0, 0, 50, 100);
CustomFontFilter fontFilter = new CustomFontFilter(MyRectangle);
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy =
listener.AttachEventListener(new LocationTextExtractionStrategy(), fontFilter);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.ProcessPageContent(MyDocument.GetPage(1));
String MyBoldTextList = extractionStrategy.GetResultantText();
//------
class CustomFontFilter : TextRegionEventFilter
{
public CustomFontFilter(iText.Kernel.Geom.Rectangle filterRect) : base(filterRect){ }
override public bool Accept(IEventData data, EventType type)
{
if (type == EventType.RENDER_TEXT){
TextRenderInfo renderInfo = (TextRenderInfo)data;
PdfFont font = renderInfo.GetFont();
if (font!=null)
return font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold");
}
return false;
}
}
The problem is that the pdf in question here is a multi-column document. SimpleTextExtractionStrategy brings the text in perfect order but if I use the LocationStrategy, it messes up texts by jumping from one column to next column in each line. I am not able to find any way to get the list of bold words using SimpleTextExtractionStrategy. In LocationStrategy, the list that I get is not in the right order so I am unable to co-relate it.
So to summarize:
You want to extract all the text from a pdf and put the html tag for bold (<b>...</b>) around all the text that uses bold fonts.
Your PDFs allow normal text extraction (without those <b> tags) using the SimpleTextExtractionStrategy. The LocationTextExtractionStrategy on the other hand cannot be used as it messes up the order of the multi-column text.
Bold text in your PDFs can properly be recognized by your CustomFontFilter, i.e. by the
font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold")
condition.
Thus, one way to implement your task would be to extend the SimpleTextExtractionStrategy to check every chunk received using the CustomFontFilter condition and insert <b> tags where required.
For example like this:
public class BoldTaggingSimpleTextExtractionStrategy : SimpleTextExtractionStrategy
{
FieldInfo textField = typeof(TextRenderInfo).GetField("text", BindingFlags.NonPublic | BindingFlags.Instance);
bool currentlyBold = false;
public override void EventOccurred(IEventData data, EventType type)
{
if (type.Equals(EventType.RENDER_TEXT))
{
TextRenderInfo renderInfo = (TextRenderInfo)data;
string fontName = renderInfo.GetFont()?.GetFontProgram()?.GetFontNames()?.GetFontName();
if (fontName != null && fontName.Contains("Bold"))
{
if (!currentlyBold)
{
textField.SetValue(renderInfo, "<b>" + renderInfo.GetText());
currentlyBold = true;
}
}
else if (currentlyBold)
{
AppendTextChunk("</b>");
currentlyBold = false;
}
}
base.EventOccurred(data, type);
}
}
As you see I used reflection here. I did so because (A) TextRenderInfo does not allow public setting of the text and (B) AppendTextChunk must not be used before the first chunk is processed by base.EventOccurred - there the size of a StringBuilder containing the collected text chunks is used to check whether the chunk currently processed is the first one or not; if something is in that builder before at least one chunk has been processed, one gets a NullReferenceException. There are other work-arounds for that but reflection here means but one more line of code.

What ending marks should be used to extend a range to the end of the paragraph?

I am coding a word add-in and am not clear how to use the getNextTextRange(endingMarks, trimSpacing) method of the Range class.
Specifically I want to select a new Range starting from the currently selected range and going to the end of the paragraph.
The API for for the method states
endingMarks string[]
Required. The punctuation marks and/or other
ending marks as an array of strings
That's clear enough if you want to select up to the next comma, period or even space. But what ending marks should you use for a paragraph, a line break, or the end of the document?
I have tried using '\n', '^p' and '¶' but none of these seem to work.
var nr = selection.getNextTextRange(['¶'],true);
nr.load("isEmpty,text");
await context.sync();
console.log('nr='+nr.text);
} catch(e) {
console.log("error, soz");
console.log(e);
}
Given a document consisting of one paragraph of text with a blank paragraph after it, and the first word of the paragraph highlighted, this add-in throws a RichApi.Error
We couldn't find the item you requested.
I would expect it to instead print out the remainder of the paragraph.
If I understand your scenario, you can work with the ParagraphCollection.getFirst() method. Please install the Script Lab tool. Open the sample called "Get paragraph from insertion point" for an example.
Let me expand on rick-kirkham's answer in case it helps anyone else in my situation. This is basically the same answer as given here https://stackoverflow.com/a/51160690/4114053
Ok, here is my sample word document:
The rain in Spain falls. Mainly on the plain.
Alice stepped through the looking glass. What did she see?
And there endeth the lesson. Amen.
The user selects "stepped" in the second paragraph and I want to know what the text for the rest of the paragraph, from that word, says. I also want to know what the text up to that point says.
var doc = context.document;
var selection = doc.getSelection();
selection.load("isEmpty,text");
await context.sync();
console.log(selection.text); //prints stepped
var startRange = selection.getRange("start");
var endRange = selection.paragraphs.getLast().getRange("start");
var deltaRange = startRange.expandTo(endRange);
context.load(deltaRange);
await context.sync();
console.log(deltaRange.text); //prints "Alice"
startRange = selection.getRange("end");
endRange = selection.paragraphs.getLast().getRange("end");
deltaRange = startRange.expandTo(endRange);
context.load(deltaRange);
await context.sync();
console.log(deltaRange.text); // prints "through the looking glass. What did she see?"
My mistake was to get too caught up in trying to work out what "ending marks" might mean and how to use them to achieve this. (Although I still would like that spelled out in the API specification.)

How to replace text with a MS Word web add-in by preserving the formatting?

I'm working on a simple grammar correction web add-in for MS Word. Basically, I want to get the selected text, make minimal changes and update the document with the corrected text. Currently, if I use 'text' as the coercion type, I lose formatting. If there is a table or image in the selected text, they are also gone!
As I understand from the investigation I've been doing so far, openxml is the way to go. But I couldn't find any useful example on the web. How can I manipulate text by preserving the original formatting data? How can I ignore non-text paragraphs? I want to be able to do this with the Office JavaScript API:
I would do something like this:
// get data as OOXML
Office.context.document.getSelectedDataAsync(Office.CoercionType.Ooxml, function (result) {
if (result.status === "succeeded") {
var selectionAsOOXML = result.value;
var bodyContentAsOOXML = selectionAsOOXML.match(/<w:body.*?>(.*?)<\/w:body>/)[1];
// perform manipulations to the body
// it can be difficult to do to OOXML but with som regexps it should be possible
bodyContentAsOOXML = bodyContentAsOOXML.replace(/error/g, 'rorre'); // reverse the word 'error'
// insert the body back in to the OOXML
selectionAsOOXML = selectionAsOOXML.replace(/(<w:body.*?>)(.*?)<\/w:body>/, '$1' + bodyContentAsOOXML + '<\/w:body>');
// replace the selected text with the new OOXML
Office.context.document.setSelectedDataAsync(selectionAsOOXML, { coercionType: Office.CoercionType.Ooxml }, function (asyncResult) {
if (asyncResult.status === "failed") {
console.log("Action failed with error: " + asyncResult.error.message);
}
});
}
});

Chapters in iText 7

I'm looking to create a pdf file with chapters and sub chapters with iText 7. I've found examples for previous versions of iText using the Chapter class. However this class does not seem to be included in iText 7.
How is that functionality implemented in iText7?
The Chapter and Section class in iText 5 were problematic. Already with iText 5, we advised people to use PdfOutline.
For an example on how to create chapters, and more specifically, the corresponding outlines in the bookmarks panel, please take a look at the iText 7: Building Blocks tutorial. This tutorial has a recurring theme: the novel "The Strange Case of Dr. Jekyll and Mr. Hyde."
We use that text and a database with movies based on this novel to explain how iText 7 works. If you don't have the time to read it, please jump to Chapter 6.
In this chapter, we create a document that looks like this:
You can download the full sample code here: TOC_OutlinesDestinations
BufferedReader br = new BufferedReader(new FileReader(SRC));
String name, line;
Paragraph p;
boolean title = true;
int counter = 0;
PdfOutline outline = null;
while ((line = br.readLine()) != null) {
p = new Paragraph(line);
p.setKeepTogether(true);
if (title) {
name = String.format("title%02d", counter++);
outline = createOutline(outline, pdf, line, name);
p.setFont(bold).setFontSize(12)
.setKeepWithNext(true)
.setDestination(name);
title = false;
document.add(p);
}
else {
p.setFirstLineIndent(36);
if (line.isEmpty()) {
p.setMarginBottom(12);
title = true;
}
else {
p.setMarginBottom(0);
}
document.add(p);
}
}
In this example, we loop over a text file that contains titles and chapters. Every time we encounter a title, we create a name (title01, title02, and so on), and we use this named as named destination for the title paragraph: setDestination(name).
We create the outlines using the PdfOutline object for which we define a named destination like this: PdfDestination.makeDestination(new PdfString(name))
public PdfOutline createOutline(PdfOutline outline, PdfDocument pdf, String title, String name) {
if (outline == null) {
outline = pdf.getOutlines(false);
outline = outline.addOutline(title);
outline.addDestination(PdfDestination.makeDestination(new PdfString(name)));
return outline;
}
PdfOutline kid = outline.addOutline(title);
kid.addDestination(PdfDestination.makeDestination(new PdfString(name)));
return outline;
}
There are other ways to achieve this result, but using named destinations is the most simple way. Just try the example, you'll discover that most of the complexity of this example is caused by the fact that we turn a simple text file into a document with chapter titles and chapter content.

How to reject numeric values in Lucene.net?

I want to know whether is it possible to reject numeric phrases or numeric values while indexing or searching in Lucene.net.
For example (this is one line),
Hi all my no is 4756396
Now, when I index or search it should reject the numeric value 4756396 to be indexed or searched. I tried making a custom stop word list with 1, 2, 3, 4, 5, 6, etc, but I guess it will only ignore if a single number will appears.
You can copy the StandardAnalyzer and customize the grammar (simple JFlex stuff) to reject number tokens. If you do that, you'll need to port back the analyzer to Java since JFlex will generate java code, tho you could give it a try with C# Flex.
You could also write a TokenFilter that scans tokens one by one and rejects them if they are numbers. If you wanna filter only whole numbers and still retain numbers that are for example separate by hyphens, the filter could simply attempt a double.TryParse() and if it fails you accept the Token. A more robust and customizable solution would still use a lexical parser.
Edit:
Heres a quick sample of what I mean, with a little main method that shows how to use it. In this I used a TryParse() to filter out tokens, if it were for a more complex production system I'd use a lexical parser system. (take a look at C# Flex for that)
public class NumericFilter : TokenFilter
{
private ITermAttribute termAtt ;
public NumericFilter(TokenStream tokStream)
: base(tokStream)
{
termAtt = AddAttribute<ITermAttribute>();
}
public override bool IncrementToken()
{
while (base.input.IncrementToken())
{
string term = termAtt.Term;
double res ;
if(double.TryParse(term, out res))
{
// skip this token
continue;
}
// accept this token
return true;
}
// no more token in the stream
return false;
}
}
static void Main(string[] args)
{
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(dir, new KeywordAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
Document d = new Document();
Field f = new Field("text", "", Field.Store.YES, Field.Index.ANALYZED);
d.Add(f);
// use our Filter here
f.SetTokenStream(new NumericFilter(new LowerCaseFilter(new WhitespaceTokenizer(new StringReader("I have 300 dollars")))));
iw.AddDocument(d);
iw.Commit();
IndexReader reader = iw.GetReader();
// print all terms in the text field
TermEnum terms = reader.Terms(new Term("text", ""));
do
{
Console.WriteLine(terms.Term.Text);
}
while (terms.Next());
reader.Dispose();
iw.Dispose();
Console.ReadLine();
Environment.Exit(42);
}