English stemming or lemmatization in Lucene.NET without SnowBall Analyzer or a custom analyzer

English stemming or lemmatization in Lucene.NET without SnowBall Analyzer or a custom analyzer - lucene.net

Is there a non-obsolete Lucene.NET Analyzer that can do english language stemming or lemmatization or do I need to write a custom Analyzer?
I can't seem to find an Analyzer that includes PorterStemFilter or EnglishMinimalStemFilter in the source code. I could write my own Analyzer, but it feels like that shouldn't be required, and I'd be reinventing the wrong wheel.
I'm trying to do Stemming of english words in Lucene.NET. As far as I can tell, this does not work out of the box. I tried using the EnglishAnalizer like so:
[TestFixture]
public class TestAnalyzers
{
private const string FieldName = "CustomFieldName";
public Directory CreateDirectory(IEnumerable<string> documents, Analyzer analyzer)
{
var directory = new RAMDirectory();
var iwc = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
OpenMode = OpenMode.CREATE_OR_APPEND,
};
var writer = new IndexWriter(directory, iwc);
writer.Commit();
foreach(var doc in documents) {
var document = new Document();
document.AddTextField(FieldName, doc, StoredField.Store.YES);
writer.AddDocument(document);
}
writer.Flush(true, true);
writer.Commit();
return directory;
}
private QueryParser CreateQueryParser(Analyzer analyzer)
=> new MultiFieldQueryParser(
LuceneVersion.LUCENE_48,
GetSearchFields(),
analyzer);
private string[] GetSearchFields() => new [] { FieldName };
[TestCase("for", "for")]
[TestCase("for", "forward")]
[TestCase("forward", "for")]
//[TestCase("retire", "retirement")]
[TestCase("retirement", "retire")]
[Test]
public void TestPartialWordsStandard(string fieldValue, string query)
{
var analyzer = new EnglishAnalyzer(LuceneVersion.LUCENE_48);
var directory = CreateDirectory(new [] { fieldValue }, analyzer);
var indexReader = DirectoryReader.Open(directory);
Assert.AreEqual(1, indexReader.NumDocs);
var doc = indexReader.Document(0);
Assert.NotNull(doc);
Assert.AreEqual(fieldValue, doc.GetField(FieldName).GetStringValue());
var searcher = new IndexSearcher(indexReader);
var queryObj = CreateQueryParser(analyzer).Parse(query);
var results = searcher.Search(queryObj, 2);
Assert.AreEqual(1, results.TotalHits);
doc = indexReader.Document(results.ScoreDocs.First().Doc);
Assert.AreEqual(fieldValue, doc.GetField(FieldName).GetStringValue());
}
}
It did no stemming. From reading the code it using a possessive filter to remove 's and s, but not the english stemming filter or thePorterStemFilter`.
I was able to get some stemming to happen with var analyzer = new SnowballAnalyzer(LuceneVersion.LUCENE_48, "English");. Its an adequate amount of stemming , but the class is obsolete.

The Lucene.Net EnglishAnalyzer does include porter stemming. In line 117 of the source code for the class is this line:
result = new PorterStemFilter(result);
I also ran a test in my system using the EnglishAnalyzer and confirmed that it is in fact stemming. So for example my indexed text contained the word "walking" and when I searched on "walked" I got a hit on the record.

Related

Concatenate multiple PDF/A with different conformance levels

Is it possible to concatenate a number of pdf/a (with possibly different conformance levels: some pdf/a-1b, some pdf/a-3b ecc) into a single pdfa ?
I was thinking that using the latest level (3-a or 3b) would be ok but I get errors when validating with VeraPDF:
Here is my code (where :
public static byte[] CreateConformantCopy(List<byte[]> sourcePdfs)
{
var version = PdfVersion.PDF_1_7;
var type = PdfAType.PDF_A_3B;
WriterProperties wp = new WriterProperties();
wp.UseSmartMode();
wp.SetPdfVersion(version.ToPdfVersion());
PdfOutputIntent oi = new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", Assembly.GetExecutingAssembly().GetManifestResourceStream("xxx.Resources.sRGB_CS_profile.icm"));
using (var mergedPdf = new MemoryStream())
{
var writer = new PdfWriter(mergedPdf, wp);
using (PdfADocument newDoc = new PdfADocument(writer, type.ToPdfAConformanceLevel(), oi, new DocumentProperties() { }))
{
Document document = new Document(newDoc, PageSize.A4.Rotate());
newDoc.SetTagged();
newDoc.GetCatalog().SetLang(new PdfString(Thread.CurrentThread.CurrentUICulture.Name));
newDoc.GetCatalog().SetViewerPreferences(
new PdfViewerPreferences()
.SetDisplayDocTitle(true)
.SetCenterWindow(true)
);
PdfMerger merger = new PdfMerger(newDoc);
for (int k = 0; k < sourcePdfs.Count; k++)
{
using (var inDoc = PdfHelper.GetDocument(sourcePdfs[k]))
{
var numberOfPages = inDoc.GetNumberOfPages();
merger.Merge(inDoc, 1, numberOfPages);
}
}
newDoc.Close();
}
return mergedPdf.ToArray();
}
}

PDF/A-1 and PDF/A-2 have several differences in the requirements. So, merging them together might not be possible. Looking on your validation errors, I think this is exactly the case. For example, the very first one is about XMP metadata. The PDF/A-2 is more strict here, and you get this error because your first file (which is probably a valid PDF/A-1) does not actually satisfy the PDF/A-2 rules.
What is possible however is to attach a PDF/A-1 document to PDF/A-2 one. This does not even require the use of PDF/A-3, which allows arbitrary attachments. The PDF/A-2 standard does allow attaching valid PDF/A-1 (as well as PDF/A-2 documents).

How to remove the extra page at the end of a word document which created during mail merge

I have written a piece of code to create a word document by mail merge using Syncfusion (Assembly Syncfusion.DocIO.Portable, Version=17.1200.0.50,), Angular 7+ and .NET Core. Please see the code below.
private MemoryStream MergePaymentPlanInstalmentsScheduleToPdf(List<PaymentPlanInstalmentReportModel>
PaymentPlanDetails, byte[] templateFileBytes)
{
if (templateFileBytes == null || templateFileBytes.Length == 0)
{
return null;
}
var templateStream = new MemoryStream(templateFileBytes);
var pdfStream = new MemoryStream();
WordDocument mergeDocument = null;
using (mergeDocument = new WordDocument(templateStream, FormatType.Docx))
{
if (mergeDocument != null)
{
var mergeList = new List<PaymentPlanInstalmentScheduleMailMergeModel>();
var obj = new PaymentPlanInstalmentScheduleMailMergeModel();
obj.Applicants = 0;
if (PaymentPlanDetails != null && PaymentPlanDetails.Any()) {
var applicantCount = PaymentPlanDetails.GroupBy(a => a.StudentID)
.Select(s => new
{
StudentID = s.Key,
Count = s.Select(a => a.StudentID).Distinct().Count()
});
obj.Applicants = applicantCount?.Count() > 0 ? applicantCount.Count() : 0;
}
mergeList.Add(obj);
var reportDataSource = new MailMergeDataTable("Report", mergeList);
var tableDataSource = new MailMergeDataTable("PaymentPlanDetails", PaymentPlanDetails);
List<DictionaryEntry> commands = new List<DictionaryEntry>();
commands.Add(new DictionaryEntry("Report", ""));
commands.Add(new DictionaryEntry("PaymentPlanDetails", ""));
MailMergeDataSet ds = new MailMergeDataSet();
ds.Add(reportDataSource);
ds.Add(tableDataSource);
mergeDocument.MailMerge.ExecuteNestedGroup(ds, commands);
mergeDocument.UpdateDocumentFields();
using (var converter = new DocIORenderer())
{
using (var pdfDocument = converter.ConvertToPDF(mergeDocument))
{
pdfDocument.Save(pdfStream);
pdfDocument.Close();
}
}
mergeDocument.Close();
}
}
return pdfStream;
}
Once the document is generated, I notice there is a blank page (with the footer) at the end. I searched for a solution on the internet over and over again, but I was not able to find a solution. According to experts, I have done the initial checks such as making sure that the initial word template file has no page breaks, etc.
I am wondering if there is something that I can do from my code to remove any extra page breaks or anything like that, which can cause this.
Any other suggested solution for this, even including MS Word document modifications also appreciated.

Please refer the below documentation link to remove empty page at the end of Word document using Syncfusion Word library (Essential DocIO).
https://www.syncfusion.com/kb/10724/how-to-remove-empty-page-at-end-of-word-document
Please reuse the code snippet before converting Word to PDF in your sample application.
Note: I work for Syncfusion.

RavenDb Suggestions - return multiple-contexts words

I'm trying to get context based suggestions result from RavenDb, the purpose is ui dropdown with auto suggestion from large amount of data on server, each keystroke (in 400ms) is sent to retrieve suggestions.
The suggestion i need is with multiple words by context.
let's say i'm looking for 'Harry Potter', i have documents with just 'Harry' and some docs with only 'Potter', and documents with both.
But if i type 'harre poter' i would get one word suggestions.
i tried searching with multiple words (demonstrated here - suggest.Term = "(word1 word2)";), but the result is list of one words. i want to type 'harre poter' and get suggestion of 'Harry Potter'
i even tried querying multiple times with each word, but the result are not context based, in other word - there is no connection between them.
var words = text.Split(new String[] {" "}, StringSplitOptions.RemoveEmptyEntries).ToList();
var sugegstions = new List<SuggestionQuery>();
foreach (var word in words)
{
var suggest = new SuggestionQuery();
suggest.Field = "Body";
suggest.Term = word;
suggest.Popularity = true;
suggest.MaxSuggestions = 5;
suggest.Distance = StringDistanceTypes.Levenshtein;
sugegstions.Add(suggest);
}
var results = new List<SuggestionQueryResult>();
foreach (var suggest in sugegstions)
{
SuggestionQueryResult result =
s.Query<Book, Books_ByBody>().Suggest(suggest);
results.Add(result);
}
i looked in this SO question, and tried it too, but the results are the docs not suggestions.
my index is : `
public class Books_ByBody : AbstractIndexCreationTask
{
public Books_ByBody()
{
Map = books from book in books
select new
{
book.Body,
};
Indexes.Add(x => x.Body, FieldIndexing.Analyzed);
Suggestion(x => x.Body);
}
}
`

Testing With A Fake DbContext and Autofixture and Moq

SO follow this example
example and how make a fake DBContex For test my test using just this work fine
[Test]
public void CiudadIndex()
{
var ciudades = new FakeDbSet<Ciudad>
{
new Ciudad {CiudadId = 1, EmpresaId =1, Descripcion ="Santa Cruz", FechaProceso = DateTime.Now, MarcaBaja = null, UsuarioId = 1},
new Ciudad {CiudadId = 2, EmpresaId =1, Descripcion ="La Paz", FechaProceso = DateTime.Now, MarcaBaja = null, UsuarioId = 1},
new Ciudad {CiudadId = 3, EmpresaId =1, Descripcion ="Cochabamba", FechaProceso = DateTime.Now, MarcaBaja = null, UsuarioId = 1}
};
//// Create mock unit of work
var mockData = new Mock<IContext>();
mockData.Setup(m => m.Ciudades).Returns(ciudades);
// Setup controller
var homeController = new CiudadController(mockData.Object);
// Invoke
var viewResult = homeController.Index();
var ciudades_de_la_vista = (IEnumerable<Ciudad>)viewResult.Model;
// Assert..
}
Iam tryign now to use Autofixture-Moq
to create "ciudades" but I cant. I try this
var fixture = new Fixture();
var ciudades = fixture.Build<FakeDbSet<Ciudad>>().CreateMany<FakeDbSet<Ciudad>>();
var mockData = new Mock<IContext>();
mockData.Setup(m => m.Ciudades).Returns(ciudades);
I get this error
Cant convert System.Collections.Generic.IEnumerable(FakeDbSet(Ciudad)) to System.Data.Entity.IDbSet(Ciudad)
cant put "<>" so I replace with "()" in the error message
Implementation of IContext and FakeDbSet
public interface IContext
{
IDbSet<Ciudad> Ciudades { get; }
}
public class FakeDbSet<T> : IDbSet<T> where T : class
how can make this to work?

A minor point... In stuff like:
var ciudades_fixture = fixture.Build<Ciudad>().CreateMany<Ciudad>();
The second type arg is unnecessary and should be:
var ciudades_fixture = fixture.Build<Ciudad>().CreateMany();
I really understand why you need a FakeDbSet and the article is a bit TL;DR... In general, I try to avoid faking and mucking with ORM bits and instead dealing with interfaces returning POCOs to the max degree possible.
That aside... The reason the normal syntax for initialising the list works is that there is an Add (and IEnumerable) in DBFixture. AutoFixture doesn't have a story for that pattern directly (after all it is compiler syntactic sugar and not particularly amenable to reflection or in line with any other conventions) but you can use AddManyTo as long as there is an ICollection in play. Luckily, within the impl of FakeDbSet as in the article, the following gives us an in:-
public ObservableCollection<T> Local
{
get { return _data; }
}
As ObservableCollection<T> derives from ICollection<T>, you should be able to:
var ciudades = new FakeDbSet<Cuidad>();
fixture.AddManyTo(ciudades.Local);
var mockData = new Mock<IContext>();
mockData.Setup(m => m.Ciudades).Returns(ciudades);
It's possible to wire up a customization to make this prettier, but at least you have a way to manage it. The other option is to have something implement ICollection (or add a prop with a setter taking IEnumerable<T> and have AF generate the parent object, causing said collection to be filled in.
Long superseded side note: In your initial question, you effectively have:
fixture.Build<FakeDbSet<Ciudad>>().CreateMany()
The problem becomes clearer then - you are asking AF to generate Many FakeDbSet<Ciudad>s, which is not what you want.

I haven't used AutoFixture in a while, but shouldn't it be:
var ciudades = new FakeDbSet<Ciudad>();
fixture.AddManyTo(ciudades);

for the moment I end doing this, I will keep reading about how use automoq, cause I'm new in this
var fixture = new Fixture();
var ciudades_fixture = fixture.Build<Ciudad>().CreateMany<Ciudad>();
var ciudades = new FakeDbSet<Ciudad>();
foreach (var item in ciudades_fixture)
{
ciudades.Add(item);
}
var mockData = new Mock<IContext>();
fixture.Create<Mock<IContext>>();
mockData.Setup(r => r.Ciudades).Returns(ciudades);

Extending TokenStream

I am trying to index into a document a field with one term that has a payload.
Since the only constructor of Field that can work for me takes a TokenStream, I decided to inherit from this class and give the most basic implementation for what I need:
public class MyTokenStream : TokenStream
{
TermAttribute termAtt;
PayloadAttribute payloadAtt;
bool moreTokens = true;
public MyTokenStream()
{
termAtt = (TermAttribute)GetAttribute(typeof(TermAttribute));
payloadAtt = (PayloadAttribute)GetAttribute(typeof(PayloadAttribute));
}
public override bool IncrementToken()
{
if (moreTokens)
{
termAtt.SetTermBuffer("my_val");
payloadAtt.SetPayload(new Payload(/*bye[] data*/));
moreTokens = false;
}
return false;
}
}
The code which was used while indexing:
IndexWriter writer = //init tndex writer...
Document d = new Document();
d.Add(new Field("field_name", new MyTokenStream()));
writer.AddDocument(d);
writer.Commit();
And the code that was used during the search:
IndexSearcher searcher = //init index searcher
Query query = new TermQuery(new Term("field_name", "my_val"));
TopDocs result = searcher.Search(query, null, 10);
I used the debugger to verify that call to IncrementToken() actually sets the TermBuffer.
My problem is that the returned TopDocs instance returns no documents, and I cant understand why... Actually I started from TermPositions (which gives me approach to the Payload...), but it also gave me no results.
Can someone explain to me what am I doing wrong?
I am currently using Lucene .NET 2.9.2

After you set the TermBuffer you need to return true from IncrementToken, you return false when you have nothing to feed the TermBuffer with anymore

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

English stemming or lemmatization in Lucene.NET without SnowBall Analyzer or a custom analyzer - lucene.net

Related

Concatenate multiple PDF/A with different conformance levels

How to remove the extra page at the end of a word document which created during mail merge

RavenDb Suggestions - return multiple-contexts words

Testing With A Fake DbContext and Autofixture and Moq

Extending TokenStream

Categories

Resources