after upgrading to 4.8 from 3.0.3 Lucene Net performance issue - lucene.net

after migrating from 3.0.3 to 4.8, indexing new document is slower than 3.0.3
but index file size much smaller than 3.0.3.
here my code
private IndexReader reader;
private IndexSearcher searcher;
var writeconfig = new IndexWriterConfig(Lucene.Net.Util.LuceneVersion.LUCENE_48, analyzer);
writer = new IndexWriter(_directory, writeconfig);
foreach (var member in list_of_members)
{
new_(writer, member.name,member.surname, member.location);
}
writer.Dispose();
reader = DirectoryReader.Open(index_location);
searcher = new IndexSearcher(reader);
public void new_(Lucene.Net.Index.IndexWriter writer, string name, string surname, string location)
{
Document doc = new Document();
doc.Add(new StringField("name", name, Field.Store.YES));
doc.Add(new TextField("surname", surname, Field.Store.YES));
doc.Add(new StringField("location", location, Field.Store.YES));
writer.AddDocument(doc);
}
when comparing with 3.0.3 indexing new document almost 2x slower than 4.8..
edit 1: found out the performance problem with field compression;
found the this webiste about performance of stored field compression field compression
at web site they explain disabling compressing in java but couldnt convert the code into c#...
now my question is , how can i disable field compression with lucene.net 4.8?

seems like this is compression issue, after version 41, fields store are compressed by default.
in this case, compression penalty is too high.
add no compression codec:
public class NoCompressionCodec : FilterCodec
{
internal NoCompressionCodec(Codec #delegate) : base(#delegate)
{
}
public override StoredFieldsFormat StoredFieldsFormat => new Lucene40StoredFieldsFormat();
}
override the default codec factory
public class CustomCodecFactory : DefaultCodecFactory
{
private readonly NoCompressionCodec _noCompressionCodec;
public CustomCodecFactory()
{
_noCompressionCodec = new NoCompressionCodec(Codec.Default);
}
protected override void Initialize()
{
PutCodecType(typeof(NoCompressionCodec));
base.Initialize();
}
protected override Codec GetCodec(Type type)
{
if (type == typeof(NoCompressionCodec))
return _noCompressionCodec;
return base.GetCodec(type);
}
}
and run this on your startup
Codec.SetCodecFactory(new CustomCodecFactory());
on your index writer, set codec to:
indexWriterConfig.Codec = new NoCompressionCodec(Codec.Default);

Related

Is there a high performance way to replace the BinaryFormatter in .NET5?

Before .NET5 we serialize/deserialize the Bytes/Object by these code:
private static byte[] StructToBytes<T>(T t)
{
using (var ms = new MemoryStream())
{
var bf = new BinaryFormatter();
bf.Serialize(ms, t);
return ms.ToArray();
}
}
private static T BytesToStruct<T>(byte[] bytes)
{
using (var memStream = new MemoryStream())
{
var binForm = new BinaryFormatter();
memStream.Write(bytes, 0, bytes.Length);
memStream.Seek(0, SeekOrigin.Begin);
var obj = binForm.Deserialize(memStream);
return (T)obj;
}
}
But the BinaryFormatter will be removed for the security reason:
https://learn.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide
So is there some simple but high performance method to replace BinaryFormatter?
In my project, which we recently migrated from .NET Core 3.1 to .NET 5, I swapped out our BinarySerializer code with Protobuf-net: https://github.com/protobuf-net/protobuf-net
The code was almost exactly the same, and the project is very reputable with (currently) 22 million downloads and 3.2k stars on GitHub. It is very fast and has none of the security baggage surrounding BinarySerializer.
Here's my class for byte[] serialization:
public static class Binary
{
/// <summary>
/// Convert an object to a Byte Array, using Protobuf.
/// </summary>
public static byte[] ObjectToByteArray(object obj)
{
if (obj == null)
return null;
using var stream = new MemoryStream();
Serializer.Serialize(stream, obj);
return stream.ToArray();
}
/// <summary>
/// Convert a byte array to an Object of T, using Protobuf.
/// </summary>
public static T ByteArrayToObject<T>(byte[] arrBytes)
{
using var stream = new MemoryStream();
// Ensure that our stream is at the beginning.
stream.Write(arrBytes, 0, arrBytes.Length);
stream.Seek(0, SeekOrigin.Begin);
return Serializer.Deserialize<T>(stream);
}
}
I did have to add attributes to the class I serialized. It was decorated with [Serializable] only, and although I understand Protobuf can work with a lot of common decorations, that one didn't work. From the example on github:
[ProtoContract]
class Person {
[ProtoMember(1)]
public int Id {get;set;}
[ProtoMember(2)]
public string Name {get;set;}
[ProtoMember(3)]
public Address Address {get;set;}
}
[ProtoContract]
class Address {
[ProtoMember(1)]
public string Line1 {get;set;}
[ProtoMember(2)]
public string Line2 {get;set;}
}
In my case I am caching things in Redis, and it worked great.
It is also possible to re-enable this, in your .csproject file:
<PropertyGroup>
<TargetFramework>net5.0</TargetFramework>
<EnableUnsafeBinaryFormatterSerialization>true</EnableUnsafeBinaryFormatterSerialization>
</PropertyGroup>
...But it's a bad idea. BinaryFormatter is responsible for many of .NET's historical vulnerabilities, and it can't be fixed. It will likely become completely unavailable in future versions of .NET, so replacing it is the right move.
if you are using .NET Core 5 or greater, you can use the new System.Text.Json.JsonSerializer.Serialize and System.Text.Json.JsonSerializer.Deserialize like so:
public static class Binary
{
/// <summary>
/// Convert an object to a Byte Array.
/// </summary>
public static byte[] ObjectToByteArray(object objData)
{
if (objData == null)
return default;
return Encoding.UTF8.GetBytes(JsonSerializer.Serialize(objData, GetJsonSerializerOptions()));
}
/// <summary>
/// Convert a byte array to an Object of T.
/// </summary>
public static T ByteArrayToObject<T>(byte[] byteArray)
{
if (byteArray == null || !byteArray.Any())
return default;
return JsonSerializer.Deserialize<T>(byteArray, GetJsonSerializerOptions());
}
private static JsonSerializerOptions GetJsonSerializerOptions()
{
return new JsonSerializerOptions()
{
PropertyNamingPolicy = null,
WriteIndented = true,
AllowTrailingCommas = true,
DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull,
};
}
}
While an old thread, it's still relevant, especially if you find yourself dealing with code storing .NET data in Memcached for example (or Redis, or secondary storage on-prem or in a cloud). BinaryFormatter has the security problems mentioned in the OP, and also has performance and size issues.
A great alternative is the MessagePack format, and more specifically the MessagePack NuGet package for .NET solutions.
It's secure, maintained, faster, and smaller all around. See the benchmarks for details.
ZeroFormatter also appears to be a great alternative.
In today's cloud-centric solutions where sizing and capacity are important for lowering costs, these are extremely helpful.
There is an option to use it in .NET Core 5:
Just add
<EnableUnsafeBinaryFormatterSerialization>true</EnableUnsafeBinaryFormatterSerialization>
To the project like:
<PropertyGroup>
<TargetFramework>net5.0</TargetFramework>
<EnableUnsafeBinaryFormatterSerialization>true</EnableUnsafeBinaryFormatterSerialization>
</PropertyGroup>
I believe it will work.

CacheBuilder using guava cache for query resultant

To reduce the DB hits to read the data from DB using the query, I am planning to keep resultant in the cache. To do this I am using guava caching.
studentController.java
public Map<String, Object> getSomeMethodName(Number departmentId, String departmentType){
ArrayList<Student> studentList = studentManager.getStudentListByDepartmentType(departmentId, departmentType);
----------
----------
}
StudentHibernateDao.java(criteria query )
#Override
public ArrayList<Student> getStudentListByDepartmentType(Number departmentId, String departmentType) {
Criteria criteria =sessionFactory.getCurrentSession().createCriteria(Student.class);
criteria.add(Restrictions.eq("departmentId", departmentId));
criteria.add(Restrictions.eq("departmentType", departmentType));
ArrayList<Student> studentList = (ArrayList)criteria.list();
return studentList;
}
To cache the criteria query resultant i started off with building CacheBuilder, like below.
private static LoadingCache<Number departmentId, String departmentType, ArrayList<Student>> studentListCache = CacheBuilder
.newBuilder().expireAfterAccess(1, TimeUnit.MINUTES)
.maximumSize(1000)
.build(new CacheLoader<Number departmentId, String departmentType, ArrayList<Student>>() {
public ArrayList<Student> load(String key) throws Exception {
return getStudentListByDepartmentType(departmentId, departmentType);
}
});
Here I dont know where to put CacheBuilder function and how to pass multiple key parameters(i.e departmentId and departmentType) to CacheLoader and call it.
Is this the correct way of caching using guava. Am I missing anything?
Guava's cache only accepts two type parameters, a key and a value type. If you want your key to be a compound key then you need to build a new compound type to encapsulate it. Effectively it would need to look like this (I apologize for my syntax, I don't use Java that often):
// Compound Key type
class CompoundDepartmentId {
public CompoundDepartmentId(Long departmentId, String departmentType) {
this.departmentId = departmentId;
this.departmentType = departmentType;
}
}
private static LoadingCache<CompoundDepartmentId, ArrayList<Student>> studentListCache =
CacheBuilder
.newBuilder().expireAfterAccess(1, TimeUnit.MINUTES)
.maximumSize(1000)
.build(new CacheLoader<CompoundDepartmentId, ArrayList<Student>>() {
public ArrayList<Student> load(CompoundDepartmentId key) throws Exception {
return getStudentListByDepartmentType(key.departmentId, key.departmentType);
}
});

How can I ignore a "$" in a DocumentContent to save in MongoDB?

My Problem is, that if I save a Document with a $ inside the content, Mongodb gives me an exception:
java.lang.IllegalArgumentException: Invalid BSON field name $ xxx
I would like that mongodb ignores the $ character in the content.
My Application is written in java. I read the content of the File and put it as a string into an object. After that the object will be saved with a MongoRepository class.
Someone has any ideas??
Example content
Edit: I heard mongodb has the same problem wit dot. Our Springboot found i workaround with dot, but not for dollar.
How to configure mongo converter in spring to encode all dots in the keys of map being saved in mongo db
If you are using Spring Boot you can extend MappingMongoConverter class and add override methods that do the escaping/unescaping.
#Component
public class MappingMongoConverterCustom extends MappingMongoConverter {
protected #Nullable
String mapKeyDollarReplacemant = "characters_to_replace_dollar";
protected #Nullable
String mapKeyDotReplacement = "characters_to_replace_dot";
public MappingMongoConverterCustom(DbRefResolver dbRefResolver, MappingContext<? extends MongoPersistentEntity<?>, MongoPersistentProperty> mappingContext) {
super(dbRefResolver, mappingContext);
}
#Override
protected String potentiallyEscapeMapKey(String source) {
if (!source.contains(".") && !source.contains("$")) {
return source;
}
if (mapKeyDotReplacement == null && mapKeyDollarReplacemant == null) {
throw new MappingException(String.format(
"Map key %s contains dots or dollars but no replacement was configured! Make "
+ "sure map keys don't contain dots or dollars in the first place or configure an appropriate replacement!",
source));
}
String result = source;
if(result.contains(".")) {
result = result.replaceAll("\\.", mapKeyDotReplacement);
}
if(result.contains("$")) {
result = result.replaceAll("\\$", mapKeyDollarReplacemant);
}
//add any other replacements you need
return result;
}
#Override
protected String potentiallyUnescapeMapKey(String source) {
String result = source;
if(mapKeyDotReplacement != null) {
result = result.replaceAll(mapKeyDotReplacement, "\\.");
}
if(mapKeyDollarReplacemant != null) {
result = result.replaceAll(mapKeyDollarReplacemant, "\\$");
}
//add any other replacements you need
return result;
}
}
If you go with this approach make sure you override the default converter from AbstractMongoConfiguration like below:
#Configuration
public class MongoConfig extends AbstractMongoConfiguration{
#Bean
public DbRefResolver getDbRefResolver() {
return new DefaultDbRefResolver(mongoDbFactory());
}
#Bean
#Override
public MappingMongoConverter mappingMongoConverter() throws Exception {
MappingMongoConverterCustom converter = new MappingMongoConverterCustom(getDbRefResolver(), mongoMappingContext());
converter.setCustomConversions(customConversions());
return converter;
}
.... whatever you might need extra ...
}

Mongodb scala driver custom conversion to JSON

If I am using "native" json support from mongodb oficial scala driver:
val jsonText = Document(...).toJson()
it produces json text with type prefixes for extended types:
{ "$oid" : "AABBb...." } - for ObjectID,
{ "$longNumber" : 123123 } - for Long and etc.
I want to avoid such type conversion and write directly just values for each type. Is it possible somehow to overwrite encoding behavior for some type?
You can subclass JsonWriter and override writeXXX methods. For example, to customize date serialization you can use:
class CustomJsonWriter extends JsonWriter {
public CustomJsonWriter(Writer writer) {
super(writer);
}
public CustomJsonWriter(Writer writer, JsonWriterSettings settings) {
super(writer, settings);
}
#Override
protected void doWriteDateTime(long value) {
doWriteString(DateTimeFormatter.ISO_DATE_TIME
.withZone(ZoneId.of("Z"))
.format(Instant.ofEpochMilli(value)));
}
}
And then you can use the overridden version that way:
public static String toJson(Document doc) {
CustomJsonWriter writer = new CustomJsonWriter(new StringWriter(), new JsonWriterSettings());
DocumentCodec encoder = new DocumentCodec();
encoder.encode(writer, doc, EncoderContext.builder().isEncodingCollectibleDocument(true).build());
return writer.getWriter().toString();
}

ITextSharp / PDFBox text extract fails for certain pdfs

The code below extracts the text from a PDF correctly via ITextSharp in many instances.
using (var pdfReader = new PdfReader(filename))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var currentText = PdfTextExtractor.GetTextFromPage(
pdfReader,
1,
strategy);
currentText =
Encoding.UTF8.GetString(Encoding.Convert(
Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
Console.WriteLine(currentText);
}
However, in the case of this PDF I get the following instead of text: "\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\a\u0001\u0002\u0003\u0004\u0005\u0006\u0003"
I have tried different encodings and even PDFBox but still failed to decode the PDF correctly. Any ideas on how to solve the issue?
Extracting the text nonetheless
#Bruno's answer is the answer one should give here, the PDF clearly does not provide the information required to allow proper text extraction according to section 9.10 Extraction of Text Content of the PDF specification ISO 32000-1...
But there actually is a slightly evil way to extract the text from the PDF at hand nonetheless!
Wrapping one's text extraction strategy in an instance of the following class, the garbled text is replaced by the correct text:
public class RemappingExtractionFilter : ITextExtractionStrategy
{
ITextExtractionStrategy strategy;
System.Reflection.FieldInfo stringField;
public RemappingExtractionFilter(ITextExtractionStrategy strategy)
{
this.strategy = strategy;
this.stringField = typeof(TextRenderInfo).GetField("text", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}
public void RenderText(TextRenderInfo renderInfo)
{
DocumentFont font =renderInfo.GetFont();
PdfDictionary dict = font.FontDictionary;
PdfDictionary encoding = dict.GetAsDict(PdfName.ENCODING);
PdfArray diffs = encoding.GetAsArray(PdfName.DIFFERENCES);
;
StringBuilder builder = new StringBuilder();
foreach (byte b in renderInfo.PdfString.GetBytes())
{
PdfName name = diffs.GetAsName((char)b);
String s = name.ToString().Substring(2);
int i = Convert.ToInt32(s, 16);
builder.Append((char)i);
}
stringField.SetValue(renderInfo, builder.ToString());
strategy.RenderText(renderInfo);
}
public void BeginTextBlock()
{
strategy.BeginTextBlock();
}
public void EndTextBlock()
{
strategy.EndTextBlock();
}
public void RenderImage(ImageRenderInfo renderInfo)
{
strategy.RenderImage(renderInfo);
}
public String GetResultantText()
{
return strategy.GetResultantText();
}
}
It can be used like this:
ITextExtractionStrategy strategy = new RemappingExtractionFilter(new LocationTextExtractionStrategy());
string text = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
Beware, I had to use System.Reflection to access private members. Some environments may forbid this.
The same in Java
I initially coded this in Java for iText because that's my primary development environment. Thus, here the initial Java version:
public class RemappingExtractionFilter implements TextExtractionStrategy
{
public RemappingExtractionFilter(TextExtractionStrategy strategy) throws NoSuchFieldException, SecurityException
{
this.strategy = strategy;
this.stringField = TextRenderInfo.class.getDeclaredField("text");
this.stringField.setAccessible(true);
}
#Override
public void renderText(TextRenderInfo renderInfo)
{
DocumentFont font =renderInfo.getFont();
PdfDictionary dict = font.getFontDictionary();
PdfDictionary encoding = dict.getAsDict(PdfName.ENCODING);
PdfArray diffs = encoding.getAsArray(PdfName.DIFFERENCES);
;
StringBuilder builder = new StringBuilder();
for (byte b : renderInfo.getPdfString().getBytes())
{
PdfName name = diffs.getAsName((char)b);
String s = name.toString().substring(2);
int i = Integer.parseUnsignedInt(s, 16);
builder.append((char)i);
}
try
{
stringField.set(renderInfo, builder.toString());
}
catch (IllegalArgumentException | IllegalAccessException e)
{
e.printStackTrace();
}
strategy.renderText(renderInfo);
}
#Override
public void beginTextBlock()
{
strategy.beginTextBlock();
}
#Override
public void endTextBlock()
{
strategy.endTextBlock();
}
#Override
public void renderImage(ImageRenderInfo renderInfo)
{
strategy.renderImage(renderInfo);
}
#Override
public String getResultantText()
{
return strategy.getResultantText();
}
final TextExtractionStrategy strategy;
final Field stringField;
}
(RemappingExtractionFilter.java)
It can be used like this:
String extractRemapped(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException
{
TextExtractionStrategy strategy = new RemappingExtractionFilter(new LocationTextExtractionStrategy());
return PdfTextExtractor.getTextFromPage(reader, pageNo, strategy);
}
(from RemappedExtraction.java)
Why does this work?
First of all, this is not the solution to all extraction problems, merely for extracting text from PDFs like the OP has presented.
This method works because the names the PDF uses in its fonts' encoding differences arrays can be interpreted even though they are not standard. These names are built as /Gxx where xx is the hexadecimal representation of the ASCII code of the character this name represents.
A good test to find out whether or not a PDF allows text to be extracted correctly, is by opening it in Adobe Reader and to copy and paste the text.
For instance: I copied the word ABSTRACT and I pasted it in Notepad++:
Do you see the word ABSTRACT in Notepad++? No, you see %&SOH'"%GS. The A is represented as %, the B is represented as &, and so on.
This is a clear indication that the content of the PDF isn't accessible: there is no mapping between the encoding that was use (% = A, & = B,...) and the actual characters that humans can understand.
In short: the PDF doesn't allow you to extract text, not with iText, not with iTextSharp, not with PDFBox. You'll have to find an OCR tool instead and OCR the complete document.
For more info, you may want to watch the following videos:
https://www.youtube.com/watch?v=4ur9WRWVrbM (~5 minutes)
https://www.youtube.com/watch?v=wxGEEv7ibHE (~15 minutes)
https://www.youtube.com/watch?v=g-QcU9B4qMc (~45 minutes)