Hibernate Search: Indexing only a few entites of a huge table without loading all entries - hibernate-search

I have a huge table which I write into a Lucene Index using Hibernate Search Version 5.11.5. After a (existing) raw SQL import into this table I have to manually update the search index. This SQL import runs multiple times a day and reindexing should not block incoming search requests for more than a few seconds.
On each entity there exists a date field "modified" so I already annotated this #Indexed Entity with an EntityIndexingInterceptor like following:
public class CustomEntityIndexingInterceptor implements EntityIndexingInterceptor<HugeTableEntity> {
public static Date lastModified = //some logic;
#Override
public IndexingOverride onAdd(HugeTableEntity entity) {
return IndexingOverride.APPLY_DEFAULT;
}
#Override
public IndexingOverride onUpdate(HugeTableEntity entity) {
if (entity.getModified().after(lastModified)) {
return IndexingOverride.APPLY_DEFAULT;
}
return IndexingOverride.SKIP;
}
}
This code works, but reindexing takes a lot of time because all entites get loaded. Only < 0.001% match the condition of the modified date.
I saw that deep inside of Hibernate Search there exists a class IdentifierProducer, which loads all IDs in the method loadAllIdentifiers. I would love to add a SQL filter to the inner criteria - something like "where modified > :given_date".
Do you know if I can customize the IdentifierProducer without copying all the code? Do you know another smart way to solve my problem?
Regards

The MassIndexer cannot be used to index just part of your data... yet. IdentifierProducer is an internal class an you should not try to alter it.
What you can do is run a query to list the entities that were affected by your import, and ask Hibernate Search to reindex them, for example by batches of 20 elements. You can take some inspiration from this example from the documentation. In your case, of course, you will add filters to the Hibernate ORM query, to only go through entities that actually changed because of your import.
Don't forget to remove your indexing interceptor, which should no longer be useful.

Related

JPA: How to get results by compromised where-clause

I have a table with 30 columns.
I fill the object within my java code. Now I want to look up in my database, if the row is already inserted. I can do this primitive like:
SELECT *
FROM tablename
WHERE table.name=object.name
AND table.street=object.street
AND ...
AND ...
AND ...
I think you get it. It works, but in my opinion this is not the best solution.
Is there any kind of a generic solution (eg: I do not need to change the code, if the table changes), where I can give the where-clause my object and it can match itself? Also the where-clause is not that massive.
The closest thing that comes to mind is the Spring Data JPA Specifications.
You can isolate the where clauses in an instance for a particular entity.
Afterwards, you just pass it to any of the #Repository methods:
public interface UserRepository extends CrudRepository<User, Long>,
JpaSpecificationExecutor<User> {
}
Then in your service:
#Autowired
private UrerRepository repo;
public void findMatching() {
List<User> users = repo.findAll(new MyUserSpecification());
Then, whenever db changes you simply alter one place, which is the Specification implementation.

Entity Framework : map duplicate tables to single entity at runtime?

I have a legacy database with a particular table -- I will call it ItemTable -- that can have billions of rows of data. To overcome database restrictions, we have decided to split the table into "silos" whenever the number of rows reaches 100,000,000. So, ItemTable will exist, then a procedure will run in the middle of the night to check the number of rows. If numberOfRows is > 100,000,000 then silo1_ItemTable will be created. Any Items added to the database from now on will be added to silo1_ItemTable (until it grows to big, then silo2_ItemTable will exist...)
ItemTable and silo1_ItemTable can be mapped to the same Item entity because the table structures are identical, but I am not sure how to set this mapping up at runtime, or how to specify the table name for my queries. All inserts should be added to the latest siloX_ItemTable, and all Reads should be from a specified siloX_ItemTable.
I have a separate siloTracker table that will give me the table name to insert/read the data from, but I am not sure how I can use this with entity framework...
Thoughts?
You could try to use the Entity Inheritance to get this. So you have a base class which has all the fields mapped to ItemTable and then you have descendant classes that inherit from ItemTable entity and is mapped to the silo tables in the db. Every time you create a new silo you create a new entity mapped to that silo table.
[Table("ItemTable")]
public class Item
{
//All the fields in the table goes here
}
[Table("silo1_ItemTable")]
public class Silo1Item : Item
{
}
[Table("silo2_ItemTable")]
public class Silo2Item : Item
{
}
You can find more information on this here
Other option is to create a view that creates a union of all those table and map your entity to that view.
As mentioned in my comment, to solve this problem I am using the SQLQuery method that is exposed by DBSet. Since all my item tables have the exact same schema, I can use the SQLQuery to define my own query and I can pass in the name of the table to the query. Tested on my system and it is working well.
See this link for an explanation of running raw queries with entity framework:
EF raw query documentation
If anyone has a better way to solve my question, please leave a comment.
[UPDATE]
I agree that stored procedures are also a great option, but for some reason my management is very resistant to make any changes to our database. It is easier for me (and our customers) to put the sql in code and acknowledge the fact that there is raw sql. At least I can hide it from the other layers rather easily.
[/UPDATE]
Possible solution for this problem may be using context initialization with DbCompiledModel param:
var builder = new DbModelBuilder(DbModelBuilderVersion.V6_0);
builder.Configurations.Add(new EntityTypeConfiguration<EntityName>());
builder.Entity<EntityName>().ToTable("TableNameDefinedInRuntime");
var dynamicContext = new MyDbContext(builder.Build(context.Database.Connection).Compile());
For some reason in EF6 it fails on second table request, but mapping inside context looks correct on the moment of execution.

Attempting to use EF/Linq to Entities for dynamic querying and CRUD operations

(as advised re-posting this question here... originally posted in msdn forum)
I am striving to write a "generic" routine for some simple CRUD operations using EF/Linq to Entities. I'm working in ASP.NET (C# or VB).
I have looked at:
Getting a reference to a dynamically selected table with "GetObjectByKey" (But I don't want anything from cache. I want data from database. Seems like not what this function is intended for).
CRM Dynamic Entities (here you can pass a tablename string to query) looked like the approach I am looking for but I don't get the idea that this CRM effort is necessarily staying current (?) and/or has much assurance for the future??
I looked at various ways of drilling thru Namespaces/Objects to get to where I could pass a TableName parameter into the oft used query syntax var query = (from c in context.C_Contacts select c); (for example) where somehow I could swap out the "C_Contacts" TEntity depending on which table I want to work with. But not finding a way to do this ??
Slightly over-simplyfing, I just want to be able to pass a tablename parameter and in some cases some associated fieldnames and values (perhaps in a generic object?) to my routine and then let that routine dynamically plug into LINQ to Entity data context/model and do some standard "select all" operations for parameter table or do a delete to parameter table based on a generic record id. I'm trying to avoid calling the various different automatically generated L2E methods based on tablename etc...instead just trying to drill into the data context and ultimately the L2E query syntax for dynamically passed table/field names.
Has anyone found any successful/efficient approaches for doing this? Any ideas, links, examples?
The DbContext object has a generic Set() method. This will give you
from c in context.Set<Contact>() select c
Here's method when starting from a string:
public void Test()
{
dynamic entity = null;
Type type = Type.GetType("Contract");
entity = Activator.CreateInstance(type);
ProcessType(entity);
}
public void ProcessType<TEntity>(TEntity instance)
where TEntity : class
{
var result =
from item in this.Set<TEntity>()
select item;
//do stuff with the result
//passing back to the caller can get more complicated
//but passing it on will be fine ...
}

Using Annotations In JPA can I limit child records with a where clause?

I have an EJB with an #onetomany relationship like this in my parent class (Timeslot):
#OneToMany(mappedBy = "rsTimeslots")
private List<RsEvents> rsEventsList;
I also have a function to get the rsEventList:
public void setRsEventsList(List<RsEvents> rsEventsList) {
this.rsEventsList = rsEventsList;
}
This was all auto generated so far. In my view-layer code I can get a timeslot object and do something like timeslot.getRsEventList() and get all children of this timeslot. Now I need to restrict that list based on a criteria. For example I only want events that are children of this timeslot with a status of 1. Is there a way to do this with annotations?
Not in JPA.
Normally you would execute a Query for this, using JPQL or the criteria API.
Some JPA providers do provide ways to restrict relationships, but I think you would be best off with a query, or providing a get/filter method on your class that just accesses the list and filters it (i.e. getStatus1Events()).
For an EclipseLink example of having a criteria on a mapping see,
http://wiki.eclipse.org/EclipseLink/Examples/JPA/MappingSelectionCriteria

MVC 1.0 + EF: Does db.EntitySet.where(something) still return all rows in table?

In a repository, I do this:
public AgenciesDonor FindPrimary(Guid donorId) {
return db.AgenciesDonorSet.Include("DonorPanels").Include("PriceAdjustments").Include("Donors").First(x => x.Donors.DonorId == donorId && x.IsPrimary);
}
then down in another method in the same repository, this:
AgenciesDonor oldPrimary = this.FindPrimary(donorId);
In the debugger, the resultsview shows all records in that table, but:
oldPrimary.Count();
is 1 (which it should be).
Why am I seeing all table entries retrieved, and not just 1? I thought row filtering was done in the DB.
If db.EntitySet really does fetch everything to the client, what's the right way to keep the client data-lite using EF? Fetching all rows won't scale for what I'm doing.
You will see everything if you hover over the AgenciesDonorSet because LINQ to Entities (or SQL) uses delayed execution. When the query is actually executed, it is just retrieving the count.
If you want to view the SQL being generated for any query, you can add this bit of code:
var query = queryObj as ObjectQuery; //assign your query to queryObj rather than returning it immediately
if (query != null)
{
System.Diagnostics.Trace.WriteLine(context);
System.Diagnostics.Trace.WriteLine(query.ToTraceString());
}
Entity Set does not implement IQueryable, so the extension methods that you're using are IEnumerable extension methods. See here:
http://social.msdn.microsoft.com/forums/en-US/linqprojectgeneral/thread/121ec4e8-ce40-49e0-b715-75a5bd0063dc/
I agree that this is stupid, and I'm surprised that more people haven't complained about it. The official reason:
The design reason for not making
EntitySet IQueryable is because
there's not a clean way to reconcile
Add\Remove on EntitySet with
IQueryable's filtering and
transformation ability.