Hibernate Search Indexer process hangs after half work - hibernate-search

I have 7 entity classes to index using Hibernate Search. Having tried both MassIndexer and FlushToIndexes, the indexer process churned away through the smallest entites but the largest entities/tables did not finish, even though a MassIndexerProgressMonitor told the indexing finished. The process just hangs when it hits 100-200 MB allocated. I want to ensure indexing process ends properly.
Questions: Is the code correct? Should hibernate or database settings be tuned?
Environment: 64-bit Windows 7, JBoss, Struts2, Hibernate, Hibernate Search, Lucene, SQL Server. The Hibernate Search Index is placed in filesystem.
MassIndexer code sample:
final Session session = HibernateSessionFactory.getSession();
final FullTextSession fullTextSession = Search.getFullTextSession(session);
MassIndexerProgressMonitor monitor = new IndexProgressMonitor("Kanalregister");
fullTextSession.createIndexer()
.purgeAllOnStart(true)
.progressMonitor(monitor)
.batchSizeToLoadObjects(BATCH_SIZE) // 250000
.startAndWait();
FlushToIndexes code sample: (from Hibernate ref. doc.) (seems to index ok, but never ends)
final Session session = HibernateSessionFactory.getSession();
final FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
Transaction t1 = fullTextSession.beginTransaction();
// Scrollable results will avoid loading too many objects in memory
ScrollableResults results = fullTextSession.createCriteria(Land.class)
.setFetchSize(BATCH_SIZE) // 250000
.scroll(ScrollMode.FORWARD_ONLY);
int index = 0;
while (results.next()) {
index++;
fullTextSession.index(results.get(0)); // index each element
if (index % BATCH_SIZE == 0) {
fullTextSession.flushToIndexes(); // apply changes to indexes
fullTextSession.clear(); // free memory since the queue is processed
}
}
t1.commit();
The code is verified to end when mocking all indexing work, using the following setting in hibernate.cfg.xml:
<property name="hibernate.search.default.worker.backend">blackhole</property>

The code above is validated and correct.
My problem with the console not ending is thought to be related to Eclipse, as a printout at end of main() was indeed displayed.
There were some missing entity classes (in my model) which were not reported properly. Once I got notified of those and added them to my model, the indexing process ended successfully for MassIndexer, as evidenced by 3+ files in each directory in the lucene index.

Related

How did it come to be that EF is trying to save thousands of entities simultaneously?

TLDR
If I were to call EF like this:
dbcontext.Update(newEntityA);
dbContext.Save();
dbcontext.Update(newEntityB);
dbContext.Save();
and it all completed successfully in < 1 millisecond, how many INSERT statements would it execute? Two inserts with X parameters each or one insert with 2X parameters and a VALUES line that was a ( (...), (...) ) ?
If the first insert failed because of eg PK violation, does it change the answer?
Long version
I've a very simple app that moves data from a RabbitMQ to an Azure SQL database. The app looks something like this (I've stripped out most of the setup/logging calls):
class Program
{
private static IConfigurationRoot _configuration;
private static IConsumer _consumer;
private static ConsumerDbContext _dbContext;
static void Main(string[] args)
{
var builder = new ConfigurationBuilder()
.SetBasePath(Directory.GetCurrentDirectory())
.AddJsonFile("appsettings.json", optional: true, reloadOnChange: true);
_configuration = builder.Build();
_dbContext = new ConsumerDbContext(...configblah...);
_consumer = new RabbitMqConsumer(...configblah...);
_consumer.Connect();
_consumer.ReceivedMessage += ConsumerOnReceivedMessage;
}
private static void ConsumerOnReceivedMessage(object sender, MessageArgs messageArgs)
{
try{
var model = JsonConvert.DeserializeObject<DrawingModel>(messageArgs.Message.ToString());
if (model != null)
{
var drawing = _dbContext.Drawings.FirstOrDefault(x => x.Id == model.DrawingId);
if (drawing != null)
{
if (drawing.DrawingCommandModel != null)
{
var command = new DrawingCommandEntity(model.DrawingCommandModel);
command.Id = model.DrawingCommandId;
_dbContext.Add(command);
}
drawing.LastExecutedCommandId = model.DrawingCommandId;
drawing.CurrentState = model.DrawingState;
_dbContext.Update(drawing);
_dbContext.SaveChanges();
}
_logger.LogInformation("success ...");
}
}
catch (Exception e)
{
_logger.LogInformation("fail .....");
}
}
}
Essentially the RMQ contains JSON that describes the current state of a vector drawing and the the command that led to the current state. Drawing has many DrawingCommand, and the Drawing.CurrentState is a concatenation/convergence of multiple DrawingCommand. A DrawingCommand is of the ilk "draw a green line width 5px from 0,0 to 10,10" and the commands themselves may have a history of changes - if the user changes the green to blue, then the change is processed as two drawing commands - firstly that the command was created as green and then later that it was changed to blue so a new drawing command like "draw a blue line width 5px from 0,0 to 10,10" is written.
This allows undo/redo, so the CurrentState of a Drawing can be conceived as the latest set of unique DrawingCommands. Every change that is ever performed on a Drawing's DrawingCommand list is stored, whether it's addition, removal or changing of the DrawingCommand entities
The JSON is entirely calculated elsewhere; multiple systems submit DrawingCommand JSON to a central aggregator, the aggregator maintains the current state of a Drawing and all the DrawingCommand that compose it, and emits the results into an RMQ. The RMQ is used to coalesce the work so that a single app (this app) can connect to Azure and load the data into the DB. This means the RMQ already contains all the primary key guids for a drawing and the drawing entities, and literally the only thing this app does is deser the JSON to a DB entity and insert it.
It does it this way because there is a limit to the number of connections we can make to the azure DB and coalescing things via a queue is the best way to reduce load and work within the connections limit on Azure
So, to the problem:
Recently, we ended up with a situation where messages were piling into the RMQ and the consumer wouldn't process them. After altering the logging level to minimum of Info (granted, logging an exception with info level isn't great) rather than Error I hit the button in Azure management portal to restart the service.
Looking at the log (logging to azure blob storage, a new file created on that day) it became apparent that a duplicate GUID PK had entered the queue and was breaking the INSERT query formed by EF. To my surprise though when pulling the logs, the queries EF was forming were enormous:
date,level,message
2019-09-20T11:52:02,Inf,fail: Microsoft.EntityFrameworkCore.Database.Command[20102]
2019-09-20T11:52:02,Inf," Failed executing DbCommand (2,689ms) [Parameters= .... 1.4 megabytes of data snipped ... ]
2019-09-20T11:52:02,Inf, SET NOCOUNT ON;,
2019-09-20T11:52:02,Inf," INSERT INTO [DrawingCommands] ([Id], [CommandType], [CreatedDate], [CurrentState], [CustomerId], [ObjectsIds], [PreviousState], [ToolType], [DrawingId])"
2019-09-20T11:52:02,Inf," VALUES (#p0, #p1, #p2, #p3, #p4, #p5, #p6, #p7, #p8),",
2019-09-20T11:52:02,Inf," (#p9, #p10, #p11, #p12, #p13, #p14, #p15, #p16, #p17),",
2019-09-20T11:52:02,Inf," (#p18, #p19, #p20, #p21, #p22, #p23, #p24, #p25, #p26),",
2019-09-20T11:52:02,Inf," (#p27, #p28, #p29, #p30, #p31, #p32, #p33, #p34, #p35),",
2019-09-20T11:52:02,Inf," (#p36, #p37, #p38, #p39, #p40, #p41, #p42, #p43, #p44),",
2019-09-20T11:52:02,Inf," (#p45, #p46, #p47, #p48, #p49, #p50, #p51, #p52, #p53),",
2019-09-20T11:52:02,Inf," (#p54, #p55, #p56, #p57, #p58, #p59, #p60, #p61, #p62),",
2019-09-20T11:52:02,Inf," (#p63, #p64, #p65, #p66, #p67, #p68, #p69, #p70, #p71),",
2019-09-20T11:52:02,Inf," (#p72, #p73, #p74, #p75, #p76, #p77, #p78, #p79, #p80),",
2019-09-20T11:52:02,Inf," (#p81, #p82, #p83, #p84, #p85, #p86, #p87, #p88, #p89),",
2019-09-20T11:52:02,Inf," (#p90, #p91, #p92, #p93, #p94, #p95, #p96, #p97, #p98),",
2019-09-20T11:52:02,Inf," (#p99, #p100, #p101, #p102, #p103, #p104, #p105, #p106, #p107),",
...
2019-09-20T11:52:02,Inf," (#p2088, #p2089, #p2090, #p2091, #p2092, #p2093, #p2094, #p2095, #p2096);",
1.4 megabytes worth of data makes up the parameters line, over 2096 parameters
I'm curious to know how it came to be that EF was trying to save approx 232 drawing commands in one go. I'm also curious how it comes to be that this is the very first thing in the log.
I suspect that the tracked entities are added to with the Update() call and saved in the Save() call. If Save() fails then the entity remains unsaved in the tracked entity list and when another queue message is processed and added with Update(), then the Save() will be attempted again - this time there are two entities in the list, so 9 parameters will become 18.. And so on, until the graph of entities to be saved is huge..
Is this the case?
Any thoughts on why I didn't see a succession of error messages where the number of parameters repeatedly increased all the way up to 2096? Literally the first entry in the log after reconfiguring the min logging level to information had 2096 parameters.
Is there a possibility that the log level change was picked up immediately but the request from the azure management panel to restart the service was processed later? Immediately after the final massive query in the logs I see "thread was being aborted", then the service goes back to normal, inserting one entity at a time. I'm thinking that either EF coalesces db writes, so it doesn't do the Save() immediately if another Save() follows within a very short timescale.. or it's that the log level change reflected immediately, so the first log entries were jsut of those final seconds where the app was still running through its old object list. Restarting the app deleted the entire set of unsaved entities which allowed things to proceed normally
EF couldn't save the entity, due to a PK violation. Because the db context was a singleton, the list of tracked entities was just growing and growing every time it tried; more entities would be added and still the first would fail to save.
I changed the scope of the DB context so it was created anew on every cycle of processing a message from the queue. In this way entities that didn't save correctly would be forgotten when the dbcontext was renewed (and we don't care about them being lost since they are corrupt anyway) which stopped it from getting stuck on one failed entity with a massive backlog

update Typo3 7.6 to 8.7, can't get frontend to work on a local test envirement with XAMPP

I working on updating a Typo3 7.6 to 8.7. I do this on my local machine with XAMPP on windows with PHP 7.2.
I got the backend working. It needed some manual work in the DB, like changing the CType in tt_content for my own content elements as well as filling the colPos.
However when I call the page on the frontend all I get is a timeout:
Fatal error: Maximum execution time of 60 seconds exceeded in
C:\xampp\htdocs\typo3_src-8.7.19\vendor\doctrine\dbal\lib\Doctrine\DBAL\Driver\Mysqli\MysqliStatement.php on line 92
(this does not change if I set max_execution_time to 300)
Edit: I added an echo just before line 92 in the above file, this is the function:
public function __construct(\mysqli $conn, $prepareString)
{
$this->_conn = $conn;
echo $prepareString."<br />";
$this->_stmt = $conn->prepare($prepareString);
if (false === $this->_stmt) {
throw new MysqliException($this->_conn->error, $this->_conn->sqlstate, $this->_conn->errno);
}
$paramCount = $this->_stmt->param_count;
if (0 < $paramCount) {
$this->types = str_repeat('s', $paramCount);
$this->_bindedValues = array_fill(1, $paramCount, null);
}
}
What I get is the following statement 1000 of times, always exactly the same:
`SELECT `tx_fed_page_controller_action_sub`, `t3ver_oid`, `pid`, `uid` FROM `pages` WHERE (uid = 0) AND ((`pages`.`deleted` = 0) AND (`pages`.`hidden` = 0) AND (`pages`.`starttime` <= 1540305000) AND ((`pages`.`endtime` = 0) OR (`pages`.`endtime` > 1540305000)))`
Note: I don't have any entry in pages with uid=0. So I am really not sure what this is good for. Does there need to be a page with uid=0?
I enabled logging slow query in mysql, but don't get anything logged with it. I don't get any aditional PHP error nor do I get a log entry in typo3.
So right now I am a bit stuck and don't know how to proceed.
I enabled general logging for mysql and when I call a page on frontent I get this SQL query executed over and over again:
SELECT `tx_fed_page_controller_action_sub`, `t3ver_oid`, `pid`, `uid` FROM `pages` WHERE (uid = 0) AND ((`pages`.`deleted` = 0) AND (`pages`.`hidden` = 0) AND (`pages`.`starttime` <= 1540302600) AND ((`pages`.`endtime` = 0) OR (`pages`.`endtime` > 1540302600)))
executing this query manually gives back an empty result (I don't have any entry in pages with uid=0). I don't know if that means anything..
What options do I have? How can I find whats missing / where the error is?
First: give your PHP more time to run.
in the php.ini configuration increase the max execution time to 240 seconds.
be aware that for TYPO3 in production mode 240 seconds are recommended. If you start the install-tool you can do a system check and get information about configuration which might need optimization.
Second: avoid development mode and use production mode.
the execution is faster, but you will loose the option to debug.
debugging always costs more time and more memory to prepare all that information. maybe 240 seconds are not enough and you even need more memory.
The field tx_fed_page_controller_action_sub comes from an extension it is not part of the core. Most likely you have flux and fluidpages installed in your system.
Try to deactivate those extensions and proceed without them. Reintegrate them later if you still need them. A timeout often means that there is some kind of recursion going on. From my experience with flux it is possible that a content element has itself set as its own flux_parent and therefore creates an infinite rendering loop that will cause a fatal after the max_execution_time.
So, in your case I'd try to find the record that is causing this (seems to be a page record) and/or the code that initiates the Query. You do not need to debug in Doctrine itself :)

Program runs out of memory reading a large TYPO3 Extbase repository

I'm writing an extension function in TYPO3 CMS 6.2 Extbase that must process every object in a large repository. My function works fine if I have about 10,000 objects, but runs out of memory if I have over about 20,000 objects. How can I handle the larger repository?
$importRecordsCount = $this->importRecordRepository->countAll();
for ($id = 1; $id <= $importRecordsCount; $id++) {
$importRecord = $this->importRecordRepository->findByUid($id);
/* Do things to other models based on properties of $importRecord */
}
The program exceeds memory near ..\GeneralUtility.php:4427 in TYPO3\CMS\Core\Utility\GeneralUtility::instantiateClass( ) after passing through the findByUid() line, above. It took 117 seconds to reach this error during my latest test. The error message is:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 4194304 bytes) in ... \typo3\sysext\core\Classes\Utility\GeneralUtility.php on line 4448
If it matters, I do not use #lazy because of some of the processing I do later in the function.
According to official TYPO3 website, it is recommended 256M memory limit instead of 128M:
Source
So my first suggestion would be trying to do that first and it might solve your problem now. Also you should use importRecordRepository->findAll(); instead of fetching each record by iterating uid, since someone might have deleted some records.
In general, Extbase is not really suitable for processing such a large amount of data. An alternative would be to use the DataHandler if a correct history etc. is required. It also has quite some overhead compared to using the TYPO3 Database API (DatabaseConnection, $GLOBALS['TYPO3_DB']) which would be the best-performance approach. See my comments and tutorial in this answer.
If you decide to stay with the Extbase API, the only way that could work would be to persist every X item (try what works in your setup) to free some memory. From your code I cannot really see at which point your manipulation works, but take this as an example:
$importRecords = $this->importRecordRepository->findAll();
$i = 0;
foreach ($importRecords as $importRecord) {
/** #var \My\Package\Domain\Model\ImportRecord $importRecord */
// Manipulate record
$importRecord->setFoo('Test');
$this->importRecordRepository->update($importRecord);
if ($i % 100 === 0) {
// Persist after each 100 items
$this->persistenceManager->persistAll();
}
$i++;
}
// Persist the rest
$this->persistenceManager->persistAll();
There's also the clearState function of Extbase to free some memory:
$this->objectManager->get(\TYPO3\CMS\Extbase\Persistence\Generic\PersistenceManager::class)->clearState();

Cancelling an Entity Framework Query

I'm in the process of writing a query manager for a WinForms application that, among other things, needs to be able to deliver real-time search results to the user as they're entering a query (think Google's live results, though obviously in a thick client environment rather than the web). Since the results need to start arriving as the user types, the search will get more and more specific, so I'd like to be able to cancel a query if it's still executing while the user has entered more specific information (since the results would simply be discarded, anyway).
If this were ordinary ADO.NET, I could obviously just use the DbCommand.Cancel function and be done with it, but we're using EF4 for our data access and there doesn't appear to be an obvious way to cancel a query. Additionally, opening System.Data.Entity in Reflector and looking at EntityCommand.Cancel shows a discouragingly empty method body, despite the docs claiming that calling this would pass it on to the provider command's corresponding Cancel function.
I have considered simply letting the existing query run and spinning up a new context to execute the new search (and just disposing of the existing query once it finishes), but I don't like the idea of a single client having a multitude of open database connections running parallel queries when I'm only interested in the results of the most recent one.
All of this is leading me to believe that there's simply no way to cancel an EF query once it's been dispatched to the database, but I'm hoping that someone here might be able to point out something I've overlooked.
TL/DR Version: Is it possible to cancel an EF4 query that's currently executing?
Looks like you have found some bug in EF but when you report it to MS it will be considered as bug in documentation. Anyway I don't like the idea of interacting directly with EntityCommand. Here is my example how to kill current query:
var thread = new Thread((param) =>
{
var currentString = param as string;
if (currentString == null)
{
// TODO OMG exception
throw new Exception();
}
AdventureWorks2008R2Entities entities = null;
try // Don't use using because it can cause race condition
{
entities = new AdventureWorks2008R2Entities();
ObjectQuery<Person> query = entities.People
.Include("Password")
.Include("PersonPhone")
.Include("EmailAddress")
.Include("BusinessEntity")
.Include("BusinessEntityContact");
// Improves performance of readonly query where
// objects do not have to be tracked by context
// Edit: But it doesn't work for this query because of includes
// query.MergeOption = MergeOption.NoTracking;
foreach (var record in query
.Where(p => p.LastName.StartsWith(currentString)))
{
// TODO fill some buffer and invoke UI update
}
}
finally
{
if (entities != null)
{
entities.Dispose();
}
}
});
thread.Start("P");
// Just for test
Thread.Sleep(500);
thread.Abort();
It is result of my playing with if after 30 minutes so it is probably not something which should be considered as final solution. I'm posting it to at least get some feedback with possible problems caused by this solution. Main points are:
Context is handled inside the thread
Result is not tracked by context
If you kill the thread query is terminated and context is disposed (connection released)
If you kill the thread before you start a new thread you should use still one connection.
I checked that query is started and terminated in SQL profiler.
Edit:
Btw. another approach to simply stop current query is inside enumeration:
public IEnumerable<T> ExecuteQuery<T>(IQueryable<T> query)
{
foreach (T record in query)
{
// Handle stop condition somehow
if (ShouldStop())
{
// Once you close enumerator, query is terminated
yield break;
}
yield return record;
}
}

Quickly Testing Database Connectivity within the Entity Framework

[I am new to ADO.NET and the Entity Framework, so forgive me if this questions seems odd.]
In my WPF application a user can switch between different databases at run time. When they do this I want to be able to do a quick check that the database is still available. What I have easily available is the ObjectContext. The test I am preforming is getting the count on the total records of a very small table and if it returns results then it passed, if I get an exception then it fails. I don't like this test, it seemed the easiest to do with the ObjectContext.
I have tried setting the connection timeout it in the connection string and on the ObjectConntext and either seem to change anything for the first scenario, while the second one is already fast so it isn't noticeable if it changes anything.
Scenario One
If the connect was down when before first access it takes about 30 seconds before it gives me the exception that the underlying provider failed.
Scenario Two
If the database was up when I started the application and I access it, and then the connect drops while using the test is quick and returns almost instantly.
I want the first scenario described to be as quick as the second one.
Please let me know how best to resolve this, and if there is a better way to test the connectivity to a DB quickly please advise.
There really is no easy or quick way to resolve this. The ConnectionTimeout value is getting ignored with the Entity Framework. The solution I used is creating a method that checks if a context is valid by passing in the location you which to validate and then it getting the count from a known very small table. If this throws an exception the context is not valid otherwise it is. Here is some sample code showing this.
public bool IsContextValid(SomeDbLocation location)
{
bool isValid = false;
try
{
context = GetContext(location);
context.SomeSmallTable.Count();
isValid = true;
}
catch
{
isValid = false;
}
return isValid;
}
You may need to use context.Database.Connection.Open()