Problems with Chronicle Map on Windows - memory-mapped-files

I am trying to use ChronicleMap for my index structure, this seems to work fine on Linux but when I am running my JUnit test on Windows (which is my development environment), I keep getting an error: java.io.IOException: Unable to wait until the file is ready, likely the process which created the file crashed or hung for more than 1 minute.
Here's the code snippet that is problematic:
File file = new File(idxFullPath);
ChronicleMap<Integer, int[]> idx =
ChronicleMapBuilder.of(Integer.class, int[].class)
.averageValue(getSampleIdxList())
.entries(IDX_MAX_SIZE)
.createPersistedTo(file);
The following exception is thrown:
[2016-06-17 14:32:47.779] ERROR main com.mcm.op.persistence.Persistence ERR java.io.IOException: Unable to wait until the file is ready, likely the process which created the file crashed or hung for more than 1 minute
at net.openhft.chronicle.map.ChronicleMapBuilder.waitUntilReady(ChronicleMapBuilder.java:1520)
at net.openhft.chronicle.map.ChronicleMapBuilder.openWithExistingFile(ChronicleMapBuilder.java:1583)
at net.openhft.chronicle.map.ChronicleMapBuilder.createWithFile(ChronicleMapBuilder.java:1444)
at net.openhft.chronicle.map.ChronicleMapBuilder.createPersistedTo(ChronicleMapBuilder.java:1405)
at com.mcm.op.persistence.Persistence.initIdx(Persistence.java:131)
at com.mcm.op.persistence.Persistence.init(Persistence.java:177)
at com.mcm.op.persistence.PersistenceTest.initPersist(PersistenceTest.java:47)
at com.mcm.op.persistence.PersistenceTest.setUp(PersistenceTest.java:29)

Indeed, it is likely that the process which created the file has crashed, or stopped terminated debugging, or something like that.
If it's ok to have a fresh index from unit test-to-test runs, I recommend to try either delete the file at idxFullPath before creating a Chronicle Map, or randomize the mapping file via something like File.createTempFile(). In either case File.deleteOnExit() could appear to be helpful.
If you want to keep the index between unit test runs and always use the same file at idxFullPath for persistence, you could try to use builder.createOrRecoverPersistedTo() instead of plain createPersistedTo() map creation method. However this might slow down the map creation.

Related

TypeError--using slurm queue to submit pyiron jobs

I'm facing some issues while running pyiron jobs on my HPC via the pysqa adapter. I had accidentally erased the main pyiron directory containing pyiron, projects and resources folders. I had copied all the three from another cluster. The only thing that I think will cause problem is sqlite.db file in the resources folder. Previously, I had no issues running interactive VASP jobs through the adapter. I'm guessing something happened after the deletion incident.
The pyiron version I'm using is: 0.2.17
Here is a minimal example using an Interactive vasp job that I have tried:
from pyiron import Project
pr = Project('Al-test')
structure = pr.create_structure('Al', 'fcc', 4.05)
pr.remove_jobs(recursive=True)
from pysqa import QueueAdapter
sqa = QueueAdapter(directory='~/pyiron/resources/queues/')
sqa.queue_view
pr.job_table()
job = pr.create_job(pr.job_type.Vasp, 'job_int')
job.structure = structure
job.server.run_mode.interactive = True
job.executable.executable_path = '~/pyiron/resources/vasp/bin/run_vasp_5.4.4_std_mpi.sh'
job.input.incar['NCORE']=4
job.server.queue = 'slurm'
job.server.cores=16
job.server.view_queues()
sqa.get_queue_status()
job.run(run_again=True)
end of the error log:
~/pyiron/pyiron/pyiron/base/server/generic.py in queue_id(self, qid)
208 qid (int): queue ID
209 """
--> 210 self._queue_id = int(qid)
211
212 #property
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Some inputs/feedback on this would be greatly appreciated.
Thanks!
We updated the queuing system interface in pyiron 0.3.X you can read more about this here:
https://pyiron.org/news/releases/2020/09/06/pyiron-0-3-X-HPC-release.html
For pyiron 0.3.X we have a detailed installation guide available on readthedocs.org:
https://pyiron.readthedocs.io/en/latest/source/installation.html#remote-hpc-cluster
So I highly recommend updating to pyiron 0.3.13.
Apart from this the error message basically says that the submission was not successful. If you navigate to the jobs working directory job.working_directory you should find a run_queue.sh script in the working directory. This is the script pyiron is using to submit the job to the queuing system. You can try to submit it manually using sbatch run_queue.sh this should print the queue id if successful and otherwise the error message from your queuing system.

Remote debug an apache beam job on flink cluster

I am running a streaming beam job on a flink cluster where I am getting the following exception.
Caused by: org.apache.beam.sdk.util.UserCodeException: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:34)
at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218)
at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:183)
at org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:62)
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:544)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:596)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:941)
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:895)
at org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:252)
at org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:74)
at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:576)
at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:139)
Caused by: org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received alluxio, file.
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:34)
at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn$DoFnInvoker.invokeProcessElement(Unknown Source)
at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218)
at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:183)
at org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:62)
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:544)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:579)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:941)
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:895)
at org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:252)
at org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:74)
at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:576)
at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:139)
at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218)
at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:183)
at org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:62)
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:544)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received alluxio, file.
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
at org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:428)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:308)
at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:755)
at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:850)
The streaming job is getting data from the apache pulsar source and writing output data onto an Alluxio data lake in parquet file format. I am using Spotify's scio for writing this job in Scala. A little code chunk to emphasize what I am trying to achieve:
pulsarSource
.open(sc)
.withFixedWindows(Duration.standardSeconds(windowDuration))
.toSinkTap(sink)
From the exception, I can see that source and output paths should have the same URI scheme but I don't know how it is happening because I am using an alluxio path as an output directory. There are some temp directories that are being created on alluxio output directory but after the WindowDuration, when the output file is being created, this exception happens.
I had a doubt that temp location might be configured by default to the local filesystem, so I did set that to output directory path (alluxio dir path) but it didn't change anything.
sc.options.setTempLocation(outputDir)
I want to do remote debugging in order to figure out the issue. I have tried this document to do remote debugging on the task executor node, but once my IntelliJ IDE connects with the node, I don't get hit on my breakpoint.
Can someone suggest, how can I debug or get more information about this issue.
Thanks
Remote debugging might be quite hard, but let's try this first: Make sure you connect to the task manager and not job manager (easy to verify with thread names). Then make sure to have a high number of retries, such that you don't miss the task execution, as attaching the debugging might take a while.
It's also helpful to double check that line numbers of the stack trace match your code version in the IDE. If Flink/Beam is preinstalled, they might run a slightly different version and your break point is void. Just paste the stack trace in your IDE and check if each line matches the expectation. Finally, add a few more breakpoint at central places like org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202) to make sure if the setup is working at all.
However, remote debugging is usually not the recommended option for big data systems. You'd first ensure locally that most of the things work on their own with some IT tests and local runners. Then, you might want to add e2e tests with docker containers and a local mini cluster. Additionally, you'd add a ton of logging statements, which you can turn on and off with your logging configuration. Similarly, if you set logging level to debug, the existing log statements of the frameworks might already be enough to gain some insights. One important thing that you should always look at is the generated topology that you can see in Web UI. Maybe it already tells you the paths in question.

OrientDB 2.1.9 crashes with OStorageException EOFException when running SQL script in console

I've been using my SQL database initialization script for a while, but it seems that recently the database crashes in the middle of the execution and I don't know why, but here's some details:
I am running OrientDB on Ubuntu 14 Trusty x64 (via Vagrant)
It always seems to crash while the script attempts to create a UNIQUE_HASH_INDEX, but doesn't always crash at the same UNIQUE_HASH_INDEX instruction
The script creates a lot of vertices and edges, but for example, it will crash here (see line with UNIQUE_HASH_INDEX):
CREATE CLASS Channel EXTENDS V;
CREATE PROPERTY Channel.version LONG;
CREATE PROPERTY Channel.channelId STRING;
CREATE INDEX Channel.uq_channelId ON Channel(channelId) UNIQUE_HASH_INDEX;
The database crashes entirely with the following error:
Creating index... Error:
com.orientechnologies.orient.core.exception.OStorageException: Error
on executing command: sql.create INDEX Channel.uq_channelId ON
Channel(channelId) UNIQUE_HASH_INDEX
Error: java.io.EOFException
Looking at the log files, the only hint I get are the last two lines:
2016-01-14 17:17:05:437 INFO Received signal: SIGTERM [OSignalHandler]
2016-01-14 17:17:05:454 INFO Received signal: SIGTERM [OSignalHandler]
How can I resolve this issue, or at least get better hints as to what is making the database crash?
I also test with OrientDB 2.1.6, as I was running the older version initially. Same problem.
Sorry, false alarm -- this is a Vagrant issue, not an OrientDB issue. Running the exact same script on a 32bit instance instead of 64bit solved my problem, and installing the same script on a real 64bit server also works.

Error: No chunks found for a file with Mongo gridFS

An error has crashed my application server and I can't seem to figure out what could be causing the issue. My application is built with Meteor and hosted on modulus.io. Here are my application logs:
Error: no chunks found for file, possibly corrupt
at /mnt/data/2/node_modules/mongodb/lib/mongodb/gridfs/gridstore.js:817:20
at /mnt/data/2/node_modules/mongodb/lib/mongodb/gridfs/gridstore.js:594:7
at /mnt/data/2/node_modules/mongodb/lib/mongodb/cursor.js:758:35
at Cursor.close (/mnt/data/2/node_modules/mongodb/lib/mongodb/cursor.js:989:5)
at Cursor.nextObject (/mnt/data/2/node_modules/mongodb/lib/mongodb/cursor.js:758:17)
at commandHandler (/mnt/data/2/node_modules/mongodb/lib/mongodb/cursor.js:727:14)
at /mnt/data/2/node_modules/mongodb/lib/mongodb/db.js:1916:9
at Server.Base._callHandler (/mnt/data/2/node_modules/mongodb/lib/mongodb/connection/base.js:448:41)
at /mnt/data/2/node_modules/mongodb/lib/mongodb/connection/server.js:481:18
at [object Object].MongoReply.parseBody (/mnt/data/2/node_modules/mongodb/lib/mongodb/responses/mongo_reply.js:68:5)
[2015-03-29T22:05:57.573Z] Application CRASH detected. Exit code 8.
Most probably this is a mongo bug with gridfs (has been fixed)
Writing two or more different files concurrently from different node
processes using the GridStore.writeFile command results in some files
not being correctly written (ending up with a number of corrupt files
in the gridstore). Ending up with corrupt files even with all
writeFile calls being successfull and no indication of error.
writeFile occasionally fails with error "chunks out of order", but
this happens very rarely (something like 1 failed writeFile for 100
corrupt files or more).
Based on the comments with in a discussion, the problem will be fixed if you will update mongo (the gridfs files should be removed, as they are corrupt).
Error: no chunks found for file, possibly corrupt
at /home/developer/rundir/node_modules/mongoose/node_modules/mongodb/lib/mongodb/gridfs/gridstore.js:808:20
at /home/developer/rundir/node_modules/mongoose/node_modules/mongodb/lib/mongodb/gridfs/gridstore.js:586:5
at /home/developer/rundir/node_modules/mongoose/node_modules/mongodb/lib/mongodb/collection/query.js:164:5
at /home/developer/rundir/node_modules/mongoose/node_modules/mongodb/lib/mongodb/cursor.js:778:35
I had a similar occurance, but it ended up the file sought in a GFS read stream had actually been deleted - so in my case it wasn't corrupt, but gone! Above is a log from when that happened.

Prefetching information in windows XP fails and abort the launching of my application

I compile my application on a windows XP SP3 machine. When it compiles, I try to lauch it, and windows replies me back with :
Unable to start program 'xx'. This
application has failed to start
because the application configuration
is incorrect. Reviex the manifest file
for possible errors. Reinstalling the
application may fix this problem. For
more details , please see the
application event log.
Trying to copy DLL files didn't help (see my previous question if you want).
I've launch Process monitor from sysinternals then.
I try here to summarise the report while it is not very long.
The process starts, then its first thread. Following is calls to :
QueryNameInformationFile() of my exe file => SUCCESS
Load Image() of my exe file => SUCCESS
Load Image() of ntdll.dll => SUCCESS
QueryNameInformationFile() if my exe file => SUCCESS
CreateFile() Try to create it un C:\WINDOWS\Prefetch\blahbla.pf => NAME NOT FOUND
then the thread and the process exits.
I've add my users with full control on that folder (C:\WINDOWS\prefetch), but did not help.
How to make it work? I feel if I go through this step, my application will work as expected.
Edit: I add procmon details about the error:
18:13:40,4305346 xxx.exe 3172 CreateFile C:\WINDOWS\Prefetch\XXX.EXE-1FA9609A.pf NAME
NOT FOUND Desired Access: Generic
Read, Disposition: Open, Options:
Synchronous IO Non-Alert, Attributes:
n/a, ShareMode: None, AllocationSize:
n/a
Is Task Scheduler running on the PC? A way to repair Prefetch is detailed here, if that is causing the problem :
http://members.rushmore.com/~jsky/id14.html