How can I implement a job queue with a greedy-worker-pool in Java EE 6 in a correct way? - threadpool

I'm looking for a correct way, to do the following in Java EE 6, if possible with vanilla Java EE 6 only.
I want to put a job in a job queue and have a fixed pool of worker objects, which should pull a job from the queue, if they are idle.
The worker objects are in a fixed relation to a legacy system, so it is not possible to use one worker object in multiple threads for all jobs and it is also not possible to instantiate a new worker object for every job.
The greedy worker pattern looks perfect, but that's only true for Java SE. In EE, I'm not sure, what the correct way is, to implement this.
Any suggestions?
Thanks in Advance.
M.

The first thing to notice is, that by definition in the spec you must not create and start your own threads in JavaEE.
Concerning your setup, I'm not completely sure how it works in your system - do you have a fixed relation to clients all the time or are there only jobs from time to time to execute which then do work for one client?
In both cases you can just use stateful EJBs, so that one EJB serves a specific client system. Then for the first case this EJB serves the client for the whole lifecycle or for the second case you can start asynchronous EJBs to do the work.

Related

Vert.x standard verticle thread safety

I'm just going through vert.x documentation and got confused by the part about standard verticles:
No more worrying about synchronized and volatile any more, and you also avoid many other cases of race conditions and deadlock so prevalent when doing hand-rolled 'traditional' multi-threaded application development.
This is the link to it: https://vertx.io/docs/vertx-core/java/#_standard_verticles
Is this statement true only if I deploy only one instance of standard verticle, and if my vert.x application isn't clustered?
only if I deploy only one instance of standard verticle, and if my vert.x application isn't clustered?
Each verticle deployed is single threaded. So if you have 3 instances - each of them individually are single threaded.
vert.x application isn't clustered?
Not related. Clustered is across processes/machines - here we are talking about threads

Spring Boot - running scheduled jobs as separate process

I have a spring boot application which also have few scheduled jobs. I don't see any functional issue with implementation. one of the job runs every almost second for real time updates. There are other jobs also.
I suspect there is performance issue especially when long running API hits the controller.
// Heavy Job
#Scheduled(fixedRate = 10000)
public void processAlerts(){
}
#Scheduled(fixedDelayString = "${process.events.interval}")
public void triggerTaskReadiness() throws IOException {
log.info("Trigger event processing job");
}
// Heavy Job to process data from different tables.
#Scheduled(fixedDelayString = "${app.status.interval}")
public void triggerUpdateAppHealth() throws IOException {
log.info("Trigger application health");
}
Is it possible to have jobs as separate process. What are the best practices to have spring boot application with heavy jobs.
The question is way too general, IMO. It all depends on your resources and what exactly does the job do.
Spring boot provides a general purpose scheduling mechanism but doesn't make any assumptions about the job nature.
All-in-all, its true that when you run a heavy job, CPU, network, I/O and whatever resources are consumed (Again, depending on the actual code of your job).
If you run it externally basically another process will consume the same resources assuming its being run on the same server.
From the spring boot standpoint I can say the following:
It looks like the job deal with database. In this case Spring boot supports the integration with DataSources, connection pooling, transaction management, more High Level APIs like JPA or even spring data, you can also plug in frameworks like JOOQ. Bottom line, it makes the actual work with the database much easier.
You've stated Mongodb in the question tag - well, spring has also mongo db integration in spring data.
Bottom line if you're running the job in an external process you're kind of on your own (which doesn't mean it can't be done, it just means you lose all the goodies spring has upon its sleeves)
AppHealth - spring boot already provides an actuator feature that has an endpoint of db health, it also provides a way to create your own endpoints to check the health of any concrete resource (you implement it in code so you have a freedom to check however you want). Make sure you're using the right tool for the right job.
Regarding the controller API. If you're running with traditional spring mvc, tomcat has a thread pool to serve the API, so from the Threads management point of view the threads of job won't be competing with the threads of controller, however they'll likely share the same db connection so it can become a bottleneck.
Regarding the implementation of #Scheduled. By default there will be one thread to serve all the #Scheduled jobs, which might be insufficient.
You can alter this behavior by creating your own taskScheduler:
#Bean(destroyMethod = "shutdown")
public Executor taskScheduler() {
return Executors.newScheduledThreadPool(10); // allocate 10 threads to run #Scheduled jobs
}
You might be interested to read this discussion
Spring #Scheduled always works "within the boundaries" of one spring managed application context. So that if you have decided to scale out your instances each and every instance will run the "scheduled" code and will execute heavy jobs.
Its possible to use Quartz with which spring can be integrated do to clustered mode you can configure it to pick one node every time and execute the job, but since you're planning to run every second, I doubt quartz will work good enough.
A general observation: running a set of "heavy" jobs as you say doesn't really sounds well with "running every second". It just doesn't sound reasonable, since heavy jobs tend to last much longer than 1 second, so doing this will eventually occupy all the resources and you won't be able to run more jobs.

why/how deploying multiple instances of a verticle

While reading a document about vert-x mongo client I came across following line:
In most cases you will want to share a pool between different client instances.
E.g. you scale your application by deploying multiple instances of your verticle and you want (...)
It is the last line that caught my attention. I didn't know I should scale my application by deploying multiple instances of the verticle. I plan to make a MongoDbVerticle class that will listen for queries on the event bus.
Questions are:
Am I really supposed to deploy this verticle several times?
How many times? Based on what criterias? Or have I misunderstood some basic concept? I'm new to vert-x, so that might well be.
What happens is that vertx will route your request to one of the verticles that you have defined. Since vertx can be deployed over several machines you can i practice load balance you verticles that have long running operations(such as talking to a database or writing to file, etc.).
If I remeber correctly vertx uses Round Robin to route the requests. That means that if you have two mongo-verticles; a and b, it will first select a then b then a again and so on.
To deploy a verticle you just use the command vertx run <verticle>.
Note: This is not as simple if you run your vertx instance as a fat-jar.

Process work in parallel with non-threadsafe function in scala

I have a lot of work (thousands of jobs) for a Scala application to process. Each piece of work is the file name of a 100 MB file. To process each file, I need to use an extractor object that is not thread safe (I can have multiple copies, but copies are expensive, and I should not make one per job). What is the best way to complete this work in parallel in Scala?
You can wrap your extractor in an Actor and send each file name to the actor as a message. Since an instance of an actor will process only one message at a time, thread safety won't be an issue. If you want to use multiple extractors, just start multiple instances of the actor and balance between them (you could write another actor to act as a load balancer).
The extractor actor(s) can then send extracted files to other actors to do the rest of the processing in parallel.
Don't make 1000 jobs, but make 4x250 jobs (targeting 4 threads) and give one extractor to each batch. Inside each batch, work sequentially. This might not be optimal parallel-wise, since one batch might finish earlier but it is very easy to implement.
Probably the correct (but more complicated) solution would be to make a pool of extractors, where jobs take extractors from and put them back after finishing.
I would make a thread pool, where each thread has an instance of the extractor class, and instantiate just as many of these threads as it takes to saturate the system (based on CPU usage, IO bandwidth, memory bandwidth, network bandwidth, contention for other shared resources, etc.). Then use a thread-safe work queue that these threads can pull tasks from, process them, and iterate until the container is empty.
Mind you, there should be one or several libraries in just about any modern language that implements exactly this. In C++, it would be Intel's Threading Building Blocks. In Objective-C, it would be Grand Central Dispatch.
It depends: what's the relative amount of CPU consumed by the extractor for each job ?
If it is very small, you have a classic single-producer/multiple-consumer problem for which you can find lots of solution in different languages. For Scala, if you are reluctant to start using actors, you can still use the Java API (Runnable, Executors and BlockingQueue, are quite good).
If it is a substantial amount (more than 10%), you app will never scale with a multithread model (see Amdhal law). You may prefer to run several process (several JVM) to obtain thread safety, and thus eliminate the non-sequential part.
First question: how quick does the work need to be completed?
Second question: would this work be isolated to a single physical box or what are your upper bounds on computational resource.
Third question: does the work that needs doing to each individual "job" require blocking and is it serialised or could be partitioned into parallel packets of work?
Maybe think about a distributed model whereby you scale through designing with a mind to pushing out across multiple nodes from the first instance, actors, remoteref all that crap first...try and keep your logic simple and easy - so serialised. Don't just think in terms of a single box.
Most answers here seem to dwell on the intricacies of spawning thread pools and executors and all that stuff - which is fine, but be sure you have a handle on the real problem first, before you start complicating your life with lots of thinking around how you manage the synchronisation logic.
If a problem can be decomposed, then decompose it. Don't overcomplicate it for the sake of doing so - it leads to better engineered code and less sleepless nights.

Jboss Messaging WorkerThread# what are these threads?

I am load testing a jboss messaging install with 5 producers producing 100,000 100k messages. I am seeing significant bottlenecking. When I monitor the profiler, I see there are 15 threads named WorkerThread#. These threads are allocated 100% with no waits. I think they may be related. Does anyone know what function these threads service and if there is a threadpool setting. I am using a supp
JBoss Enterprise Application Server 4.3 CP08
JBoss Enterprise Service Bus 4.4 CP04
JBoss Transactions 4.2.3._CP07
JBoss Messaging 1.4.0.SP3-CP09
JBoss Rules 4.0.7
JBoss jBPM 3.2.9
JBoss Web Services 2.0.1.SP2_CP07
I've figured it out. Its not a pool of threads. In the jboss-messaging.sar/remoting-bisocket.xml file that defines the remoting connector for Jboss Messaging, you see a couple of values mainly clientMaxPool, maxPoolSize, numAcceptThreads.
In remoting, when a socket is established threads are created to monitor that socket up to the value of "numAcceptThreads". All this thread does is read data from the socket and hand it off to a thread in the client pool(governed by maxPoolSize).
The threads called workerThread#[] refer to the accept threads. The reason that I see more when I create more producers is because for the bisocket transport for Jboss Messaging there apparently are three sockets created. Initially there are 3, but when I create 5 producers that number is increased to 15(or 5*3 for those not mathematically inclined :)). The reason they are 100% allocated is because when I am sending all those messages the threads read from the socket, hand off to Server Thread, go back to reading from the socket(where this is always data)
So the short answer is there is no pool to govern these threads. You can have more than 1 accept thread, but It would almost never make sense. This because its job is so minimal read the data, hand it off, read the data... So have more threads would just add synchronization overhead.
This is from http://download.oracle.com/javase/tutorial/uiswing/concurrency/worker.html; hope it helps.
When a Swing program needs to execute a long-running task, it usually uses one of the worker threads, also known as the background threads. Each task running on a worker thread is represented by an instance of javax.swing.SwingWorker. SwingWorker itself is an abstract class; you must define a subclass in order to create a SwingWorker object; anonymous inner classes are often useful for creating very simple SwingWorker objects.