About dividing into subproblems for akka actor model using scala - scala

I am doing a project involving finding a suffix string given a prefix --for instance, "aaa", so that that its hashed value (SHA256) has a certain pattern -- for example, starting with "123".
My method of finding the required hash key is to generate the suffix string in an ordered way: first try all the string with one character -- basically it goes through the ASCII printable code chart, 95 trials in total. If the required hash key is not found, then try all the string with two characters (95*95 trails)... and so on so forth.
I am also required to use akka actor model to let multiple actors get involved in solving this problem concurrently. (The number of actors is an input)
Any idea about how to efficiently divide the total problem to multiple actors using this pattern? Or anyone has a better solution to this problem?

You can group your workers under a BalancingPool which will automatically distribute work to idle actors, with the manager and worker actors using work pulling to prevent the mailbox from growing too large.
The manager accepts two message types: work-request and work-complete. Work-request is called by a worker when it has completed X tasks (where X is, say, 10), signalling the manager to add X more tasks to the BalancingPool. Work-complete is called when a worker has found an appropriate string prefix, at which point the manager sends out a stop command to the BalancingPool instructing it to immediately terminate and to also terminate its workers. Aside from this, the manager is responsible for initially filling the BalancingPool with sufficient prefixes for the workers to test (say, 20 * workerCount), and refilling the pool every time it receives a work-request message.
The workers accept one message type: test-prefix, which contains a string prefix for the worker to test. It also needs to maintain a count of messages that it has received and processed, and when this count reaches X then it is zeroed and the worker sends a work-request message to the manager.

Related

Quarkus Scheduled Records Processing mechanism Best Practice

What is the best practice or way to process the records from DB in scheduled.
Situation:
A Microservice based on Quarkus - responsible for sending a communication to customers.
DB Table Having Customers Records (100000 customers)
Microservice is running on multiple nodes (4 nodes)
Expectation:
There should be a scheduler that runs every 5 sec
Fetches the records from DB where employee status = pending
Should be Multithreaded architecture.
Send email to employee email.
Problem 1:
The same scheduler running on multiple nodes picks the same records and process How can we avoid this?
Problem 2:
Scheduler pics (100 records and processing it) and takes more than 5 seconds and scheduler run again pics few same records. How can we avoid that:
If you are planning to run your microservices on kubernetes I would sugest to use an external components as a scheduler and let this component distribute the work over your microservices using messages or HTTP invocations.
As responses to your questions here we go:
You can use some locking strategy or "reserve" each row including a field that indicates that your record is being processed and excluding all records containing this fields from your query. By this means when the scheduler fires it will read a set of rows not reserved and use a multithreading approach to process the records, by using a locking strategy (pesimits or optimist) you can prevent other records from marking the same row as reserved for them to be processed. After that the thread thas was able to commit the reserve process the records and updates the state or releases the "reserve" so other workers can work on the record if its needed.
You can always instruct your scheduler to do no execute if there is still an execution going.
#Scheduled(identity = "ProcessUpdateScheduler", every = "2s", concurrentExecution = Scheduled.ConcurrentExecution.SKIP)
You mainly have two approaches among other possible ones:
Pulling (Distribute mining or work distribution): Each instance of the microservice pick a random pending row and mark this row as "processing" commiting the transaction, if its able to commit then this instance holds the right to process this record continuing with its execution, if not it tries to retrieve a different row or just exists waiting for the next invocation. This approach scales horizontally because adding more workers will mean increasing your processing throughput.
Pushing (central distribution, distributed processing). You have two kinds of components: First the "Distributor" which is executed with the scheduler and is responsible for picking rows to be processed and marking then as "pending processing", this rows will be forward via a messaging system or HTTP call to the "Processor". The Processor component recieves as input a record and is responsible of processing this record completely or releasing the hold ("procesing pending") state.
Choouse the best suited for your scenario, if you go for the second option, you can have one or more distributors if its necessary, but in order to increment your processing throughput you only need to scale the "Processor" workers

Kafka Streams: Understanding groupByKey and windowedBy

I have the following code.
My goal is to group messages by a given key and a 10 second window. I would like to count the total amount accumulated for a particular key in the particular window.
I read that I need to have caching enabled and also have a cache size declared. I am also forwarding the wall clock to enforce the windowing to kick in and group the elements in two separate groups. You can see what my expectations are for the given code in the two assertions.
Unfortunately this code fails them and it does so in two ways:
it sends a result of the reduction operation each time it is executed as opposed to utilizing the caching on the store and sending a single total value
windows are not respected as can be seen by the output
Can you please explain to me how am I misunderstanding the mechanics of Kafka Streams in this case?

Is there a way of assigning an int number to different instances of stateless services?

I'm building a solution where we'll have a (service-fabric) stateless service deployed to K instances. This service is tasked with some workload (like querying) and I want to split the workload between them as evenly as I can - and I want to make this a dynamic solution, which means if I decide to go from K instances to N instances tomorrow, I want the workload splitting to happen in a way that it will automatically distribute the load across N instances now. I don't have any partitions specified for this service.
As an example -
Let's say I'd like to query a database to retrieve a particular chunk of the records. I have 5 nodes. I want these 5 nodes to retrieve different 1/5th of the set of records. This can be achieved through some query logic like (row_id % N == K) where N is the total number of instances and K is the unique instance_number.
I was hoping to leverage FabricRuntime.GetNodeContext().NodeId - but this returns a guid which is not overly useful.
I'm looking for a way where I can deterministically say it's instance number M out of N (I need to be able to name the instances through 1..N) - so I can set my querying logic according to this. One of the requirements is if that instance goes down / crashes etc... when SF automatically restarts it, it should still identify as the same instance id - so that 2 or more nodes doesn't query the same set of results.
What is the best of solving this problem? Is there a solution which involves pure configuration through ApplicationManifest.xml or ServiceManifest.xml?
There is no out of the box solution for your problem, but it can be easily done in many different ways.
The simplest way is using the Queue-Based Load Leveling pattern in conjunction with Competing Consumers pattern.
It consists of creating a queue, add the work to the queue, and each instance get one message to process this work, if one instance goes down and the message is not processed, it goes back to the queue and another instance pick it up.
This way you don't have to worry about the number of instances running, failures and so on.
Regarding the work being put in the queue, it will depend if you want to to do batch processing or process item by item.
Item by item, you put one message in the queue for each item being processed, this is a simple way to handle the work and each instance process one message at time, or multiple messages in parallel.
In batch, you can put a message that represents a list of items to be processed and each instance process that batch until completed, this is a bit trickier because you might have to handle the progress of the work being done, in case of failure, the next time you can continue from where it stopped.
The queue approach is a reactive design, in this case the work need to be put in the queue to trigger the processing, If you want a proactive approach and need to keep track of which work goes to who, you probably might be better of using some other approach, like a Leasing mechanism, where each instance acquire a lease that belongs to the instance until it releases the lease, this would more suitable when you work with partitioned data or other mechanism where you can easily split the load.
Regarding the issue with the ID, an option would be the InstanceId of the replica you are on, you can reach by StatelessService.Context.InstanceId, it is not a sequential ID, but it is a random number. It is better than using the node id, because you might have multiple partitions on same node and the id would conflict with each other.
If you decide to use named partitions, you could use order in the partition name instead, so each partition would have a sequential name.
Worth mention that service fabric has a limitation that doesn't allow services to have multiple replicas on same node, because of this limitation you might have to design your services with this in mind, otherwise you won't be able to scale out once the limit is reached. Also, the same thread has some discussion about approaches to process multiple distributed items that might give you some ideas.

Akka actor pipeline and congested store actor

I am attempting to implement a message processing pipeline using actors. The steps of the pipeline include functions such as reading, filtering, augmentation and, finally, storage into a database.
Something similar to this: http://sujitpal.blogspot.nl/2013/12/akka-content-ingestion-pipeline-part-i.html
The issue is that the reading, filtering and augmentation steps are much faster than the storage step which results in having a congested store actor and an unreliable system.
I am considering the following option: have the store actor pull the processed and ready to store messages. Is this a good option? better suggestions?
Thank you
You may consider several options:
if order of messages doesn't matter - just execute every storage operation inside separate actor (or future). It will cause all data storage to be doing in parallel - I recommend to use separate thread pool for that. If some messages are amendments to others or participate in same transaction - you may create separate actors only for each messageId/transactionId to avoid pessimistic/optimistic lock problems (don't forget to kill such actors on transaction end or by timeout) .
use bounded mailboxes (back-pressure) - then you will block new messages from your input if older are still not processed (for example you may block the receiving thread til message will be acknowledged by last actor in the chain). It will move responsibility to source system. It's working pretty much good with JMS durables - messages are storing in reliable way on JMS-broker side til your system finally have them processed.
combine the previous two
I am using an approach similar to this: Akka Work Pulling Pattern (source code here: WorkPullingPattern.scala). It has the advantage that it works both locally & with Akka Cluster. Plus the whole approach is fully asynchronous, no blocking at all.
If your processed "objects" won't all fit into memory, or one of the steps is slow, it is an awesome solution. If you spawn N workers, then N "tasks" will be processed at one time. It might be a good idea to put the "steps" into BalancingPools also with parallelism N (or less).
I have no idea if your processing "pipeline" is sequential or not, but if it is, just a couple hours ago I have developed a type safe abstraction based on the above + Shapeless library. A glimpse at the code, before it was merged with WorkPullingPattern is here: Pipeline.
It takes any pipeline of functions (of properly matching signatures), spawns them in BalancingPools, creates Workers and links them to a master actor which can be used for scheduling the tasks.
The new AKKA stream (still in beta) has back pressure. It's designed to solve this problem.
You could also use receive pipeline on actors:
class PipelinedActor extends Actor with ReceivePipeline {
// Increment
pipelineInner { case i: Int ⇒ Inner(i + 1) }
// Double
pipelineInner { case i: Int ⇒ Inner(i * 2) }
def receive: Receive = { case any ⇒ println(any) }
}
actor ! 5 // prints 12 = (5 + 1) * 2
http://doc.akka.io/docs/akka/2.4/contrib/receive-pipeline.html
It suits your needs the best as you have small pipelining tasks before/after processing of the message by actor. Also it is blocking code but that is fine for your case, I believe

How to count discarded entities in a FIFO queue using Simulink?

I'm trying to model a single queue, single server simulation using Simulink in MATLAB, I've recently installed it and I'm pretty new.
I've created a Time-Based Entity Generator (with an exponential arrival time), a FIFO queue with capacity of 50 entities and a Single Server with an exponential service time as shown in this image:
I wonder how I can count the number of entities that are generated but can't get into the FIFO because it's full (reached 50 entities already) and discard them.
This will probably not help you anymore, but I found a solution to this problem and thought I would share it for future reference. The way to solve it is using an Output Switch block with 2 ports. Connect the first to your FIFO queue and the second to a sink (or whatever you want your entities to go to) and select "First port that is not blocked" as a switching criterion. Picture here: http://i.imgur.com/qxmQS4s.png. Cheers!