How to implement job migration fault tolerance technique in clousim? - github

I want to implement job migration for the failed task in cloudsim.Is there any source code available available on gitHub etc?

CloudSim does not implement Cloudlet migration or fault injection.
But if you want to use a modern, full-featured, state-of-the-art and easier-to-use fork of CloudSim, try CloudSim Plus. It has a fault-injection module that doesn't migrate Cloudlets, but when a VM fails due to a failure on its Hosts, a clone of that VM is created from a snapshot and the Cloudlets restarted.
To inject the failures and enable VM clones to be created, you can execute the following code:
long seed = System.currentTimeMillis();
//Creates a random number generator following the Poisson distribution
//MEAN_FAILURE_NUMBER_PER_HOUR is a constant you have to create
//to define the mean number of Host failures per hour.
ContinuousDistribution poisson = new PoissonDistr(MEAN_FAILURE_NUMBER_PER_HOUR, seed);
//The object that will inject random failures into Host's PEs.
HostFaultInjection fault = new HostFaultInjection(datacenter0, poisson);
fault.setMaxTimeToFailInHours(800);
//Defines a cloner function (cloneVm method) that will be called
//when a VM from a given broker fails due to a Host failure.
//The call to addVmCloner also defines a method to clone the cloudlets
//running inside the failed VM.
fault.addVmCloner(broker, new VmClonerSimple(this::cloneVm, this::cloneCloudlets)));
To check and understand the complete example, follow this link and the faulinjection package documentation in CloudSim Plus.

Related

Azure Data Factory not Using Data Flow Runtime

I have an Azure Data Factory with a pipeline that I'm using to pick up data from an on-premise database and copy to CosmosDB in the cloud. I'm using a data flow step at the end to delete documents that don't exist in the source from the sink.
I have 3 integration runtimes set up:
AutoResolveIntegrationRuntime (default set up by Azure)
Self hosted integration runtime (I set this up to connect to the on-premise database so it's used by the source dataset)
Data flow integration runtime (I set this up to be used by the data flow step with a TTL setting)
The issue I'm seeing is when I trigger the pipeline the AutoResolveIntegrationRuntime is the one being used so I'm not getting the optimisation that I need from the Data flow integration runtime with the TTL.
Any thoughts on what might be going wrong here?
Per my experience, only the AutoResolveIntegrationRuntime (default set up by Azure) supports the optimization:
When we choose the data flow run on non-default integration, there isn't the optimization:
And once the integration runtime created, we also couldn't change the settings:
Data Factory documents didn't talk more about this. When I run the pipeline, I found that the dataflowruntime won't work:
That means that no matter which integration runtime you used to connect to the dataset, data low will always use the Azure Default integration runtime.
SHIR doesnt support dataflow execution.

NServiceBus disposing Autofac Container

Here goes - bear with me:
Two Autofac 4.2.1 Containers:
One in an Asp.NET 4.6.1 WebApi project
One in an NServiceBus 6 host
Both possess an IJobService reference to the JobService (which saves jobs to DynamoDB).
Run the project in Visual Studio...
If I make a WebApi request into the first JobService it succeeds and inserts a record to DynamoDB and drops a command on the bus for NServiceBus to pickup.
During the processing of the Saga, NServiceBus makes a call to JobService again (presumably on the second container) to save progress. This second call fails to insert to DynamoDB with the lifetime disposed. If I try to create anything from IComponentContext I get:
Instances cannot be resolved and nested lifetimes cannot be created from this LifetimeScope as it has already been disposed.
The NServiceBus host is running AsA_Server and I register the container in the Customize method of IConfigureThisEndPoint.
Any pointers on how to see where the lifetime is getting dumped or if it's mysteriously picking the wrong IJobService somehow?
Just to close this one out - we ended up redesigning the solution and moving any web service calls out to their own handlers. That was based off the advice found here http://docs.particular.net/nservicebus/sagas That change resolved the issue one way or another.
Specifically, this guidance:
Other than interacting with its own internal state, a saga should not access a database, call out to web services, or access other resources - neither directly nor indirectly by having such dependencies injected into it.

How to cancel or remove Persistent EJBTimers

When we use persistent EJBTimer with #schedule and persistent=true, deploy it to cluster and then we change the actual schedule within #Schedule and re-deploy to the cluster, does the original schedule get replaced with the new one ( removed and added with new parameters ), or both the schedules remain active ( keeping in mind the persistent=true is set )
This is what I have read so far - Each scheduler instance has a unique jndi name and #schedule automatically creates a timer through application deployment so it would be better to remove the automatic created EJBTimer or cancel the original schedule to avoid trouble. But I don't know how to cancel the original schedule programmatically or would that need to be done by the websphere admins, if both the original and changed schedules remain active
Also from this document, the removeAutomaticEJBTimers command is used to remove timers from a specified scheduler, but that also seems in the area of a websphere admin, not a developer.
How can a developer programmatically cancel an automatic EJBTimer created by using #Schedule annotation?
I am using Java EE 6 with Websphere 8.5 and EJB 3.1.
Do the following to remove persisted EJB timers:
Delete directory jboss-home\standalone\data\timer-service-data{yourporjectname}.{serivename}
See this page Creating timers using the EJB timer service
The application server automatically removes persistent automatic
timers from the database when you uninstall the application while the
server is running. If the application server is not running, you must
manually delete the automatic timers from the database. Additionally,
if you add, remove, or change the metadata for automatic timers while
the server is not running, you must manually delete the automatic
timers.
I have the following class:
#Stateless
#LocalBean
public class HelloBean {
#Schedule(persistent=true, minute="*", hour="*", info="myTimer")
public void printHello() {
System.out.println("### hello");
}
}
When I install it to the server, I can find related automatic timer:
C:\IBM\WebSphere\AppServer85\profiles\AppSrv02\bin>findEJBTimers.bat server1 -all
ADMU0116I: Tool information is being logged in file C:\IBM\WebSphere\AppServer85\profiles\AppSrv02\logs\server1\EJBTimers.log
ADMU0128I: Starting tool with the AppSrv02 profile
ADMU3100I: Reading configuration for server: server1
EJB timer : 3 Expiration: 2/14/15 12:39 PM Calendar
EJB : ScheduleTestEAR, ScheduleTest.jar, HelloBean
Info : myTimer
Automatic timer with timout method: printHello
Calendar expression: [start=null, end=null, timezone=null, seconds="0",
minutes="*", hours="*", dayOfMonth="*", month="*", dayOfWeek="*", year="*"]
1 EJB timer tasks found
After uninstalling application, the timer is removed:
C:\IBM\WebSphere\AppServer85\profiles\AppSrv02\bin>findEJBTimers.bat server1 -all
ADMU0116I: Tool information is being logged in file
C:\IBM\WebSphere\AppServer85\profiles\AppSrv02\logs\server1\EJBTimers.log
ADMU0128I: Starting tool with the AppSrv02 profile
ADMU3100I: Reading configuration for server: server1
0 EJB timer tasks found
I don't know how you are 'redeploying' your applications, but looks like your process is incorrect. As in normal install/uninstall/update process automatic timers are correctly removed.
UPDATE
On the same page you have info regarding ND environment:
Automatic persistent timers are removed from their persistent store
when their containing module or application is uninstalled. Therefore,
do not update applications that use automatic persistent timers with
the Rollout Update feature. Doing so uninstalls and reinstalls the
application while the cluster is still operational, which might cause
failure in the following cases:
If a timer running in another cluster member activates after the database entry is removed and before the database entry is recreated,
then the timer fails. In this case, a
com.ibm.websphere.scheduler.TaskPending exception is written to the
First Failure Data Capture (FFDC), along with the SCHD0057W message,
indicating that the task information in the database has been changed
or canceled.
If the timer activates on a cluster member that has not been updated after the timer data in the database has been updated, then
the timer might fail or cause other failures if the new timer
information is not compatible with the old application code still
running in the cluster member.
In JBoss/WildFly, if you change the timer-service to use a "clustered-store" instead of "default-file-store", you'll be able to programmatically cancel a Timer. Here is a brief guide explaining how to make it:
Mastertheboss.com: Creating clustered EJB 3 Timers
Published: 08 March 2015
http://www.mastertheboss.com/jboss-server/wildfly-8/creating-clustered-ejb-3-timers

Torque pbs_python submit job error (15025 queue already exists)

I try to execute this example script (https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage/Scripts/Submit)
#!/usr/bin/env python
import sys
sys.path.append('/usr/local/build_pbs/lib/python2.7/site-packages/pbs/')
import pbs
server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)
attropl = pbs.new_attropl(4)
# Set the name of the job
#
attropl[0].name = pbs.ATTR_N
attropl[0].value = "test"
# Job is Rerunable
#
attropl[1].name = pbs.ATTR_r
attropl[1].value = 'y'
# Walltime
#
attropl[2].name = pbs.ATTR_l
attropl[2].resource = 'walltime'
attropl[2].value = '400'
# Nodes
#
attropl[3].name = pbs.ATTR_l
attropl[3].resource = 'nodes'
attropl[3].value = '1:ppn=4'
# A1.tsk is the job script filename
#
job_id = pbs.pbs_submit(c, attropl, "A1.tsk", 'batch', 'NULL')
e, e_txt = pbs.error()
if e:
print e,e_txt
print job_id
But shell shows error "15025 Queue already exists". With qsub job submits normally. I have one queue 'batch' on my server. Torque version - 4.2.7. Pbs_python version - 4.4.0.
What I should to do to start new job?
There are two things going on here. First there is an error in pbs_python that maps the 15025 error code to "Queue already exists". Looking at the source of torque we see that 15025 actually maps to the error "Bad UID for job execution", this means that on the torque server, the daemon cannot determine if the user you are submitting as is allowed to run jobs. This could be because of several things:
The user you are submitting as doesn't exist on the machine running pbs_server
The host you are submitting from is not in the "submit_hosts" parameter of the pbs_server.
Solution For 1
The remedy for this depends on how you authenticate users across systems, you could use /etc/hosts.equiv to specify users/hosts allowed to submit, this file would need to be distributed to all the torque nodes as well as the torque server machine. Using hosts.equiv is pretty insecure, I haven't actually used it in this. We use a central LDAP server to authenticate all users on the network and do not have this problem. You could also manually add the user to all the torque nodes and the torque server, taking care to make sure the UID is the same on all systems.
Solution For 2
If #1 is not your problem (which I doubt it is), you probably need to add the hostname of the machine you're submitting from to the "submit_hosts" parameter on the torque server. This can be accomplished with qmgr:
[root#torque_server ]# qmgr -c "set server submit_hosts += hostname.example.com"
The pbs python library that you are using was written for torque 2.4.x.
The internal api's for torque were largely rewritten in torque 4.0.x. The library will most likely need to be written for thew new API.
Currently the developers of torque do not test any external libraries. It is possible that they could break at any time.

MessageQueue.Exists(QueueName) returns false but it exists

The problem I'm having is with this code:
if (!MessageQueue.Exists(QueueName))
{
MessageQueue.Create(QueueName, true);
}
It will check if a queue exists; if it doesn't I want it to create the queue. This code has been working and hasn't changed for a few months. Today I started receiving this error:
[MessageQueueException (0x80004005): A queue with the same path name
already exists.] System.Messaging.MessageQueue.Create(String path,
Boolean transactional) +239478
The queues are local and if I delete the specific queue it will work once. After the queue is created it starts to fail again with the same error message.
It looks like the issue may be because of the Network Load Balancing (NLB) configuration. I was unaware of a change that recently put the machine in a NLB environment. The configuration we are using is an unsupported one.
More information is in How Message Queuing can function over Network Load Balancing (NLB).