quartz with configured schedulers - quartz-scheduler

How can I create a quartz.properties file so it has declared number of scheduler with given properties to them and access them using StdSchedulerFactory getScheduler("schedulername") method?
org.quartz.scheduler.instanceName = MyScheduler1
org.quartz.threadPool.threadCount = 3
org.quartz.jobStore.class = org.quartz.simpl.RAMJobStore
org.quartz.scheduler.instanceName = MyScheduler2
org.quartz.threadPool.threadCount = 1
org.quartz.jobStore.class = org.quartz.simpl.RAMJobStore

No, each scheduler needs it's own configuration properties.

Related

Celery: How to schedule worker processes/children restart

I try to figure out how to setup my "celery workers" to restart after a living day.
Indeed, I configured my worker children/processes to restart after executing a task.
But in some cases, there are no tasks to execute in 3-4 days. So I need to restart the long-living children.
Do you how to do this ?
This is my actual celery app setup:
app = Celery(
"celery",
broker=f"amqp://bla#blabla/blablabla",
backend="rpc://",
)
app.conf.task_serializer = "pickle"
app.conf.result_serializer = "pickle"
app.conf.accept_content = ["pickle", "application/json"]
app.conf.broker_connection_max_retries = 5
app.conf.broker_pool_limit = 1
app.conf.worker_max_tasks_per_child = 1 # Ensure 1 task is executed before restarting child
app.conf.worker_max_living_time_before_restart = 60 * 60 * 24 # The conf I want
Thank you :)

Apache Airflow: Multiple Trigger

[In Airflow]
Please how I can trigger a task based on the result of another task? If task A result is true run Task 4 , if false ? re-run task A
You're likely looking for the BranchPythonOperator, you can read about it here -- here's some code copied from the documentation that branches based on the return value of a task named start_task:
def branch_func(**kwargs):
ti = kwargs['ti']
xcom_value = int(ti.xcom_pull(task_ids='start_task'))
if xcom_value >= 5:
return 'continue_task'
else:
return 'stop_task'
start_op = BashOperator(
task_id='start_task',
bash_command="echo 5",
xcom_push=True,
dag=dag)
branch_op = BranchPythonOperator(
task_id='branch_task',
provide_context=True,
python_callable=branch_func,
dag=dag)
continue_op = DummyOperator(task_id='continue_task', dag=dag)
stop_op = DummyOperator(task_id='stop_task', dag=dag)
start_op >> branch_op >> [continue_op, stop_op]

Configure state.dir to keep state store data in different directory

I am trying to configure my topology to change the default directory where state stores are kept.
I read from the docs the property state.dir is what I need to change.
The default is /tmp/kafka-streams and I want to change that to /opt/kafka-streams because in our production system, our /tmp directory gets peridiocally cleaned.
My application has only one topology and I have set the property state.dir to have the value /opt/kafka-streams.
In the logs I see the following being printed first which confirms that /opt/kafka-streams is being used:
| loglevel="INFO" | thread="main" | logger="org.apache.kafka.streams.StreamsConfig " StreamsConfig values:
application.id = my-application
application.server =
bootstrap.servers = [192.168.92.118:9092]
buffered.records.per.partition = 1000
cache.max.bytes.buffering = 10485760
client.id = my-application-client
commit.interval.ms = 30000
connections.max.idle.ms = 540000
...
...
...
state.dir = /opt/kafka-streams
But then ... soon after the log above, there is another log line being printed below that shows /tmp/kafka-streams is still being used somewhere:
| loglevel="INFO" | thread="main" | logger="org.apache.kafka.streams.StreamsConfig " StreamsConfig values:
application.id = my-application
application.server =
bootstrap.servers = [192.168.92.118:9092]
buffered.records.per.partition = 1000
cache.max.bytes.buffering = 10485760
client.id = my-application-client-StreamThread-1-consumer
...
...
...
state.dir = /tmp/kafka-streams
My application has only 1 topology. It is using the processor api. One processor is adding things in a state store. Then there is a punctuator that periodically reads from the state store and outputs messages on the output topic.
When building my topology I used application.id=my-application and client.id=my-application-client. But in the 2nd log line above, I see client.id=my-application-client-StreamThread-1-consumer and that is where state.dir=/tmp/kafka-streams.
Is there another place where I need to configure the state.dir when building my topology ?

Getting description of a PBS job queue

Is there any command that would allow me to query the description of a running/ queued PBS job for its attributes such as RAM, number of processors, GPUs etc?
Use qstat command:
qstat -f job_id
Expanding on the answer posted by dimm.
If a job is registered in a queue, you can query it's attributes with qstat command. If the job has already finished, you can only grep relevant information from the log files. There is a handy tracejob command to do the grepping for you.
In PBS Pro and Torque each job registered with a queue has two sets of attributes:
Resource_List has resources requested for a running or queued job
resources_used holds actual resource usage for a running job.
For example in PBS Pro you could get the following attributes for Resource_List
Resource_List.mem = 2000mb
Resource_List.mpiprocs = 8
Resource_List.ncpus = 8
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.qlist = queue1
Resource_List.select = 1:ncpus=8:mpiprocs=8
Resource_List.walltime = 02:00:00
 
And the following values for resources_used
resources_used.cpupercent = 800
resources_used.cput = 00:03:31
resources_used.mem = 529992kb
resources_used.ncpus = 8
resources_used.vmem = 3075580kb
resources_used.walltime = 00:00:28
For finished jobs tracejob could fetch you only some of the requested resources:
ncpus=8:mem=2048000kb
and the final values for resources_used
resources_used.cpupercent=799
resources_used.cput=00:54:29
resources_used.mem=725520kb
resources_used.ncpus=8
resources_used.vmem=3211660kb
resources_used.walltime=00:06:53

suggestion with hadoop project

I am thinking to build something using big data. Ideally what I would like to do is:
take a .csv put it into flume, than kafka, perform n ETL and put back in Kafka, from kafka put into flume and then in hdfs. Once the infos are in hdfs I would like to perform a map reduce job or some hive queries and then chart whatever I want.
How can I put the .csv file into flume and save it to kafka? I have this piece of code but I am not sure if it works:
myagent.sources = r1
myagent.sinks = k1
myagent.channels = c1
myagent.sources.r1.type = spooldir
myagent.sources.r1.spoolDir = /home/xyz/source
myagent.sources.r1.fileHeader = true
myagent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
vmagent.channels.c1.type = memory
myagent.channels.c1.capacity = 1000
myagent.channels.c1.transactionCapacity = 100
myagent.sources.r1.channels = c1
myagent.sinks.k1.channel = c1
Any help or suggestions? And if this piece of code is correct, how to move on?
Thanks everyone!!
Your Sink config is incomplete. Try :
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = mytopic
a1.sinks.k1.brokerList = localhost:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1
https://flume.apache.org/FlumeUserGuide.html#kafka-sink