pyspark crontab triggering duplicates (same) job - pyspark

I have a pyspark job that I would like to run once a week. It takes ~ 3 hrs for the job to finish
I added this entry to crontab:
# use /bin/sh to run commands
#
# This crontab entry runs a pyspark job once a week:
CRON_TZ=UTC
* 12 * * 3 nohup sh -x /home/me/run_my_pyspark.sh &
The issue I am running into is that multiple instances of the same job starts running. Is there a to prevent multiple instances from starting up?
Thanks

I would suggest to find the root cause of duplicate firing from cron.
However, to prevent a Spark job firing multiple times, I have used the file lock approach, I hope it might help you.
1. Refer a common directory and a lock file from the Spark driver
2. If the lock file is not present
2.1. create the lock file
2.2. continue job execution
3. If the lock file is present
3.1. get the lock file creation time
3.2. if the creation time is recent
3.2.1. then abort the execution
3.2.2. log and alert about the error
3.3. if the log file is older than certain interval
3.3.1. overwrite the file with the new version
3.3.2. continue job execution

Related

Airflow parallelism failuer while changing the DB to postgres

I have installed airflow locally and I am changing the executor to run parallel tasks
For that, I changed
1- the Database to Postgres 13.3
2- in the config file
sql_alchemy_conn = postgresql+psycopg2://postgres:postgres#localhost/postgres
3- executor = LocalExecutor
I have checked the DB and no errors
using
airflow db check --> INFO - Connection successful.
airflow db init --> Initialization done
Errors that I receive and I don't use SQLite at all
1- {dag_processing.py:515} WARNING - Because we cannot use more than 1 thread (parsing_processes = 2 ) when using SQLite. So we set parallelism to 1.
2- I receive this error from airflow web-interface
The scheduler does not appear to be running.
The DAGs list may not update, and new tasks will not be scheduled.
So shall i do any other change ?
Did you actually restart your Airflow webserver/scheduler after you changed the config?
The following logging statement:
{dag_processing.py:515} WARNING - Because we cannot use more than 1 thread (parsing_processes = 2 ) when using SQLite. So we set parallelism to 1.
It comes from Airflow 2.0.1 with the following code fragment
if 'sqlite' in conf.get('core', 'sql_alchemy_conn') and self._parallelism > 1:
self.log.warning(
"Because we cannot use more than 1 thread (parsing_processes = "
"%d ) when using sqlite. So we set parallelism to 1.",
self._parallelism,
)
self._parallelism = 1
This means that somehow, it is still on 'sqlite' based on your [core] sql_alchemy_conn setting. I think if you are certain you changed the airflow.cfg and restart all airflow service, that it might be picking up another copy of an airflow.cfg then you expect. Please inspect the logs to verify it is using the correct one.

What is a restartpoint in postgresql?

In the postgresql.conf file for PostgreSQL version 13, the archive_cleanup_command comment explains the command in the following way:
#archive_cleanup_command = '' # command to execute at every restartpoint.
The documentation here and here have no mention of a 'restartpoint'. This raises the following questions:
What is a restartpoint?
For example: is restartpoint just the same word for a checkpoint? Do the two mean the exact same thing?
When is a restartpoint created?
For example: if the restartpoint is just a checkpoint then the check point will be created every 5mins or whatever the setting for checkpoint_timeout is in postgresql.conf file.
When is the archive cleanup command run?
For example: The archive cleanup command is run every time the archive_timeout (set in the postgresql.conf file) is reached. If the archive timeout is set to 1hr, then the archive_cleanup_command runs every 1hr.
A restartpoint is just a checkpoint during recovery, and it is triggered in the same fashion as a checkpoint: either by timeout or by the amount of WAL processed since the last restartpoint. Note also that
Restartpoints can't be performed more frequently than checkpoints in the master because restartpoints can only be performed at checkpoint records.
The reason for restartpoints is “restartable recovery”: if your recovery process is interrupted, the next restart won't start recovering from the beginning of the backup, but from the latest restartpoint.
archive_cleanup_command is run for all completely recovered WAL segments during a restartpoint. Its main use case are log shipping standby servers: using archive_cleanup_command they can remove all shipped WAL segments they don't need any more, so that the directory containing them doesn't grow out of bounds.

GCloud Compute Project-Info Add-Metadata Hangs

The following command doesn't exit on my system:
gcloud compute project-info add-metadata --metadata=itemname=itemvalue
I am using powershell on windows, and I've also tested on a linux container in docker. In both environments, the metadata is updated, but the command never terminates.
If I provide an invalid key, or update to the existing value, I do get the output: No change requested; skipping update for [project]. and the program exits. Performing an actual update produces the hang.
I need this command to terminate so that I can use it in a script. I would like to be able to check the exit code to ensure the update occurred successfully.
You aren't patient enough. In large projects, this operation can take significant time to process. Give the script several minutes to complete.

Scheduled sequential jobs in Ubuntu server

I want to write two scheduled jobs for my Ubuntu 14.04.4 server. The jobs need to be sequential.
The first job should unzip a .gz file (SQL Dump) and then import the table "myTable" into MySQL Database (localhost).
The second job (written using Pentaho Data Integration tool) extracts data from the table "myTable" , transforms it and loads it into a new database.
I could have accomplished the first task using pentaho PDI spoon but it doesn't provide any function to unzip a .gz file & after some research and coming accross these posts :
http://forums.pentaho.com/showthread.php?82566-How-to-use-the-content-of-a-tar-gz-file-in-Kettle
How to uncompress and import a .tar.gz file in kettle?
I have gathered that I should manually write a job to accomplish the first task i.e. unzip a .gz file and then import the table "myTable" into MySQL Database.
My question is that how to create a cron job that executes the two sequentially i.e. first job first completes and then the second is executed.
If there is any better alternative approach to this please suggest.
You can make use of the "SHELL" step in a PDI job. Code the unzip portion of your code in the shell step followed sequentially by your transformation. A sample image looks like this:
Now you can schedule this complete job in CRON or any other scheduler. No need for separate scripts.
Note: This works only in a linux env. which i assume you are using.
Hope this helps :)

why my postgres job stop running?

I create a job to clean the database every day at 01:00.
According to statistic run OK from 3 months.
But today i realize the database size was very big so i check the jobs and hasn't run for one month.
Properties say last run was '10/27/2014' and statistics confirm run successful.
Also say next run will be '10/28/2014' but looks like never run and stay frozen since then.
(I'm using dd/mm/yyyy format)
So why stop running?
There is a way to restart the job or should i go and delete and recreate the job?
How can i know a job didn't run?
I guess i can write a code if job isn't successful but what about if never execute?
Windows Server 2008
PostgreSQL 9.3.2, compiled by Visual C++ build 1600, 64-bit
The problem was the pgAgent service wasn't running.
When I Restart the Postgres service:
Postgres service stop
pgAgent stop because is a dependent service
Postgres service start
but pgAgent didn't
Here you can see pgAgent didn't start.