Is there a way to kill Snowflake queries using the Spark connector ? Alternatively is there a way to grab the last last query id or session id in Spark to kill it outside of Spark.
The use case is user controlled long running Spark jobs with long running Snowflake queries. When a user is killing the Spark jobs , the current Snowflake query keeps on running (for many hours )
Thank you
Log into snowflake UI (or use snowsql) with the same user you use for spark, and run following:
use database <your_db>;
use warehouse <your wh>;
select
query_id, query_text, execution_status, error_message, start_time, end_time
from
table(information_schema.query_history( RESULT_LIMIT => 10) );
This should show you your recent queries. Find the one that is in RUNNING state, copy its QUERY_ID, and use to run this:
select system$cancel_query('<your query id here>');
Related
I have deployed a test trino cluster composed by a coordinator and one node.
I have defined several catalogs, all PostgreSQL database, I am am trying to execute some simple operation as
describe analysis_n7pt_sarar4.public.tests_summary;
or
show tables from analysis_n7pt_sarar4.public like '%sub_step%'
From trino webpage I found the queries blocke at 9% and everything seems hanging.
If I execute queries such as:
select * from analysis_n7pt_sarar4.public.bench limit 5
or
select count(*) from analysis_n7pt_sarar4.public.tests_summary;
I obtain results in some seconds.
In http-request.log I found no errors in both coordinator and worker.
What shoudl I check?
Thanks
Working on getting Airflow implemented at my company but need to perform some safety checks prior to connecting to our prod dbs.
There is concern about stupid SQL being deployed and eating up too many resources. My thought was that an execution_timeout setting on a PostgresOp task would:
Fail the task
Kill the query process in the db
I have found neither to be true.
Code:
with DAG(
# Arguments applied to instantiate this DAG. Update your values here
# All parameters visible in airflow.models.dag
dag_id=DAG_ID,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(minutes=20),
start_date=days_ago(1),
schedule_interval=None,
tags=['admin'],
max_active_runs=1
) as dag:
kill_test = PostgresOperator(
task_id="kill_test",
execution_timeout=timedelta(seconds=10),
postgres_conn_id="usa_db",
sql="""
SET application_name to airflow_test;
<SELECT... intentionally long running query> ;
""")
Airflow does not fail the task after the timeout.
Even when I manually fail the task in the UI, it does not kill the query in the Postgres db.
What is the deal here? Is there any way to put in safety measures to hard kill an Airflow initiated Postgres query in the db?
I'm not posting here, but I have checked:
Airflow UI shows task instance duration way over execution timeout
pg_stat activity to confirm query is running way over execution timeout
I guess you are looking for this parameter runtime_parameters={'statement_timeout': '180000ms'}.(airflow example)
I don't know in which version this was added but if you update your apache-airflow-providers-postgres module to the last version you can use the mentioned parameter.
I want to get the list of Jobs which are working in my server on database, which displays the name, job timings, etc. using a query. Is it possible in Postresql PgAdmin.
I have a spark job, which I normally run with spark-submit with the input file name as the argument. Now I want to make the job available for the team, so people can submit an input file (probably through some web-API), then the spark job will be trigger, and it will return user the result file (probably also through web-API). (I am using Java/Scala)
What do I need to build in order to trigger the spark job in such scenario? Is there some tutorial somewhere? Should I use spark-streaming for such case? Thanks!
One way to go is have a web server listening for jobs, and each web request potentially triggering an execution of a spark-submit.
You can execute this using Java's ProcessBuilder.
To the best of my knowledge, there is no good way of invoking spark jobs other than through spark-submit.
You can use Livy.
Livy is an open source REST interface for using Spark from anywhere.
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them.
Please check this url for info.
https://github.com/cloudera/livy
You can use SparkLauncher class to do this. You will need to have a REST API that will take file from the user and after that trigger the spark job using SparkLauncher.
Process spark = new SparkLauncher()
.setAppResource(job.getJarPath())
.setMainClass(job.getMainClass())
.setMaster("master spark://"+this.serverHost + ":" + this.port)
.launch();
I need to execute a query on a teradata database on a daily basis (select + insert).
Can this be done within the (teradata-) database or should I consider external means (e.g. a cron-job).
Teradata doesn't have a built-in scheduler to run jobs. You will need to leverage something like cron or Tivioli Workload Scheduler to manage your job schedule(s).