My team is trying to create a "sharded" operator. In its simplest form, it's just a KubernetesPodOperator with a few extra arguments { totalPartitions: number, partitionNumber: number } that the job implementation will use to partition it's data source and execute some code.
We would like to be able to auto generate these partitioned jobs in airflow rather than in the job implementation, something like this in (2.3.x):
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(...) as dag:
#task
def sharded_job(total_shards, shard_number):
return Operator(
...
job_id = "same_job_id_for_all"
total_shards = total_shards,
shard_number = shard_number,
)
sharded_job.expand(total_shards = [5], shard_number = [0, 1, 2, 3, 4])
but have it work? Additionally, is there a nice way to package the above into a util or ShardedOperator class that can spin up total_shards number of job instances?
If not the above, does airflow have other primitives for partitioning jobs/creating a cluster of same-type jobs?
I have a parameterized DAG and I want to programmatically create DAGs instances based on this DAG.
In traditional Airflow model, I can achieve this easily using a loop:
# Code sample from: https://github.com/astronomer/dynamic-dags-tutorial/blob/main/dags/dynamic-dags-loop.py
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py)
return dag
for n in range(1, 4):
dag_id = 'loop_hello_world_{}'.format(str(n))
// ...
globals()[dag_id] = create_dag(dag_id,
schedule,
dag_number,
default_args)
How I can implement similar behavior in AirFlow 2.0 using TaskFlow model?
I have tried to manually expand the #dag decorations like the following code, but it does not work. The dynamically created DAGs do not have any tasks in them:
#dag(schedule_interval=None, start_date=datetime(2021, 1, 1), catchup=False, tags=['test'])
def flow_test():
#task()
def sleep():
time.sleep(5)
t1 = sleep()
for n in range(1,3):
dag_id = 'flow_test_{}'.format(str(n))
globals()[dag_id] = dag(dag_id=dag_id, tags=['test'])(flow_test)()
I have a scenario where I want to make an integration test for operators called through on_failure_callback in an Airflow DAG.
A minimal example of this DAG is as follows:
def failure_callback(context):
# CustomOperator in this case links to an external K8s service
handle_failure = CustomOperator(
task_id="handle_failure",
timestamp=context["ts"]
)
handle_failure.execute(context=context)
args = {
"catchup": False,
"retries": 3,
"retry_delay": timedelta(seconds=30),
"start_date": START_DATE,
"on_failure_callback": failure_callback,
}
with DAG("foo", schedule_interval=None, default_args=args) as dag:
task_to_fail = SomeOperator()
My first thought for testing would be to run task_to_fail, let it fail, and validate the outcome of the failure_callback with some other process, attempt below:
import pytest
from airflow.models import DagBag, TaskInstance
from dateutil import parser
#pytest.fixture
def foo_dag():
dag_id = "foo"
dag_bag = DagBag("dags")
return dag_bag.dags[dag_id]
#pytest.mark.integration
def test_task_to_fail(foo_dag):
execution_date = parser.parse("2000-01-01T00:00+00:00")
task_id = "task_to_fail"
task = foo_dag.get_task(task_id=task_id)
task_instance = TaskInstance(task, execution_date)
with pytest.raises(Exception):
task_instance.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
assert "INSERT DESIRED OUTCOME OF `failure_callback` HERE"
The issue I'm having is that it doesn't appear that failure_callback is being called when running pytest. I suspect this is due to how TaskInstance is being called (i.e not running the on_failure_callback, but am not sure.
My questions:
Is this the correct way to validate the behavior of this callback? If not, how should this be handled?
Upstream of the task_to_fail task, there are many expensive operations that I want to avoid running during tests. Is it possible to have a full-run of a DAG, executed with pytest, starting from a particular task (in this case, task_to_fail?
I have a small class:
import pickle
import shelve
class Base:
def __init__(self, userid, username, pic_list=[]):
self.userid=userid
self.pic_list=pic_list
self.username_list=[username]
self.username=self.username_list[-1]
def pic_add(self, pic):
self.pic_list.append(pic)
if __name__ == '__main__':
path="D:/"
db = shelve.open(path+'test_base')
db['111']=Base('111','111name',[4,5,6])
db['111'].pic_add('123dsf')
print (sorted(db['111'].pic_list))
db.close()
I want to append 123dsf to pic_list of class instance "111". But the result I get is:
[4, 5, 6]
[Finished in 0.3s]
I want to receive [4, 5, 6, 123dsf]. What am I doing wrong?
Thanx.
P.S. Hint - It is something with shelve module syntax, 'cos adding 'y' works fine:
db['111']=Base('111','111name',[4,5,6])
db['111'].pic_add('123dsf')
Base.pic_add(db['111'],'123dsf')
y=Base('222','222name',[7,8,9])
y.pic_add('pis')
print (y.pic_list)
print (sorted(db['111'].pic_list))
The result is:
[7, 8, 9, 'pis']
[4, 5, 6]
[Finished in 0.4s]
Two ways to do it - as proposed in the documentation: https://docs.python.org/2/library/shelve.html#shelve-example
1. set writeback flag:
db = shelve.open(path+'test_base', writeback=True)
allows you mutate objects in place:
db['111'].pic_add('123dsf')
2. Retrieve copy of stored object, then mutate copy, then store copy back:
cpy = db['111']
cpy.pic_add('123dsf')
db['111'] = cpy
I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x,
In Pipeline Example in Spark documentation, they add different parameters (numFeatures, regParam) by using ParamGridBuilder in the Pipeline. Then by the following line of code they make the best model:
val cvModel = crossval.fit(training.toDF)
Now, I want to know what are the parameters (numFeatures, regParam) from ParamGridBuilder that produces the best model.
I already used the following commands without success:
cvModel.bestModel.extractParamMap().toString()
cvModel.params.toList.mkString("(", ",", ")")
cvModel.estimatorParamMaps.toString()
cvModel.explainParams()
cvModel.getEstimatorParamMaps.mkString("(", ",", ")")
cvModel.toString()
Any help?
Thanks in advance,
One method to get a proper ParamMap object is to use CrossValidatorModel.avgMetrics: Array[Double] to find the argmax ParamMap:
implicit class BestParamMapCrossValidatorModel(cvModel: CrossValidatorModel) {
def bestEstimatorParamMap: ParamMap = {
cvModel.getEstimatorParamMaps
.zip(cvModel.avgMetrics)
.maxBy(_._2)
._1
}
}
When run on the CrossValidatorModel trained in the Pipeline Example you cited gives:
scala> println(cvModel.bestEstimatorParamMap)
{
hashingTF_2b0b8ccaeeec-numFeatures: 100,
logreg_950a13184247-regParam: 0.1
}
val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestPipelineModel.stages
val hashingStage = stages(1).asInstanceOf[HashingTF]
println("numFeatures = " + hashingStage.getNumFeatures)
val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
println("regParam = " + lrStage.getRegParam)
source
To print everything in paramMap, you actually don't have to call parent:
cvModel.bestModel.extractParamMap()
To answer OP's question, to get a single best parameter, for example regParam:
cvModel.bestModel.extractParamMap().apply(cvModel.bestModel.getParam("regParam"))
This is how you get the chosen parameters
println(cvModel.bestModel.getMaxIter)
println(cvModel.bestModel.getRegParam)
this java code should work:
cvModel.bestModel().parent().extractParamMap().you can translate it to scala code
parent()method will return an estimator, you can get the best params then.
This is the ParamGridBuilder()
paraGrid = ParamGridBuilder().addGrid(
hashingTF.numFeatures, [10, 100, 1000]
).addGrid(
lr.regParam, [0.1, 0.01, 0.001]
).build()
There are 3 stages in pipeline. It seems we can assess parameters as the following:
for stage in cv_model.bestModel.stages:
print 'stages: {}'.format(stage)
print stage.params
print '\n'
stage: Tokenizer_46ffb9fac5968c6c152b
[Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='inputCol', doc='input column name'), Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='outputCol', doc='output column name')]
stage: HashingTF_40e1af3ba73764848d43
[Param(parent='HashingTF_40e1af3ba73764848d43', name='inputCol', doc='input column name'), Param(parent='HashingTF_40e1af3ba73764848d43', name='numFeatures', doc='number of features'), Param(parent='HashingTF_40e1af3ba73764848d43', name='outputCol', doc='output column name')]
stage: LogisticRegression_451b8c8dbef84ecab7a9
[]
However, there is no parameter in the last stage, logiscRegression.
We can also get weight and intercept parameter from logistregression like the following:
cv_model.bestModel.stages[1].getNumFeatures()
10
cv_model.bestModel.stages[2].intercept
1.5791827733883774
cv_model.bestModel.stages[2].weights
DenseVector([-2.5361, -0.9541, 0.4124, 4.2108, 4.4707, 4.9451, -0.3045, 5.4348, -0.1977, -1.8361])
Full exploration:
http://kuanliang.github.io/2016-06-07-SparkML-pipeline/
I am working with Spark Scala 1.6.x and here is a full example of how i can set and fit a CrossValidator and then return the value of the parameter used to get the best model (assuming that training.toDF gives a dataframe ready to be used) :
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Instantiate a LogisticRegression object
val lr = new LogisticRegression()
// Instantiate a ParamGrid with different values for the 'RegParam' parameter of the logistic regression
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)).build()
// Setting and fitting the CrossValidator on the training set, using 'MultiClassClassificationEvaluator' as evaluator
val crossVal = new CrossValidator().setEstimator(lr).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = crossVal.fit(training.toDF)
// Getting the value of the 'RegParam' used to get the best model
val bestModel = cvModel.bestModel // Getting the best model
val paramReference = bestModel.getParam("regParam") // Getting the reference of the parameter you want (only the reference, not the value)
val paramValue = bestModel.get(paramReference) // Getting the value of this parameter
print(paramValue) // In my case : 0.001
You can do the same for any parameter or any other type of model.
If javaļ¼see this debug show;
bestModel.parent().extractParamMap()
Building in the solution of #macfeliga, a single liner that works for pipelines:
cvModel.bestModel.asInstanceOf[PipelineModel]
.stages.foreach(stage => println(stage.extractParamMap))
This SO thread kinda answers the question.
In a nutshell, you need to cast each object to its supposed-to-be class.
For the case of CrossValidatorModel, the following is what I did:
import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.regression.RandomForestRegressionModel
// Load CV model from S3
val inputModelPath = "s3://path/to/my/random-forest-regression-cv"
val reloadedCvModel = CrossValidatorModel.load(inputModelPath)
// To get the parameters of the best model
(
reloadedCvModel.bestModel
.asInstanceOf[PipelineModel]
.stages(1)
.asInstanceOf[RandomForestRegressionModel]
.extractParamMap()
)
In the example, my pipeline has two stages (a VectorIndexer and a RandomForestRegressor), so the stage index is 1 for my model.
For me, the #orangeHIX solution is perfect:
val cvModel = cv.fit(training)
val cvMejorModelo = cvModel.bestModel.asInstanceOf[ALSModel]
cvMejorModelo.parent.extractParamMap()
res86: org.apache.spark.ml.param.ParamMap =
{
als_08eb64db650d-alpha: 0.05,
als_08eb64db650d-checkpointInterval: 10,
als_08eb64db650d-coldStartStrategy: drop,
als_08eb64db650d-finalStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-implicitPrefs: false,
als_08eb64db650d-intermediateStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-itemCol: product,
als_08eb64db650d-maxIter: 10,
als_08eb64db650d-nonnegative: false,
als_08eb64db650d-numItemBlocks: 10,
als_08eb64db650d-numUserBlocks: 10,
als_08eb64db650d-predictionCol: prediction,
als_08eb64db650d-rank: 1,
als_08eb64db650d-ratingCol: rating,
als_08eb64db650d-regParam: 0.1,
als_08eb64db650d-seed: 1994790107,
als_08eb64db650d-userCol: user
}