spark.databricks.queryWatchdog.outputRatioThreshold Error for FPGrowth using Pyspark on Databricks - pyspark

I'm working on Market Basket Analysis using Pyspark on Databricks.
The transactional dataset consists of a total of 5.4 Million transactions, with approx. 11,000 items.
I'm able to run FPGrowth on the dataset, but whenever I'm trying to either display or take a count of model.freqItemsets & model.associationRules, I'm getting this weird error every time:
org.apache.spark.SparkException: Job 88 cancelled because Task 8084 in Stage 283 exceeded the maximum allowed ratio of input to output records (1 to 235158, max allowed 1 to 10000); this limit can be modified with configuration parameter spark.databricks.queryWatchdog.outputRatioThreshold
I'm not even able to understand why am I facing this error, and how I can resolve the same.
Any help would be appreciated. Thanks in advance!
I tried reading the docs provided by Databricks, yet I'm not clearly able to understand why am I getting this error

Related

Error in dataflow plugins.adfprod.AutoResolveIntegrationRuntime.45

I am getting below error while running my dataflow. This dataflow was running fine till yesterday. From today onwards I am getting error like this
Operation on target LoadAccount failed:
[plugins.adfprod.AutoResolveIntegrationRuntime.45 WorkspaceType: CCID:<1a11d7e0-b019-4845-ab29-641100c79f04>] The job has surpassed the max number of seconds it can be in ResourceAcquisition state [1000s], so ending the job.
Error Message - The job has surpassed the max number of seconds it can
be in ResourceAcquisition state [1000s], so ending the job.
In a lot of cases of Data Factory the MAX limitations are only soft restrictions that can easily be lifted via a support ticket.
There is no such thing as a limitless cloud platform.
Refer this article by MRPAULANDREW

How to speed up downloading a CSV file locally with PySpark (databricks)?

We created an ImageClassifier to predict wether certain instagram images are of a certain class. Running this model works fine.
#creating deep image feauturizer using the InceptionV3 lib
featurizer = DeepImageFeaturizer(inputCol="image",
outputCol="features",
modelName="InceptionV3")
#using lr for speed and reliability
lr = LogisticRegression(maxIter=5, regParam=0.03,
elasticNetParam=0.5, labelCol="label")
#define Pipeline
sparkdn = Pipeline(stages=[featurizer, lr])
spark_model = sparkdn.fit(df)
We made this seperately from our basetable (which runs on a higher cluster). We need to extract the spark_model predictions as a csv to import it back in the other notebook and merge it with our basetable.
To do this we have tried the following
image_final_estimation = spark_model.transform(image_final)
display(image_final_estimation) #since this gives an option in databricks to
download the csv
AND
image_final_estimation.coalesce(1).write.csv(path = 'imagesPred2.csv') #and then we would be able to read it back in with spark.read.csv
The thing is these operations take very long (probably due to the nature of the task) and they crash our cluster. We are able to show our outcome, but not only with '.show()', not with the display() method.
Is there any other way to save this csv locally? Or how can we improve the speed of these tasks?
Please note that we use the community edition of Databricks.
When storing DataFrames on files, a good way to parallelize the writing is defining an appropiate number of Partitions related to that DataFrame/RDD.
In the code shown by you, you are using the coalesce function (which it basically reduces the number of partitions to 1, which reduces the effects of parallelism).
On Databricks Community Edition, I tried the following test with a CSV Dataset provided by Databricks( https://docs.databricks.com/getting-started/databricks-datasets.html). The idea is to measure the time elapsed by writing the data into a csv using one partition vs using many partitions.
carDF = spark.read.option("header", True).csv("dbfs:/databricks-datasets/Rdatasets/data-001/csv/car/*")
print("Total count of Rows {0}".format(carDF.count()))
print("Original Partitions Number: {0}".format(carDF.rdd.getNumPartitions()))
>>Total count of Rows 39005
>>Original Partitions Number: 7
%timeit carDF.write.format("csv").mode("overwrite").save("/tmp/caroriginal")
>>2.79 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, up to now, writing the dataset on a local file using 7 partitions took 2.79 seconds
newCarDF = carDF.coalesce(1)
print("Total count of Rows {0}".format(newCarDF.count()))
print("New Partitions Number: {0}".format(newCarDF.rdd.getNumPartitions()))
>>Total count of Rows 39005
>>New Partitions Number: 1
%timeit newCarDF.write.format("csv").mode("overwrite").save("/tmp/carmodified")
>>4.13 s ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, for the same DataFrame, writing to a csv with one partition took 4,13 seconds.
In conclusion, in this case the "coalesce(1)" part is affecting on the writing performance.
Hope this helps

hiveQL counter limit exceeded error

I am running a create table query in Hiveql and obtain the following error when it is run:
Status: Failed
Counters limit exceeded: Too many counters: 2001 max=2000
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Counters limit exceeded: Too many counters: 2001 max=2000
I have attempted to set the counters to to a greater number, i.e.
set tez.counters.max=16000;
However, it still falls over with the same error.
My query incorporates 13 left joins but the data sets are relatively small (1,000's rows). The query did work when there were roughly 10 joins but since I've added additional joins it has started to fail.
Any suggestions on how I can configure this to work would be greatly appreciated!
You need to find real initial error log from failed container. The error you have shown here is not initial error. 2001 containers (including their restart attempts) have failed because of some other error (which you really need to fix), then all job was terminated, all other containers were killed because of Failed Counters limit. Go to Job tracker and find some failed (not killed) container and read it's log. The real problem is not in limit and changing the Failed Counters limit will not help.
Divide your query into multiple step and then run it.
As you said your query works with 10 joins,So first create the table which has data with first 10 joins and then with the new table,create other table which has data from first table and three other tables.
I faced the same issue as I was applying union all statement on 100 tables.But when I started to run only 10 tables at a time it works.
Hope This Helps!!!!

Exceeding `spark.driver.maxResultSize` without bringing any data to the driver

I have a Spark application that performs a large join
val joined = uniqueDates.join(df, $"start_date" <= $"date" && $"date" <= $"end_date")
and then aggregates the resulting DataFrame down to one with maybe 13k rows. In the course of the join, the job fails with the following error message:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 78021 tasks is bigger than spark.driver.maxResultSize (2.0 GB)
This was happening before without setting spark.driver.maxResultSize, and so I set spark.driver.maxResultSize=2G. Then, I made a slight change to the join condition, and the error resurfaces.
Edit: In resizing the cluster, I also doubled the number of partitions the DataFrame assumes in a .coalesce(256) to a .coalesce(512), so I can't be sure it's not because of that.
My question is, since I am not collecting anything to the driver, why should spark.driver.maxResultSize matter at all here? Is the driver's memory being used for something in the join that I'm not aware of?
Just because you don't collect anything explicitly it doesn't mean that nothing is collected. Since the problem occurs during a join, the most likely explanation is that execution plan uses broadcast join. In that case Spark will collect data first, and then broadcast it.
Depending on the configuration and pipeline:
Make sure that spark.sql.autoBroadcastJoinThreshold is smaller than spark.driver.maxResultSize.
Make sure you don't force broadcast join on a data of unknown size.
While nothing indicates it is the problem here, be careful when using Spark ML utilities. Some of these (most notably indexers) can bring significant amounts of data to the driver.
To determine if broadcasting is indeed the problem please check the execution plan, and if needed, remove broadcast hints and disable automatic broadcasts:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
In theory, exception is not always related with customer data.
Technical information about tasks execution results send to Driver Node in serialized form, and this information can take more memory then threshold.
Prove:
Error message located in org.apache.spark.scheduler.TaskSetManager#canFetchMoreResults
val msg = s"Total size of serialized results of ${calculatedTasks} tasks " +
Method called in org.apache.spark.scheduler.TaskResultGetter#enqueueSuccessfulTask
val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
case directResult: DirectTaskResult[_] =>
if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
return
}
If tasks number is huge, mentioned exception can occurs.

java.lang.OutOfMemoryError while using coalesce(1)

I'm trying to save an rdd by as below,
data.coalesce(1).saveAsTextFile(outputPath)
but i'm getting java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
Has anyone faced the similar issue, if so, i would like to learn how did you fix it
can you please provide of more details on you are getting OOM on driver of executor ?
By the code you posted coalesce(1) will force all executor to send data to a single executor and if you are data size is huge , you will start seeing failures .
coalesce is resulting in shuffle. (All mapperTask sending data to a single Task).
Follow http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/ for getting a in-depth understanding