How to disable broadcast in a Databricks notebook?

How to disable broadcast in a Databricks notebook? - pyspark

When I run a query in Databricks/PySpark I get the following error:
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
How do I do this programmatically (Python) in a Databricks notebook? I have tried the below:
>>> spark.sql.autoBroadcastJoinThreshold(-1)
result:
AttributeError: 'function' object has no attribute 'autoBroadcastJoinThreshold'
>>> spark.sql.autoBroadcastJoinThreshold = -1
result:
AttributeError: 'method' object has no attribute 'autoBroadcastJoinThreshold'
Maybe spark.sql.autoBroadcastJoinThreshold is a property key and this property can somehow be set to -1, but I haven't yet found any documentation that describes how to accomplish this using Python.

The cluster configuration page for the Spark settings is where this can be specified.

I used this in databricks before my join command and it worked:
spark.conf.set("spark.sql.broadcastTimeout" ,"-1")

You can set it in cluster configuration as in the accepted answer.
Also, you can set in in a notebook:

Related

How to change spark config in Python

Below is my python code
spark = SparkSession.builder.appName('CD6').config('spark.ui.port','9999').enableHiveSupport().getOrCreate()
I would like to amend following properties to false in the spark config.
How to change them using spark session command?
spark.sql.hive.convertMetastoreOrc=false
spark.sql.hive.convertMetastoreParquet=false
I tried adding properties to .config; but it errors out.

You can set spark config properties like so :
spark.conf.set("spark.sql.<name-of-property>", <value>)
in your case, it would be :
spark.conf.set("spark.sql.hive.convertMetastoreOrc",False)
spark.conf.set("spark.sql.hive.convertMetastoreParquet", False)

Pyspark in Azure - need to configure sparkContext

Using spark Notebook in Azure Synapse, I'm processing some data from parquet files, and outputting it as different parquet files. I produced a working script and started applying it to different datasets, all working fine until I cam across a dataset containing dates older than 1900.
For this issue, I came across this article (which I took to be applicable to my scenario):
Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
The fix is to add this code chunk, which I did, to the top of my notebook:
%%pyspark
from pyspark import SparkContext
sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)
Unfortunately this generated another error:
Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext. :
java.lang.IllegalStateException: Promise already completed. at
scala.concurrent.Promise.complete(Promise.scala:53) at
scala.concurrent.Promise.complete$(Promise.scala:52) at
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86) at
scala.concurrent.Promise.success$(Promise.scala:86) at
scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$sparkContextInitialized(ApplicationMaster.scala:408)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:910)
at
org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
at org.apache.spark.SparkContext.(SparkContext.scala:683) at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:238) at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748)
I've tried looking into resolutions, but this is getting outside of my area of expertise. I want my Synapse spark notebook to run, even on date fields where the date is less than 1900. Any ideas?

I was able to solve this problem by changing the overall configuration for my spark pool (which you will probably want to do as well, unless you want to add config code to every notebook you make). To do this, open up Synapse Studio, then go Manage > Apache Spark pools, click the three dots by your pool (which will be hidden until you mouse over them, great design Microsoft), then select Apache Spark configuration.
From there, create a new configuration, and add a configuration property. For the property, enter spark.sql.parquet.int96RebaseModeInRead and the value enter CORRECTED. Note that spark.sql.parquet.int96RebaseModeInRead does NOT show up as a suggested property, you have to enter it yourself.
Apply your changes, save everything, and make sure your new configuration is selected. It might take a bit for the new changes to be reflected in your notebooks, but it should work from there. If you notice some funky date issues with older dates, try changing CORRECTED to LEGACY.

Reading url via pyspark in Databricks notebook

I am unable to read the content of a URL via pySpark in Databricks Notebooks(Version 8.3, Spark 3.1.1). I have tried almost all the possibilities but unable to find out the exact problem. Here is my code.
from pyspark import SparkFiles
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
spark.sparkContext.addFile(url)
df1 = spark.read.text("file://"+SparkFiles.get('8028d38a.tps'))
df1.show()
Here is the error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43) (10.139.64.4 executor 0): com.databricks.sql.io.FileReadException: Error while reading file file:/local_disk0/spark-95887d0f-a955-4075-86ac-520a51f0c64e/userFiles-9204e03a-a0fd-4999-9f40-9d9c3cc599a6/8028d38a.tps. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
I have referred reading data from URL using spark databricks platform as an example. Did anyone face the similar problem?

This is the best i've found from youtube pyspark for everyone playlist
!curl "https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps" >> 8028d38a.tps

As workaround , we can read respective location panda dataframe and covert into pyspark dataframe for further process .
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
import pandas as pd
df = spark.createDataFrame(pd.read_csv(url))
display(df)
Screen print :
If you want to skip first row if that is invalid one ,

Kafka connect with mysql custom query

I have done incremental data sync with help of kafka connect.
Now i want to achieve same with custom query. But I am getting error.
My config file is
name=mysql-whitelist-timestamp-source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://127.0.0.1:3306/demouser=root&password=root
query=select name from students3 where marks = 10
mode=timestamp table.whitelist=students3
timestamp.column.name=timestamp
topic.prefix=test-mysql-jdbc-
And getting below error:
ERROR WorkerConnector{id=mysql-whitelist-timestamp-source} Error while
starting connector
(org.apache.kafka.connect.runtime.WorkerConnector:119)
org.apache.kafka.connect.errors.ConnectException: query may not be
combined with whole-table copying settings.

We shouldn't use the tag table.whitelist with the custom query. see the full explanation.

Spark job using HBase fails

Any Spark job I run that involves HBase access results in the errors below. My own jobs are in Scala, but supplied python examples end the same. The cluster is Cloudera, running CDH 5.4.4. The same jobs run fine on a different cluster with CDH 5.3.1.
Any help is greatly apreciated!
...
15/08/15 21:46:30 WARN TableInputFormatBase: initializeTable called multiple times. Overwriting connection and table reference; TableInputFormatBase will not close these old references when done.
...
15/08/15 21:46:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, some.server.name): java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:163)
...
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
... 14 more

run spark-shell with this parameters:
--driver-class-path .../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --driver-java-options "-Dspark.executor.extraClassPath=.../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"
Why it works is described here.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to disable broadcast in a Databricks notebook? - pyspark

The cluster configuration page for the Spark settings is where this can be specified.

I used this in databricks before my join command and it worked: spark.conf.set("spark.sql.broadcastTimeout" ,"-1")

You can set it in cluster configuration as in the accepted answer. Also, you can set in in a notebook:

Related

How to change spark config in Python

Pyspark in Azure - need to configure sparkContext

Reading url via pyspark in Databricks notebook

Kafka connect with mysql custom query

Spark job using HBase fails

Categories

Resources