Databricks - overwriteSchema - pyspark

Multiple times I've had an issue while updating a delta table in Databricks where overwriting the Schema fails the first time, but is then successful the second time. The solution to my problem was to simply run it again, and I'm unable to reproduce at this time. If it happens again I'll come back and post the exact error message, but it was in essence a Schema Mismatch error. Has anyone else had a similar problem?
overwriteSchema = True
DF.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", overwriteSchema) \
.partitionBy(datefield) \
.saveAsTable(deltatable)

Key-value should be string, not Boolean.
.option("overwriteSchema", "True")
DF.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "True") \
.partitionBy(datefield) \
.saveAsTable(deltatable)

Related

Read Json Kafka message without schema

Currently we are working on a real time data feeds having Json data.
While reading the examples from -
https://sparkbyexamples.com/spark/spark-streaming-with-kafka/
It looks like we need a schema for kafka json message.
Is there any other way to process data without schema ?
try below code after running the zookeeper, Kafka server and other required service.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest")\
.load() #earliest
print("Printing Schema of transaction_detail_df: ")
df.printSchema()
transaction_detail_df1 = df.selectExpr("CAST(value AS STRING)")
trans_detail_write_stream = transaction_detail_df1 \
.writeStream \
.trigger(processingTime='2 seconds') \
.option("truncate", "false") \
.format("console") \
.start()
trans_detail_write_stream.awaitTermination()
just change the basic configuration, you would be able to see the output
You can use get_json_object SparkSQL function to parse data out of JSON string data without defining any additional schema.
You can simply use cast function to deserialize the binary key/value, as the example shows

PySpark dataframe show() gives 'Unsupported class file major version 58' Error

I am trying to read in from a mongo DB using pyspark and the mongo-spark connector. Reading in the data works fine, but when I try to see the dataframe with df.show() I get the error
IllegalArgumentException: 'Unsupported class file major version 58'
This is what my code looks like right now.
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "...") \
.config("spark.mongodb.output.uri", "...") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.1') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.show()

Azure Databricks to Azure SQL DW: Long text columns

I would like to populate an Azure SQL DW from an Azure Databricks notebook environment. I am using the built-in connector with pyspark:
sdf.write \
.format("com.databricks.spark.sqldw") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "test_table") \
.option("url", url) \
.option("tempDir", temp_dir) \
.save()
This works fine, but I get an error when I include a string column with a sufficiently long content. I get the following error:
Py4JJavaError: An error occurred while calling o1252.save.
: com.databricks.spark.sqldw.SqlDWSideException: SQL DW failed to execute the JDBC query produced by the connector.
Underlying SQLException(s):
- com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopSqlException: String or binary data would be truncated. [ErrorCode = 107090] [SQLState = S0001]
As I understand it, this is because the default string type is NVARCHAR(256). It is possible to configure (reference), but the maximum NVARCHAR length is 4k characters. My strings occasionally reach 10k characters. Therefore, I am curious as to how I can export certain columns as text/longtext instead.
I would guess that the following would work, if only the preActions were executed after table was created. It's not, and therefore it fails.
sdf.write \
.format("com.databricks.spark.sqldw") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "test_table") \
.option("url", url) \
.option("tempDir", temp_dir) \
.option("preActions", "ALTER TABLE test_table ALTER COLUMN value NVARCHAR(MAX);") \
.save()
Also, postActions are executed after data is inserted, and therefore this will also fail.
Any ideas?
I had a similar problem and was able to resolve it using the options:
.option("maxStrLength",4000)
Thus in your example this would be:
sdf.write \
.format("com.databricks.spark.sqldw") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "test_table") \
.option("maxStrLength",4000)\
.option("url", url) \
.option("tempDir", temp_dir) \
.save()
This is documented here:
"StringType in Spark is mapped to the NVARCHAR(maxStrLength) type in Azure Synapse. You can use maxStrLength to set the string length for all NVARCHAR(maxStrLength) type columns that are in the table with name dbTable in Azure Synapse."
If your strings go over 4k then you should:
Pre-define your table column with NVARCHAR(MAX) and then write in append mode to the table. In this case you can't use the default columnstore index so either use a HEAP or set proper indexes. A lazy heap would be:
CREATE TABLE example.table
(
NormalColumn NVARCHAR(256),
LongColumn NVARCHAR(4000),
VeryLongColumn NVARCHAR(MAX)
)
WITH (HEAP)
Then you can write to it as usual, without the maxStrLength option. This also means you don't overspecify all other string columns.
Other options are to:
use split to convert 1 column into several string columns.
save as parquet and then load in from inside synapse

Attempting to write csv file in sftp mode with Spark on Yarn in a Kerberos environment

I'm trying to write a Dataframe into a csv file and put this csv file into a remote machine. The Spark job is running on Yarn into a Kerberos cluster.
Below, the error I get when the job tries to write the csv file on the remote machine :
diagnostics: User class threw exception:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=dev, access=WRITE,
inode="/data/9/yarn/local/usercache/dev/appcache/application_1532962490515_15862/container_e05_1532962490515_15862_02_000001/tmp/spark_sftp_connection_temp178/_temporary/0":hdfs:hdfs:drwxr-xr-x
In order to write this csv file, i'm using the folowing parameters in a method that write this file in sftp mode :
def writeToSFTP(df: DataFrame, path: String) = {
df.write
.format("com.springml.spark.sftp")
.option("host", "hostname.test.fr")
.option("username", "test_hostname")
.option("password", "toto")
.option("fileType", "csv")
.option("delimiter", ",")
.save(path)
}
I'm using the Spark SFTP Connector library as described in the link : https://github.com/springml/spark-sftp
The script which is used to launch the job is :
#!/bin/bash
kinit -kt /home/spark/dev.keytab dev#CLUSTER.HELP.FR
spark-submit --class fr.edf.dsp.launcher.LauncherInsertion \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 5g \
--executor-memory 5g \
--queue dev \
--files /home/spark/dev.keytab#user.keytab,\
/etc/krb5.conf#krb5.conf,\
/home/spark/jar/dev-application-SNAPSHOT.conf#app.conf \
--conf "spark.executor.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
/home/spark/jar/dev-SNAPSHOT.jar > /home/spark/out.log 2>&1&
The csv files are not written into HDFS. Once the Dataframe is built i try to send it to the machine. I suspect a Kerberos issue with the sftp Spark connector : Yarn can't contact a remote machine...
Any help is welcome, thanks.
add temporary location where you have write access, and do not worry about cleanup this because in the end after ftp done these files will be deleted,
def writeToSFTP(df: DataFrame, path: String) = {
df.write
.format("com.springml.spark.sftp")
.option("host", "hostname.test.fr")
.option("username", "test_hostname")
.option("password", "toto")
.option("fileType", "csv")
**.option("hdfsTempLocation","/user/currentuser/")**
.option("delimiter", ",")
.save(path)
}

CDH spark steaming consumer kerberos kafka

Does any one tried to use spark-steaming(pyspark) as consumer for kerberos KAFKA in CDH ?
I search the CDH and just find some example about Scala.
Does it means CDH does not support this ?
Anyone can help on this ???
CDH supports Pyspark based Structured Streaming API to connect Kerberos-secured Kafka cluster as well. Even I found it hard to find example code . You can refer below sample code which well tested and implemented in CDH prod environment .
Note : Points to consider in below sample code .
Adjust packages version based on your environment .
Mention right JAAS,Keytab file location in spark submit command and config parameters in code.
This code has been given as an example to read Kerberos enabled Kafka cluster topic and writing into HDFS location.
spark/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,com.databricks:spark-avro_2.11:3.2.0 --conf spark.ui.port=4055 --files /home/path/spark_jaas,/home/bdpda/bdpda.headless.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/bdpda/spark_jaas" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/bdpda/spark_jaas" pysparkstructurestreaming.py
Pyspark code: pysparkstructurestreaming.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time
# Spark Streaming context :
spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
df1.writeStream \
.option("path","target_directory") \
.format("csv") \
.option("checkpointLocation","chkpint_directory") \
.outputMode("append") \
.start()
ssc.awaitTermination()