Azure Databricks to Azure SQL DW: Long text columns - pyspark

I would like to populate an Azure SQL DW from an Azure Databricks notebook environment. I am using the built-in connector with pyspark:
sdf.write \
.format("com.databricks.spark.sqldw") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "test_table") \
.option("url", url) \
.option("tempDir", temp_dir) \
.save()
This works fine, but I get an error when I include a string column with a sufficiently long content. I get the following error:
Py4JJavaError: An error occurred while calling o1252.save.
: com.databricks.spark.sqldw.SqlDWSideException: SQL DW failed to execute the JDBC query produced by the connector.
Underlying SQLException(s):
- com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopSqlException: String or binary data would be truncated. [ErrorCode = 107090] [SQLState = S0001]
As I understand it, this is because the default string type is NVARCHAR(256). It is possible to configure (reference), but the maximum NVARCHAR length is 4k characters. My strings occasionally reach 10k characters. Therefore, I am curious as to how I can export certain columns as text/longtext instead.
I would guess that the following would work, if only the preActions were executed after table was created. It's not, and therefore it fails.
sdf.write \
.format("com.databricks.spark.sqldw") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "test_table") \
.option("url", url) \
.option("tempDir", temp_dir) \
.option("preActions", "ALTER TABLE test_table ALTER COLUMN value NVARCHAR(MAX);") \
.save()
Also, postActions are executed after data is inserted, and therefore this will also fail.
Any ideas?

I had a similar problem and was able to resolve it using the options:
.option("maxStrLength",4000)
Thus in your example this would be:
sdf.write \
.format("com.databricks.spark.sqldw") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "test_table") \
.option("maxStrLength",4000)\
.option("url", url) \
.option("tempDir", temp_dir) \
.save()
This is documented here:
"StringType in Spark is mapped to the NVARCHAR(maxStrLength) type in Azure Synapse. You can use maxStrLength to set the string length for all NVARCHAR(maxStrLength) type columns that are in the table with name dbTable in Azure Synapse."
If your strings go over 4k then you should:
Pre-define your table column with NVARCHAR(MAX) and then write in append mode to the table. In this case you can't use the default columnstore index so either use a HEAP or set proper indexes. A lazy heap would be:
CREATE TABLE example.table
(
NormalColumn NVARCHAR(256),
LongColumn NVARCHAR(4000),
VeryLongColumn NVARCHAR(MAX)
)
WITH (HEAP)
Then you can write to it as usual, without the maxStrLength option. This also means you don't overspecify all other string columns.
Other options are to:
use split to convert 1 column into several string columns.
save as parquet and then load in from inside synapse

Related

Read Json Kafka message without schema

Currently we are working on a real time data feeds having Json data.
While reading the examples from -
https://sparkbyexamples.com/spark/spark-streaming-with-kafka/
It looks like we need a schema for kafka json message.
Is there any other way to process data without schema ?
try below code after running the zookeeper, Kafka server and other required service.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest")\
.load() #earliest
print("Printing Schema of transaction_detail_df: ")
df.printSchema()
transaction_detail_df1 = df.selectExpr("CAST(value AS STRING)")
trans_detail_write_stream = transaction_detail_df1 \
.writeStream \
.trigger(processingTime='2 seconds') \
.option("truncate", "false") \
.format("console") \
.start()
trans_detail_write_stream.awaitTermination()
just change the basic configuration, you would be able to see the output
You can use get_json_object SparkSQL function to parse data out of JSON string data without defining any additional schema.
You can simply use cast function to deserialize the binary key/value, as the example shows

Databricks - overwriteSchema

Multiple times I've had an issue while updating a delta table in Databricks where overwriting the Schema fails the first time, but is then successful the second time. The solution to my problem was to simply run it again, and I'm unable to reproduce at this time. If it happens again I'll come back and post the exact error message, but it was in essence a Schema Mismatch error. Has anyone else had a similar problem?
overwriteSchema = True
DF.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", overwriteSchema) \
.partitionBy(datefield) \
.saveAsTable(deltatable)
Key-value should be string, not Boolean.
.option("overwriteSchema", "True")
DF.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "True") \
.partitionBy(datefield) \
.saveAsTable(deltatable)

Create table in kafka using ksqldb-server

I am trying to create a kafka table using the (Confluent) ksqldb-server via its REST interface using the following code (bash script):
KSQLDB_COMMAND="CREATE TABLE sample_table \
(xkey VARCHAR, \
xdata VARCHAR) \
WITH (KAFKA_TOPIC=\'sample-topic\', \
VALUE_FORMAT=\'JSON\', \
KEY=\'xkey\'); "
COMMAND="curl -X 'POST' '$KSQLDB_SERVER' \
-H 'Content-Type: application/vnd.ksql.v1+json; charset=utf-8' \
-d '{ \"ksql\": \"$KSQLDB_COMMAND\" }' "
eval $COMMAND
The following error output message is returned:
{"#type":"statement_error","error_code":40001,"message":"Failed to prepare statement: Invalid config variable(s) in the WITH clause: KEY","statementText":"CREATE TABLE sample_table (xkey VARCHAR, xdata VARCHAR) WITH (KAFKA_TOPIC='sample-topic', VALUE_FORMAT='JSON', KEY='xkey');","entities":[]}%
The error suggests an error in the actual statement, in particular with the KEY attribute.
I can get basic commands ("LIST STREAMS" etc) working using the REST interface but can not create tables, so I figure this is a problem in the KSQL statement or how I am create the bash command (in "COMMAND" variable).
Any help is appreciated.
I spent a fair bit of time experimenting and got this simple example working (my original attempt required too many bash variable substitutions to make it useful/maintainable, so this version is simplified quite a bit). I also found that KSQLDB table names must follow regular SQL naming conventions for table names (ie. alpha, underscores, etc... but no hyphens, which caused a bunch of errors in my original question... I should have read the documentation more carefully).
The following works (you may need to change your KSQLDB server address)... and with minimal changes, just about any KSQLDB command can be executed:
####
# NOTE: table MUST be alpha (underscores are OK)... hyphens are not allowed
####
KSQLDB_SERVER="http://localhost:8088/ksql"
KSQLDB_TABLE="some_table"
KSQLDB_TOPIC="some_topic"
VALUE_FORMAT="JSON"
FMT="{ \"ksql\": \"CREATE TABLE %s (key VARCHAR PRIMARY KEY, data VARCHAR) WITH (KAFKA_TOPIC='%s', VALUE_FORMAT='%s');\" }"
JSON_DATA=$(printf "$FMT" "$KSQLDB_TABLE" "$KSQLDB_TOPIC" "$VALUE_FORMAT")
curl -X "POST" "$KSQLDB_SERVER" \
-H "Content-Type: application/vnd.ksql.v1+json; charset=utf-8" \
-d "$JSON_DATA"
you can't specify KEY for table, KEY is used for streams. you should use PRIMARY KEY for table in the type declaration.like :
CREATE OR REPLACE TABLE TABLE_1
(
ID INT PRIMARY KEY,
EMAILADDRESS VARCHAR,
ISPRIMARY BOOLEAN,
USERID INT,
PARANT INT
)WITH (KAFKA_TOPIC='test_1', VALUE_FORMAT='AVRO', KEY_FORMAT='AVRO');

PySpark dataframe show() gives 'Unsupported class file major version 58' Error

I am trying to read in from a mongo DB using pyspark and the mongo-spark connector. Reading in the data works fine, but when I try to see the dataframe with df.show() I get the error
IllegalArgumentException: 'Unsupported class file major version 58'
This is what my code looks like right now.
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "...") \
.config("spark.mongodb.output.uri", "...") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.1') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.show()

Attempting to write csv file in sftp mode with Spark on Yarn in a Kerberos environment

I'm trying to write a Dataframe into a csv file and put this csv file into a remote machine. The Spark job is running on Yarn into a Kerberos cluster.
Below, the error I get when the job tries to write the csv file on the remote machine :
diagnostics: User class threw exception:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=dev, access=WRITE,
inode="/data/9/yarn/local/usercache/dev/appcache/application_1532962490515_15862/container_e05_1532962490515_15862_02_000001/tmp/spark_sftp_connection_temp178/_temporary/0":hdfs:hdfs:drwxr-xr-x
In order to write this csv file, i'm using the folowing parameters in a method that write this file in sftp mode :
def writeToSFTP(df: DataFrame, path: String) = {
df.write
.format("com.springml.spark.sftp")
.option("host", "hostname.test.fr")
.option("username", "test_hostname")
.option("password", "toto")
.option("fileType", "csv")
.option("delimiter", ",")
.save(path)
}
I'm using the Spark SFTP Connector library as described in the link : https://github.com/springml/spark-sftp
The script which is used to launch the job is :
#!/bin/bash
kinit -kt /home/spark/dev.keytab dev#CLUSTER.HELP.FR
spark-submit --class fr.edf.dsp.launcher.LauncherInsertion \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 5g \
--executor-memory 5g \
--queue dev \
--files /home/spark/dev.keytab#user.keytab,\
/etc/krb5.conf#krb5.conf,\
/home/spark/jar/dev-application-SNAPSHOT.conf#app.conf \
--conf "spark.executor.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
/home/spark/jar/dev-SNAPSHOT.jar > /home/spark/out.log 2>&1&
The csv files are not written into HDFS. Once the Dataframe is built i try to send it to the machine. I suspect a Kerberos issue with the sftp Spark connector : Yarn can't contact a remote machine...
Any help is welcome, thanks.
add temporary location where you have write access, and do not worry about cleanup this because in the end after ftp done these files will be deleted,
def writeToSFTP(df: DataFrame, path: String) = {
df.write
.format("com.springml.spark.sftp")
.option("host", "hostname.test.fr")
.option("username", "test_hostname")
.option("password", "toto")
.option("fileType", "csv")
**.option("hdfsTempLocation","/user/currentuser/")**
.option("delimiter", ",")
.save(path)
}