Event hub to databricks error as stream is terminated? - pyspark

I've been trying to set up a proof of concept were Azure Databricks reads data from my Event Hub using the following code:
connectionString = "Endpoint=sb://mystream.servicebus.windows.net/;EntityPath=theeventhub;SharedAccessKeyName=event_test;SharedAccessKey=mykeygoeshere12345"
ehConf = {
'eventhubs.connectionString' : connectionString
}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", df["body"].cast("string"))
display(readEventStream)
I'm using the azure_eventhubs_spark_2_11_2_3_6.jar package as recommeneded here but i've tried the latest version and keep getting the message
ERROR : Some streams terminated before this command could finish!
I've used the databricks runtime version 6.1, and rolled it back to 5.3 but can't seem to get it up and running. I have a Python script that sends data to the event hub, I just can't see anything coming out of it?
Is it the package? or something else I'm doing wrong?
Update: I was loading the library from a JAR file that I downloaded. I deleted that and then got it from the Maven repo. After testing it worked

It works perfectly with the below configuration:
Databrick Runtime: 5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)
Azure EventHub library: com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13
Use above configuration, able to get stream the data from Azure Eventhubs.
Reference: Integrating Apache Spark with Azure Event Hubs
Hope this helps.

Related

parquet streaming of Azure Blob storage into databricks with unity catalog

Unity Catalog have recently been set up in my databricks account, and I am trying to stream from an Azure container containing parquet files to a service catalog, using a notebook that ran before.
I do however now get the following
py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.streaming.DataStreamReader.format(java.lang.String) is not whitelisted on class class org.apache.spark.sql.streaming.DataStreamReader
when trying to run the following spark command from my Notebook:
df = (spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.useNotifications", "false") # useNotifications determines if we efficiently scan the new files or if we set up a subscription to listen to new file events
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns") # schemaEvolutionMode determines what happens when the schema changes
.option("cloudFiles.schemaLocation", schemaPath)
.load(dataPath)
)
where schemaPathand dataPath contain the paths to the parquet schema and data files.
The closest related error I have found is the following pre-Unity Catalog error, suggesting that I should disable table access control on my clusters:
https://kb.databricks.com/en_US/streaming/readstream-is-not-whitelisted
All table access control are disabled in my Admin Console.
Are there some other settings that should be set to ensure white-listing from Azure files now that Unity Catalog is set up?
------ Edit -----
Using a Single User cluster on Databricks runtime version 11.3 beta, I get the following error instead:
com.databricks.sql.cloudfiles.errors.CloudFilesIOException: Failed to write to the schema log at location
followed by the location to the azure schema in my storage location. I also get this error message by spawning new job clusters from azure datafactory.

Read files from a VM from Databricks

I want to read files from a VM from databricks.
I am able to SFTP to VM from Databricks driver. However I want to read using spark.read.
I have tried:
val read_input_df = spark.read
.format("com.springml.spark.sftp")
.option("host", "SFTP_HOST")
.option("username", "username")
.option("password", "password")
.option("fileType", "csv")
.load("/home/file1.csv")
Getting error
NoClassDefFoundError: scala/Product$class.
Has anyone done this successfully?
The problem is that you're using a library compiled for Scala 2.11 on Databricks cluster runtime that uses Scala 2.12 (7.x/8.x/9.x/10.x). As of right now, there is no released version for Spark 3.x/Scala 2.12, but there is a pull request that you can try to compile yourself & use.
Another approach would be to copy files first via SFTP onto DBFS (for example, like here), and then open as usually.

Spring Cloud Data Flow Custom Scala Processor unable to send/receive data from Starter Apps (SCDF 2.5.1 & Spring Boot 2.2.6)

I have been working on creating a simple custom processor in Scala for Spring Cloud Data Flow and have been running into issues with sending/receiving data from/to starter applications. I have been unable to see any messages propagating through the stream. The definition of the stream is time --trigger.time-unit=SECONDS | pass-through-log | log where pass-through-log is my custom processor.
I am using Spring Cloud Data Flow 2.5.1 and Spring Boot 2.2.6.
Here is the code used for the processor - I am using the functional model.
#SpringBootApplication
class PassThroughLog {
#Bean
def passthroughlog(): Function[String, String] = {
input: String => {
println(s"Received input `$input`")
input
}
}
}
object PassThroughLog {
def main(args: Array[String]): Unit = SpringApplication.run(classOf[PassThroughLog], args: _ *)
}
application.yml
spring:
cloud:
stream:
function:
bindings:
passthroughlog-in-0: input
passthroughlog-out-0: output
build.gradle.kts
// scala
implementation("org.scala-lang:scala-library:2.12.10")
// spring
implementation(platform("org.springframework.cloud:spring-cloud-dependencies:Hoxton.SR5"))
implementation(platform("org.springframework.cloud:spring-cloud-stream-dependencies:Horsham.SR5"))
implementation("org.springframework.boot:spring-boot-starter")
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("org.springframework.cloud:spring-cloud-starter-function-web:3.0.7.RELEASE")
implementation("org.springframework.cloud:spring-cloud-starter-stream-kafka:3.0.5.RELEASE")
I have posted the entire project to github if the code samples here are lacking. I also posted the logs there, as they are quite long.
When I bootstrap a local Kafka cluster and push arbitrary data to the input topic, I am able to see data flowing through the processor. However, when I deploy the application on Spring Cloud Data Flow, this is not the case. I am deploying the app via Docker in Kubernetes.
Additionally, when I deploy a stream with the definition time --trigger.time-unit=SECONDS | log, I see messages in the log sink. This has convinced me the problem lies with the custom processor.
Am I missing something simple like a dependency or extra configuration? Any help is greatly appreciated.
When using Spring Cloud Stream 3.x version in SCDF, there's an additional property that you will have to set to let SCDF know what channel bindings are configured as input and output channels.
See: Functional Applications
Pay attention specifically to the following properties:
app.time-source.spring.cloud.stream.function.bindings.timeSupplier-out-0=output
app.log-sink.spring.cloud.stream.function.bindings.logConsumer-in-0=input
In your case, you will have to map passthroughlog-in-0 and passthroughlog-out-0 function bindings to input and output respectively.
Turns out the problem was with my Dockerfile. For ease of configuration, I had a build argument to specify the jar file used in the ENTRYPOINT. To accomplish this, I used the shell version of ENTRYPOINT. Changing up my ENTRYPOINT to the exec version solved my issue.
The shell version of ENTRYPOINT does not play well with image arguments (docker run <image> <args>), and hence SCDF could not pass the appropriate arguments to the container.
Changing my Dockerfile from:
FROM openjdk:11.0.5-jdk-slim as build
ARG JAR
ENV JAR $JAR
ADD build/libs/$JAR .
ENTRYPOINT java -jar $JAR
to
FROM openjdk:11.0.5-jdk-slim as build
ARG JAR
ADD build/libs/$JAR program.jar
ENTRYPOINT ["java", "-jar", "program.jar"]
fixed the problem.

Spark Scala S3 storage: permission denied

I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly.
I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.
I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:
hadoop-aws-2.7.7.jar
hadoop-auth-2.7.7.jar
aws-java-sdk-1.7.4.jar
Still I can't use nor S3N nor S3A to get my file read by spark:
For S3A I have this exception:
sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden
Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url.
This code gives me the following file url:
https://my.domain:8080/test_bucket/test_file.txt?Signature=%2Fg3jv96Hdmq2450VTrl4M%2Be%2FI%3D&Expires=1539595614&AWSAccessKeyId=myaccesskey
How should I install / set up / download to get spark able to read and write from my S3 server ?
Edit 3:
Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.
First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
Here's my final code:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()
I would recommand to put most of the settings inside spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm S3SignerType
One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.
PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.

Create Analytics from http using spark streaming

Hi My reqmnt is to create Analytics from http://10.3.9.34:9900/messages that is pull data from from http://10.3.9.34:9900/messages and put this data in HDFS location /user/cloudera/flume and from HDFS create Analytics report using Tableau or HUE UI . I tried with below code at scala console of spark-shell of CDH5.5 but unable to fetch data from the http link
import org.apache.spark.SparkContext
val dataRDD = sc.textFile("http://10.3.9.34:9900/messages")
dataRDD.collect().foreach(println)
dataRDD.count()
dataRDD.saveAsTextFile("/user/cloudera/flume")
I get below error at scala console:
java.io.IOException: No FileSystem for scheme: http at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2623)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2637)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2680)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2662) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:379) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
You can't use a http endpoint as input, it needs to be a file system such as HDFS, S3 or local.
You would need a separate process which is pulling data from this endpoint, perhaps using something like Apache NiFi to land the data on a filesystem where you can then use it as input to Spark.