I want to read files from a VM from databricks.
I am able to SFTP to VM from Databricks driver. However I want to read using spark.read.
I have tried:
val read_input_df = spark.read
.format("com.springml.spark.sftp")
.option("host", "SFTP_HOST")
.option("username", "username")
.option("password", "password")
.option("fileType", "csv")
.load("/home/file1.csv")
Getting error
NoClassDefFoundError: scala/Product$class.
Has anyone done this successfully?
The problem is that you're using a library compiled for Scala 2.11 on Databricks cluster runtime that uses Scala 2.12 (7.x/8.x/9.x/10.x). As of right now, there is no released version for Spark 3.x/Scala 2.12, but there is a pull request that you can try to compile yourself & use.
Another approach would be to copy files first via SFTP onto DBFS (for example, like here), and then open as usually.
Related
I am fairly new to pyspark and am trying to load data from a folder which contains multiple json files.However the load fails. Here is the code that I am using:
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
spark.read.json('file_directory/*')
I am getting error as :
Exception in thread "globPath-ForkJoinPool-1-worker-57" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
I tried setting the path variables for hadoop and spark as well but still no use.
However, if I load a single file from the directory, it loads perfectly.
Can someone please tell me what is going wrong in this case.
I can successfully read all CSV under a directory without adding the asterik.
I think you should try
spark.read.json('file_directory/')
I've been trying to set up a proof of concept were Azure Databricks reads data from my Event Hub using the following code:
connectionString = "Endpoint=sb://mystream.servicebus.windows.net/;EntityPath=theeventhub;SharedAccessKeyName=event_test;SharedAccessKey=mykeygoeshere12345"
ehConf = {
'eventhubs.connectionString' : connectionString
}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", df["body"].cast("string"))
display(readEventStream)
I'm using the azure_eventhubs_spark_2_11_2_3_6.jar package as recommeneded here but i've tried the latest version and keep getting the message
ERROR : Some streams terminated before this command could finish!
I've used the databricks runtime version 6.1, and rolled it back to 5.3 but can't seem to get it up and running. I have a Python script that sends data to the event hub, I just can't see anything coming out of it?
Is it the package? or something else I'm doing wrong?
Update: I was loading the library from a JAR file that I downloaded. I deleted that and then got it from the Maven repo. After testing it worked
It works perfectly with the below configuration:
Databrick Runtime: 5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)
Azure EventHub library: com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13
Use above configuration, able to get stream the data from Azure Eventhubs.
Reference: Integrating Apache Spark with Azure Event Hubs
Hope this helps.
I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly.
I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.
I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:
hadoop-aws-2.7.7.jar
hadoop-auth-2.7.7.jar
aws-java-sdk-1.7.4.jar
Still I can't use nor S3N nor S3A to get my file read by spark:
For S3A I have this exception:
sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden
Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url.
This code gives me the following file url:
https://my.domain:8080/test_bucket/test_file.txt?Signature=%2Fg3jv96Hdmq2450VTrl4M%2Be%2FI%3D&Expires=1539595614&AWSAccessKeyId=myaccesskey
How should I install / set up / download to get spark able to read and write from my S3 server ?
Edit 3:
Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.
First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
Here's my final code:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()
I would recommand to put most of the settings inside spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm S3SignerType
One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.
PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.
I am using Scala in Apache Spark. I am very new to the platform. I cannot save a collection to file using the following code:
val x = sc.parallelize(Array(2,4,1))
x.saveAsTextFile("/temp/demo")
This is most likely a problem about permissions.
Try to write to a directory you have write permissions, e.g. your home.
I have some data in HDFS /user/Cloudera/Test/*. I am very well able to see the records by running hdfs -dfs -cat Test/*.
Now the same file, I need it to be read as RDD in scala.
I have tried the following in scala shell.
val file = sc.textFile("hdfs://quickstart.cloudera:8020/user/Cloudera/Test")
Then I have written some filter and for loop to read the words. But when I use the Println at last, it says file not found.
Can anyone please help me know what would be the HDFS url in this case.
Note: I am using Cloudera CDH5.0 VM
If you are trying to access your file in spark job then you can simply use URL
val file = sc.textFile("/user/Cloudera/Test")
Spark will automatically detect this file. You do not need to add localhost as prefix because spark job by default read them from HDFS directory.
Hope this solve your query.
Instead of using "quickstart.cloudera" and the port, use just the ip address:
val file = sc.textFile("hdfs://<ip>/user/Cloudera/Test")