How do we connect Databricks with SFTP using Pyspark?

How do we connect Databricks with SFTP using Pyspark? - pyspark

I wish to connect to sftp (to read files stored in a folder) from databricks cluster using Pyspark (using a private key) . Historically I have been downloading files to a linux box from sftp and moving it to azure containers before reading it with pyspark. Is there a way to enable direct read from sftp using databricks ?
Thank you for looking into this.

To connect to SFTP from Databricks cluster using spark very simple Pyspark SFTP connector to do that.
This library can be used to construct spark dataframe by downloading the files from SFTP server.
Install library on your cluster: com.springml:spark-sftp_2.11:1.1.5
This library requires some options to connect with sftp server path, username, password, host, fileType
Code Example:
val df = spark.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
load("/ftp/files/sample.csv")
Reference: Spark SFTP Connector Library

Related

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.

I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5

Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

Spark Scala - Connect to MySQL over SSH using Key Pair

I wanted to understand if there is a method one can connect to MySQL database over SSH using Private-Public Key pair using Spark notebook for scala?
I have been trying to modify this code to no avail
Connect to MySQL over SSH using Java

Connect to SQL Data Warehouse from HDInsight OnDemand

I'm trying to read/write data to an Azure SQL Data Warehouse from a spark on demand HDInsight cluster.
I can do this from a normal HDInsight spark cluster by using a script action to install the jdbc driver but I don't think it's possible to run script actions on the on demand clusters.
I've tried
Copying the files from %user%.m2\repository\com\microsoft\sqlserver\mssql-jdbc\6.2.2.jre8 up to blob storage in a folder called jars next to where the built spark code is.
including the driver dependency in the built jar file
Both of these led to a java.lang.NoClassDefFoundError
I'm not too familiar with scala/maven/JVM/etc so not sure what else to try or include in this SO question.
Scala code i'm trying to run is
val sqlContext = SparkSession.builder().appName("GenerateEventsSql").getOrCreate()
val jdbcSqlConnStr = "jdbc:sqlserver://someserver.database.windows.net:1433;databaseName=myDW;user=admin;password=XXXX;"
val tableName = "dbo.SomeTable"
val allTableData = sqlContext.read.format("jdbc")
.options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConnStr, "dbtable" -> tableName)
)
.load()

Jars on Blob storage folder are not accessible to the Class path of HDinsight spark job. You need to copy the jar files to the local host for example /tmp/jars/xyz.jar and mention the same in Spark-submit command.
For e.g.
nohup spark-submit --jars /tmp/jars/xyz.jar

spark read from hdfs with kerberos and write on local filesystem

I am trying to get the following use case:
spark read files from HDFS with Kerberos in parquet format
spark write this files in csv format
If I write to hdfs, it works perfectly. If I try to write to local filesystem, it doesn´t work: "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
I am using Spark 1.6.2.
To sumarize, my code is
val dfIn = sqc.read.parquet(pathIsilon)
dfIn.coalesce(1).write.format("com.databricks.spark.csv").save(pathFilesystem)

How to connect and load files in remote BigInsights HDFS(kerberos authentication enabled) from local pyspark program for processing?

How to connect and load files in remote BigInsights HDFS(kerberos authentication enabled) from local pyspark program for processing?
df = sqlContext.read.parquet("hdfs://<<remote_hdfs_host>>:8020/testDirectory")
Help would be much appreciated.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How do we connect Databricks with SFTP using Pyspark? - pyspark

Related

Using Postgresql JDBC source with Apache Spark on EMR

Spark Scala - Connect to MySQL over SSH using Key Pair

Connect to SQL Data Warehouse from HDInsight OnDemand

spark read from hdfs with kerberos and write on local filesystem

How to connect and load files in remote BigInsights HDFS(kerberos authentication enabled) from local pyspark program for processing?

Categories

Resources