Accessing Azure Data Lake Storage gen2 from Scala - scala

I am able to connect to ADLS gen2 from a notebook running on Azure Databricks but am unable to connect from a job using a jar. I used the same settings as I did in the notebook, save for the use of dbutils.
I used the same setting for Spark conf from the notebook in the Scala code.
Notebook:
spark.conf.set(
"fs.azure.account.key.xxxx.dfs.core.windows.net",
dbutils.secrets.get(scope = "kv-secrets", key = "xxxxxx"))
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "true")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val rdd = sqlContext.read.format
("csv").option("header",
"true").load(
"abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
Scala code:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
sc.getConf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
sc.getConf.set("fs.azure.account.auth.type", "OAuth")
sc.getConf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sc.getConf.set("fs.azure.account.oauth2.client.id", "<app id>")
sc.getConf.set("fs.azure.account.oauth2.client.secret", "<app password>")
sc.getConf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
sc.getConf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val sqlContext = spark.sqlContext
val rdd = sqlContext.read.format
("csv").option("header",
"true").load
("abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
println(df.count())
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
I expected the parquet file to get written. Instead I get the following error:
19/04/20 13:58:40 ERROR Uncaught throwable from user code: Configuration property xxxx.dfs.core.windows.net not found.
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:385)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:802)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:133)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)

Never mind, silly mistake. It should be:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
spark.conf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<app id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<app password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")

Related

CSV format is not loading in spark-shell

Using spark 1.6
I tried following code:
val diamonds = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/got_own/com_sep_fil.csv")
which caused the error
error: not found: value spark
In Spark 1.6 shell you get sc of type SparkContext, not spark of type SparkSession, if you want to get that functionlity you will need to instantiate a SqlContext
import org.apache.spark.sql._
val spark = new SQLContext(sc)
sqlContext is implict object SQL contect which can be used to load csv file and use com.databricks.spark.csv for mentionin csv file format
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("data.csv")
You need to initialize instance using SQLContext(spark version<2.0) or SparkSession(spark version>=2.0) to use methods provided by Spark.
To initialize spark instance for spark version < 2.0 use:
import org.apache.spark.sql._
val spark = new SQLContext(sc)
To initialize spark instance for spark version >= 2.0 use:
val spark = new SparkConf().setAppName("SparkSessionExample").setMaster("local")
To read the csv using spark 1.6 and databricks spark-csv package:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data.csv")

Getting error while converting DynamicFrame to a Spark DataFrame using toDF

I stated using AWS Glue to read data using data catalog and GlueContext and transform as per requirement.
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession
// Data Catalog: database and table name
val dbName = "abcdb"
val tblName = "xyzdt_2017_12_05"
// S3 location for output
val outputDir = "s3://output/directory/abc"
// Read data into a DynamicFrame using the Data Catalog metadata
val stGBDyf = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
val revisedDF = stGBDyf.toDf() // This line getting error
While executing above code I got following error,
Error : Syntax Error: error: value toDf is not a member of
com.amazonaws.services.glue.DynamicFrame val revisedDF =
stGBDyf.toDf() one error found.
I followed this example to convert DynamicFrame to Spark dataFrame.
Please suggest what will be the best way to resolve this problem
There's a typo. It should work fine with capital F in toDF:
val revisedDF = stGBDyf.toDF()

special character in dataframe spark

I have a CSV file with the following content
id,pos_id,supplier_id
5127973,2000,"test
5704355,77,/10122
I wanted to load it into a dataframe and the data as it is , this dataframe will be loaded into postresql through JDBC
Here what I did:
val conf = new SparkConf().setMaster("local[2]").setAppName("my app")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder.config(conf = conf).appName("spark session example").getOrCreate()
val df= sparkSession.sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").option("escape", "\"").load("C:\\Users\\MHT\\Desktop\\data.csv")
df.show()
+-------+------+--------------------+
| id|pos_id| supplier_id|
+-------+------+--------------------+
|5127973| 2000|test
5704355,77,/...|
+-------+------+--------------------+
What should I do to get the same data in the dataframe and then the same data in postresql.
Write the csv data on to HDFS and using sqoop we can export the data to the destination database by providing required jdbc jars in the $SQOOP_HOME/lib directory.

spark dataframe write to file using scala

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename.
When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.
Below is the code i am using
object File_Process
{
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("yarn")
.appName("File_Process")
.getOrCreate()
def main(arg:Array[String])
{
val FileDF = spark.read
.csv("/data/sourcefile/")
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
val query = dataframefinal.write
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.save("/data/text_file/")
spark.stop()
}
If I remove logic to add seq_no, code is working fine.
code for creating seq no is
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
Thanks in advance.

Convert dataframe to hive table in spark scala

I am trying to convert a dataframe to hive table in spark Scala. I have read in a dataframe from an XML file. It uses SQL context to do so. I want to convert save this dataframe as a hive table. I am getting this error:
"WARN HiveContext$$anon$1: Could not persist database_1.test_table in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format."
object spark_conversion {
def main(args: Array[String]): Unit = {
if (args.length < 2) {
System.err.println("Usage: <input file> <output dir>")
System.exit(1)
}
val in_path = args(0)
val out_path_csv = args(1)
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("conversion")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val df = hiveContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "PolicyPeriod")
.option("attributePrefix", "attr_")
.load(in_path)
df.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save(out_path_csv)
df.saveAsTable("database_1.test_table")
df.printSchema()
df.show()
saveAsTable in spark is not compatible with hive. I am on CDH 5.5.2. Workaround from cloudera website:
df.registerTempTable(tempName)
hsc.sql(s"""
CREATE TABLE $tableName (
// field definitions )
STORED AS $format """)
hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")
http://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html