How can I read/write data from Azurite using Spark? - scala

I have tried to read/write Parquet files from/to Azurite using Spark like this:
import com.holdenkarau.spark.testing.DatasetSuiteBase
import org.apache.spark.SparkConf
import org.apache.spark.sql.SaveMode
import org.scalatest.WordSpec
class SimpleAzuriteSpec extends WordSpec with DatasetSuiteBase {
val AzuriteHost = "localhost"
val AzuritePort = 10000
val AzuriteAccountName = "devstoreaccount1"
val AzuriteAccountKey = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
val AzuriteContainer = "container1"
val AzuriteDirectory = "dir1"
val AzuritePath = s"wasb://$AzuriteContainer#$AzuriteAccountName.blob.core.windows.net/$AzuriteDirectory/"
override final def conf: SparkConf = {
val cfg = super.conf
val settings =
Map(
s"spark.hadoop.fs.azure.storage.emulator.account.name" -> AzuriteAccountName,
s"spark.hadoop.fs.azure.account.key.${AzuriteAccountName}.blob.core.windows.net" -> AzuriteAccountKey
)
settings.foreach { case (k, v) =>
cfg.set(k, v)
}
cfg
}
"Spark" must {
"write to/read from Azurite" in {
import spark.implicits._
val xs = List(Rec(1, "Alice"), Rec(2, "Bob"))
val inputDs = spark.createDataset(xs)
inputDs.write
.format("parquet")
.mode(SaveMode.Overwrite)
.save(AzuritePath)
val ds = spark.read
.format("parquet")
.load(AzuritePath)
.as[Rec]
ds.show(truncate = false)
val actual = ds.collect().toList.sortBy(_.id)
assert(actual == xs)
}
}
}
case class Rec(id: Int, name: String)
I have tried both Azurite 3.9.0 and Azurite 2.7.0 (both in Docker). I can transfer files to/from Azurite using az (dockerized as well).
The test above runs on the Docker host. Azurite is reachable from the Docker host.
I am using Spark 2.4.5, Hadoop 2.10.0, and this dependency:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure" % "2.10.0"
When using az, this connection string works:
AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite-3.9.0:10000/devstoreaccount1;QueueEndpoint=http://azurite-3.9.0:10001/devstoreaccount1;"
yet I do not know how to configure this in Spark.
My question: How can I configure the host, the port, credentials etc. (in the path or in SparkConf)?

Yes, that's possible, but azurite should be accessible via 127.0.0.1:10000 for wasb (so if it runs on another machine then port forwarding will help) and then specify following spark args as example:
./pyspark --conf "spark.hadoop.fs.defaultFS=wasb://container#azurite" --conf "spark.hadoop.fs.azure.storage.emulator.account.name=azurite"
Then default file system will be backed up by your instance of azurite.

Don't use Azurite, just add these Jars to your Spark Dockerfile:
# Set JARS env
ENV JARS=${SPARK_HOME}/jars/azure-storage-${AZURE_STORAGE_VER}.jar,${SPARK_HOME}/jars/hadoop-azure-${HADOOP_AZURE_VER}.jar,${SPARK_HOME}/jars/jetty-util-ajax-${JETTY_VER}.jar,${SPARK_HOME}/jars/jetty-util-${JETTY_VER}.jar
RUN echo "spark.jars ${JARS}" >> $SPARK_HOME/conf/spark-defaults.conf
Set your configuration:
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set(f"fs.azure.account.key.{ os.environ['AZURE_STORAGE_ACCOUNT'] }.blob.core.windows.net", os.environ['AZURE_STORAGE_KEY'])
Then you can read it:
val df = spark.read.parquet("wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>")

Related

How to use properties in spark scala maven project

i want to include properties file explicitly and include it in spark code , instead of hardcoding directly in spark code with all credentials.
i am trying following approach but not able to do, AppContext is not able to be resolved.
please guide me how to achieve this.
Spark_env.properties (under src/main/resourcses in maven project for spark with scala)
CASSANDRA_HOST1=127.0.0.133
CASSANDRA_PORT1=9042
CASSANDRA_USER1=usr1
CASSANDRA_PASS1=pas2
DataMigration.cassandra.keyspace1=demo2
DataMigration.cassandra.table1= data1
CASSANDRA_HOST2=
CASSANDRA_PORT2=9042
CASSANDRA_USER2=usr2
CASSANDRA_PASS2=pas2
D.cassandra.keyspace2=kesp2
D.cassandra.table2= data2
DataMigration.DifferencedRecords.output.path1=C:/spark_windows_proj/File1.csv
DataMigration.DifferencedRecords.output.path2=C:/spark_windows_proj/File1.parquet
----------------------------------------------------------------------------------
DM.scala
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.mapreduce.v2.app.AppContext
object Data_Migration {
def main(args: Array[String]) {
val host1: String = AppContext.getProperties().getProperty("CASSANDRA_HOST1")
val port1 = AppContext.getProperties().getProperty("CASSANDRA_PORT1").toInt
val keySpace1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace1")
val DataMigrationTableName1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table1")
val username1: String = AppContext.getProperties().getProperty("CASSANDRA_USER1")
val pass1: String = AppContext.getProperties().getProperty("CASSANDRA_PASS1")
val host2: String = AppContext.getProperties().getProperty("CASSANDRA_HOST2")
val port2 = AppContext.getProperties().getProperty("CASSANDRA_PORT2").toInt
val keySpace2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace2")
val DataMigrationTableName2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table2")
val username2: String = AppContext.getProperties().getProperty("CASSANDRA_USER2")
val pass2: String = AppContext.getProperties().getProperty("CASSANDRA_PASS2")
val Result_csv: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path1")
val Result_parquet: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path2")
val sc = AppContext.getSparkContext()
val spark = SparkSession
.builder() .master("local")
.appName("ABC")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df_read1 = spark.read
.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host",host1)
.option("spark.cassandra.connection.port",port1)
.option( "spark.cassandra.auth.username",username1)
.option("spark.cassandra.auth.password",pass1)
.option("keyspace",keySpace1)
.option("table",DataMigrationTableName1)
.load()
I would rather pass the properties explicitly by passing the --properties-file option to the spark-submit when submitting the job.
The AppContext won't necessary work for all submission types, while passing config file should work everywhere.
Edit: For local usage without spark-submit, you can simply use the standard Properties class, loading it from the resources and get access to properties. You only need to put property file into src/main/resources instead of src/test/resources that is included into classpath only for tests. The code is something like:
val props = new Properties
props.load(getClass.getClassLoader.getResourceAsStream("file.props"))

Setting UP Intellij to run apache spark with remote master

I have a project setup with H2o. I am able to run the code with apache toree and I set the spark master as spark://xxxx.yyyy.zzzz:port.
It works fine and I can see the output in spark UI.
I am trying to run the same code as application in intellij with but I get error java.lang.ClassNotFoundException: org.apache.spark.h2o.utils.NodeDesc. but I see the application in Spark UI for a short amount of Time.
I tried running simple application with hello world and that worked as well as I am able to see the application in Spark UI,
import java.io.File
import hex.tree.gbm.GBM
import hex.tree.gbm.GBMModel.GBMParameters
import org.apache.spark.h2o.{StringHolder, H2OContext}
import org.apache.spark.{SparkFiles, SparkContext, SparkConf}
import water.fvec.H2OFrame
/**
* Example of Sparkling Water based application.
*/
object SparklingWaterDroplet {
def main(args: Array[String]) {
// Create Spark Context
val conf = configure("Sparkling Water Droplet")
val sc = new SparkContext(conf)
// Create H2O Context
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext.implicits._
// Register file to be available on all nodes
sc.addFile(this.getClass.getClassLoader.getResource("iris.csv").getPath)
// Load data and parse it via h2o parser
val irisTable = new H2OFrame(new File(SparkFiles.get("iris.csv")))
// Build GBM model
val gbmParams = new GBMParameters()
gbmParams._train = irisTable
gbmParams._response_column = 'class
gbmParams._ntrees = 5
val gbm = new GBM(gbmParams)
val gbmModel = gbm.trainModel.get
// Make prediction on train data
val predict = gbmModel.score(irisTable)('predict)
// Compute number of mispredictions with help of Spark API
val trainRDD = h2oContext.asRDD[StringHolder](irisTable('class))
val predictRDD = h2oContext.asRDD[StringHolder](predict)
// Make sure that both RDDs has the same number of elements
assert(trainRDD.count() == predictRDD.count)
val numMispredictions = trainRDD.zip(predictRDD).filter( i => {
val act = i._1
val pred = i._2
act.result != pred.result
}).collect()
println(
s"""
|Number of mispredictions: ${numMispredictions.length}
|
|Mispredictions:
|
|actual X predicted
|------------------
|${numMispredictions.map(i => i._1.result.get + " X " + i._2.result.get).mkString("\n")}
""".stripMargin)
// Shutdown application
sc.stop()
}
def configure(appName:String = "Sparkling Water Demo"):SparkConf = {
val conf = new SparkConf().setAppName(appName)
.setMaster("spark://xxx.yyy.zz.aaaa:oooo")
conf
}
}
I also tried exporting jars as compile from the dependencies menu ,
is there anything I am missing from Intellij setup.?
It looks like the external libraries are not getting pushed to the master

Load a file from SFTP server into spark RDD

How can I load a file from SFTP server into spark RDD. After loading this file I need to perform some filtering on the data. Also the file is csv file so could you please help me decide if I should use Dataframes or RDDs.
You can use spark-sftp library in your program in following ways:
For Spark 2.x
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.11</artifactId>
<version>1.1.0</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.11" % "1.1.0"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.11:1.1.0
Scala API
// Construct Spark dataframe using file in FTP server
val df = spark.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For Spark 1.x (1.5+)
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.10</artifactId>
<version>1.0.2</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.10" % "1.0.2"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.10:1.0.2
Scala API
import org.apache.spark.sql.SQLContext
// Construct Spark dataframe using file in FTP server
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write().
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For more information on spark-sftp you can visit there github page springml/spark-sftp
Loading from SFTP is straight forward using the sftp-connector.
https://github.com/springml/spark-sftp
Remember it is single thread application and lands data into hdfs even you dont specify it. It Streams the data into hdfs and then creates an DataFrame on top of it
While Loading we need to specify couple of more parameters.
Normally with out specifying the location also it may work when your user sudo user of hdfs. It will create the temp file in / of hdfs and will delete it once the process is completed.
val data = sparkSession.read.format("com.springml.spark.sftp").
option("host", "host").
option("username", "user").
option("password", "password").
option("fileType", "json").
option("createDF", "true").
option("hdfsTempLocation","/user/currentuser/").
load("/Home/test_mapping.json");
All the available options are the following, Source code
https://github.com/springml/spark-sftp/blob/master/src/main/scala/com/springml/spark/sftp/DefaultSource.scala
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType) = {
val username = parameters.get("username")
val password = parameters.get("password")
val pemFileLocation = parameters.get("pem")
val pemPassphrase = parameters.get("pemPassphrase")
val host = parameters.getOrElse("host", sys.error("SFTP Host has to be provided using 'host' option"))
val port = parameters.get("port")
val path = parameters.getOrElse("path", sys.error("'path' must be specified"))
val fileType = parameters.getOrElse("fileType", sys.error("File type has to be provided using 'fileType' option"))
val inferSchema = parameters.get("inferSchema")
val header = parameters.getOrElse("header", "true")
val delimiter = parameters.getOrElse("delimiter", ",")
val createDF = parameters.getOrElse("createDF", "true")
val copyLatest = parameters.getOrElse("copyLatest", "false")
//System.setProperty("java.io.tmpdir","hdfs://devnameservice1/../")
val tempFolder = parameters.getOrElse("tempLocation", System.getProperty("java.io.tmpdir"))
val hdfsTemp = parameters.getOrElse("hdfsTempLocation", tempFolder)
val cryptoKey = parameters.getOrElse("cryptoKey", null)
val cryptoAlgorithm = parameters.getOrElse("cryptoAlgorithm", "AES")
val supportedFileTypes = List("csv", "json", "avro", "parquet")
if (!supportedFileTypes.contains(fileType)) {
sys.error("fileType " + fileType + " not supported. Supported file types are " + supportedFileTypes)
}

How to make Spark slaves use HDFS input files 'local' to them in a Hadoop+Spark cluster?

I have a cluster of 9 computers with Apache Hadoop 2.7.2 and Spark 2.0.0 installed on them. Each computer runs an HDFS datanode and Spark slave. One of these computers also runs an HDFS namenode and Spark master.
I've uploaded a few TBs of gz-archives in HDFS with Replication=2. It turned out that some of the archives are corrupt. I'd want to find them. It looks like 'gunzip -t ' can help. So I'm trying to find a way to run a Spark application on the cluster so that each Spark executor tests archives 'local' (i.e. having one of the replicas located on the same computer where this executor runs) to it as long as it is possible. The following script runs but sometimes Spark executors process 'remote' files in HDFS:
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchivesTester' in spark.pom
// and placing this jar file into Spark's home directory):
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
// means testing for corruption the gz-archives in the directory hdfs://LV-WS10.lviv:9000/commoncrawl
// using a Spark cluster with the Spark master URL spark://LV-WS10.lviv:7077 and 9 Spark slaves
package com.qbeats.cortex
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapred.FileSplit
import org.apache.spark.rdd.HadoopRDD
import org.apache.spark.{SparkContext, SparkConf, AccumulatorParam}
import sys.process._
object CommoncrawlArchivesTester extends App {
object LogAccumulator extends AccumulatorParam[String] {
def zero(initialValue: String): String = ""
def addInPlace(log1: String, log2: String) = if (log1.isEmpty) log2 else log1 + "\n" + log2
}
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchivesTester"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val log = sc.accumulator(LogAccumulator.zero(""))(LogAccumulator)
val text = sc.hadoopFile(args(1), classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]
val fileAndLine = hadoopRdd.mapPartitionsWithInputSplit { (inputSplit, iterator) =>
val fileName = inputSplit.asInstanceOf[FileSplit].getPath.toString
class FilePath extends Iterable[String] {
def iterator = List(fileName).iterator
}
val result = (sys.env("HADOOP_PREFIX") + "/bin/hadoop fs -cat " + fileName) #| "gunzip -t" !
println("Processed %s.".format(fileName))
if (result != 0) {
log.add(fileName)
println("Corrupt: %s.".format(fileName))
}
(new FilePath).iterator
}
val result = fileAndLine.collect()
println("Corrupted files:")
println(log.value)
}
}
}
What would you suggest?
ADDED LATER:
I tried another script which gets files from HDFS via textFile(). I looks like a Spark executor doesn't prefer among input files the files which are 'local' to it. Doesn't it contradict to "Spark brings code to data, not data to code"?
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchiveLinesCounter' in spark.pom)
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
package com.qbeats.cortex
import org.apache.spark.{SparkContext, SparkConf}
object CommoncrawlArchiveLinesCounter extends App {
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchiveLinesCounter"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val helper = new Helper
val nLines = sc.
textFile(args(1) + "/*").
mapPartitionsWithIndex( (index, it) => {
println("Processing partition %s".format(index))
it
}).
count
println(nLines)
}
}
}
SAIF C, could you explain in more detail please?
I've solved the problem by switching from Spark’s standalone mode to YARN.
Related topic: How does Apache Spark know about HDFS data nodes?

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.