Reading CSV using spark-csv package in spark-shell - scala

I am trying to use spark-csv to read a csv from aws s3 in spark-shell.
Below are the steps that I did. Started spark-shell using below command
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.2.0
In the shell, executed the following scala code
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3.awsAccessKeyId", "****")
scala> hadoopConf.set("fs.s3.awsSecretAccessKey", "****")
scala> val s3path = "s3n://bucket/sample.csv"
scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(s3path)
Getting the below error
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
What is that I am missing here? Please note that I am able to read the csv using
scala> sc.textFile(s3path)
The same scala code is working fine in databricks notebook as well
Created a issue in spark-csv github. I'll update here when I get answer for the issue

For the URL s3n://bucket/sample.csv, all properties for s3n has to be set. So setting the below properties makes me to read the CSV using spark-csv
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3n.awsAccessKeyId", "****")
scala> hadoopConf.set("fs.s3n.awsSecretAccessKey", "****")
Refer https://github.com/databricks/spark-csv/issues/137

Related

Unable to validate Spark Installation from Shell

i tried to install spark in Mac using Homebrew. I did all the steps as mentioned in https://sparkbyexamples.com/spark/install-apache-spark-on-mac/ . However, when i want to vallidate the spark installation from shell cmd, i got this following output: How to do this, i already did reinstallation but nothing changed. Thank you.
scala> import spark.implicits._
import spark.implicits._
scala> val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
data: Seq[(String, String)] = List((Java,20000), (Python,100000), (Scala,3000))
scala> val df = data.toDF()
java.lang.NoSuchMethodError: 'boolean org.apache.spark.util.Utils$.isInRunningSparkTask()'
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:201)
at org.apache.spark.sql.types.DataType.sameType(DataType.scala:99)
at org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.$anonfun$haveSameType$1(TypeCoercion.scala:157)
at org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.$anonfun$haveSameType$1$adapted(TypeCoercion.scala:157)
at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)
at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)
at scala.collection.immutable.List.forall(List.scala:91)
at org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.haveSameType(TypeCoercion.scala:157)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1124)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1119)
at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:39)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1130)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1129)
at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:39)
at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:39)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1134)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1134)
at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:39)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStruct(ExpressionEncoder.scala:306)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStructForTopLevel(ExpressionEncoder.scala:316)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:245)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:300)
at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:261)
at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:261)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:32)
... 49 elided
scala> df.show()
<console>:26: error: not found: value df
df.show()
^
The output of df.show().

Accessing Azure Data Lake Storage gen2 from Scala

I am able to connect to ADLS gen2 from a notebook running on Azure Databricks but am unable to connect from a job using a jar. I used the same settings as I did in the notebook, save for the use of dbutils.
I used the same setting for Spark conf from the notebook in the Scala code.
Notebook:
spark.conf.set(
"fs.azure.account.key.xxxx.dfs.core.windows.net",
dbutils.secrets.get(scope = "kv-secrets", key = "xxxxxx"))
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "true")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val rdd = sqlContext.read.format
("csv").option("header",
"true").load(
"abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
Scala code:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
sc.getConf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
sc.getConf.set("fs.azure.account.auth.type", "OAuth")
sc.getConf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sc.getConf.set("fs.azure.account.oauth2.client.id", "<app id>")
sc.getConf.set("fs.azure.account.oauth2.client.secret", "<app password>")
sc.getConf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
sc.getConf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val sqlContext = spark.sqlContext
val rdd = sqlContext.read.format
("csv").option("header",
"true").load
("abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
println(df.count())
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
I expected the parquet file to get written. Instead I get the following error:
19/04/20 13:58:40 ERROR Uncaught throwable from user code: Configuration property xxxx.dfs.core.windows.net not found.
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:385)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:802)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:133)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
Never mind, silly mistake. It should be:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
spark.conf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<app id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<app password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")

Spark Streaming : Write Data to HDFS by reading from one HDFSdir to another

I am trying to use spark streaming in reading data from one HDFS location to another
Below is my code snippet on spark-shell
But I couldn't see the files created on HDFS output directory
Can some point point how to load the files on HDFS
scala> sc.stop()
scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.streaming
scala> import org.apache.spark.streaming.{StreamingContext,Seconds}
scala> val conf = new SparkConf().setMaster("local[2]").setAppName("files_word_count")
scala> val ssc = new StreamingContext(conf,Seconds(10))
scala> val DF = ssc.textFileStream("/user/cloudera/streamingcontext_dir")
scala> val words_freq = DF.flatMap(x=>(x.split(" "))).map(y=>(y,1)).reduceByKey(_+_)
scala> words_freq.saveAsTextFiles("hdfs://localhost:8020/user/cloudera/streamingcontext_dir2")
scala> ssc.start()
I have placed files on HDFS "/user/cloudera/streamingcontext_dir" and created another directory "/user/cloudera/streamingcontext_dir2" for seeing the files written
But I couldn't see the files in the output directory
Can someone point what's wrong here ?
Thanks
Sumit
Try making use of RDD here and not the entire DStream perhaps:
words_freq.foreachRDD(rdd =>
rdd.saveAsTextFile("hdfs://localhost:8020/user/cloudera/streamingcontext_dir2")

Load a file from SFTP server into spark RDD

How can I load a file from SFTP server into spark RDD. After loading this file I need to perform some filtering on the data. Also the file is csv file so could you please help me decide if I should use Dataframes or RDDs.
You can use spark-sftp library in your program in following ways:
For Spark 2.x
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.11</artifactId>
<version>1.1.0</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.11" % "1.1.0"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.11:1.1.0
Scala API
// Construct Spark dataframe using file in FTP server
val df = spark.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For Spark 1.x (1.5+)
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.10</artifactId>
<version>1.0.2</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.10" % "1.0.2"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.10:1.0.2
Scala API
import org.apache.spark.sql.SQLContext
// Construct Spark dataframe using file in FTP server
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write().
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For more information on spark-sftp you can visit there github page springml/spark-sftp
Loading from SFTP is straight forward using the sftp-connector.
https://github.com/springml/spark-sftp
Remember it is single thread application and lands data into hdfs even you dont specify it. It Streams the data into hdfs and then creates an DataFrame on top of it
While Loading we need to specify couple of more parameters.
Normally with out specifying the location also it may work when your user sudo user of hdfs. It will create the temp file in / of hdfs and will delete it once the process is completed.
val data = sparkSession.read.format("com.springml.spark.sftp").
option("host", "host").
option("username", "user").
option("password", "password").
option("fileType", "json").
option("createDF", "true").
option("hdfsTempLocation","/user/currentuser/").
load("/Home/test_mapping.json");
All the available options are the following, Source code
https://github.com/springml/spark-sftp/blob/master/src/main/scala/com/springml/spark/sftp/DefaultSource.scala
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType) = {
val username = parameters.get("username")
val password = parameters.get("password")
val pemFileLocation = parameters.get("pem")
val pemPassphrase = parameters.get("pemPassphrase")
val host = parameters.getOrElse("host", sys.error("SFTP Host has to be provided using 'host' option"))
val port = parameters.get("port")
val path = parameters.getOrElse("path", sys.error("'path' must be specified"))
val fileType = parameters.getOrElse("fileType", sys.error("File type has to be provided using 'fileType' option"))
val inferSchema = parameters.get("inferSchema")
val header = parameters.getOrElse("header", "true")
val delimiter = parameters.getOrElse("delimiter", ",")
val createDF = parameters.getOrElse("createDF", "true")
val copyLatest = parameters.getOrElse("copyLatest", "false")
//System.setProperty("java.io.tmpdir","hdfs://devnameservice1/../")
val tempFolder = parameters.getOrElse("tempLocation", System.getProperty("java.io.tmpdir"))
val hdfsTemp = parameters.getOrElse("hdfsTempLocation", tempFolder)
val cryptoKey = parameters.getOrElse("cryptoKey", null)
val cryptoAlgorithm = parameters.getOrElse("cryptoAlgorithm", "AES")
val supportedFileTypes = List("csv", "json", "avro", "parquet")
if (!supportedFileTypes.contains(fileType)) {
sys.error("fileType " + fileType + " not supported. Supported file types are " + supportedFileTypes)
}

java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to packagename.MyRecord

I am trying to use Spark 1.5.1 (with Scala 2.10.2) to read some .avro files from HDFS (with spark-avro 1.7.7) , in order to do some computation on them.
Now, starting with the assumption that I have already searched thoroughly the web to find a solution ( and the best link so far is this one that suggests to use a GenericRecord, while this one reports the same issue, and this one just does not work for me, because it gives almost the same code I have used ), I ask here, because it might be that someone had the same. This is the code :
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
import org.apache.hadoop.io.NullWritable
import org.apache.spark.{SparkConf, SparkContext}
object SparkPOC {
def main(args: Array[String]): Unit ={
val conf = new SparkConf()
.setAppName("SparkPOC")
.set("spark.master", "local[4]")
val sc = new SparkContext(conf)
val path = args(0)
val profiles = sc.hadoopFile(
path,
classOf[AvroInputFormat[MyRecord]],
classOf[AvroWrapper[MyRecord]],
classOf[NullWritable]
)
val timeStamps = profiles.map{ p => p._1.datum.getTimeStamp().toString}
timeStamps.foreach(print)
}
And I get the following message:
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to packagename.MyRecord
at packagename.SparkPOC$$anonfun$1.apply(SparkPOC.scala:24)
at packagename.SparkPOC$$anonfun$1.apply(SparkPOC.scala:24)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Does anybody have a clue? I was also considering the possibility of using spark-avro, but they don't support reading from multiple files at the same time (while .hadoopFile supports wildcards). Otherwise, it seems that I have to go for the GenericRecord and use the .get method, losing the advantage of the coded schema (MyRecord).
Thanks in advance.
I usually read it in as GenericRecord and explicitly cast as necessary, ie
val conf = sc.hadoopConfiguration
sc.newAPIHadoopFile(path, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable], conf).map(_._1.datum().asInstanceOf[MyRecord])
The problem has gone after I have set KryoSerializer and a spark.kryo.registrator class, as follows:
val config = new SparkConf()
.setAppName(appName)
.set("spark.master", master)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "com.mypackage.AvroKryoRegistrator")
where AvroKryoRegistrator is something like this.