How to read file from Blob storage using scala to spark - scala

I have a piece of scala code that works locally
val test = "resources/test.csv"
val trainInput = spark.read
.option("header", "true")
.option("inferSchema", "true")
.format("com.databricks.spark.csv")
.load(train)
.cache
However when i try to run it on azure, spark by submitting the job, and adjusting the following line:
val test = "wasb:///tmp/MachineLearningScala/test.csv"
It doesn't work. How do i reference files in blob storage in azure using scala? This should be straight forward.

If you are using sbt add this dependency to built.sbt
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
For maven add the dependency as
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>2.7.0</version>
</dependency>
To read the files from blob storage you need to define the file system to be used in the underlying Hadoop configurations.
spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")
And read the csv file as
val path = "wasb[s]://BlobStorageContainer#yourUser.blob.core.windows.net"
val dataframe = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path + "/tmp/MachineLearningScala/test.csv")
here is the example
Hope this helped!

Related

Getting error in spark-sftp, no such file

In a databricks cluster Spark 2.4.5, Scala 2.1.1 I am trying to read a file into a spark data frame using the following code.
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "*")
.option("username", "*")
.option("password", "*")
.option("fileType", "csv")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("/my_file.csv")
However, I get the following error
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/local_disk0/tmp/my_file.csv;
I think I need to specify an option to save that file temporarily, but I can't find a way to do so. How can I solve that?

Saving output of spark to csv in spark 1.6

Spark 1.6
scala
How to save output to csv file of spark 1.6.
i did something like this.
myCleanData.write.mode(SaveMode.Append).csv(path="file:///filepath")
but it throw error as
cannot resolve symbol csv
i tried like this even.
for dependency
<!-- https://mvnrepository.com/artifact/com.databricks/spark-csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.5.0</version>
</dependency>
val outputfile = "file:///D:/path/output"
val myCleanData= sqlContext.sql("""SELECT
col1,
col1,
col1
FROM dataframe
WHERE col1 LIKE "^[a-zA-Z0-9]*$"
""" )
myCleanData.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save(outputfile)
But this give error as java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
Please help if it is possible with spark 1.6.

how to read a .dat file with delimiter /u0001 and record next record will be separating by next line in spark with scala

I have .dat extension file which not having any header
1.fields separated by '\u0001'
2.next record will be in new line
how can i read this file in spark with scala and convert to a dataframe.
Try below code, I assume you are using spark > 2.x version -
val df = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\01")
.csv("<CSV_FILE_PATH_GOES_HERE>")

Reading CSV using Spark from Zeppelin in EMR

I have a simple CSV file in S3 where I have read it many times using Spark in EMR.
Now I want to use Zeppelin so, I can do some analysis.
My code is very simple
val path="s3://somewhere/some.csv"
val df=
_spark
.read
.format("csv")
.option("delimiter", "\t")
.option("header", false)
.option("mode", ParseModes.DROP_MALFORMED_MODE)
.option("nullValue", "NULL")
.option("charset", "UTF-8")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.load(path)
But when I try to collect the dataframe
df.collect
I get an error
java.io.InvalidClassException:
org.apache.commons.lang3.time.FastDateFormat; local class
incompatible: stream classdesc serialVersionUID = 1, local class
serialVersionUID = 2
which is the different versions commons-lang3 between Zeppelin and Spark use.
reference:
http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/InvalidClassException-using-Zeppelin-master-and-spark-2-1-on-a-standalone-spark-cluster-td4900.html
I have used many different EMR version from 5.3.1 to 5.7.0
I have tried to add in --jars in spark
commons-lang3-3.4.jar
but with no luck.
Has anyone, had the same error?

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0
Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899