How to read file from Blob storage using scala to spark

How to read file from Blob storage using scala to spark - scala

I have a piece of scala code that works locally
val test = "resources/test.csv"
val trainInput = spark.read
.option("header", "true")
.option("inferSchema", "true")
.format("com.databricks.spark.csv")
.load(train)
.cache
However when i try to run it on azure, spark by submitting the job, and adjusting the following line:
val test = "wasb:///tmp/MachineLearningScala/test.csv"
It doesn't work. How do i reference files in blob storage in azure using scala? This should be straight forward.

If you are using sbt add this dependency to built.sbt
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
For maven add the dependency as
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>2.7.0</version>
</dependency>
To read the files from blob storage you need to define the file system to be used in the underlying Hadoop configurations.
spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")
And read the csv file as
val path = "wasb[s]://BlobStorageContainer#yourUser.blob.core.windows.net"
val dataframe = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path + "/tmp/MachineLearningScala/test.csv")
here is the example
Hope this helped!

Related

Getting error in spark-sftp, no such file

In a databricks cluster Spark 2.4.5, Scala 2.1.1 I am trying to read a file into a spark data frame using the following code.
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "*")
.option("username", "*")
.option("password", "*")
.option("fileType", "csv")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("/my_file.csv")
However, I get the following error
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/local_disk0/tmp/my_file.csv;
I think I need to specify an option to save that file temporarily, but I can't find a way to do so. How can I solve that?

Saving output of spark to csv in spark 1.6

Spark 1.6
scala
How to save output to csv file of spark 1.6.
i did something like this.
myCleanData.write.mode(SaveMode.Append).csv(path="file:///filepath")
but it throw error as
cannot resolve symbol csv
i tried like this even.
for dependency
<!-- https://mvnrepository.com/artifact/com.databricks/spark-csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.5.0</version>
</dependency>
val outputfile = "file:///D:/path/output"
val myCleanData= sqlContext.sql("""SELECT
col1,
col1,
col1
FROM dataframe
WHERE col1 LIKE "^[a-zA-Z0-9]*$"
""" )
myCleanData.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save(outputfile)
But this give error as java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
Please help if it is possible with spark 1.6.

how to read a .dat file with delimiter /u0001 and record next record will be separating by next line in spark with scala

I have .dat extension file which not having any header
1.fields separated by '\u0001'
2.next record will be in new line
how can i read this file in spark with scala and convert to a dataframe.

Try below code, I assume you are using spark > 2.x version -
val df = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\01")
.csv("<CSV_FILE_PATH_GOES_HERE>")

Reading CSV using Spark from Zeppelin in EMR

I have a simple CSV file in S3 where I have read it many times using Spark in EMR.
Now I want to use Zeppelin so, I can do some analysis.
My code is very simple
val path="s3://somewhere/some.csv"
val df=
_spark
.read
.format("csv")
.option("delimiter", "\t")
.option("header", false)
.option("mode", ParseModes.DROP_MALFORMED_MODE)
.option("nullValue", "NULL")
.option("charset", "UTF-8")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.load(path)
But when I try to collect the dataframe
df.collect
I get an error
java.io.InvalidClassException:
org.apache.commons.lang3.time.FastDateFormat; local class
incompatible: stream classdesc serialVersionUID = 1, local class
serialVersionUID = 2
which is the different versions commons-lang3 between Zeppelin and Spark use.
reference:
http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/InvalidClassException-using-Zeppelin-master-and-spark-2-1-on-a-standalone-spark-cluster-td4900.html
I have used many different EMR version from 5.3.1 to 5.7.0
I have tried to add in --jars in spark
commons-lang3-3.4.jar
but with no luck.
Has anyone, had the same error?

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0

Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to read file from Blob storage using scala to spark - scala

Related

Getting error in spark-sftp, no such file

Saving output of spark to csv in spark 1.6

how to read a .dat file with delimiter /u0001 and record next record will be separating by next line in spark with scala

Reading CSV using Spark from Zeppelin in EMR

Not able to load file from HDFS in spark Dataframe

Categories

Resources