Reading CSV using Spark from Zeppelin in EMR - scala

I have a simple CSV file in S3 where I have read it many times using Spark in EMR.
Now I want to use Zeppelin so, I can do some analysis.
My code is very simple
val path="s3://somewhere/some.csv"
val df=
_spark
.read
.format("csv")
.option("delimiter", "\t")
.option("header", false)
.option("mode", ParseModes.DROP_MALFORMED_MODE)
.option("nullValue", "NULL")
.option("charset", "UTF-8")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.load(path)
But when I try to collect the dataframe
df.collect
I get an error
java.io.InvalidClassException:
org.apache.commons.lang3.time.FastDateFormat; local class
incompatible: stream classdesc serialVersionUID = 1, local class
serialVersionUID = 2
which is the different versions commons-lang3 between Zeppelin and Spark use.
reference:
http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/InvalidClassException-using-Zeppelin-master-and-spark-2-1-on-a-standalone-spark-cluster-td4900.html
I have used many different EMR version from 5.3.1 to 5.7.0
I have tried to add in --jars in spark
commons-lang3-3.4.jar
but with no luck.
Has anyone, had the same error?

Related

Getting error in spark-sftp, no such file

In a databricks cluster Spark 2.4.5, Scala 2.1.1 I am trying to read a file into a spark data frame using the following code.
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "*")
.option("username", "*")
.option("password", "*")
.option("fileType", "csv")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("/my_file.csv")
However, I get the following error
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/local_disk0/tmp/my_file.csv;
I think I need to specify an option to save that file temporarily, but I can't find a way to do so. How can I solve that?

how to read a .dat file with delimiter /u0001 and record next record will be separating by next line in spark with scala

I have .dat extension file which not having any header
1.fields separated by '\u0001'
2.next record will be in new line
how can i read this file in spark with scala and convert to a dataframe.
Try below code, I assume you are using spark > 2.x version -
val df = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\01")
.csv("<CSV_FILE_PATH_GOES_HERE>")

How to read file from Blob storage using scala to spark

I have a piece of scala code that works locally
val test = "resources/test.csv"
val trainInput = spark.read
.option("header", "true")
.option("inferSchema", "true")
.format("com.databricks.spark.csv")
.load(train)
.cache
However when i try to run it on azure, spark by submitting the job, and adjusting the following line:
val test = "wasb:///tmp/MachineLearningScala/test.csv"
It doesn't work. How do i reference files in blob storage in azure using scala? This should be straight forward.
If you are using sbt add this dependency to built.sbt
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
For maven add the dependency as
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>2.7.0</version>
</dependency>
To read the files from blob storage you need to define the file system to be used in the underlying Hadoop configurations.
spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")
And read the csv file as
val path = "wasb[s]://BlobStorageContainer#yourUser.blob.core.windows.net"
val dataframe = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path + "/tmp/MachineLearningScala/test.csv")
here is the example
Hope this helped!

spark error: spark.read.format("org.apache.spark.csv")

I am getting the following error after firing the command from spark-shell
scala> val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswa
s7917/src_files/movies_data_srcfile_sess06_01.csv")
<console>:21: error: not found: value spark
val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswas7917/src_files/movies_data_srcfile_sess06_01.csv")
Do I need to import something explicitly.
Please help with the complete command set
Thanks.
It seems like you are using the old version of spark, You need to use the spark2.x or higher and import the implicits as
import spark.implicits._
And then
val df1 = spark.read.format("csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("path")
You aren't even getting a SparkSession. You are using an older version of Spark it seems, and you should use the SQlContext and also you need to include the external databricks csv library when you start spark shell...
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
and then from within the spark shell...
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
You can see more info about it here

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0
Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899