Spark 1.6
How to save output to csv file of spark 1.6.
i did something like this.
but it throw error as
cannot resolve symbol csv
i tried like this even.
for dependency
<!-- -->
val outputfile = "file:///D:/path/output"
val myCleanData= sqlContext.sql("""SELECT
FROM dataframe
WHERE col1 LIKE "^[a-zA-Z0-9]*$"
""" )
.option("header", "true")
But this give error as java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
Please help if it is possible with spark 1.6.
I try this basic command to read a CSV in scala:
val df =
.option("header", "true")
.option("inferSchema", "true")
And I get:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.
What could be the solution?
The solution is to rename de file from "_34File.csv" to "34File.csv". It's a peculiar case and that worked for me.
I have .dat extension file which not having any header
1.fields separated by '\u0001' record will be in new line
how can i read this file in spark with scala and convert to a dataframe.
Try below code, I assume you are using spark > 2.x version -
val df = spark
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\01")
I have a piece of scala code that works locally
val test = "resources/test.csv"
val trainInput =
.option("header", "true")
.option("inferSchema", "true")
However when i try to run it on azure, spark by submitting the job, and adjusting the following line:
val test = "wasb:///tmp/MachineLearningScala/test.csv"
It doesn't work. How do i reference files in blob storage in azure using scala? This should be straight forward.
If you are using sbt add this dependency to built.sbt
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
For maven add the dependency as
To read the files from blob storage you need to define the file system to be used in the underlying Hadoop configurations.
spark.sparkContext.hadoopConfiguration.set("", "")
spark.sparkContext.hadoopConfiguration.set("", "yourKey ")
And read the csv file as
val path = "wasb[s]://"
val dataframe =
.option("header", "true")
.option("inferSchema", "true")
.csv(path + "/tmp/MachineLearningScala/test.csv")
here is the example
Hope this helped!
I have a simple CSV file in S3 where I have read it many times using Spark in EMR.
Now I want to use Zeppelin so, I can do some analysis.
My code is very simple
val path="s3://somewhere/some.csv"
val df=
.option("delimiter", "\t")
.option("header", false)
.option("mode", ParseModes.DROP_MALFORMED_MODE)
.option("nullValue", "NULL")
.option("charset", "UTF-8")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
But when I try to collect the dataframe
I get an error
org.apache.commons.lang3.time.FastDateFormat; local class
incompatible: stream classdesc serialVersionUID = 1, local class
serialVersionUID = 2
which is the different versions commons-lang3 between Zeppelin and Spark use.
I have used many different EMR version from 5.3.1 to 5.7.0
I have tried to add in --jars in spark
but with no luck.
Has anyone, had the same error?
I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
.option("header", "true")
.option("inferSchema", "true")
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(
at org.apache.hadoop.fs.Path.<init>(
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0
Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.