spark error: spark.read.format("org.apache.spark.csv") - scala

I am getting the following error after firing the command from spark-shell
scala> val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswa
s7917/src_files/movies_data_srcfile_sess06_01.csv")
<console>:21: error: not found: value spark
val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswas7917/src_files/movies_data_srcfile_sess06_01.csv")
Do I need to import something explicitly.
Please help with the complete command set
Thanks.

It seems like you are using the old version of spark, You need to use the spark2.x or higher and import the implicits as
import spark.implicits._
And then
val df1 = spark.read.format("csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("path")

You aren't even getting a SparkSession. You are using an older version of Spark it seems, and you should use the SQlContext and also you need to include the external databricks csv library when you start spark shell...
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
and then from within the spark shell...
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
You can see more info about it here

Related

How to parallelize spark.read.parquet()?

My Spark job reads a folder with parquet data partitioned by the column partition:
val spark = SparkSession
.builder()
.appName("Prepare Id Mapping")
.getOrCreate()
import spark.implicits._
spark.read
.parquet(sourceDir)
.filter($"field" === "ss_id" and $"int_value".isNotNull)
.select($"int_value".as("ss_id"), $"partition".as("date"), $"ct_id")
.coalesce(1)
.write
.partitionBy("date")
.parquet(idMappingDir)
I've noticed that only one task is created so it's very slow. There is a lot of subfolders like partition=2019-01-07 inside the source folder, and each subfolder contains a lot of files with the extension snappy.parquet. I submit the job --num-executors 2 --executor-cores 4, and RAM is not an issue. I tried reading from both S3 and the local filesystem. I tried adding .repartition(nPartitions), removing .coalesce(1) and .partitionBy("date") but the same.
Could you suggest how I can get Spark read these parquet files in parallel?
Well, I've figured out the correct code:
val spark = SparkSession
.builder()
.appName("Prepare Id Mapping")
.getOrCreate()
import spark.implicits._
spark.read
.option("mergeSchema", "true")
.parquet(sourceDir)
.filter($"field" === "ss_id" and $"int_value".isNotNull)
.select($"int_value".as("ss_id"), $"partition".as("date"), $"ct_id")
.write
.partitionBy("date")
.parquet(idMappingDir)
Hope this will save someone time in future.

ERROR: org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV

I try this basic command to read a CSV in scala:
val df = spark.read
.option("header", "true")
.option("sep","|")
.option("inferSchema", "true")
.csv("path/to/_34File.csv")
And I get:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.
What could be the solution?
The solution is to rename de file from "_34File.csv" to "34File.csv". It's a peculiar case and that worked for me.

Reading CSV using Spark from Zeppelin in EMR

I have a simple CSV file in S3 where I have read it many times using Spark in EMR.
Now I want to use Zeppelin so, I can do some analysis.
My code is very simple
val path="s3://somewhere/some.csv"
val df=
_spark
.read
.format("csv")
.option("delimiter", "\t")
.option("header", false)
.option("mode", ParseModes.DROP_MALFORMED_MODE)
.option("nullValue", "NULL")
.option("charset", "UTF-8")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.load(path)
But when I try to collect the dataframe
df.collect
I get an error
java.io.InvalidClassException:
org.apache.commons.lang3.time.FastDateFormat; local class
incompatible: stream classdesc serialVersionUID = 1, local class
serialVersionUID = 2
which is the different versions commons-lang3 between Zeppelin and Spark use.
reference:
http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/InvalidClassException-using-Zeppelin-master-and-spark-2-1-on-a-standalone-spark-cluster-td4900.html
I have used many different EMR version from 5.3.1 to 5.7.0
I have tried to add in --jars in spark
commons-lang3-3.4.jar
but with no luck.
Has anyone, had the same error?

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0
Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899

Reading csv files to create dataframe

I am trying to read a csv file to create a dataframe (https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html)
Using:
spark-1.3.1-bin-hadoop2.6
spark-csv_2.11-1.1.0
Code:
import org.apache.spark.sql.SQLContext
object test {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.csvFile("filename.csv")
...
}
}
Error:
value csvFile is not a member of org.apache.spark.sql.SQLContext
I was trying to do as advised here: Spark - load CSV file as DataFrame?
But sqlContext doesn't seem to recognize the csvFile method of CsvContext class.
Any advise would be appreciated!
I am also facing some issues with CSV(without Spark-CSV) but here is somethings that you can look at and check if they are OK.
Build the Spark shell with the spark-csv library using sbt assembly.
Add the spark-csv dependency to POM.XML of you maven project.
use the load/save methods of Dataframe API.
SPARK-CSV GITHUB
refer the spark-csv github readme.md page and you will up and running :)