I am using following code to load a csv file that has text/notes in it.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
Notes are not in any specific format. During loading I am getting this error:
com.univocity.parsers.common.TextParsingException: Error processing input: null
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'.
I'd appreciate any help. Thanks.
I do not have privilege to comment on question, I'm adding answer.
As you are doing na.drop(), may use option("mode", "DROPMALFORMED") as well.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
BTW, databricks spark csv is inbuilt in Spark 2.0+
Related
I have a CSV file with a column of string dd-MMM-yyyy (e.g. 03-APR-2019), which I want to read as date format.
my code to read as below:
spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("quote", "\"")
.option("escape", "\"")
.option("multiLine", "true")
.option("dateFormat", "dd-MMM-yyyy")
.csv(csvInPath)
However, after my code read the CSV file, the date still appears as String in my data frame.
anyone can advise? thanks
Im reading csv files with spark and scala,the files are coming from another spark streaming job.
I need to read only the new files ?
val df= spark
.read //
.schema(test_raw)
.option("header", "true")
.option("sep", ",")
.csv(path).toDF().cache()
event3.registerTempTable("test")
I resolved the problem by adding a checkpoint on the dataframe like this
val df= spark
.read //
.schema(test_raw)
.option("header", "true")
.option("sep", ",")
.csv(path).toDF().checkpoint().cache()
I want to load a text file which has delimiter "|" into Dataframe in spark.
one way is to create the RDD and use toDF to create Dataframe. However I was wondering if I can create DF directly.
As of now I am using the below command
val productsDF = sqlContext.read.text("/user/danishdshadab786/paper2/products/")
For Spark 2.x
val df = spark.read.format("csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
For Spark<2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
You can add more options like option("header", "true") for reading headers in the same statement.
You can specify the delimiter in the 'read' options:
spark.read
.option("delimiter", "|")
.csv("/user/danishdshadab786/paper2/products/")
I have two files data.csv and headers.csv. I want to create dataframe in Spark/Scala with headers.
var data = spark.sqlContext.read.format(
"com.databricks.spark.csv").option("header", "true"
).option("inferSchema", "true").load(data_path)
Can you help me customizing above lines to do this?
you can read the headers.csv by using the above method and use the schema of headers dataframe to read the data.csv as below
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
I hope the answer is helpful
I have a "\u0001" delimited file reading with spark for parquet conversion and I don't have any issues with schema, but, data has quotes(") in between without an end quote. I tried different solutions but couldn't figured out any.
val df = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "\u0001")
//.option("quote", "\"")
//.option("quote", null)
//.option("quoteMode", "ALL")
.option("header", "false")
.option("mode","FAILFAST")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(schema)
.load(fileLocation)
Thanks in advance and appreciate your help
You can use sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter","\u0001")
and read as textFile
val sentences = sparkContext.textFile(directoryPath)