spark parquet conversion issue with malformed lines in file - scala

I have a "\u0001" delimited file reading with spark for parquet conversion and I don't have any issues with schema, but, data has quotes(") in between without an end quote. I tried different solutions but couldn't figured out any.
val df = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "\u0001")
//.option("quote", "\"")
//.option("quote", null)
//.option("quoteMode", "ALL")
.option("header", "false")
.option("mode","FAILFAST")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(schema)
.load(fileLocation)
Thanks in advance and appreciate your help

You can use sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter","\u0001")
and read as textFile
val sentences = sparkContext.textFile(directoryPath)

Related

spark SQL - spark.read.option reading dd-MMM-yyyy from csv into dataFrame

I have a CSV file with a column of string dd-MMM-yyyy (e.g. 03-APR-2019), which I want to read as date format.
my code to read as below:
spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("quote", "\"")
.option("escape", "\"")
.option("multiLine", "true")
.option("dateFormat", "dd-MMM-yyyy")
.csv(csvInPath)
However, after my code read the CSV file, the date still appears as String in my data frame.
anyone can advise? thanks

How to read only new files in spark

Im reading csv files with spark and scala,the files are coming from another spark streaming job.
I need to read only the new files ?
val df= spark
.read //
.schema(test_raw)
.option("header", "true")
.option("sep", ",")
.csv(path).toDF().cache()
event3.registerTempTable("test")
I resolved the problem by adding a checkpoint on the dataframe like this
val df= spark
.read //
.schema(test_raw)
.option("header", "true")
.option("sep", ",")
.csv(path).toDF().checkpoint().cache()

Trying to create Data frame from a file with delimiter '|'

I want to load a text file which has delimiter "|" into Dataframe in spark.
one way is to create the RDD and use toDF to create Dataframe. However I was wondering if I can create DF directly.
As of now I am using the below command
val productsDF = sqlContext.read.text("/user/danishdshadab786/paper2/products/")
For Spark 2.x
val df = spark.read.format("csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
For Spark<2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
You can add more options like option("header", "true") for reading headers in the same statement.
You can specify the delimiter in the 'read' options:
spark.read
.option("delimiter", "|")
.csv("/user/danishdshadab786/paper2/products/")

Spark-Scala Malformed Line Issue

I have a control-A delimited file which I am trying to convert to parquet format. However in the file there is a String field with a single " in it.
Reading the data like below:
val dataframe = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", datasetDelimiter)
.option("header", "false")
.option("mode","FAILFAST")
//.option("mode", "DROPMALFORMED")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(schema)
.load(fileLocation)
dataframe
As you can see there is just an open double quote in the data and no closed double quote. This is resulting in Malformed Line exception. While reading I have explicitly mention the delimiter as U0001. Is there any way to convert such data to parquet without losing any data
You can set the quote option to empty String:
.option("quote", "")
// or, equivalently, .option("quote", '\u0000')
That would tell Spark to treat " as any other non-special character.
(tested with Spark 2.1.0)

Spark - CSV text loading parsing error

I am using following code to load a csv file that has text/notes in it.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
Notes are not in any specific format. During loading I am getting this error:
com.univocity.parsers.common.TextParsingException: Error processing input: null
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'.
I'd appreciate any help. Thanks.
I do not have privilege to comment on question, I'm adding answer.
As you are doing na.drop(), may use option("mode", "DROPMALFORMED") as well.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
BTW, databricks spark csv is inbuilt in Spark 2.0+