spark SQL - spark.read.option reading dd-MMM-yyyy from csv into dataFrame - date

I have a CSV file with a column of string dd-MMM-yyyy (e.g. 03-APR-2019), which I want to read as date format.
my code to read as below:
spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("quote", "\"")
.option("escape", "\"")
.option("multiLine", "true")
.option("dateFormat", "dd-MMM-yyyy")
.csv(csvInPath)
However, after my code read the CSV file, the date still appears as String in my data frame.
anyone can advise? thanks

Related

Parse CSV file in Scala

I am trying to load a CSV file that has Japanese characters into a dataframe in scala. When I read a column value as "セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!" which is supposed to go in one column only, it breaks the string at "」"(considers this as new line) and creates two records.
I have set the "charset" property to UTF-16 also, quote character is "\"", still it showing more records than the file.
val df = spark.read.option("sep", "\t").option("header", "true").option("charset","UTF-16").option("inferSchema", "true").csv("file.txt")
Any pointer on how to solve this would be very helpful.
Looks like there's a new line character in your Japanese string. Can you try using the multiLine option while reading the file?
var data = spark.read.format("csv")
.option("header","true")
.option("delimiter", "\n")
.option("charset", "utf-16")
.option("inferSchema", "true")
.option("multiLine", true)
.load(filePath)
Note: As per the below answer there are some concerns with this approach when the input file is very big.
How to handle multi line rows in spark?
The below code should work for UTF-16. I couldn't able to set csv file encoding UTF-16 in Notepad++ and hence I have tested it with UTF-8. Please make sure that you have set input file encoding which is UTF-16.
Code snippet :
val br = new BufferedReader(
new InputStreamReader(
new FileInputStream("C:/Users/../Desktop/csvFile.csv"), "UTF-16"));
for(line <- br.readLine()){
print(line)
}
br.close();
csvFile content used:
【セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!,January, セキュリティ, 開催, 1000.00
Update:
If you want to load using spark then you can load csv file as below.
spark.read
.format("com.databricks.spark.csv")
.option("charset", "UTF-16")
.option("header", "false")
.option("escape", "\\")
.option("delimiter", ",")
.option("inferSchema", "false")
.load(fromPath)
Sample Input file for above code:
"102","03","セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!","カグラアカガワヤツキヨク","セキュリティ","受講登録でス"

Spark-Scala quote issue

I have my input data in ISO-8859-1 format. It is a cedilla delimited file. The data has a double quote in it. I am converting the file to UTF8 format. When doing so, spark is inserting some escape character and more quotes. What can i do to make sure that the extra quotes and escape character is not added to the output?
Sample Input
XYZÇVIB BROS CRANE AND BIG "TONYÇ1961-02-23Ç00:00:00
Sample Output
XYZÇ"VIB BROS CRANE AND BIG \"TONY"Ç1961-02-23Ç00:00:00
Code
var InputFormatDataFrame = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", delimiter)
.option("charset", input_format)
.option("header", "false")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("quote","")
.option("quoteMode","NONE")
//.option("escape","\"")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("mode","FAILFAST")
.load(input_location)
InputFormatDataFrame.write.mode("overwrite").option("delimiter", delimiter).option("charset", "UTF-8").csv(output_location)

Spark-Scala Malformed Line Issue

I have a control-A delimited file which I am trying to convert to parquet format. However in the file there is a String field with a single " in it.
Reading the data like below:
val dataframe = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", datasetDelimiter)
.option("header", "false")
.option("mode","FAILFAST")
//.option("mode", "DROPMALFORMED")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(schema)
.load(fileLocation)
dataframe
As you can see there is just an open double quote in the data and no closed double quote. This is resulting in Malformed Line exception. While reading I have explicitly mention the delimiter as U0001. Is there any way to convert such data to parquet without losing any data
You can set the quote option to empty String:
.option("quote", "")
// or, equivalently, .option("quote", '\u0000')
That would tell Spark to treat " as any other non-special character.
(tested with Spark 2.1.0)

spark parquet conversion issue with malformed lines in file

I have a "\u0001" delimited file reading with spark for parquet conversion and I don't have any issues with schema, but, data has quotes(") in between without an end quote. I tried different solutions but couldn't figured out any.
val df = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "\u0001")
//.option("quote", "\"")
//.option("quote", null)
//.option("quoteMode", "ALL")
.option("header", "false")
.option("mode","FAILFAST")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(schema)
.load(fileLocation)
Thanks in advance and appreciate your help
You can use sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter","\u0001")
and read as textFile
val sentences = sparkContext.textFile(directoryPath)

Spark - CSV text loading parsing error

I am using following code to load a csv file that has text/notes in it.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
Notes are not in any specific format. During loading I am getting this error:
com.univocity.parsers.common.TextParsingException: Error processing input: null
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'.
I'd appreciate any help. Thanks.
I do not have privilege to comment on question, I'm adding answer.
As you are doing na.drop(), may use option("mode", "DROPMALFORMED") as well.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
BTW, databricks spark csv is inbuilt in Spark 2.0+