Scala spark Read CSV with comma as decimal separator - scala

i try to read a csv in databrick(spark scala) whit comma (,) as decimal separator.
I use badRecordsPath option to catch bad record in a specified path so i can't read it as string then replace "," with ".".
I try ro read csv but i have the following error: "java.lang.NumberFormatException: For input string: "0,939".
It's means that number with "," as decimal separator can't be read as Float.
my code is a simple spark read with a schema :
val schema = new StructType()
.add("mycolumn1","float")
.add("mycolumn2","int")
.add("mycolumn3","timestamp")
And the read :
spark.read
.options(Map("inferSchema"->"false","delimiter"->";","header"->"true", "timestampFormat" ->"dd/MM/yyyy HH:mm","badRecordsPath" -> path to save bad records))
.schema(schema)
.csv(path to csv)
Anybody has a idea of how i can read it as float ?

Related

Spark Dataframe from a different data format

I've this data set. for which I need to create a sparkdataframe in scala. This data is a column in a csv file. column name is dataheader
dataheader
"{""date_time"":""1999/05/22 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""113"",""activityTimeWindowMilliseconds"":20000,""ec"":""event1"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
"{""date_time"":""1999/05/23 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""114"",""activityTimeWindowMilliseconds"":20000,""ec"":""event2"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
I was able to read the csv file -
val df_tmp = spark
.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("quoteMode", "ALL")
.option("delimiter", ",")
.option("escape", "\"")
//.option("inferSchema","true")
.option("multiline", "true")
.load("D:\\dataFile.csv")
I tried to split the data into separate columns in a dataframe but did not succeed.
one thing I noticed in data is both key and value are enclosed by double double quotes ""key1"":""value1""
If you want to get the field inside the data field, you need to parse it and write it into a new CSV file.
It's obviously a string in json format

Parse CSV file in Scala

I am trying to load a CSV file that has Japanese characters into a dataframe in scala. When I read a column value as "セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!" which is supposed to go in one column only, it breaks the string at "」"(considers this as new line) and creates two records.
I have set the "charset" property to UTF-16 also, quote character is "\"", still it showing more records than the file.
val df = spark.read.option("sep", "\t").option("header", "true").option("charset","UTF-16").option("inferSchema", "true").csv("file.txt")
Any pointer on how to solve this would be very helpful.
Looks like there's a new line character in your Japanese string. Can you try using the multiLine option while reading the file?
var data = spark.read.format("csv")
.option("header","true")
.option("delimiter", "\n")
.option("charset", "utf-16")
.option("inferSchema", "true")
.option("multiLine", true)
.load(filePath)
Note: As per the below answer there are some concerns with this approach when the input file is very big.
How to handle multi line rows in spark?
The below code should work for UTF-16. I couldn't able to set csv file encoding UTF-16 in Notepad++ and hence I have tested it with UTF-8. Please make sure that you have set input file encoding which is UTF-16.
Code snippet :
val br = new BufferedReader(
new InputStreamReader(
new FileInputStream("C:/Users/../Desktop/csvFile.csv"), "UTF-16"));
for(line <- br.readLine()){
print(line)
}
br.close();
csvFile content used:
【セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!,January, セキュリティ, 開催, 1000.00
Update:
If you want to load using spark then you can load csv file as below.
spark.read
.format("com.databricks.spark.csv")
.option("charset", "UTF-16")
.option("header", "false")
.option("escape", "\\")
.option("delimiter", ",")
.option("inferSchema", "false")
.load(fromPath)
Sample Input file for above code:
"102","03","セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!","カグラアカガワヤツキヨク","セキュリティ","受講登録でス"

Spark, Scala not able to create view appropriately after reading from file

I am using spark and scala on jdk1.8.I am new to Scala.
I am reading a text file (pat1.txt) that looks like :
Now I am reading that file from my scala code as :
val sqlContext = SparkSession.builder().getOrCreate()
sqlContext.read
.format(externalEntity.getExtractfileType)
.option("compression", externalEntity.getCompressionCodec)
.option("header", if (externalEntity.getHasHeader.toUpperCase == "Y") "true" else "false")
.option("inferSchema", "true")
.option("delimiter", externalEntity.getExtractDelimiter)
.load(externalEntity.getFilePath)
.createOrReplaceTempView(externalEntity.getExtractName)
And then making a query as from my scala code:
val queryResult = sqlContext.sql(myQuery)
and output is generated as :
queryResult
.repartition(LteGenericExtractEntity.getNumberOfFiles.toInt)
.write.format("csv")
.option("compression", LteGenericExtractEntity.getCompressionCodec)
.option("delimiter", LteGenericExtractEntity.getExtractDelimiter)
.option("header", "true"")
.save(s"${outputDirectory}/${extractFileBase}")
Now when the 'myQuery' above is
select * from PAT1
The program is generating o/p as (notice the extra line with "value" that was not part of the file). Basically the program is not able to to identify the "," separated columns in the input file and in the output it creates 1 column under the header that is named as "value". So the output file looks like :
If I change 'myQuery' as :
select p1.FIRST_NAME, p1.LAST_NAME,p1.HOBBY from PAT1 p1
It throws exception as:
My input can be in any format ( like can be text/csv and can have compression) and output will always be in .csv
I am getting hard time to understand how to change the read part so the created view can have columns appropriately.Can I get help on that.
This looks like csv file, but with .txt extension.
You could try the following:
Rad this file as csv with extra options like spark.read.option("inferSchema", "true").option("header", "true").csv("path/to/file")
After reading file as you did, just specify the schema of the dataframe as:
sqlContext.read.format("text")
.option("compression", "none")
.option("delimiter", ",")
.option("header", "true")
.load("/tmp/pat1")
.toDF("first_name", "last_name", "hobby")

Spark-Scala Malformed Line Issue

I have a control-A delimited file which I am trying to convert to parquet format. However in the file there is a String field with a single " in it.
Reading the data like below:
val dataframe = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", datasetDelimiter)
.option("header", "false")
.option("mode","FAILFAST")
//.option("mode", "DROPMALFORMED")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(schema)
.load(fileLocation)
dataframe
As you can see there is just an open double quote in the data and no closed double quote. This is resulting in Malformed Line exception. While reading I have explicitly mention the delimiter as U0001. Is there any way to convert such data to parquet without losing any data
You can set the quote option to empty String:
.option("quote", "")
// or, equivalently, .option("quote", '\u0000')
That would tell Spark to treat " as any other non-special character.
(tested with Spark 2.1.0)

Condense several columns read with Spark CSV

I have data like the following in a CSV file:
ColumnA,1,2,3,2,1
"YYY",242,34234,232,322,432
"ZZZ",16,435,363,3453,3434
I want to read it with https://github.com/databricks/spark-csv
I would like to read this into a DataFrame and condense all the columns except the first one into a Seq.
So I would like to obtain something like this from it:
MyCaseClass("YYY", Seq(242,34234,232,322,432))
MyCaseClass("ZZZ", Seq(16,435,363,3453,3434))
I'm not sure how to obtain that.
I tried reading like this, where url is the location of the file:
val rawData = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(url)
Then, I am mapping it into the values that I want.
The problem is that I get the error:
The header contains a duplicate entry: '1'
So how can I condense all the fields except the first into a Seq using spark-csv?
EDIT
I can not change the format of the input.
you can do by mapping over row . And also as Pawel's comment duplicate column name is not allowed. So, you can do like :
val dataFrame = yourCSV_DataFrame
dataFrame.map{row =>
Row(row(0), Seq(row(1), row(2), row(3) ...))
}