Spark fails to read CSV when last column name contains spaces - scala

I have a CSV that looks like this:
+-----------------+-----------------+-----------------+
| Column One | Column Two | Column Three |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
In plain text, it actually looks like this:
Column One,Column Two,Column Three
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
My spark.read method looks like this:
val df = spark.read
.format("csv")
.schema(schema)
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
When multiLine is set to true, the df loads as empty. It loads fine when multiLine is set to false, but I need multiLine set to true.
If I change the name of Column Three to ColumnThree, and also update that in the schema object, then it works fine. It seems like multiLine is being applied to the header row! I was hoping that wouldn't be the case when header is also set to true.
Any ideas how to get around this? Should I be using the univocity parser instead of the default commons?
UPDATE:
I don't know why that mocked data was working fine. Here's a closer representation of the data:
CSV (Just 1 header and 1 line of data...):
Digital ISBN,Print ISBN,Title,Price,File Name,Description,Book Cover File Name
97803453308,test,This is English,29.99,qwe_1.txt,test,test
Schema & the spark.read method:
val df = spark.read
.format("csv")
.schema(StructType(Array(
StructField("Digital ISBN", StringType, true),
StructField("Print ISBN", StringType, true),
StructField("Title", StringType, true),
StructField("File Name", StringType, true),
StructField("Price", StringType, true),
StructField("Description", StringType, true),
StructField("Book Cover File Name", StringType, true)
)))
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
df.show() result in spark-shell:
+------------+----------+-----+---------+-----+-----------+--------------------+
|Digital ISBN|Print ISBN|Title|File Name|Price|Description|Book Cover File Name|
+------------+----------+-----+---------+-----+-----------+--------------------+
+------------+----------+-----+---------+-----+-----------+--------------------+
UDPATE 2:
I think I found "what's different". When I copy the data in the CSV and save it to another CSV, it works fine. But that original CSV (which was saved by Excel), fails... The CSV saved by Excel is 1290 bytes, while the CSV I created myself (which works fine) is 1292 bytes....
UPDATE 3:
I opened the two files mentioned in Update2 in vim and noticed that the CSV saved by Excel had ^M instead of new lines. All of my testing prior to this was flawed because it was always comparing a CSV originally saved by Excel vs a CSV created from Sublime... Sublime wasn't showing the difference. I'm sure there's a setting or package I can install to see that, because I use Sublime as my go-to one-off file editor...
Not sure if I should close this question since the title is misleading. Then again, there's gotta be some value to someone out there lol...

Since the question has a few up-votes, here's the resolution to the original problem as an answer...
Newlines in files saved in the Windows world contain both carriage return and line feed. Spark (running on Linux) sees this as a malformed row and drops it, because in its world, newlines are just line feed.
Lessons:
It's important to be familiar with the origin of the file you're working with.
When debugging data processing issues, work with an editor that shows carriage returns.

I was facing the same issue with the option of multiLine being applied to the header. I solved it by adding the additional option for ignoring trailing white space.
.option("header", true)
.option("multiLine", true)
.option("ignoreTrailingWhiteSpace", true)

Related

Spark Dataframe to TXT file without carriage return

I am trying to save the spark dataframe as text file. While doing this, I need to have specific column delimiter and row delimiters. I am unable to get the row delimiter working. Any help would be greatly appreciated.
Below is the sample code for reference.
//option -1
spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter", "\\§")
df.coalesce(1)
.map(_.mkString("\u00B6"))
.write
.option("encoding", "US-ASCI")
.mode(SaveMode.Overwrite).text(FileName)
//option-2
df.coalesce(1)
.write.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("encoding", "US-ASCI")
.option("multiLine", false)
.option("delimiter", "\u00B6")
.option("lineSep", "\u00A7")
.csv(FileName1)
Below is my input and output for reference:
Input:
Test1,Test2,Test2
Pqr,Rsu,Lmn
one,two,three
Output:
Test1¶Test2¶Test2§Pqr¶Rsu¶Lmn§one¶two¶three
From Spark 2.4.0, the "lineSep" option can be used to write json and text files with a custom line separator (cf. DataFrameWriter spec). This option is ignored in previous Spark versions and for csv format.
val df = spark.createDataFrame(Seq(("Test1","Test2","Test2"), ("one","two","three")))
df.map(_.mkString("\u00B6"))
.coalesce(1)
.write
.option("lineSep", "\u00A7")
.text(FileName)
Output with Spark 2.4.*:
Test1¶Test2¶Test2§one¶two¶three
Output with Spark 2.3.* and lower (the "lineSep" option is ignored):
Test1¶Test2¶Test2
one¶two¶three

Parse CSV file in Scala

I am trying to load a CSV file that has Japanese characters into a dataframe in scala. When I read a column value as "セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!" which is supposed to go in one column only, it breaks the string at "」"(considers this as new line) and creates two records.
I have set the "charset" property to UTF-16 also, quote character is "\"", still it showing more records than the file.
val df = spark.read.option("sep", "\t").option("header", "true").option("charset","UTF-16").option("inferSchema", "true").csv("file.txt")
Any pointer on how to solve this would be very helpful.
Looks like there's a new line character in your Japanese string. Can you try using the multiLine option while reading the file?
var data = spark.read.format("csv")
.option("header","true")
.option("delimiter", "\n")
.option("charset", "utf-16")
.option("inferSchema", "true")
.option("multiLine", true)
.load(filePath)
Note: As per the below answer there are some concerns with this approach when the input file is very big.
How to handle multi line rows in spark?
The below code should work for UTF-16. I couldn't able to set csv file encoding UTF-16 in Notepad++ and hence I have tested it with UTF-8. Please make sure that you have set input file encoding which is UTF-16.
Code snippet :
val br = new BufferedReader(
new InputStreamReader(
new FileInputStream("C:/Users/../Desktop/csvFile.csv"), "UTF-16"));
for(line <- br.readLine()){
print(line)
}
br.close();
csvFile content used:
【セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!,January, セキュリティ, 開催, 1000.00
Update:
If you want to load using spark then you can load csv file as below.
spark.read
.format("com.databricks.spark.csv")
.option("charset", "UTF-16")
.option("header", "false")
.option("escape", "\\")
.option("delimiter", ",")
.option("inferSchema", "false")
.load(fromPath)
Sample Input file for above code:
"102","03","セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!","カグラアカガワヤツキヨク","セキュリティ","受講登録でス"

How to drop malformed rows while reading csv with schema Spark?

While I am using Spark DataSet to load a csv file. I prefer designating schema clearly. But I find there are a few rows not compliant with my schema. A column should be double, but some rows are non-numeric values. Is it possible to filter all rows that are not compliant with my schema from DataSet easily?
val schema = StructType(StructField("col", DataTypes.DoubleType) :: Nil)
val ds = spark.read.format("csv").option("delimiter", "\t").schema(schema).load("f.csv")
f.csv:
a
1.0
I prefer "a" can be filtered from my DataSet easily. Thanks!
If you are reading a CSV file and want to drop the rows that do not match the schema. You can do this by adding the option mode as DROPMALFORMED
Input data
a,1.0
b,2.2
c,xyz
d,4.5
e,asfsdfsdf
f,3.1
Schema
val schema = StructType(Seq(
StructField("key", StringType, false),
StructField("value", DoubleType, false)
))
Reading a csv file with schema and option as
val df = spark.read.schema(schema)
.option("mode", "DROPMALFORMED")
.csv("/path to csv file ")
Output:
+-----+-----+
|key |value|
+-----+-----+
|hello|1.0 |
|hi |2.2 |
|how |3.1 |
|you |4.5 |
+-----+-----+
You can get more details on spark-csv here
Hope this helps!
.option("mode", "DROPMALFORMED") should do the work.
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When
a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.

Scala spark dataframe keep leading zeroes

I'm reading the following csv files :
id,hit,name
0001,00000,foo
0002,00001,bar
0003,00150,toto
As a spark Dataframe with an SqlContext which give the output :
+--+---+----+
|id|hit|name|
+--+---+----+
|1 |0 |foo |
|2 |1 |bar |
|3 |150|toto|
+--+---+----+
I need to keep the leading zeros in the Dataframe.
I tried with the option "allowNumericLeadingZeros" set to true, it doesn't work.
I saw some posts saying it's an excel issue, but my issue is that the leading zeros are removing inside the Dataframe.
How can I keep the leading zeros inside the Dataframe ?
Thanks !
public Dataset csv(String... paths)
Loads a CSV file and returns the result as a DataFrame.
This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.
You can set the following CSV-specific options to deal with CSV files:
sep (default ,): sets the single character as a separator for each field and value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different form com.databricks.spark.csv.
escape (default ): sets the single character used for escaping quotes inside an already quoted value.
comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data.
ignoreLeadingWhiteSpace (default false): defines whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): defines whether or not trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive infinity value.
negativeInf (default -Inf): sets the string representation of a negative infinity value.
dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSZZ): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
java.sql.Timestamp.valueOf() and java.sql.Date.valueOf() or ISO 8601 format.
maxColumns (default 20480): defines a hard limit of how many columns a record can have.
maxCharsPerColumn (default 1000000): defines the maximum number of characters allowed for any given value being read.
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored.
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
Parameters:
paths - (undocumented)
Returns:
(undocumented)
Since:
2.0.0
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/DataFrameReader.html
Example :
val dataframe= sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("/FileStore/tables/1wmfde6o1508943117023/Book2.csv")
val selectedData = dataframe.select("id","hit","name")
Result :
+---+---+----+
| id|hit|name|
+---+---+----+
| 1| 0| foo|
| 2| 1| bar|
| 3|150|toto|
+---+---+----+
Now change .option("inferSchema", "false")
val dataframe= sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types
.load("/FileStore/tables/1wmfde6o1508943117023/Book2.csv")
val selectedData = dataframe.select("id","hit","name")
Result :
+----+-----+----+
| id| hit|name|
+----+-----+----+
|0001|00000| foo|
|0002|00001| bar|
|0003|00150|toto|
+----+-----+----+
I would suggest you to create a schema for your dataframe and define the types as Strings.
You can create schema as
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("id", StringType, true), StructField("hit", StringType, true), StructField("name", StringType, true)))
and use it in sqlContext as
val df = sqlContext.read.option("header", true).schema(schema).format("com.databricks.spark.csv")
.csv("path to csv file")
As others answered you that the real culprit is that you might be using .option("inferSchema", true) which takes id and hit columns as integers and leading 0s are removed.
So you can read the csv file without .option("inferSchema", true) or with schema defined as above
I hope the answer is helpful
You must have set InferSchema as true while reading dataframe, remove this option or set it to false
sparkSession.read.option("header","true").option("inferSchema","false").csv("path")
Through this option, Spark infer the schema of dataframe and set dataType according to the values found, so spark is basically inferring that id and hits columns are numeric in nature and so it is removing all the leading zeros.
For further assistance take a look at this

Read .csv data in european format with Spark

I am currently doing my first attempts with Apache Spark.
I would like to read a .csv File with an SQLContext object, but Spark won't provide the correct results as the File is a european one (comma as decimal separator and semicolon used as value separator).
Is there a way to tell Spark to follow a different .csv syntax?
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Foo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("inferSchema","true")
.load("data.csv")
df.show()
A row in the relating .csv looks like this:
04.10.2016;12:51:00;1,1;0,41;0,416
Spark interprets the entire row as a column. df.show() prints:
+--------------------------------+
|Col1;Col2,Col3;Col4;Col5 |
+--------------------------------+
| 04.10.2016;12:51:...|
+--------------------------------+
In previous attempts to get it working df.show() was even printing more row-content where it now says '...' but eventually cutting the row at the comma in the third col.
You can just read as Test and split by ; or set a custom delimiter to the CSV format as in .option("delimiter",";")