How to drop malformed rows while reading csv with schema Spark? - scala

While I am using Spark DataSet to load a csv file. I prefer designating schema clearly. But I find there are a few rows not compliant with my schema. A column should be double, but some rows are non-numeric values. Is it possible to filter all rows that are not compliant with my schema from DataSet easily?
val schema = StructType(StructField("col", DataTypes.DoubleType) :: Nil)
val ds = spark.read.format("csv").option("delimiter", "\t").schema(schema).load("f.csv")
f.csv:
a
1.0
I prefer "a" can be filtered from my DataSet easily. Thanks!

If you are reading a CSV file and want to drop the rows that do not match the schema. You can do this by adding the option mode as DROPMALFORMED
Input data
a,1.0
b,2.2
c,xyz
d,4.5
e,asfsdfsdf
f,3.1
Schema
val schema = StructType(Seq(
StructField("key", StringType, false),
StructField("value", DoubleType, false)
))
Reading a csv file with schema and option as
val df = spark.read.schema(schema)
.option("mode", "DROPMALFORMED")
.csv("/path to csv file ")
Output:
+-----+-----+
|key |value|
+-----+-----+
|hello|1.0 |
|hi |2.2 |
|how |3.1 |
|you |4.5 |
+-----+-----+
You can get more details on spark-csv here
Hope this helps!

.option("mode", "DROPMALFORMED") should do the work.
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When
a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.

Related

Spark-shell : The number of columns doesn't match

I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below .
Column1|Column2
1|Name_a
2|Name_b
But sometimes we receive only one column value and other is missing like below
Column1|Column2
1|Name_a
2|Name_b
3
4
5|Name_c
6
7|Name_f
So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those rows, without having a exception while reading the data from spark-shell like below.
val readFile = spark.read.option("delimiter", "|").csv("File.csv").toDF(Seq("Column1", "Column2"): _*)
When we are trying to read the file we are getting the below exception.
java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (1): _c0
New column names (2): Column1, Column2
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:435)
... 49 elided
You can specify schema of your data file and allow some columns to be nullable. In scala it may look like:
val schm = StructType(
StructField("Column1", StringType, nullable = true) ::
StructField("Column3", StringType, nullable = true) :: Nil)
val readFile = spark.read.
option("delimiter", "|")
.schema(schm)
.csv("File.csv").toDF
Than you can filter your dataset by column is not null.
Just add DROPMALFORMED mode to the option as below, while reading. Setting this makes Spark to drop the corrupted records.
val readFile = spark.read
.option("delimiter", "|")
.option("mode", "DROPMALFORMED") // Option to drop invalid rows.
.csv("File.csv")
.toDF(Seq("Column1", "Column2"): _*)
This is documented here.

Handling schema mismatches in Spark

I am reading a csv file using Spark in Scala.
The schema is predefined and i am using it for reading.
This is the esample code:
// create the schema
val schema= StructType(Array(
StructField("col1", IntegerType,false),
StructField("col2", StringType,false),
StructField("col3", StringType,true)))
// Initialize Spark session
val spark: SparkSession = SparkSession.builder
.appName("Parquet Converter")
.getOrCreate
// Create a data frame from a csv file
val dataFrame: DataFrame =
spark.read.format("csv").schema(schema).option("header", false).load(inputCsvPath)
From what i read when reading cav with Spark using a schema there are 3 options:
Set mode to DROPMALFORMED --> this will drop the lines that don't match the schema
Set mode to PERMISSIVE --> this will set the whole line to null values
Set mode to FAILFAST --> this will throw an exception when a mismatch is discovered
What is the best way to combine the options? The behaviour I want is to get the mismatches in the schema, print them as errors and ignoring the lines in my data frame.
Basically, I want a combination of FAILFAST and DROPMALFORMED.
Thanks in advance
This is what I eventually did:
I added to the schema the "_corrupt_record" column, for example:
val schema= StructType(Array(
StructField("col1", IntegerType,true),
StructField("col2", StringType,false),
StructField("col3", StringType,true),
StructField("_corrupt_record", StringType, true)))
Then I read the CSV using PERMISSIVE mode (it is Spark default):
val dataFrame: DataFrame = spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "PERMISSIVE")
.load(inputCsvPath)
Now my data frame holds an additional column that holds the rows with schema mismatches.
I filtered the rows that have mismatched data and printed it:
val badRows = dataFrame.filter("_corrupt_record is not null")
badRows.cache()
badRows.show()
Just use DROPMALFORMED and follow the log. If malformed records are present there are dumped to the log, up to the limit set by maxMalformedLogPerPartition option.
spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "DROPMALFORMED")
.option("maxMalformedLogPerPartition", 128)
.load(inputCsvPath)

Spark fails to read CSV when last column name contains spaces

I have a CSV that looks like this:
+-----------------+-----------------+-----------------+
| Column One | Column Two | Column Three |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
In plain text, it actually looks like this:
Column One,Column Two,Column Three
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
My spark.read method looks like this:
val df = spark.read
.format("csv")
.schema(schema)
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
When multiLine is set to true, the df loads as empty. It loads fine when multiLine is set to false, but I need multiLine set to true.
If I change the name of Column Three to ColumnThree, and also update that in the schema object, then it works fine. It seems like multiLine is being applied to the header row! I was hoping that wouldn't be the case when header is also set to true.
Any ideas how to get around this? Should I be using the univocity parser instead of the default commons?
UPDATE:
I don't know why that mocked data was working fine. Here's a closer representation of the data:
CSV (Just 1 header and 1 line of data...):
Digital ISBN,Print ISBN,Title,Price,File Name,Description,Book Cover File Name
97803453308,test,This is English,29.99,qwe_1.txt,test,test
Schema & the spark.read method:
val df = spark.read
.format("csv")
.schema(StructType(Array(
StructField("Digital ISBN", StringType, true),
StructField("Print ISBN", StringType, true),
StructField("Title", StringType, true),
StructField("File Name", StringType, true),
StructField("Price", StringType, true),
StructField("Description", StringType, true),
StructField("Book Cover File Name", StringType, true)
)))
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
df.show() result in spark-shell:
+------------+----------+-----+---------+-----+-----------+--------------------+
|Digital ISBN|Print ISBN|Title|File Name|Price|Description|Book Cover File Name|
+------------+----------+-----+---------+-----+-----------+--------------------+
+------------+----------+-----+---------+-----+-----------+--------------------+
UDPATE 2:
I think I found "what's different". When I copy the data in the CSV and save it to another CSV, it works fine. But that original CSV (which was saved by Excel), fails... The CSV saved by Excel is 1290 bytes, while the CSV I created myself (which works fine) is 1292 bytes....
UPDATE 3:
I opened the two files mentioned in Update2 in vim and noticed that the CSV saved by Excel had ^M instead of new lines. All of my testing prior to this was flawed because it was always comparing a CSV originally saved by Excel vs a CSV created from Sublime... Sublime wasn't showing the difference. I'm sure there's a setting or package I can install to see that, because I use Sublime as my go-to one-off file editor...
Not sure if I should close this question since the title is misleading. Then again, there's gotta be some value to someone out there lol...
Since the question has a few up-votes, here's the resolution to the original problem as an answer...
Newlines in files saved in the Windows world contain both carriage return and line feed. Spark (running on Linux) sees this as a malformed row and drops it, because in its world, newlines are just line feed.
Lessons:
It's important to be familiar with the origin of the file you're working with.
When debugging data processing issues, work with an editor that shows carriage returns.
I was facing the same issue with the option of multiLine being applied to the header. I solved it by adding the additional option for ignoring trailing white space.
.option("header", true)
.option("multiLine", true)
.option("ignoreTrailingWhiteSpace", true)

creating dataframe by loading csv file using scala in spark

but csv file is added with extra double quotes which results all cloumns into single column
there are four columns,header and 2 rows
"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv")
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]
What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api
//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)
You should be getting
+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1 |Priya|78 |Phone |
|2 |Jhon |20 |mail |
+----+-----+---+-------+
I hope the answer is helpful

Scala spark dataframe keep leading zeroes

I'm reading the following csv files :
id,hit,name
0001,00000,foo
0002,00001,bar
0003,00150,toto
As a spark Dataframe with an SqlContext which give the output :
+--+---+----+
|id|hit|name|
+--+---+----+
|1 |0 |foo |
|2 |1 |bar |
|3 |150|toto|
+--+---+----+
I need to keep the leading zeros in the Dataframe.
I tried with the option "allowNumericLeadingZeros" set to true, it doesn't work.
I saw some posts saying it's an excel issue, but my issue is that the leading zeros are removing inside the Dataframe.
How can I keep the leading zeros inside the Dataframe ?
Thanks !
public Dataset csv(String... paths)
Loads a CSV file and returns the result as a DataFrame.
This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.
You can set the following CSV-specific options to deal with CSV files:
sep (default ,): sets the single character as a separator for each field and value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different form com.databricks.spark.csv.
escape (default ): sets the single character used for escaping quotes inside an already quoted value.
comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data.
ignoreLeadingWhiteSpace (default false): defines whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): defines whether or not trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive infinity value.
negativeInf (default -Inf): sets the string representation of a negative infinity value.
dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSZZ): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
java.sql.Timestamp.valueOf() and java.sql.Date.valueOf() or ISO 8601 format.
maxColumns (default 20480): defines a hard limit of how many columns a record can have.
maxCharsPerColumn (default 1000000): defines the maximum number of characters allowed for any given value being read.
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored.
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
Parameters:
paths - (undocumented)
Returns:
(undocumented)
Since:
2.0.0
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/DataFrameReader.html
Example :
val dataframe= sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("/FileStore/tables/1wmfde6o1508943117023/Book2.csv")
val selectedData = dataframe.select("id","hit","name")
Result :
+---+---+----+
| id|hit|name|
+---+---+----+
| 1| 0| foo|
| 2| 1| bar|
| 3|150|toto|
+---+---+----+
Now change .option("inferSchema", "false")
val dataframe= sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types
.load("/FileStore/tables/1wmfde6o1508943117023/Book2.csv")
val selectedData = dataframe.select("id","hit","name")
Result :
+----+-----+----+
| id| hit|name|
+----+-----+----+
|0001|00000| foo|
|0002|00001| bar|
|0003|00150|toto|
+----+-----+----+
I would suggest you to create a schema for your dataframe and define the types as Strings.
You can create schema as
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("id", StringType, true), StructField("hit", StringType, true), StructField("name", StringType, true)))
and use it in sqlContext as
val df = sqlContext.read.option("header", true).schema(schema).format("com.databricks.spark.csv")
.csv("path to csv file")
As others answered you that the real culprit is that you might be using .option("inferSchema", true) which takes id and hit columns as integers and leading 0s are removed.
So you can read the csv file without .option("inferSchema", true) or with schema defined as above
I hope the answer is helpful
You must have set InferSchema as true while reading dataframe, remove this option or set it to false
sparkSession.read.option("header","true").option("inferSchema","false").csv("path")
Through this option, Spark infer the schema of dataframe and set dataType according to the values found, so spark is basically inferring that id and hits columns are numeric in nature and so it is removing all the leading zeros.
For further assistance take a look at this