Scala spark dataframe keep leading zeroes - scala

I'm reading the following csv files :
id,hit,name
0001,00000,foo
0002,00001,bar
0003,00150,toto
As a spark Dataframe with an SqlContext which give the output :
+--+---+----+
|id|hit|name|
+--+---+----+
|1 |0 |foo |
|2 |1 |bar |
|3 |150|toto|
+--+---+----+
I need to keep the leading zeros in the Dataframe.
I tried with the option "allowNumericLeadingZeros" set to true, it doesn't work.
I saw some posts saying it's an excel issue, but my issue is that the leading zeros are removing inside the Dataframe.
How can I keep the leading zeros inside the Dataframe ?
Thanks !

public Dataset csv(String... paths)
Loads a CSV file and returns the result as a DataFrame.
This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.
You can set the following CSV-specific options to deal with CSV files:
sep (default ,): sets the single character as a separator for each field and value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different form com.databricks.spark.csv.
escape (default ): sets the single character used for escaping quotes inside an already quoted value.
comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data.
ignoreLeadingWhiteSpace (default false): defines whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): defines whether or not trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive infinity value.
negativeInf (default -Inf): sets the string representation of a negative infinity value.
dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSZZ): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
java.sql.Timestamp.valueOf() and java.sql.Date.valueOf() or ISO 8601 format.
maxColumns (default 20480): defines a hard limit of how many columns a record can have.
maxCharsPerColumn (default 1000000): defines the maximum number of characters allowed for any given value being read.
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored.
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
Parameters:
paths - (undocumented)
Returns:
(undocumented)
Since:
2.0.0
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/DataFrameReader.html
Example :
val dataframe= sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("/FileStore/tables/1wmfde6o1508943117023/Book2.csv")
val selectedData = dataframe.select("id","hit","name")
Result :
+---+---+----+
| id|hit|name|
+---+---+----+
| 1| 0| foo|
| 2| 1| bar|
| 3|150|toto|
+---+---+----+
Now change .option("inferSchema", "false")
val dataframe= sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types
.load("/FileStore/tables/1wmfde6o1508943117023/Book2.csv")
val selectedData = dataframe.select("id","hit","name")
Result :
+----+-----+----+
| id| hit|name|
+----+-----+----+
|0001|00000| foo|
|0002|00001| bar|
|0003|00150|toto|
+----+-----+----+

I would suggest you to create a schema for your dataframe and define the types as Strings.
You can create schema as
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("id", StringType, true), StructField("hit", StringType, true), StructField("name", StringType, true)))
and use it in sqlContext as
val df = sqlContext.read.option("header", true).schema(schema).format("com.databricks.spark.csv")
.csv("path to csv file")
As others answered you that the real culprit is that you might be using .option("inferSchema", true) which takes id and hit columns as integers and leading 0s are removed.
So you can read the csv file without .option("inferSchema", true) or with schema defined as above
I hope the answer is helpful

You must have set InferSchema as true while reading dataframe, remove this option or set it to false
sparkSession.read.option("header","true").option("inferSchema","false").csv("path")
Through this option, Spark infer the schema of dataframe and set dataType according to the values found, so spark is basically inferring that id and hits columns are numeric in nature and so it is removing all the leading zeros.
For further assistance take a look at this

Related

How to remove all characters that start with "_" from a spark string column

I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0
As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs

How to Save a file with multiple delimiter in spark

I need to save a file delimited by "|~" characters but I get an error when I execute the command below. Can I save a file using multiple delimiters in Spark?
mydf1.coalesce(1).write.option("compression","none").format("csv").mode("Overwrite").option("delimiter","|~").save("my_hdfs_path")
// Error : pyspark.sql.utils.IllegalArgumentException: u'Delimiter cannot be more than one character: |~'
AFAIK, we are still waiting for an "official" solution, because the issue "support for multiple delimiter in Spark CSV read" is still open, and SparkSQL still rely on univocity-parsers. In univocity CSV settings, the CSV delimiter can only be a single character, which constrains both the parser (reader) and generator (writer).
Workarounds
Finding a universally fast and safe way to write as CSV is hard. But depends on your data size and the complexity of CSV contents (date format? currency? quoting?), we may find a shortcut. Following are just some, hopefully inspiring, thoughts...
write to CSV with special character (say ⊢) then substitute to |~.
(haven't been benchmarked, but IMO it's very hopeful to be the fastest)
df.coalesce(1).write.option("compression","none").option("delimiter", "⊢").mode("overwrite").csv("raw-output")
then post-process with (ideally locally) with, say sed
sed -i '.bak' 's/⊢/\|~/g' raw-output/*.csv
within PySpark, concatenate each row to a string, then write as a text file
(can be flexible to deal with locality and special needs -- with a bit more work)
d = [{'name': 'Alice', 'age': 1},{'name':'Bob', 'age':3}]
df = spark.createDataFrame(d, "name:string, age:int")
df.show()
#+-----+---+
#| name|age|
#+-----+---+
#|Alice| 1|
#| Bob| 3|
#+-----+---+
#udf
def mkstr(name, age):
"""
for example, the string field {name} should be quoted with `"`
"""
return '"{name}"|~{age}'.format(name=name, age=age)
# unparse a CSV row back to a string
df_unparsed = df.select(mkstr("name", "age").alias("csv_row"))
df_unparsed.show()
#+----------+
#| csv_row|
#+----------+
#|"Alice"|~1|
#| "Bob"|~3|
#+----------+
df_unparsed.coalesce(1).write.option("compression", "none").mode("overwrite").text("output")
numpy.savetxt allows multiple character as delimiter, so ...
(numpy has lots of builtins if you cares about precisions of floating numbers)
import pandas as pd
import numpy as np
# convert `Spark.DataFrame` to `Pandas.DataFrame`
df_pd = df.toPandas()
# use `numpy.savetxt` to save `Pandas.DataFrame`
np.savetxt("a-long-day.csv", df_pd, delimiter="|~", fmt="%s")
From Spark 3.0 We dont have this issue But if using prior version > Spark 2.3 This can be also used as solution. basically concatenating all column and fill null with blank and write the data with the desired delimiter along with the header . This will more generic solution than hardcoding.This allows to have the header also retained.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
target_row_delimited = "|,"
df=df.select([col(c).cast("string") for c in df.columns])
df=df.na.fill("")
headername=target_row_delimited.join(df.columns)
df=df.withColumn(headername, concat_ws(target_row_delimited, *df.columns))
df.select(df[headername].write.format("csv").mode(modeval).option("quoteAll", "false").option("quote","\u0000").option("header", "true").save(tgt_path + "/")
In case we need to read with multiple delimiters , the following solution can be avialed
source_delimiter = "|_|"
headcal = spark.read.text(source_filename)
headers = headcal.take(1)[0]['value']
header_column = headers.split(source_delimiter)
df = sc.textFile(source_filename).map(lambda x: x.split(source_delimiter)).toDF(header_column)

Spark-shell : The number of columns doesn't match

I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below .
Column1|Column2
1|Name_a
2|Name_b
But sometimes we receive only one column value and other is missing like below
Column1|Column2
1|Name_a
2|Name_b
3
4
5|Name_c
6
7|Name_f
So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those rows, without having a exception while reading the data from spark-shell like below.
val readFile = spark.read.option("delimiter", "|").csv("File.csv").toDF(Seq("Column1", "Column2"): _*)
When we are trying to read the file we are getting the below exception.
java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (1): _c0
New column names (2): Column1, Column2
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:435)
... 49 elided
You can specify schema of your data file and allow some columns to be nullable. In scala it may look like:
val schm = StructType(
StructField("Column1", StringType, nullable = true) ::
StructField("Column3", StringType, nullable = true) :: Nil)
val readFile = spark.read.
option("delimiter", "|")
.schema(schm)
.csv("File.csv").toDF
Than you can filter your dataset by column is not null.
Just add DROPMALFORMED mode to the option as below, while reading. Setting this makes Spark to drop the corrupted records.
val readFile = spark.read
.option("delimiter", "|")
.option("mode", "DROPMALFORMED") // Option to drop invalid rows.
.csv("File.csv")
.toDF(Seq("Column1", "Column2"): _*)
This is documented here.

Spark fails to read CSV when last column name contains spaces

I have a CSV that looks like this:
+-----------------+-----------------+-----------------+
| Column One | Column Two | Column Three |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
In plain text, it actually looks like this:
Column One,Column Two,Column Three
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
My spark.read method looks like this:
val df = spark.read
.format("csv")
.schema(schema)
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
When multiLine is set to true, the df loads as empty. It loads fine when multiLine is set to false, but I need multiLine set to true.
If I change the name of Column Three to ColumnThree, and also update that in the schema object, then it works fine. It seems like multiLine is being applied to the header row! I was hoping that wouldn't be the case when header is also set to true.
Any ideas how to get around this? Should I be using the univocity parser instead of the default commons?
UPDATE:
I don't know why that mocked data was working fine. Here's a closer representation of the data:
CSV (Just 1 header and 1 line of data...):
Digital ISBN,Print ISBN,Title,Price,File Name,Description,Book Cover File Name
97803453308,test,This is English,29.99,qwe_1.txt,test,test
Schema & the spark.read method:
val df = spark.read
.format("csv")
.schema(StructType(Array(
StructField("Digital ISBN", StringType, true),
StructField("Print ISBN", StringType, true),
StructField("Title", StringType, true),
StructField("File Name", StringType, true),
StructField("Price", StringType, true),
StructField("Description", StringType, true),
StructField("Book Cover File Name", StringType, true)
)))
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
df.show() result in spark-shell:
+------------+----------+-----+---------+-----+-----------+--------------------+
|Digital ISBN|Print ISBN|Title|File Name|Price|Description|Book Cover File Name|
+------------+----------+-----+---------+-----+-----------+--------------------+
+------------+----------+-----+---------+-----+-----------+--------------------+
UDPATE 2:
I think I found "what's different". When I copy the data in the CSV and save it to another CSV, it works fine. But that original CSV (which was saved by Excel), fails... The CSV saved by Excel is 1290 bytes, while the CSV I created myself (which works fine) is 1292 bytes....
UPDATE 3:
I opened the two files mentioned in Update2 in vim and noticed that the CSV saved by Excel had ^M instead of new lines. All of my testing prior to this was flawed because it was always comparing a CSV originally saved by Excel vs a CSV created from Sublime... Sublime wasn't showing the difference. I'm sure there's a setting or package I can install to see that, because I use Sublime as my go-to one-off file editor...
Not sure if I should close this question since the title is misleading. Then again, there's gotta be some value to someone out there lol...
Since the question has a few up-votes, here's the resolution to the original problem as an answer...
Newlines in files saved in the Windows world contain both carriage return and line feed. Spark (running on Linux) sees this as a malformed row and drops it, because in its world, newlines are just line feed.
Lessons:
It's important to be familiar with the origin of the file you're working with.
When debugging data processing issues, work with an editor that shows carriage returns.
I was facing the same issue with the option of multiLine being applied to the header. I solved it by adding the additional option for ignoring trailing white space.
.option("header", true)
.option("multiLine", true)
.option("ignoreTrailingWhiteSpace", true)

How to drop malformed rows while reading csv with schema Spark?

While I am using Spark DataSet to load a csv file. I prefer designating schema clearly. But I find there are a few rows not compliant with my schema. A column should be double, but some rows are non-numeric values. Is it possible to filter all rows that are not compliant with my schema from DataSet easily?
val schema = StructType(StructField("col", DataTypes.DoubleType) :: Nil)
val ds = spark.read.format("csv").option("delimiter", "\t").schema(schema).load("f.csv")
f.csv:
a
1.0
I prefer "a" can be filtered from my DataSet easily. Thanks!
If you are reading a CSV file and want to drop the rows that do not match the schema. You can do this by adding the option mode as DROPMALFORMED
Input data
a,1.0
b,2.2
c,xyz
d,4.5
e,asfsdfsdf
f,3.1
Schema
val schema = StructType(Seq(
StructField("key", StringType, false),
StructField("value", DoubleType, false)
))
Reading a csv file with schema and option as
val df = spark.read.schema(schema)
.option("mode", "DROPMALFORMED")
.csv("/path to csv file ")
Output:
+-----+-----+
|key |value|
+-----+-----+
|hello|1.0 |
|hi |2.2 |
|how |3.1 |
|you |4.5 |
+-----+-----+
You can get more details on spark-csv here
Hope this helps!
.option("mode", "DROPMALFORMED") should do the work.
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When
a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.