PySpark na.fill not replacing null values with 0 in DF - pyspark

I am using the following code sample:
paths = ["/FileStore/tables/data.csv"]
infer_schema = "true"
df= sqlContext.read \
.format("com.databricks.spark.csv") \
.option("inferSchema", infer_schema) \
.option("header", "true") \
.load(paths)
df.printSchema()
root |-- key: string (nullable = true) |-- dt:
string (nullable = true) |-- key1: string (nullable =
true) |-- key2: string (nullable = true) |-- sls: string
(nullable = true) |-- uts: string (nullable = true) |-- key3:
string (nullable = true)
I did the following to count the null values for the fields sls and uts
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------------+--------+------------------+-----------+-----+-----+---------+
|key| dt| key1| key2| sls| uts| key3|
+-------------+--------+------------------+-----------+-----+-----+---------+
| 0| 0| 0| 0| 616| 593| 0|
+-------------+--------+------------------+-----------+-----+-----+---------+
I did the following first:
df.na.fill({'sls': 0, 'uts': 0})
Then I realized these are string fields. So, I did:
df.na.fill({'sls': '0', 'uts': '0'})
After doing this, if I do :
df.filter("sls is NULL").show()
I see null values for sls field:
key| dt| key1| key2| sls| uts| key3|
+-------------+----------+------------------+-----------+-----+-----+-----------+
| -1| 7/13/2020| 8000|41342299215| null| 1|1.70228E+25|
| -1| 12/5/2019| 8734| 8983349833| null| 1|1.76412E+26|
| -1| 1/7/2020| 8822| 1E+15| null| 1|4.69408E+24|
| -1| 12/5/2018| 6768| 1E+15| null| 1|4.54778E+24|
It's the same thing if I do:
df.filter("uts is NULL").show()
Is there something I am missing? Why am I unable to replace the null values with 0?

.na.fill returns a new data frame with null values being replaced. You just need to assign the result to df variable in order for the replacement to take effect:
df = df.na.fill({'sls': '0', 'uts': '0'})

Related

How to change struct dataType to Integer in pyspark?

I have a dataframe df, and one column has data type of struct<long:bigint, string:string>
because of this data type structure, I can not perform addition, subtration etc...
how to change struct<long:bigint, string:string> to just IntegerType??
You can use a dot syntax to access parts of the struct column.
For example if you start with this dataframe
df = spark.createDataFrame([(1,(3,'x')),(4,(8, 'y'))]).toDF("col1", "col2")
df.show()
df.printSchema()
+----+------+
|col1| col2|
+----+------+
| 1|[3, x]|
| 4|[8, y]|
+----+------+
root
|-- col1: long (nullable = true)
|-- col2: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: string (nullable = true)
use can select the first part of the struct column and either create a new column or replace an existing one:
df.withColumn('col2', df['col2._1']).show()
prints
+----+----+
|col1|col2|
+----+----+
| 1| 3|
| 4| 8|
+----+----+

Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark

I have a CSV file where the last column is inside parenthesis and the values are separated by commas. The number of values is variable in the last column. When I read them to as Dataframe with some column names as follows, I get Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. My CSV file looks like this
a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)
what I finally want is something like this:
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: boolean(nullable = true)
|-- STime: datetype(nullable = true)
|-- TotalMinutes: double(nullable = true)
|-- SomeArrayHeader: array<string>(nullable = true)
I have written the following code till now:
val infoDF =
sqlContext.read.format("csv")
.option("header", "false")
.load(inputPath)
.toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"SomeArrayHeader")
I thought of reading them without giving column names and then cast the columns which are after the 5th columns to array type. But then I am having problems with the parentheses. Is there a way I can do this while reading and telling that fields inside parenthesis are actually one field of type array.
Ok. The solution is only tactical for your case. The below one worked for me
val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"arr")
df.show()
val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
df2.printSchema()
df2.show()
Output:
+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| c1,d1,e1)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,f2,g2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2,e2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: string (nullable = true)
|-- STime: string (nullable = true)
|-- TotalMinutes: string (nullable = true)
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| [c1, d1, e1]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, f2, g2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2, e2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+

Spark: How to split struct type into multiple columns?

I know this question has been asked many times on Stack Overflow and has been satisfactorily answered in most posts, but I'm not sure if this is the best way in my case.
I have a Dataset that has several struct types embedded in it:
root
|-- STRUCT1: struct (nullable = true)
| |-- FIELD_1: string (nullable = true)
| |-- FIELD_2: long (nullable = true)
| |-- FIELD_3: integer (nullable = true)
|-- STRUCT2: struct (nullable = true)
| |-- FIELD_4: string (nullable = true)
| |-- FIELD_5: long (nullable = true)
| |-- FIELD_6: integer (nullable = true)
|-- STRUCT3: struct (nullable = true)
| |-- FIELD_7: string (nullable = true)
| |-- FIELD_8: long (nullable = true)
| |-- FIELD_9: integer (nullable = true)
|-- ARRAYSTRUCT4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD_10: integer (nullable = true)
| | |-- FIELD_11: integer (nullable = true)
+-------+------------+------------+------------------+
|STRUCT1| STRUCT2 | STRUCT3 | ARRAYSTRUCT4 |
+-------+------------+------------+------------------+
|[1,2,3]|[aa, xx, yy]|[p1, q2, r3]|[[1a, 2b],[3c,4d]]|
+-------+------------+------------+------------------+
I want to convert this into:
1. A dataset where the structs are expanded into columns.
2. A data set where the array (ARRAYSTRUCT4) is exploded into rows.
root
|-- FIELD_1: string (nullable = true)
|-- FIELD_2: long (nullable = true)
|-- FIELD_3: integer (nullable = true)
|-- FIELD_4: string (nullable = true)
|-- FIELD_5: long (nullable = true)
|-- FIELD_6: integer (nullable = true)
|-- FIELD_7: string (nullable = true)
|-- FIELD_8: long (nullable = true)
|-- FIELD_9: integer (nullable = true)
|-- FIELD_10: integer (nullable = true)
|-- FIELD_11: integer (nullable = true)
+-------+------------+------------+---------+ ---------+----------+
|FIELD_1| FIELD_2 | FIELD_3 | FIELD_4 | |FIELD_10| FIELD_11 |
+-------+------------+------------+---------+ ... ---------+----------+
|1 |2 |3 | aa | | 1a | 2b |
+-------+------------+------------+-----------------------------------+
To achieve this, I could use:
val expanded = df.select("STRUCT1.*", "STRUCT2.*", "STRUCT3.*", "STRUCT4")
followed by an explode:
val exploded = expanded.select(explode(expanded("STRUCT4")))
However, I was wondering if there's a more functional way to do this, especially the select. I could use withColumn as below:
data.withColumn("FIELD_1", $"STRUCT1".getItem(0))
.withColumn("FIELD_2", $"STRUCT1".getItem(1))
.....
But I have 80+ columns. Is there a better way to achieve this?
You can first make all columns struct-type by explode-ing any Array(struct) columns into struct columns via foldLeft, then use map to interpolate each of the struct column names into col.*, as shown below:
import org.apache.spark.sql.functions._
case class S1(FIELD_1: String, FIELD_2: Long, FIELD_3: Int)
case class S2(FIELD_4: String, FIELD_5: Long, FIELD_6: Int)
case class S3(FIELD_7: String, FIELD_8: Long, FIELD_9: Int)
case class S4(FIELD_10: Int, FIELD_11: Int)
val df = Seq(
(S1("a1", 101, 11), S2("a2", 102, 12), S3("a3", 103, 13), Array(S4(1, 1), S4(3, 3))),
(S1("b1", 201, 21), S2("b2", 202, 22), S3("b3", 203, 23), Array(S4(2, 2), S4(4, 4)))
).toDF("STRUCT1", "STRUCT2", "STRUCT3", "ARRAYSTRUCT4")
// +-----------+-----------+-----------+--------------+
// | STRUCT1| STRUCT2| STRUCT3| ARRAYSTRUCT4|
// +-----------+-----------+-----------+--------------+
// |[a1,101,11]|[a2,102,12]|[a3,103,13]|[[1,1], [3,3]]|
// |[b1,201,21]|[b2,202,22]|[b3,203,23]|[[2,2], [4,4]]|
// +-----------+-----------+-----------+--------------+
val arrayCols = df.dtypes.filter( t => t._2.startsWith("ArrayType(StructType") ).
map(_._1)
// arrayCols: Array[String] = Array(ARRAYSTRUCT4)
val expandedDF = arrayCols.foldLeft(df)((accDF, c) =>
accDF.withColumn(c.replace("ARRAY", ""), explode(col(c))).drop(c)
)
val structCols = expandedDF.columns
expandedDF.select(structCols.map(c => col(s"$c.*")): _*).
show
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// |FIELD_1|FIELD_2|FIELD_3|FIELD_4|FIELD_5|FIELD_6|FIELD_7|FIELD_8|FIELD_9|FIELD_10|FIELD_11|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 1| 1|
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 3| 3|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 2| 2|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 4| 4|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
Note that for simplicity it's assumed that your DataFrame has only struct and Array(struct)-type columns. If there are other data types, just apply filtering conditions to arrayCols and structCols accordingly.

Spark GroupBy agg collect_list multiple columns

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:
scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
| A| D1| T0| P1|
| A| D0| T1| P2|
| B| Y1| T0| P3|
| B| Y2| T2| P3|
| C| H1| T0| P5|
| C| H0| T9| P5|
| B| Y0| T1| P2|
| B| H1| T3| P6|
| D| H1| T2| P4|
+---+-----+----+-----+
scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)
scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]
scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
| B| [Y1, Y2, Y0, H1]| [T0, T2, T1, T3]| [P3, P3, P2, P6]|
| D| [H1]| [T2]| [P4]|
| C| [H1, H0]| [T0, T9]| [P5, P5]|
| A| [D1, D0]| [T0, T1]| [P1, P2]|
+---+-------------------+------------------+-------------------+
Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?
You can use collect_list(struct(col1, col2)) AS elements.
Example:
df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema
df
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- cd_item: string (nullable = true)
|-- nm_item: string (nullable = true)
outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cd_item: string (nullable = true)
| | |-- nm_item: string (nullable = true)

spark2.0 dataframe collect muilt row as array by column

i have some dataframe like below, i want convert muilt row as an array if column value is the same
val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show
+---+---+----+------+
|id1|id2|type|value2|
+---+---+----+------+
| a| b| sum| 0|
| a| b| avg| 2|
+---+---+----+------+
i want to convert it to below
+---+---+----+------+
|id1|id2|agg |value2|
+---+---+----+------+
| a| b| 0,2| 0|
+---+---+----+------+
the printSchema should be like below
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = true)
| |-- sum: int (nullable = true)
| |-- dc: int (nullable = true)
You can:
import org.apache.spark.sql.functions._
val data = Seq(
("a","b","sum",0),("a","b","avg",2)
).toDF("id1","id2","type","value2")
val result = data.groupBy($"id1", $"id2").agg(struct(
first(when($"type" === "sum", $"value2"), true).alias("sum"),
first(when($"type" === "avg", $"value2"), true).alias("avg")
).alias("agg"))
result.show
+---+---+-----+
|id1|id2| agg|
+---+---+-----+
| a| b|[0,2]|
+---+---+-----+
result.printSchema
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = false)
| |-- sum: integer (nullable = true)
| |-- avg: integer (nullable = true)