pyspark replace column values with when function gives column object is not callable - pyspark

I have a table like this
name
----
A
B
ccc
D
eee
and a list of valid names
legal_names = [A, B, D]
And I want to replace all illegal names with another string "INVALID".
I used this script:
(
df.withColumn(
"name",
F.when((F.col("name").isin(legal_names)), F.col("name")).otherwhise(
F.lit("INVALID")
),
)
)
But I get this error
TypeError: 'Column' object is not callable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File <command-4397929369165676>:4, in <cell line: 2>()
1 (
2 df.withColumn(
3 "name",
----> 4 F.when((F.col("name").isin(legal_names)), F.col("name")).otherwhise(
5 F.lit("INVALID")
6 ),
7 )
8 )
TypeError: 'Column' object is not callable
Dummy data to reproduce:
vals = [("A", ), ("B", ), ("ccc", ), ("D", ), ("EEE", )]
cols = ["name"]
legal_names = ["A", "B", "D"]
df = spark.createDataFrame(vals, cols)

Try using below code -
df1 = df.withColumn( "name", F.when( (F.col("name").isin(*legal_names)), F.col("name") ).otherwise(F.lit('INVALID')) )
Output :
+-------+
| name|
+-------+
| A|
| B|
|INVALID|
| D|
|INVALID|
+-------+

Related

Pyspark group by collect list, to_json and pivot

Summary: Combining multiple rows to columns for a user
Input DF:
Id
group
A1
A2
B1
B2
1
Alpha
1
2
null
null
1
AlphaNew
6
8
null
null
2
Alpha
7
4
null
null
2
Beta
null
null
3
9
Note: The group values are dynamic
Expected Output DF:
Id
Alpha_A1
Alpha_A2
AlphaNew_A1
AlphaNew_A2
Beta_B1
Beta_B2
1
1
2
6
8
null
null
2
7
4
null
null
3
9
Attempted Solution:
I thought of making a json of non-null columns for each row, then a group by and concat_list of maps. Then I can explode the json to get the expected output.
But I am stuck at the stage of a nested json. Here is my code
vcols = df.columns[2:]
df\
.withColumn('json', F.to_json(F.struct(*vcols)))\
.groupby('id')\
.agg(
F.to_json(
F.collect_list(
F.create_map('group', 'json')
)
)
).alias('json')
Id
json
1
[{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}]
2
[{Alpha: {A1:7, A2:4}}, {Beta: {B1:3, B2:9}}]
What I am trying to get:
Id
json
1
[{Alpha_A1:1, Alpha_A2:2, AlphaNew_A1:6, AlphaNew_A2:8}]
2
[{Alpha_A1:7, Alpha_A2:4, Beta_B1:3, Beta_B2:9}]
I'd appreciate any help. I'm also trying to avoid UDFs as my true dataframe's shape is quite big
There's definitely a better way to do this but I continued your to json experiment.
Using UDFs:
After you get something like [{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}] you could create a UDF to flatten the dict. But since it's a JSON string you'll have to parse it to dict and then back again to JSON.
After that you would like to explode and pivot the table but that's not possible with JSON strings, so you have to use F.from_json with defined schema. That will give you MapType which you can explode and pivot.
Here's an example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from collections import MutableMapping
import json
from pyspark.sql.types import (
ArrayType,
IntegerType,
MapType,
StringType,
)
def flatten_dict(d, parent_key="", sep="_"):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, MutableMapping):
items.extend(flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def flatten_groups(data):
result = []
for item in json.loads(data):
result.append(flatten_dict(item))
return json.dumps(result)
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["Id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
vcols = df.columns[2:]
df = (
df.withColumn("json", F.struct(*vcols))
.groupby("id")
.agg(F.to_json(F.collect_list(F.create_map("group", "json"))).alias("json"))
)
# Flatten groups
flatten_groups_udf = F.udf(lambda x: flatten_groups(x))
schema = ArrayType(MapType(StringType(), IntegerType()))
df = df.withColumn("json", F.from_json(flatten_groups_udf(F.col("json")), schema))
# Explode and pivot
df = df.select(F.col("id"), F.explode(F.col("json")).alias("json"))
df = (
df.select("id", F.explode("json"))
.groupby("id")
.pivot("key")
.agg(F.first("value"))
)
At the end dataframe looks like:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
Without UDFs:
vcols = df.columns[2:]
df = (
df.withColumn("json", F.to_json(F.struct(*vcols)))
.groupby("id")
.agg(
F.collect_list(
F.create_map(
"group", F.from_json("json", MapType(StringType(), IntegerType()))
)
).alias("json")
)
)
df = df.withColumn("json", F.explode(F.col("json")).alias("json"))
df = df.select("id", F.explode(F.col("json")).alias("root", "value"))
df = df.select("id", "root", F.explode(F.col("value")).alias("sub", "value"))
df = df.select(
"id", F.concat(F.col("root"), F.lit("_"), F.col("sub")).alias("name"), "value"
)
df = df.groupBy(F.col("id")).pivot("name").agg(F.first("value"))
Result:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
I found a slightly better way than the json approach:
Stack the input dataframe value columns A1, A2,B1, B2,.. as rows
So the structure would look like id, group, sub, value where sub has the column name like A1, A2, B1, B2 and the value column has the value associated
Filter out the rows that have value as null
And, now we are able to pivot by the group. Since the null value rows are removed, we wont have the initial issue of the pivot making extra columns
import pyspark.sql.functions as F
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
# Value columns that need to be stacked
vcols = df.columns[2:]
expr_str = ', '.join([f"'{i}', {i}" for i in vcols])
expr_str = f"stack({len(vcols)}, {expr_str}) as (sub, value)"
df = df\
.selectExpr("id", "group", expr_str)\
.filter(F.col("value").isNotNull())\
.select("id", F.concat("group", F.lit("_"), "sub").alias("group"), "value")\
.groupBy("id")\
.pivot("group")\
.agg(F.first("value"))
df.show()
Result:
+---+-----------+-----------+--------+--------+-------+-------+
| id|AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
| 1| 6| 8| 1| 2| null| null|
| 2| null| null| 7| 4| 3| 9|
+---+-----------+-----------+--------+--------+-------+-------+

Validate data from the same column in different rows with pyspark

How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.
fecha id estado id_cliente error_code kilometraje error_km
1/1/2019 1 A 1 10
2/1/2019 2 A ERROR 20
3/1/2019 1 D 1 ERROR 30
4/1/2019 2 O ERROR
The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)
I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
file_path = 'archive.txt'
error = 'ERROR'
df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))
df = df.withColumn('error_code',
F.when((F.col('status') == 'O') &
(F.col('client_id') != '') |
(F.col('status') == 'D') &
(F.col('client_id') != '') |
(F.col('status') == 'A') &
(F.col('client_id') == ''),
F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
)
.otherwise(F.col('error_code')))
You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 30 ),
('7/1/2019', 3 , 30 ),
('4/1/2019', 2 , 5)]
columns = ['fecha', 'id', 'kilometraje']
df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha, 'dd/MM/yyyy'))
w = Window.partitionBy('id').orderBy('fecha')
df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))
df.show()
Output:
+----------+---+-----------+--------+
| fecha| id|kilometraje|error_km|
+----------+---+-----------+--------+
|2019-01-01| 1| 10| |
|2019-01-03| 1| 30| |
|2019-01-04| 1| 10| ERROR|
|2019-01-05| 1| 30| |
|2019-01-07| 3| 30| |
|2019-01-02| 2| 20| |
|2019-01-04| 2| 5| ERROR|
+----------+---+-----------+--------+
The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.
df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()
I use .rangeBetween(Window.unboundedPreceding,0).
This function searches from the current value for the added value for the back
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
error = 'This is error'
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 22 ),
('7/1/2019', 1 , 23 ),
('22/1/2019', 2 , 5),
('11/1/2019', 2 , 24),
('13/2/2019', 1 , 16),
('14/2/2019', 2 , 18),
('5/2/2019', 1 , 19),
('6/2/2019', 2 , 23),
('7/2/2019', 1 , 14),
('8/3/2019', 1 , 50),
('8/3/2019', 2 , 50)]
columns = ['date', 'vin', 'mileage']
df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date, 'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))
w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))
df.show()
Finally, all that remains is to remove the column that has the maximum
df = df.drop('max')
df.show()

How to create a data frame in a for loop with the variable that is iterating in loop

So I have a huge data frame which is combination of individual tables, it has an identifier column at the end which specifies the table number as shown below
+----------------------------+
| col1 col2 .... table_num |
+----------------------------+
| x y 1 |
| a b 1 |
| . . . |
| . . . |
| q p 2 |
+----------------------------+
(original table)
I have to split this into multiple little dataframes based on table num. The number of tables combined to create this is pretty large so it's not feasible to individually create the disjoint subset dataframes, so I was thinking if I made a for loop iterating over min to max values of table_num I could achieve this task but I can't seem to do it, any help is appreciated.
This is what I came up with
for (x < min(table_num) to max(table_num)) {
var df(x)= spark.sql("select * from df1 where state = x")
df(x).collect()
but I don't think the declaration is right.
so essentially what I need is df's that look like this
+-----------------------------+
| col1 col2 ... table_num |
+-----------------------------+
| x y 1 |
| a b 1 |
+-----------------------------+
+------------------------------+
| col1 col2 ... table_num |
+------------------------------+
| xx xy 2 |
| aa bb 2 |
+------------------------------+
+-------------------------------+
| col1 col2 ... table_num |
+-------------------------------+
| xxy yyy 3 |
| aaa bbb 3 |
+-------------------------------+
... and so on ...
(how I would like the Dataframes split)
In Spark Arrays can be almost in data type. When made as vars you can dynamically add and remove elements from them. Below I am going to isolate the table nums into their own array, this is so I can easily iterate through them. After isolated I go through a while loop to add each table as a unique element to the DF Holder Array. To query the elements of the array use DFHolderArray(n-1) where n is the position you want to query with 0 being the first element.
//This will go and turn the distinct row nums in a queriable (this is 100% a word) array
val tableIDArray = inputDF.selectExpr("table_num").distinct.rdd.map(x=>x.mkString.toInt).collect
//Build the iterator
var iterator = 1
//holders for DF and transformation step
var tempDF = spark.sql("select 'foo' as bar")
var interimDF = tempDF
//This will be an array for dataframes
var DFHolderArray : Array[org.apache.spark.sql.DataFrame] = Array(tempDF)
//loop while the you have note reached end of array
while(iterator<=tableIDArray.length) {
//Call the table that is stored in that location of the array
tempDF = spark.sql("select * from df1 where state = '" + tableIDArray(iterator-1) + "'")
//Fluff
interimDF = tempDF.withColumn("User_Name", lit("Stack_Overflow"))
//If logic to overwrite or append the DF
DFHolderArray = if (iterator==1) {
Array(interimDF)
} else {
DFHolderArray ++ Array(interimDF)
}
iterator = iterator + 1
}
//To query the data
DFHolderArray(0).show(10,false)
DFHolderArray(1).show(10,false)
DFHolderArray(2).show(10,false)
//....
Approach is to collect all unique keys and build respective data frames. I added some functional flavor to it.
Sample dataset:
name,year,country,id
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7746
Borussia Dortmund,2014,Germany,7746
Borussia Mönchengladbach,2014,Germany,7746
Schalke 04,2014,Germany,7746
Schalke 04,2014,Germany,7753
Lazio,2014,Germany,7753
Code:
val df = spark.read.format(source = "csv")
.option("header", true)
.option("delimiter", ",")
.option("inferSchema", true)
.load("groupby.dat")
import spark.implicits._
//collect data for each key into a data frame
val uniqueIds = df.select("id").distinct().map(x => x.mkString.toInt).collect()
// List buffer to hold separate data frames
var dataframeList: ListBuffer[org.apache.spark.sql.DataFrame] = ListBuffer()
println(uniqueIds.toList)
// filter data
uniqueIds.foreach(x => {
val tempDF = df.filter(col("id") === x)
dataframeList += tempDF
})
//show individual data frames
for (tempDF1 <- dataframeList) {
tempDF1.show()
}
One approach would be to write the DataFrame as partitioned Parquet files and read them back into a Map, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("a", "b", 1), ("c", "d", 1), ("e", "f", 1),
("g", "h", 2), ("i", "j", 2)
).toDF("c1", "c2", "table_num")
val filePath = "/path/to/parquet/files"
df.write.partitionBy("table_num").parquet(filePath)
val tableNumList = df.select("table_num").distinct.map(_.getAs[Int](0)).collect
// tableNumList: Array[Int] = Array(1, 2)
val dfMap = ( for { n <- tableNumList } yield
(n, spark.read.parquet(s"$filePath/table_num=$n").withColumn("table_num", lit(n)))
).toMap
To access the individual DataFrames from the Map:
dfMap(1).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | a| b| 1|
// | c| d| 1|
// | e| f| 1|
// +---+---+---------+
dfMap(2).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | g| h| 2|
// | i| j| 2|
// +---+---+---------+

Doing left outer join on multiple data frames in spark scala

I am newbie in Spark. I trying to achieve below use case using scala.
-DataFrame 1
| col A | col B |
-----------------
| 1 | a |
| 2 | a |
| 3 | a |
-DataFrame 2
| col A | col B |
-----------------
| 1 | b |
| 3 | b |
-DataFrame 3
| col A | col B |
-----------------
| 2 | c |
| 3 | c |
Final Output frame should be
| col A | col B |
-----------------
| 1 | a,b |
| 2 | a,c |
| 3 | a,b,c |
Number of frames are not limited to 3 , it can be any number less than 100.So I am using for each in which I am printing each of the data frame.
Can some one please help me how I can create final data frame in which I can have output in above format with N data frames.
I appreciate your help.
I see this question today. I suggest that you use python to solve it. It's easier to write than scala. Here are they:
from pyspark.sql import SQLContext
from pyspark.sql.functions import concat_ws
d1=sc.parallelize([(1, "a"), (2, "a"), (3,"a")]).toDF().toDF("Col_A","Col_B")
d2=sc.parallelize([(1, "b"), (2, "b")]).toDF().toDF("Col_A", "Col_B")
d3=sc.parallelize([(2, "c"), (3, "c")]).toDF().toDF("Col_A", "Col_B")
d4=d1.join(d2,'Col_A','left').join(d3,'Col_A','left').select(d1.Col_A.alias("col A"),concat_ws(',',d1.Col_B,d2.Col_B,d3.Col_B).alias("col B"))
df4.show()
+-----+-----+
|col A|col B|
+-----+-----+
| 1
| a,b|
| 2
|a,b,c|
| 3
| a,c|
+-----+-----+
You see the result!
You can use foldLeft to iteratively merge data with outer join
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
val df1 = Seq((1, "a"), (2, "a"), (3, "a")).toDF("Col A", "Col B")
val df2 = Seq((1, "b"), (2, "b")).toDF("Col A", "Col B")
val df3 = Seq((2, "c"), (3, "c")).toDF("Col A", "Col B")
val dfs = Seq(df2, df3)
val bs = (0 to dfs.size).map(i => s"Col B $i")
dfs.foldLeft(df1)(
(acc, df) => acc.join(df, Seq("Col A"), "fullouter")
).toDF("Col A" +: bs: _*).select($"Col A", array(bs map col: _*)).map {
case Row(a: Int, bs: Seq[_]) =>
// Drop nulls and concat
(a, bs.filter(_ != null).map(_.toString).mkString(","))
}.toDF("Col A", "Col B").show
// +-----+-----+
// |Col A|Col B|
// +-----+-----+
// | 1| a,b|
// | 3| a,c|
// | 2|a,b,c|
// +-----+-----+
but if you really think
it can be any number less than 100
then it is just unrealistic. join is the most expensive operation in Spark, and even with all the optimizer improvements, it is just not gonna work.

Transpose Dataframe in Scala

I have dataframe like below.
+---+------+------+
| ID|Field1|Field2|
+---+------+------+
| 1| x| n|
| 2| a| b|
+---+------+------+
And I need the output like below
+---+-------------+------+
| ID| Fields|values|
+---+-------------+------+
| 1|Field1,Field2| x,n|
| 2|Field1,Field2| a,b|
+---+-------------+------+
I am pretty new to scala.. I just need an approach to do this. I already researched on internet regarding transpose, but couldn't get the solution.
Since Fields column is going to be the same in every row, you can add it later.
In this example class Thing has 3 fields: id, Field1, Field2.
val sqlContext = new org.apache.spark.sql.SQLContext( sc )
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df =
sc
.parallelize( List( Thing( 1, "a", "b" ), Thing( 2, "x", "y" ) ) )
.toDF( "id", "Field1", "Field2" )
Column names are returned in the same order so we can just take last two for field names
val fieldNames =
df
.columns
.takeRight( 2 )
The org.apache.spark.sql.functions does all the job combining data from given columns.
val res =
df
.select( $"id", array( $"Field1", $"Field2" ) as "values" )
.withColumn( "Fields", lit( fieldNames ) )
res.show()
Result:
+---+------+----------------+
| id|values| Fields|
+---+------+----------------+
| 1|[a, b]|[Field1, Field2]|
| 2|[x, y]|[Field1, Field2]|
+---+------+----------------+