I'm new to pyspark, i need your help concerning dataframe column creation. I have a dataframe of type:
FROM_CURRENCY
TO_CURRENCY
RATIO_FROM
RATIO_TO
AED
EUR
0
0
AED
EUR
1
1
GNF
EUR
0
0
DZD
EUR
1
1
GNF
EUR
1000
1000
I would like to create two additional columns: Ratio_FROM_BIS and Ratio_To BIS ( Based on values of FROM_CURRENCY and TO_CURRENCY), if you noticed, 0 values were replaced by non null values from other fields with same FROM_CURRENCY values.
FROM_CURRENCY
TO_CURRENCY
RATIO_FROM_BIS
RATIO_TO_BIS
AED
EUR
1
1
AED
EUR
1
1
GNF
EUR
1000
1000
DZD
EUR
1
1
GNF
EUR
1000
1000
I have tried to used .withColumn(field1,F.Lit(command)) but it's not working.
Based on your comment, there can be multiple records for a certain currency (from_currency) and all of the non-zero records will have the same ratio values. I've added the last row to denote this scenario.
An approach with max() window function.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
data_ls = [
('AED', 'EUR', 0, 0),
('AED', 'EUR', 1, 1),
('GNF', 'EUR', 0, 0),
('DZD', 'EUR', 1, 1),
('GNF', 'EUR', 1000, 1000),
('GNF', 'EUR', 1000, 1000)
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['from_curr', 'to_curr', 'ratio_to', 'ratio_from'])
# +---------+-------+--------+----------+
# |from_curr|to_curr|ratio_to|ratio_from|
# +---------+-------+--------+----------+
# | AED| EUR| 0| 0|
# | AED| EUR| 1| 1|
# | GNF| EUR| 0| 0|
# | DZD| EUR| 1| 1|
# | GNF| EUR| 1000| 1000|
# | GNF| EUR| 1000| 1000|
# +---------+-------+--------+----------+
data_sdf. \
withColumn('ratio_to_bis',
func.when(func.col('ratio_to') > 0, func.col('ratio_to')).
otherwise(func.max('ratio_to').over(wd.partitionBy('from_curr', 'to_curr')))
). \
withColumn('ratio_from_bis',
func.when(func.col('ratio_from') > 0, func.col('ratio_from')).
otherwise(func.max('ratio_from').over(wd.partitionBy('from_curr', 'to_curr')))
). \
show()
# +---------+-------+--------+----------+------------+--------------+
# |from_curr|to_curr|ratio_to|ratio_from|ratio_to_bis|ratio_from_bis|
# +---------+-------+--------+----------+------------+--------------+
# | DZD| EUR| 1| 1| 1| 1|
# | GNF| EUR| 0| 0| 1000| 1000|
# | GNF| EUR| 1000| 1000| 1000| 1000|
# | GNF| EUR| 1000| 1000| 1000| 1000|
# | AED| EUR| 0| 0| 1| 1|
# | AED| EUR| 1| 1| 1| 1|
# +---------+-------+--------+----------+------------+--------------+
Related
I have a pyspark dataframe df :-
status
Flag
present
1
present
0
na
1
Void
0
present
1
notpresent
0
present
0
present
0
ok
1
I want to update the Flag as 1 wherever we have status is present or ok :-
Expected :-
status
Flag
present
1
present
1
na
1
Void
0
present
1
notpresent
0
present
1
present
1
ok
1
You can do so using withColumn and a check using when. You recreate the Flag column setting it to 1 if status is ok or present, otherwise you keep the existing value.
from pyspark.sql.functions import when, col, lit
data = [
('present', 0),
('ok', 0),
('present', 1),
('void', 0),
('na', 1),
('notpresent', 0)
]
df = spark.createDataFrame(data, ['status', 'Flag'])
df.show()
df.withColumn('Flag', when(col('status').isin(['ok', 'present']), lit(1)).otherwise(col('Flag'))).show()
Output
+----------+----+
| status|Flag|
+----------+----+
| present| 0|
| ok| 0|
| present| 1|
| void| 0|
| na| 1|
|notpresent| 0|
+----------+----+
+----------+----+
| status|Flag|
+----------+----+
| present| 1|
| ok| 1|
| present| 1|
| void| 0|
| na| 1|
|notpresent| 0|
+----------+----+
The simplest way
df.withColumn('Flag', col('status').isin(['ok', 'present']).astype('int')).show()
+----------+----+
| status|Flag|
+----------+----+
| present| 1|
| ok| 1|
| present| 1|
| void| 0|
| na| 1|
|notpresent| 0|
+----------+----+
I have a dataset that has null values
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 0|null|
| 1|null| 0|
|null| 1| 0|
| 1| 0| 0|
| 1| 0| 0|
|null| 0| 1|
| 1| 1| 0|
| 1| 1|null|
|null| 1| 0|
+----+----+----+
I wrote a function to count the percentage of null values of each column in the dataset and removing those columns from the dataset. Below is the function
import pyspark.sql.functions as F
def calc_null_percent(df, strength=None):
if strength is None:
strength = 80
total_count = df.count()
null_cols = []
df2 = df.select([F.count(F.when(F.col(c).contains('None') | \
F.col(c).contains('NULL') | \
(F.col(c) == '' ) | \
F.col(c).isNull() | \
F.isnan(c), c
)).alias(c)
for c in df.columns])
for i in df2.columns:
get_null_val = df2.first()[i]
if (get_null_val/total_count)*100 > strength:
null_cols.append(i)
df = df.drop(*null_cols)
return df
I am using a for loop to get the columns based on the condition. Can we use map or Is there any other way to optimise the for loop in pyspark?
Here's a way to do it with list comprehension.
data_ls = [
(1, 0, 'blah'),
(0, None, 'None'),
(None, 1, 'NULL'),
(1, None, None)
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id1', 'id2', 'id3'])
# +----+----+----+
# | id1| id2| id3|
# +----+----+----+
# | 1| 0|blah|
# | 0|null|None|
# |null| 1|NULL|
# | 1|null|null|
# +----+----+----+
Now, calculate the percentage of nulls in a dataframe and collect() it for further use.
# total row count
tot_count = data_sdf.count()
# percentage of null records per column
data_null_perc_sdf = data_sdf. \
select(*[(func.sum((func.col(k).isNull() | (func.upper(k).isin(['NONE', 'NULL']))).cast('int')) / tot_count).alias(k+'_nulls_perc') for k in data_sdf.columns])
# +--------------+--------------+--------------+
# |id1_nulls_perc|id2_nulls_perc|id3_nulls_perc|
# +--------------+--------------+--------------+
# | 0.25| 0.5| 0.75|
# +--------------+--------------+--------------+
# collection of the dataframe for list comprehension
data_null_perc = data_null_perc_sdf.collect()
# [Row(id1_nulls_perc=0.25, id2_nulls_perc=0.5, id3_nulls_perc=0.75)]
threshold = 0.5
# retain columns of `data_sdf` that have more null records than aforementioned threshold
cols2drop = [k for k in data_sdf.columns if data_null_perc[0][k+'_nulls_perc'] >= threshold]
# ['id2', 'id3']
Use cols2drop variable to drop the columns from data_sdf in the next step
new_data_sdf = data_sdf.drop(*cols2drop)
# +----+
# | id1|
# +----+
# | 1|
# | 0|
# |null|
# | 1|
# +----+
I have a column named "Sales", and another column with the Salesman, so I want to know how many salesmans concentrate the 80% of the sales by each type of sales (A, B, C).
For this example,
+---------+------+-----+
|salesman |sales |type |
+---------+-------+----+
| 5 | 9 | a|
| 8 | 12 | b|
| 6 | 3 | b|
| 6 | 1 | a|
| 1 | 3 | a|
| 5 | 1 | b|
| 2 | 11 | b|
| 4 | 3 | a|
| 1 | 1 | b|
| 2 | 3 | a|
| 3 | 4 | a|
+---------+------+-----+
The result should be:
+-----+--------- +
|type |Salesman80|
+-----+----------+
| a | 4 |
| b | 2 |
+-----+----------+
We'll find the total sales per type.
join this to the table,
groupby salesman, type to get totals per sales per type,
then use a math trick to get percentage.
You can then chop this table to any percentage you wish with a where clause.
#create some data
data = spark.range(1 , 1000)
sales = data.select( data.id, floor((rand()*13)).alias("salesman"), floor((rand()*26)+65).alias("type"), floor(rand()*26).alias("sale") )
totalSales = sales.groupby(sales.type)\
.agg(
sum(sales.sale).alias("total_sales")
)\
.select( col("*"), expr("chr( type)").alias("type_") ) #fix int to chr
sales.join( totalSales , ["type"] )\
.groupby("salesman","type_")\
.agg( (sum("sale")/avg("total_sales")).alias("percentage"),
avg("total_sales").alias("total_sales_by_type") #Math trick as the total sum is the same on all types. so average = total sales by type.
).show()
+--------+-----+--------------------+-------------------+
|salesman|type_| percentage|total_sales_by_type|
+--------+-----+--------------------+-------------------+
| 10| H| 0.04710144927536232| 552.0|
| 9| U| 0.21063394683026584| 489.0|
| 0| I| 0.09266409266409266| 518.0|
| 11| K| 0.09683426443202979| 537.0|
| 0| F|0.027070063694267517| 628.0|
| 11| F|0.054140127388535034| 628.0|
| 1| G| 0.08086253369272237| 371.0|
| 5| N| 0.1693548387096774| 496.0|
| 9| L| 0.05353728489483748| 523.0|
| 7| R|0.003058103975535...| 327.0|
| 0| C| 0.05398457583547558| 389.0|
| 6| G| 0.1105121293800539| 371.0|
| 12| A|0.057007125890736345| 421.0|
| 0| J| 0.09876543209876543| 567.0|
| 11| B| 0.11337209302325581| 344.0|
| 8| K| 0.08007448789571694| 537.0|
| 4| N| 0.06854838709677419| 496.0|
| 11| H| 0.1358695652173913| 552.0|
| 10| W| 0.11617312072892938| 439.0|
| 1| C| 0.06940874035989718| 389.0|
+--------+-----+--------------------+-------------------+
I am guessing you mean "how many salesmans contribute to the 80% of the total sales per type" and you want the lowest possible number of salesman.
If that is what you meant, you can do this in these steps
Calculate the total sales per group
Get cumulative sum of sales percentage (sales / total sales)
Assign the row number by ordering sales in descending order
Take minimum row number where the cumulative sum of sales percentage >= 80% per group
Note this is probably not efficient approach but it produces what you want.
from pyspark.sql import functions as F
part_window = Window.partitionBy('type')
order_window = part_window.orderBy(F.desc('sales'))
cumsum_window = order_window.rowsBetween(Window.unboundedPreceding, 0)
df = (df.withColumn('total_sales', F.sum('sales').over(part_window)) # Step 1
.select('*',
F.sum(F.col('sales') / F.col('total_sales')).over(cumsum_window).alias('cumsum_percent'), # Step 2
F.row_number().over(order_window).alias('rn')) # Step 3
.groupby('type') # Step 4
.agg(F.min(F.when(F.col('cumsum_percent') >= 0.8, F.col('rn'))).alias('Salesman80')))
Hi I have 2 Differente DF
scala> d1.show() scala> d2.show()
+--------+-------+ +--------+----------+
| fecha|eventos| | fecha|TotalEvent|
+--------+-------+ +--------+----------+
|20180404| 3| | 0| 23534|
|20180405| 7| |20180322| 10|
|20180406| 10| |20180326| 50|
|20180409| 4| |20180402| 6|
.... |20180403| 118|
scala> d1.count() |20180404| 1110|
res3: Long = 60 ...
scala> d2.count()
res7: Long = 74
But I like to join them by fecha without loose data, and then, create a new column with a math operation (TotalEvent - eventos)*100/TotalEvent
Something like this:
+---------+-------+----------+--------+
|fecha |eventos|TotalEvent| KPI |
+---------+-------+----------+--------+
| 0| | 23534 | 100.00|
| 20180322| | 10 | 100.00|
| 20180326| | 50 | 100.00|
| 20180402| | 6 | 100.00|
| 20180403| | 118 | 100.00|
| 20180404| 3 | 1110 | 99.73|
| 20180405| 7 | 1204 | 99.42|
| 20180406| 10 | 1526 | 99.34|
| 20180407| | 14 | 100.00|
| 20180409| 4 | 1230 | 99.67|
| 20180410| 11 | 1456 | 99.24|
| 20180411| 6 | 1572 | 99.62|
| 20180412| 5 | 1450 | 99.66|
| 20180413| 7 | 1214 | 99.42|
.....
The problems is that I can't find the way to do it.
When I use:
scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()
I loose the data that isn't in both table.
+--------+-------+--------+----------+
| fecha|eventos| fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404| 3|20180404| 1110|
|20180405| 7|20180405| 1204|
|20180406| 10|20180406| 1526|
|20180409| 4|20180409| 1230|
|20180410| 11|20180410| 1456|
....
Additional, How can I add a new column with the math operation?
Thank you
I would recommend left-joining df2 with df1 and calculating KPI based on whether eventos is null or not in the joined dataset (using when/otherwise):
import org.apache.spark.sql.functions._
val df1 = Seq(
("20180404", 3),
("20180405", 7),
("20180406", 10),
("20180409", 4)
).toDF("fecha", "eventos")
val df2 = Seq(
("0", 23534),
("20180322", 10),
("20180326", 50),
("20180402", 6),
("20180403", 118),
("20180404", 1110),
("20180405", 100),
("20180406", 100)
).toDF("fecha", "TotalEvent")
df2.
join(df1, Seq("fecha"), "left_outer").
withColumn( "KPI",
round( when($"eventos".isNull, 100.0).
otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),
2
)
).show
// +--------+----------+-------+-----+
// | fecha|TotalEvent|eventos| KPI|
// +--------+----------+-------+-----+
// | 0| 23534| null|100.0|
// |20180322| 10| null|100.0|
// |20180326| 50| null|100.0|
// |20180402| 6| null|100.0|
// |20180403| 118| null|100.0|
// |20180404| 1110| 3|99.73|
// |20180405| 100| 7| 93.0|
// |20180406| 100| 10| 90.0|
// +--------+----------+-------+-----+
Note that if the more precise raw KPI is wanted instead, just remove the wrapping round( , 2).
I would do this in several of steps. First join, then select the calculated column, then fill in the na:
# val df2a = df2.withColumnRenamed("fecha", "fecha2") # to avoid ambiguous column names after the join
# val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")
# val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))
# kpi.show()
+--------+-------+--------+----------+-----------------+
| fecha|eventos| fecha2|TotalEvent| KPI|
+--------+-------+--------+----------+-----------------+
| null| null|20180402| 6| 100.0|
| null| null| 0| 23534| 100.0|
| null| null|20180322| 10| 100.0|
|20180404| 3|20180404| 1110|99.72972972972973|
|20180406| 10| null| null| 100.0|
| null| null|20180403| 118| 100.0|
| null| null|20180326| 50| 100.0|
|20180409| 4| null| null| 100.0|
|20180405| 7| null| null| 100.0|
+--------+-------+--------+----------+-----------------+
I solved the problems with mixed both suggestion recived.
val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))
I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
+----+-------+------+
|id |COUNTRY| MONTH|
+----+-------+------+
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
+----+-------+------+
And a dataframe B :
+-------+------+------+
|COLUMN |VALUE | PRIO |
+-------+------+------+
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
+-------+------+------+
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
+----+-------+------+------+
|id |COUNTRY| MONTH| PRIO |
+----+-------+------+------+
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
+----+-------+------+------+
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
)
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
).toDF("COLUMN", "VALUE", "PRIO")
val prio = coalesce(rules.as[(String, String, Int)].collect.map {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
.groupBy("id")
.agg(min("PRIO").alias("PRIO"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
+---+-------+------+
| id|COUNTRY|MONTH|
+---+-------+------+
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
+---+-------+------+
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
df.withColumn("PRIO",code($"COUNTRY",$"MONTH")).show()
output
+---+-------+------+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+------+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+------+----+