Pyspark join returning no data in output - pyspark

while performing simple join on 2 data frame, pyspark returns no output data
from pyspark.sql import *
import pyspark.sql.functions as F
from pyspark.sql.functions import col
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
file_path="C:\\bigdata\\pipesep_data\\Sales_ny.csv"
df=spark.read.format("csv").option('header','True').option('inferSchema', 'True').option("delimiter", '|').load(file_path)
addData=[(1,"1523 Main St","SFO","CA"),
(2,"3453 Orange St","SFO","NY"),
(3,"34 Warner St","Jersey","NJ"),
(4,"221 Cavalier St","Newark","DE"),
(5,"789 Walnut St","Sandiago","CA")
]
addColumns = ["emp_id","addline1","city","State"]
addDF = spark.createDataFrame(addData,addColumns)
addDF.show()
df.join(addDF,df["State"] == addDF["State"]).show()
Sales_ny schema
Sales_ny.csv
Output: No data in output, only columns are joined
I also tried with left,right,fullouter etc..

For me it is working fine
>>> df=spark.read.format("csv").option('header','True').option('inferSchema', 'True').option("delimiter", '|').load("/Path to/sample1.csv")
>>> df.show()
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
| OrderID| Product|Quantity| Price| OrderDate| StoreAddres| City|State|Month|Hour|
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
|295665.0| Macbook Pro Laptop| 1.0|1700.0|2019-12-30 00:01:00|136 Church St, Ne|New York City| 123| 12.0| 0.0|
|295666.0| LG Washing Machine| 1.0| 600.0|2019-12-29 07:03:00| 562 2nd St, Ne|New York City| NY| 12.0| 7.0|
|295667.0|USB-C Charging Cable| 1.0| 11.95|2019-12-12 18:21:00| 277 Main St, New|New York City| NY| 12.0|18.0|
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
>>> addDF.show()
+------+---------------+--------+-----+
|emp_id| addline1| city|State|
+------+---------------+--------+-----+
| 1| 1523 Main St| SFO| CA|
| 2| 3453 Orange St| SFO| NY|
| 3| 34 Warner St| Jersey| NJ|
| 4|221 Cavalier St| Newark| DE|
| 5| 789 Walnut St|Sandiago| CA|
+------+---------------+--------+-----+
>>> df.join(addDF,df["State"] == addDF["State"]).show()
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
| OrderID| Product|Quantity|Price| OrderDate| StoreAddres| City|State|Month|Hour|emp_id| addline1|city|State|
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
|295667.0|USB-C Charging Cable| 1.0|11.95|2019-12-12 18:21:00|277 Main St, New|New York City| NY| 12.0|18.0| 2|3453 Orange St| SFO| NY|
|295666.0| LG Washing Machine| 1.0|600.0|2019-12-29 07:03:00| 562 2nd St, Ne|New York City| NY| 12.0| 7.0| 2|3453 Orange St| SFO| NY|
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
I think your df.State have spaces.
you can use below code and remove space then perform join
>>> from pyspark.sql.functions import *
>>> df=df.withColumn('State',trim(df.State))

Related

Closest Date looking from One Column to another in PySpark Dataframe

I have a pyspark dataframe where price of Commodity is mentioned, but there is no data for when was the Commodity bought, I just have a range of 1 year.
+---------+------------+----------------+----------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|
+---------+------------+----------------+----------------+
| Apple| 5| 2020-07-04| 2019-07-03|
| Banana| 3| 2020-07-03| 2019-07-02|
| Banana| 4| 2019-10-02| 2018-10-01|
| Apple| 6| 2020-01-20| 2019-01-19|
| Banana| 3.5| 2019-08-17| 2018-08-16|
+---------+------------+----------------+----------------+
I have another pyspark dataframe where I can see the market price and date of all commodities.
+----------+----------+------------+
| Date| Commodity|Market Price|
+----------+----------+------------+
|2020-07-01| Apple| 3|
|2020-07-01| Banana| 3|
|2020-07-02| Apple| 4|
|2020-07-02| Banana| 2.5|
|2020-07-03| Apple| 7|
|2020-07-03| Banana| 4|
+----------+----------+------------+
I want to see the closest date to Upper limit of date when Market Price(MP) of that commodity < or = Buying Price(BP).
Expected Output (for 2 top columns):
+---------+------------+----------------+----------------+--------------------------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
+---------+------------+----------------+----------------+--------------------------------+
| Apple| 5| 2020-07-04| 2019-07-03| 2020-07-02|
| Banana| 3| 2020-07-03| 2019-07-02| 2020-07-02|
+---------+------------+----------------+----------------+--------------------------------+
Even though Apple was much lower on 2020-07-01 ($3), but since 2020-07-02 was the first date going backwards from Upper Limit (UL) of date when MP <= BP. So, I selected 2020-07-02.
How can I see backwards to fill date of probable buying?
Try this with conditional join and window function
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("Commodity")
df1\ #first dataframe shown being df1 and second being df2
.join(df2.withColumnRenamed("Commodity","Commodity1")\
, F.expr("""`Market Price`<=BuyingPrice and Date<Date_Upper_limit and Commodity==Commodity1"""))\
.drop("Market Price","Commodity1")\
.withColumn("max", F.max("Date").over(w))\
.filter('max==Date').drop("max").withColumnRenamed("Date","Closest Date to UL when MP <= BP")\
.show()
#+---------+-----------+----------------+----------------+--------------------------------+
#|Commodity|BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
#+---------+-----------+----------------+----------------+--------------------------------+
#| Banana| 3.0| 2020-07-03| 2019-07-02| 2020-07-02|
#| Apple| 5.0| 2020-07-04| 2019-07-03| 2020-07-02|
#+---------+-----------+----------------+----------------+--------------------------------+

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

How to find quantiles inside agg() function after groupBy in Scala SPARK

I have a dataframe, in which I want to groupBy column A then find different stats like mean, min, max, std dev and quantiles.
I am able to find min, max and mean using the following code:
df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)
But I am unable to find the quantiles(0.25, 0.5, 0.75). I tried approxQuantile and percentile but it gives the following error:
error: not found: value approxQuantile
if you have Hive in classpath, you can use many UDAF like percentile_approx and stddev_samp, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
You can call these functions using callUDF:
import ss.implicits._
import org.apache.spark.sql.functions.callUDF
val df = Seq(1.0,2.0,3.0).toDF("x")
df.groupBy()
.agg(
callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
callUDF("stddev_samp",$"x").as("stdev")
)
.show()
Here is a code that I have tested on Spark 3.1
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
df.groupBy($"department")
.agg(
percentile_approx($"salary",lit(0.5), lit(10000))
)
.show(false)
Output
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
+----------+-------------------------------------+
|department|percentile_approx(salary, 0.5, 10000)|
+----------+-------------------------------------+
|Sales |86000 |
|Finance |83000 |
|Marketing |80000 |
+----------+-------------------------------------+

Pyspark groupBy Pivot Transformation

I'm having a hard time framing the following Pyspark dataframe manipulation.
Essentially I am trying to group by category and then pivot/unmelt the subcategories and add new columns.
I've tried a number of ways, but they are very slow and and are not leveraging Spark's parallelism.
Here is my existing (slow, verbose) code:
from pyspark.sql.functions import lit
df = sqlContext.table('Table')
#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
#loop over subcategory
listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
num = 1
for b in dfArraySub:
#renames all columns to append a number
for c in b.columns:
if c not in ['category','sub_category','date']:
column_name = str(c)+'_'+str(num)
b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
b = b.drop('sub_category')
num += 1
#if no df exists, create one and continually join new columns
try:
all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
except:
all_subs = b
#Fixes missing columns on union
try:
try:
diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
for d in diff_columns:
all_subs = all_subs.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except:
diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
for d in diff_columns:
all_cats = all_cats.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except Exception as e:
print e
all_cats = all_subs
But this is very slow. Any guidance would be greatly appreciated!
Your output is not really logical, but we can achieve this result using the pivot function. You need to precise your rules otherwise I can see a lot of cases it may fails.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df.show()
+----------+---------+------------+------------+------------+
| date| category|sub_category|metric_sales|metric_trans|
+----------+---------+------------+------------+------------+
|2018-01-01|furniture| bed| 100| 75|
|2018-01-01|furniture| chair| 110| 85|
|2018-01-01|furniture| shelf| 35| 30|
|2018-02-01|furniture| bed| 55| 50|
|2018-02-01|furniture| chair| 45| 40|
|2018-02-01|furniture| shelf| 10| 15|
|2018-01-01| rug| circle| 2| 5|
|2018-01-01| rug| square| 3| 6|
|2018-02-01| rug| circle| 3| 3|
|2018-02-01| rug| square| 4| 5|
+----------+---------+------------+------------+------------+
df.withColumn("fg", F.row_number().over(Window().partitionBy('date', 'category').orderBy("sub_category"))).groupBy('date', 'category', ).pivot('fg').sum('metric_sales', 'metric_trans').show()
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| date| category|1_sum(CAST(`metric_sales` AS BIGINT))|1_sum(CAST(`metric_trans` AS BIGINT))|2_sum(CAST(`metric_sales` AS BIGINT))|2_sum(CAST(`metric_trans` AS BIGINT))|3_sum(CAST(`metric_sales` AS BIGINT))|3_sum(CAST(`metric_trans` AS BIGINT))|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|2018-02-01| rug| 3| 3| 4| 5| null| null|
|2018-02-01|furniture| 55| 50| 45| 40| 10| 15|
|2018-01-01|furniture| 100| 75| 110| 85| 35| 30|
|2018-01-01| rug| 2| 5| 3| 6| null| null|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+

Spark Scala: Count Consecutive Months

I have the following DataFrame example:
Provider Patient Date
Smith John 2016-01-23
Smith John 2016-02-20
Smith John 2016-03-21
Smith John 2016-06-25
Smith Jill 2016-02-01
Smith Jill 2016-03-10
James Jill 2017-04-10
James Jill 2017-05-11
I want to programmatically add a column that indicates how many consecutive months that a patient sees a doctor. The new DataFrame would look like this:
Provider Patient Date consecutive_id
Smith John 2016-01-23 3
Smith John 2016-02-20 3
Smith John 2016-03-21 3
Smith John 2016-06-25 1
Smith Jill 2016-02-01 2
Smith Jill 2016-03-10 2
James Jill 2017-04-10 2
James Jill 2017-05-11 2
I'm assuming that there is a way to achieve this with a Window function, but I haven't been able to figure it out yet and I'm looking forward to the insight the community can provide. Thanks.
There are at least 3 ways to get the result
Implement logic in SQL
Use Spark API for windowing functions - .over(windowSpec)
Use directly .rdd.mapPartitions
Introducing Window Functions in Spark SQL
For all solutions you can call .toDebugString to see operations under the hood.
SQL solution is below
val my_df = List(
("Smith", "John", "2016-01-23"),
("Smith", "John", "2016-02-20"),
("Smith", "John", "2016-03-21"),
("Smith", "John", "2016-06-25"),
("Smith", "Jill", "2016-02-01"),
("Smith", "Jill", "2016-03-10"),
("James", "Jill", "2017-04-10"),
("James", "Jill", "2017-05-11")
).toDF(Seq("Provider", "Patient", "Date"): _*)
my_df.createOrReplaceTempView("tbl")
val q = """
select t2.*, count(*) over (partition by provider, patient, grp) consecutive_id
from (select t1.*, sum(x) over (partition by provider, patient order by yyyymm) grp
from (select t0.*,
case
when cast(yyyymm as int) -
cast(lag(yyyymm) over (partition by provider, patient order by yyyymm) as int) = 1
then 0
else 1
end x
from (select tbl.*, substr(translate(date, '-', ''), 1, 6) yyyymm from tbl) t0) t1) t2
"""
sql(q).show
sql(q).rdd.toDebugString
Output
scala> sql(q).show
+--------+-------+----------+------+---+---+--------------+
|Provider|Patient| Date|yyyymm| x|grp|consecutive_id|
+--------+-------+----------+------+---+---+--------------+
| Smith| Jill|2016-02-01|201602| 1| 1| 2|
| Smith| Jill|2016-03-10|201603| 0| 1| 2|
| James| Jill|2017-04-10|201704| 1| 1| 2|
| James| Jill|2017-05-11|201705| 0| 1| 2|
| Smith| John|2016-01-23|201601| 1| 1| 3|
| Smith| John|2016-02-20|201602| 0| 1| 3|
| Smith| John|2016-03-21|201603| 0| 1| 3|
| Smith| John|2016-06-25|201606| 1| 2| 1|
+--------+-------+----------+------+---+---+--------------+
Update
Mix of .mapPartitions + .over(windowSpec)
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
val schema = new StructType().add(
StructField("provider", StringType, true)).add(
StructField("patient", StringType, true)).add(
StructField("date", StringType, true)).add(
StructField("x", IntegerType, true)).add(
StructField("grp", IntegerType, true))
def f(iter: Iterator[Row]) : Iterator[Row] = {
iter.scanLeft(Row("_", "_", "000000", 0, 0))
{
case (x1, x2) =>
val x =
if (x2.getString(2).replaceAll("-", "").substring(0, 6).toInt ==
x1.getString(2).replaceAll("-", "").substring(0, 6).toInt + 1)
(0) else (1);
val grp = x1.getInt(4) + x;
Row(x2.getString(0), x2.getString(1), x2.getString(2), x, grp);
}.drop(1)
}
val df_mod = spark.createDataFrame(my_df.repartition($"provider", $"patient")
.sortWithinPartitions($"date")
.rdd.mapPartitions(f, true), schema)
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy($"provider", $"patient", $"grp")
df_mod.withColumn("consecutive_id", count(lit("1")).over(windowSpec)
).orderBy($"provider", $"patient", $"date").show
Output
scala> df_mod.withColumn("consecutive_id", count(lit("1")).over(windowSpec)
| ).orderBy($"provider", $"patient", $"date").show
+--------+-------+----------+---+---+--------------+
|provider|patient| date| x|grp|consecutive_id|
+--------+-------+----------+---+---+--------------+
| James| Jill|2017-04-10| 1| 1| 2|
| James| Jill|2017-05-11| 0| 1| 2|
| Smith| Jill|2016-02-01| 1| 1| 2|
| Smith| Jill|2016-03-10| 0| 1| 2|
| Smith| John|2016-01-23| 1| 1| 3|
| Smith| John|2016-02-20| 0| 1| 3|
| Smith| John|2016-03-21| 0| 1| 3|
| Smith| John|2016-06-25| 1| 2| 1|
+--------+-------+----------+---+---+--------------+
You could:
Reformat your dates to integers (2016-01 = 1, 2016-02 = 2, 2017-01 = 13 ...etc)
Combine all the dates into an array with a window and collect_list:
val winSpec = Window.partitionBy("Provider","Patient").orderBy("Date")
df.withColumn("Dates", collect_list("Date").over(winSpec))
Pass the array into a modified version of #marios solution as a UDF with spark.udf.register to get the maximum number of consecutive months