Trying to join two Datasets in Scala spark, namely input and metric
They look like this->
input:
+---------+----------+-----------------------------+------------------------+------+-------------------------+--------------------+----------+-----------------------------+------------------------+-----------------+----------+---------+----------+--------------------+---------+--------------------+-------------------+-------------------+--------------------+-------+-------------+------------+-----+------------+-------+-----------+-------------------+--------------------+----------------+--------------------+----------+-------------------+-------+---------+--------------------+---------------------+-------------+------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------+------------------+---------+----------+-------------+------------------+---------------+----------+----------+-----------------+----------------------------+------------+----------------+---------------+----------------------+---------------+-----------+-----------+---------------+--------------+--------+-----------+----------+
| seller|global_agg|global_agg_visible_impression|global_agg_keyword_click|sl_agg|sl_agg_visible_impression|sl_agg_keyword_click|sl_stg_agg|sl_stg_agg_visible_impression|sl_stg_agg_keyword_click|seller_seller_tag|adblock_id|ad_tag_id|browser_id| canonical_hash|device_id| domain_name| domain_pattern| global_pattern| google_url_category|hour_id|is_valid_suid|mobile_model|os_id|rc_num_calls|referer|screen_size| seasonal_pattern| seller_tag|seller_top_level| slot_id|state_code| url_pattern2| ctr|slideshow|is_commercial_url_nb|is_commercial_url_lda|keyword_click|visible_impression| rpm_url_part0| rpm_url_part1| rpm_url_part2| rpm_url_part3|sub_bidder_id|da_device_name| da_mobile_model|da_is_app|da_os_name|cm_os_version|cm_browser_version|da_browser_name|stats_date|learner_id|mobile_model_misc|rc_num_call_akamai_corrected|referer_misc|screen_size_misc|adblock_id_misc|seller_adblock_id_misc|seller_tag_misc|seller_misc|global_vl2r|global_vl2r_kwc|global_vl2r_vi| sl_vl2r|sl_vl2r_kwc|sl_vl2r_vi|
+---------+----------+-----------------------------+------------------------+------+-------------------------+--------------------+----------+-----------------------------+------------------------+-----------------+----------+---------+----------+--------------------+---------+--------------------+-------------------+-------------------+--------------------+-------+-------------+------------+-----+------------+-------+-----------+-------------------+--------------------+----------------+--------------------+----------+-------------------+-------+---------+--------------------+---------------------+-------------+------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------+------------------+---------+----------+-------------+------------------+---------------+----------+----------+-----------------+----------------------------+------------+----------------+---------------+----------------------+---------------+-----------+-----------+---------------+--------------+--------+-----------+----------+
|8CUS3H6GJ| 3199| 3199| 81| 139| 139| 5| 139| 139| 5| 8CUS3H6GJ__misc|1293334731|146589733| 5|3f1163154992c27f5...| 2| countryliving.com| .%2anews.%2a| .%2anews.%2a|People & Society ...| 23| 0| iPad| 6| 13| 1| 834x1112| .%2anews.%2a| gpt_lb_11| 8PRVCXX19|2dbb87b0ca5118f9c...| ma| .%2anews.%2a|4.9e-04| 1| NA| NA| 0| 1|countryliving.com...|countryliving.com...|countryliving.com...|countryliving.com...| 176| TABLET| IPAD| false| IOS| 14_8| 98.0.4758.97| CHROME MOBILE|2022030813| -1| iPad| 100| 1| misc| misc| 8CUS3H6GJ__misc| misc| 8CUS3H6GJ| 0.02532| 81| 3199|0.031515| 5| 139|
|8CU8ND892| 3199| 3199| 81| 2| 2| 0| 2| 2| 0| 8CU8ND892__misc|1303809593|816510261| 6|0fadd20698c8eb489...| 4| menupix.com| .%2amen.%2a| .%2amen.%2a|Food & Drink > Be...| 23| 0| Other| 6| 7| 1| NA| .%2amen.%2a| 300-AFR1-Menu| 8PRVCXX19|5e1617c92cc745abe...| mi| .%2amen.%2a|4.0e-04| 0| sensitive| NA| 0| 1|menupix.com_menud...|menupix.com_menu.php| menupix.com_NA| menupix.com_NA| 186| DESKTOP| SAFARI - OS X| false| OS X| 10_15_6| 15.3| SAFARI|2022030813| -1| Other| 7| 1| NA| misc| 8CU8ND892__misc| misc| 8CU8ND892| 0.02532| 81| 3199|0.024824| 0| 2|
|8CUS3H6GJ| 3199| 3199| 81| 139| 139| 5| 139| 139| 5| 8CUS3H6GJ__misc| 133963366|647881048| 16|02117a01ef3c83505...| 3| iseecars.com| .%2asale.%2a| .%2asale.%2a|Autos & Vehicles ...| 23| 0| iPhone| 6| 9| 1| 428x926| .%2asale.%2a|jam_srp-m-breaker-18| 8PRVCXX19| 647881048| fl| .%2asale.%2a|5.1e-04| 0| sensitive| NA| 0| 2|iseecars.com_cars...| iseecars.com_NA| iseecars.com_NA| iseecars.com_NA| 186| MOBILE PHONE|GENERIC_SMARTPHONE| true| IOS| 15_3_1| NA| SAFARI|2022030813| -1| iPhone| 9| 1| 428x926| misc| 8CUS3H6GJ__misc| misc| 8CUS3H6GJ| 0.02532| 81| 3199|0.031515| 5| 139|
|8CU5217S8| 3199| 3199| 81| 224| 224| 11| 224| 224| 11| 8CU5217S8__misc|1470166420|617251957| 6|97ad08b531a77a31a...| 3| usmagazine.com| .%2amagazine.%2a| .%2amagazine.%2a| Shopping| 23| 0| iPhone| 6| 1| 1| 375x667| .%2amagazine.%2a| NA| 8PRVCXX19|1632372f3fe8495e0...| az| .%2amagazine.%2a|3.5e-04| 0| NA| NA| 0| 1|usmagazine.com_ho...|usmagazine.com_gl...| usmagazine.com_NA| usmagazine.com_NA| 128| MOBILE PHONE| IPHONE| false| IOS| 15_1| 15.1| SAFARI|2022030813| -1| iPhone| 1| 1| 375x667| misc| 8CU5217S8__misc| misc| 8CU5217S8| 0.02532| 81| 3199|0.041765| 11| 224|
|8CU4V40B1| 3199| 3199| 81| 456| 456| 7| 456| 456| 7| 8CU4V40B1__misc|1614514466|454554330| 6|dbeed4c1cd0dc94bd...| 3|makingthymeforhea...| .%2amushroom.%2a| .%2amushroom.%2a| Food & Drink| 23| 0| iPhone| 6| 11| 1| 375x812| .%2amushroom.%2a|AdThrive_Content_...| 8PRVCXX19|77eca1e2d050c2dc0...| ga| .%2amushroom.%2a|0.0e+00| 0| NA| NA| 0| 1|makingthymeforhea...|makingthymeforhea...|makingthymeforhea...|makingthymeforhea...| 133| MOBILE PHONE| IPHONE| false| IOS| 14_8_1| 14.1.2| SAFARI|2022030813| -1| iPhone| 100| 1| 375x812| misc| 8CU4V40B1__misc| misc| 8CU4V40B1| 0.02532| 81| 3199|0.017144| 7| 456|
|8CU4V40B1| 3199| 3199| 81| 456| 456| 7| 456| 456| 7| 8CU4V40B1__misc|1645741457|867854626| 6|9a19532f0607f1a67...| 3| fitfoodiefinds.com| .%2achicken.%2a| .%2achicken.%2a|Food & Drink > Co...| 23| 0| iPhone| 6| 19| 1| 414x896| .%2achicken.%2a|AdThrive_Content_...| 8PRVCXX19|6c2626c75d0d61cea...| fl| .%2achicken.%2a|1.1e-04| 0| NA| NA| 0| 1|fitfoodiefinds.co...|fitfoodiefinds.co...|fitfoodiefinds.co...|fitfoodiefinds.co...| 196| MOBILE PHONE| IPHONE| false| IOS| 14_7_1| 14.1.2| SAFARI|2022030813| -1| iPhone| 100| 1| 414x896| misc| 8CU4V40B1__misc| misc| 8CU4V40B1| 0.02532| 81| 3199|0.017144| 7| 456|
|8CU65T3AT| 3199| 3199| 81| 9| 9| 0| 9| 9| 0| 8CU65T3AT__misc|1708947332|642241257| 6|e850b035360db101e...| 3| livestly.com| .%2ahousehold.%2a| .%2ahousehold.%2a|Home & Garden > H...| 23| 0| iPhone| 6| 1| 1| 414x736| .%2ahousehold.%2a|div-gpt-ad-149261...| 8PRVCXX19|3d94a0c408b09fc82...| la| .%2ahousehold.%2a|6.1e-04| 1| NA| NA| 0| 2|livestly.com_hous...| livestly.com_NA| livestly.com_NA| livestly.com_NA| 186| MOBILE PHONE| IPHONE| false| IOS| 14_7_1| 14.1.2| SAFARI|2022030813| -1| iPhone| 1| 1| misc| misc| 8CU65T3AT__misc| misc| 8CU65T3AT| 0.02532| 81| 3199|0.023229| 0| 9|
|8CUS3H6GJ| 3199| 3199| 81| 139| 139| 5| 139| 139| 5| 8CUS3H6GJ__misc|1763942358|647881048| 22|f12d9cd1c53bb4d29...| 3| caranddriver.com|.%2abest.%2asuv.%2a|.%2abest.%2asuv.%2a|Autos & Vehicles ...| 23| 0| iPhone| 6| 1| 1| 375x812|.%2abest.%2asuv.%2a| gpt_gal_a| 8PRVCXX19| 647881048| ma|.%2abest.%2asuv.%2a|9.4e-05| 0| sensitive| NA| 0| 1|caranddriver.com_...|caranddriver.com_...|caranddriver.com_...| caranddriver.com_NA| 186| MOBILE PHONE| IPHONE| true| IOS| 15_2_1| NA| SAFARI|2022030813| -1| iPhone| 1| 1| 375x812| misc| 8CUS3H6GJ__misc| misc| 8CUS3H6GJ| 0.02532| 81| 3199|0.031515| 5| 139|
|8CUXP6AUQ| 3199| 3199| 81| 19| 19| 0| 19| 19| 0| 8CUXP6AUQ__misc|1876213389|492267288| 6|2b2b8ce4d414d87c7...| 3| wunderground.com| .%2aforecast.%2a| .%2aforecast.%2a| News > Weather| 23| 0| Other| 6| 2| 1| NA| .%2aforecast.%2a| WX_Top300Variable| 8PRVCXX19|7d02cb34f1c34d1bf...| co| .%2aforecast.%2a|4.0e-03| 0| NA| NA| 0| 1|wunderground.com_...| wunderground.com_us| wunderground.com_co|wunderground.com_...| 128| DESKTOP| SAFARI - OS X| false| OS X| 10_15_6| 15.3| SAFARI|2022030813| -1| Other| 2| 1| NA| misc| 8CUXP6AUQ__misc| misc| 8CUXP6AUQ| 0.02532| 81| 3199|0.021277| 0| 19|
|8CUK5QD75| 3199| 3199| 81| 224| 224| 12| 224| 224| 12| 8CUK5QD75__misc|2026381580|373321055| 16|fbf8a2526836d2df7...| 3|thebestblogrecipe...| .%2arecipe.%2a| .%2arecipe.%2a|Food & Drink > Co...| 23| 1| iPhone| 6| 1| 1| 414x896| .%2arecipe.%2a| content_mobile| 8PRVCXX19|4ff74ea6b06d13aea...| mn| .%2arecipe.%2a|2.1e-03| 0| NA| NA| 1| 1|thebestblogrecipe...|thebestblogrecipe...|thebestblogrecipe...|thebestblogrecipe...| 196| MOBILE PHONE| IPHONE 11 PRO MAX| true| IOS| 15_2_1| NA| SAFARI|2022030813| -1| iPhone| 1| 1| 414x896| misc| 8CUK5QD75__misc| misc| 8CUK5QD75| 0.02532| 81| 3199|0.044852| 12| 224|
+---------+----------+-----------------------------+------------------------+------+-------------------------+--------------------+----------+-----------------------------+------------------------+-----------------+----------+---------+----------+--------------------+---------+--------------------+-------------------+-------------------+--------------------+-------+-------------+------------+-----+------------+-------+-----------+-------------------+--------------------+----------------+--------------------+----------+-------------------+-------+---------+--------------------+---------------------+-------------+------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------+------------------+---------+----------+-------------+------------------+---------------+----------+----------+-----------------+----------------------------+------------+----------------+---------------+----------------------+---------------+-----------+-----------+---------------+--------------+--------+-----------+----------+
only showing top 10 rows
metric
+---------+-----------------+------------------+-------------+--------+
| seller|seller_seller_tag|visible_impression|keyword_click| vl2r|
+---------+-----------------+------------------+-------------+--------+
|8CU5217S8| 8CU5217S8__misc| 224| 11|0.046841|
|8CUK5QD75| 8CUK5QD75__misc| 224| 12| 0.05088|
|8CUQJV5RI| 8CUQJV5RI__misc| 10| 1|0.038281|
|8CU29N1R8| 8CU29N1R8__misc| 6| 0|0.022535|
|8CUS47X5W| 8CUS47X5W__misc| 5| 1| 0.04156|
|8CU81SHO3| 8CU81SHO3__misc| 2| 0|0.024337|
|8CUWMI118| 8CUWMI118__misc| 1| 0|0.024821|
|8CU1NA8RS| 8CU1NA8RS__misc| 1| 0|0.024821|
|8CU6I65Y2| 8CU6I65Y2__misc| 10| 0|0.020925|
|8CUWMQE3H| 8CUWMQE3H__misc| 66| 1|0.018842|
|8CUJKX6Y3| 8CUJKX6Y3__misc| 4| 0| 0.02341|
|8CUYT9A1U| 8CUYT9A1U__misc| 1| 0|0.024821|
|8CU27488H| 8CU27488H__misc| 1| 0|0.024821|
|8CU7O5VP2| 8CU7O5VP2__misc| 4| 0| 0.02341|
|8CUS3H6GJ| 8CUS3H6GJ__misc| 139| 5|0.034107|
|8CUJN5H60| 8CUJN5H60__misc| 72| 0|0.008559|
|8CUHN3BGE| 8CUHN3BGE__misc| 63| 4|0.049125|
|8CUQ5LJ63| 8CUQ5LJ63__misc| 13| 0|0.019829|
|8CUM545EY| 8CUM545EY__misc| 23| 0|0.016736|
|8CU94FM32| 8CU94FM32__misc| 32| 0|0.014532|
+---------+-----------------+------------------+-------------+--------+
What I am trying to execute->
input.as("input").join(broadcast(metric.as("metric"), Seq("seller","seller_seller_tag"), "left_outer")
This join operation breaks with the error:
Resolved attribute(s) seller_seller_tag#17561 missing from domain_pattern#10807,canonical_hash#10804,seller#10818,visible_impression#10829L,browser_id#10803,da_os_name#10838,domain_name#10806,slideshow#10825,os_id#10813,seasonal_pattern#10817,rpm_url_part1#10831,adblock_id#10801,da_browser_name#10841,da_mobile_model#10836,rpm_url_part3#10833,da_is_app#10837,seller_tag_misc#3625,referer#10815,seller_tag#10819,keyword_click#10828L,learner_id#10843,rpm_url_part2#10832,adblock_id_misc#2850,cm_os_version#10839,seller_misc#4023,global_pattern#10808,rc_num_calls#10814,screen_size#10816,google_url_category#10809,is_commercial_url_lda#10827,is_valid_suid#10811,url_pattern2#10823,stats_date#10842,sub_bidder_id#10834,da_device_name#10835,ad_tag_id#10802,hour_id#10810,seller_top_level#10820,mobile_model#10812,slot_id#10821,state_code#10822,device_id#10805,seller_adblock_id_misc#2899,screen_size_misc#2351,cm_browser_version#10840,rc_num_call_akamai_corrected#1385,is_commercial_url_nb#10826,ctr#10824,seller_seller_tag#3676,mobile_model_misc#1340,rpm_url_part0#10830,referer_misc#1866 in operator !Project [adblock_id#10801, ad_tag_id#10802, browser_id#10803, canonical_hash#10804, device_id#10805, domain_name#10806, domain_pattern#10807, global_pattern#10808, google_url_category#10809, hour_id#10810, is_valid_suid#10811, mobile_model#10812, os_id#10813, rc_num_calls#10814, referer#10815, screen_size#10816, seasonal_pattern#10817, seller#10818, seller_tag#10819, seller_top_level#10820, slot_id#10821, state_code#10822, url_pattern2#10823, ctr#10824, ... 28 more fields]. Attribute(s) with the same name appear in the operation: seller_seller_tag. Please check if the right attribute(s) are used.
Obviously, both "seller" and "seller_seller_tag" columns are present in the left and right datasets. The error message is ambiguous and I've been scratching my head over this since a couple of days now. Have tried several things including selecting only a few columns, joining only on the column "seller". All in vain
I'm unable to reproduce your error, did you share all the transformations you've implemented to the dataframe after the join?
Try to rename the offending column, and subsequently join on the new field name:
val metricRenamed = metric.withColumnRenamed("seller_seller_tag", "metric_seller_seller_tag")
input.join(broadcast(metricRenamed), input("seller") === metricRenamed("seller") && input("seller_seller_tag") === metricRenamed("metric_seller_seller_tag"), "left_outer")
Related
Trying to compute the stddev and 25,75 quantiles but they produce NaN and Null values
# Window Time = 30min
window_time = 1800
# Stats fields for window
stat_fields = ['source_packets', 'destination_packets']
df = sqlContext.createDataFrame([('192.168.1.1','10.0.0.1',22,51000, 17, 1, "2017-03-10T15:27:18+00:00"),
('192.168.1.2','10.0.0.2',51000,22, 1,2, "2017-03-15T12:27:18+00:00"),
('192.168.1.2','10.0.0.2',53,51000, 2,3, "2017-03-15T12:28:18+00:00"),
('192.168.1.2','10.0.0.2',51000,53, 3,4, "2017-03-15T12:29:18+00:00"),
('192.168.1.3','10.0.0.3',80,51000, 4,5, "2017-03-15T12:28:18+00:00"),
('192.168.1.3','10.0.0.3',51000,80, 5,6, "2017-03-15T12:29:18+00:00"),
('192.168.1.3','10.0.0.3',22,51000, 25,7, "2017-03-18T11:27:18+00:00")],
["source_ip","destination_ip","source_port","destination_port", "source_packets", "destination_packets", "timestampGMT"])
def add_stats_column(r_df, field, window):
'''
Input:
r_df: dataframe
field: field to generate stats with
window: pyspark window to be used
'''
r_df = r_df \
.withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
.withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
.withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
.withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
.withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
.withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
.withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))
return r_df
w_s = (Window()
.partitionBy("ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-window_time, 0))
df2 = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.withColumn("arr",F.array(F.col("source_ip"),F.col("destination_ip")))\
.selectExpr("explode(arr) as ip","*")\
.drop(*['arr','source_ip','destination_ip'])
df2 = (reduce(partial(add_stats_column,window=w_s),
stat_fields,
df2
))
#print(df2.explain())
df2.show(100)
output
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| ip|source_port|destination_port|source_packets|destination_packets| timestampGMT| timestamp|source_packets_sum_30m|source_packets_avg_30m|source_packets_std_30m|source_packets_min_30m|source_packets_max_30m|source_packets_q25_30m|source_packets_q75_30m|destination_packets_sum_30m|destination_packets_avg_30m|destination_packets_std_30m|destination_packets_min_30m|destination_packets_max_30m|destination_packets_q25_30m|destination_packets_q75_30m|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|192.168.1.3| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
|192.168.1.3| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
|192.168.1.3| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
| 10.0.0.1| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
| 10.0.0.2| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
| 10.0.0.2| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
| 10.0.0.2| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
| 10.0.0.3| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
| 10.0.0.3| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
| 10.0.0.3| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
|192.168.1.2| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
|192.168.1.2| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
|192.168.1.2| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
|192.168.1.1| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
from pyspark api doc, we can get that:
pyspark.sql.functions.stddev(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
pyspark.sql.functions.stddev_pop(col)
Aggregate function: returns population standard deviation of the expression in a group.
New in version 1.6.
pyspark.sql.functions.stddev_samp(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
so, maybe you can try stddev_pop:population standard deviation other than stddev:unbiased sample standard deviation.
unbiased sample standard deviation cause divide by zero error (get NaN) when only one sample.
I have the following dataframe:
+----+----+-----+
|col1|col2|value|
+----+----+-----+
| 11| a| 1|
| 11| a| 2|
| 11| b| 3|
| 11| a| 4|
| 11| b| 5|
| 22| a| 6|
| 22| b| 7|
+----+----+-----+
I want to calculate to calculate the cumsum of the 'value' column that is partitioned by 'col1' and ordered by 'col2'.
This is the desired output:
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 1| 1|
| 11| a| 2| 3|
| 11| a| 4| 7|
| 11| b| 3| 10|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
I have used this code which gives me the df shown below. It is not what I wanted. Can someone help me please?
df.withColumn("cumsum", F.sum("value").over(Window.partitionBy("col1").orderBy("col2").rangeBetween(Window.unboundedPreceding, 0)))
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 2| 7|
| 11| a| 1| 7|
| 11| a| 4| 7|
| 11| b| 3| 15|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
You have to use .rowsBetween instead of .rangeBetween in your window clause.
rowsBetween (vs) rangeBetween
Example:
df.withColumn("cumsum", sum("value").over(Window.partitionBy("col1").orderBy("col2").rowsBetween(Window.unboundedPreceding, 0))).show()
#+----+----+-----+------+
#|col1|col2|value|cumsum|
#+----+----+-----+------+
#| 11| a| 1| 1|
#| 11| a| 2| 3|
#| 11| a| 4| 7|
#| 11| b| 3| 10|
#| 11| b| 5| 15|
#| 12| a| 6| 6|
#| 12| b| 7| 13|
#+----+----+-----+------+
I will expose my problem based on the initial dataframe and the one I want to achieve:
val df_997 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
df_997.show
+--------+-------+---+-------+
|policyId|FECMVTO|aux|IND_DEF|
+--------+-------+---+-------+
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
+--------+-------+---+-------+
Imagine I have partitioned this DF by the column policyId and created the column row_num based on it to better see the Windows:
val win = Window.partitionBy("policyId").orderBy("FECMVTO")
val df_998 = df_997.withColumn("row_num",row_number().over(win))
df_998.show
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 10| 4| 300| 3|
| 1| 20| 24| 70| 4|
| 1| 30| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 10| 4| 900| 2|
| 2| 15| 21| 60| 3|
| 2| 25| 30| 40| 4|
+--------+-------+---+-------+-------+
Now, for each window, if the value of aux is 4, I want to set the value of IND_DEF column for that register to the column FEC_MVTO for this register on until the end of the window.
The resulting DF would be:
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 300| 4| 300| 3|
| 1| 300| 24| 70| 4|
| 1| 300| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 900| 4| 900| 2|
| 2| 900| 21| 60| 3|
| 2| 900| 30| 40| 4|
+--------+-------+---+-------+-------+
Thanks for your suggestions as I am very stuck in here...
Here's one approach: First left-join the DataFrame with its aux == 4 filtered version, followed by applying Window function first to backfill nulls with the wanted IND_DEF values per partition, and finally conditionally recreate column FECMVTO:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1,1,7,10), (1,10,4,300), (1,3,14,50), (1,20,24,70), (1,30,12,90),
(2,10,4,900), (2,25,30,40), (2,15,21,60), (2,5,10,80)
).toDF("policyId","FECMVTO","aux","IND_DEF")
val win = Window.partitionBy("policyId").orderBy("FECMVTO").
rowsBetween(Window.unboundedPreceding, 0)
val df2 = df.
select($"policyId", $"aux", $"IND_DEF".as("IND_DEF2")).
where($"aux" === 4)
df.join(df2, Seq("policyId", "aux"), "left_outer").
withColumn("IND_DEF3", first($"IND_DEF2", ignoreNulls=true).over(win)).
withColumn("FECMVTO", coalesce($"IND_DEF3", $"FECMVTO")).
show
// +--------+---+-------+-------+--------+--------+
// |policyId|aux|FECMVTO|IND_DEF|IND_DEF2|IND_DEF3|
// +--------+---+-------+-------+--------+--------+
// | 1| 7| 1| 10| null| null|
// | 1| 14| 3| 50| null| null|
// | 1| 4| 300| 300| 300| 300|
// | 1| 24| 300| 70| null| 300|
// | 1| 12| 300| 90| null| 300|
// | 2| 10| 5| 80| null| null|
// | 2| 4| 900| 900| 900| 900|
// | 2| 21| 900| 60| null| 900|
// | 2| 30| 900| 40| null| 900|
// +--------+---+-------+-------+--------+--------+
Columns IND_DEF2, IND_DEF3 are kept only for illustration (and can certainly be dropped).
#I believe below can be solution for your issue
Considering input_df is your input dataframe
//Step#1 - Filter rows with IND_DEF = 4 from input_df
val only_FECMVTO_4_df1 = input_df.filter($"IND_DEF" === 4)
//Step#2 - Filling FECMVTO value from IND_DEF for the above result
val only_FECMVTO_4_df2 = only_FECMVTO_4_df1.withColumn("FECMVTO_NEW",$"IND_DEF").drop($"FECMVTO").withColumnRenamed("FECMVTO",$"FECMVTO_NEW")
//Step#3 - removing all the records from step#1 from input_df
val input_df_without_FECMVTO_4 = input_df.except(only_FECMVTO_4_df1)
//combining Step#2 output with output of Step#3
val final_df = input_df_without_FECMVTO_4.union(only_FECMVTO_4_df2)
So i have two data frame .
Data Frame 1 like this :
+----------+------+---------+--------+------+
| OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877341| 136| 4| 1| I|!||
|4295877346| 136| 4| 1| I|!||
|4295877341| 138| 2| 1| I|!||
|4295877341| 141| 4| 1| I|!||
|4295877341| 143| 2| 1| I|!||
|4295877341| 145| 14| 1| I|!||
| 123456789| 145| 14| 1| I|!||
| 809580109| 145| 9| 9| I|!||
+----------+------+---------+--------+------+
DataFrame2 is like below
+----------+------+-----------+----------+--------+
| OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
+----------+------+-----------+----------+--------+
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
| 123456789| 145| 14| 1| D|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
| 809580109| 145| NULL| NULL| I|!||
+----------+------+-----------+----------+--------+
Now i have to join both data frame update data frame 1 column with matching records with data frame 2 .
Now key in both data frame is OrgId and ItemId.
So the expected output should be .
+----------+------+---------+--------+------+
| OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877346| 136| 4| 1| I|!||
|4295877341| 145| 14| 1| I|!||
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
| 809580109| 145| 9| 9| I|!||
+----------+------+---------+--------+------+
So i need to update data frame 1 with data frame 2 records .
If records in data frame 1 is not found in 2 then also we need to retain that records .
If any new records are found in dataframe 2 then that records needs to added in the output
Here is what i am doing ..
val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
.select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
.filter(!$"Action_1".contains("D"))
df3.show()
But i am getting below output .
+----------+------+-----------+----------+--------+
| OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
+----------+------+-----------+----------+--------+
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
+----------+------+-----------+----------+--------+
I am not getting 4295877346| 136| 4| 1| I|!| record from data frame 1 ...
left_outer gives me below output
+----------+------+-----------+----------+--------+
| OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
+----------+------+-----------+----------+--------+
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
+----------+------+-----------+----------+--------+
Let me explain first whats your mistake.
if you only join as below
val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
df3.show()
You will get
+----------+------+---------+--------+------+-----------+----------+--------+
| OrgId|ItemId|segmentId|Sequence|Action|segmentId_1|Sequence_1|Action_1|
+----------+------+---------+--------+------+-----------+----------+--------+
|4295877346| 136| 4| 1| I|!|| null| null| null|
|4295877341| 145| 14| 1| I|!|| null| null| null|
|4295877343| 149| null| null| null| 15| 2| I|!||
|4295877341| 136| 4| 1| I|!|| null| null| I|!||
| 123456789| 145| 14| 1| I|!|| 14| 1| D|!||
|4295877341| 138| 2| 1| I|!|| 11| 22| I|!||
|4295877341| 141| 4| 1| I|!|| 10| 1| I|!||
|4295877341| 143| 2| 1| I|!|| 1| 1| I|!||
+----------+------+---------+--------+------+-----------+----------+--------+
It is full evident that the filter in your code is filtering the null as well in Action_1 column
So the working code for you is to change the null values that you get after you join to valid data from other table where the data is present.
val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
.withColumn("segmentId_1", when($"segmentId_1".isNotNull, $"segmentId_1").otherwise($"segmentId"))
.withColumn("Sequence_1", when($"Sequence_1".isNotNull, $"Sequence_1").otherwise($"Sequence"))
.withColumn("Action_1", when($"Action_1".isNotNull, $"Action_1").otherwise($"Action"))
.select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
.filter(!$"Action_1".contains("D") )
df3.show()
you should be getting the desired output as
+----------+------+-----------+----------+--------+
| OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
+----------+------+-----------+----------+--------+
|4295877346| 136| 4| 1| I|!||
|4295877341| 145| 14| 1| I|!||
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
+----------+------+-----------+----------+--------+
Try left-outer instead of outer:
val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "left_outer")
.select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
.filter(!$"Action_1".contains("D"))
df3.show()
Left outer should retain all non matched in the left.
A nice tutorial here.
I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns)
+--------+----------+-----+----+
| Entity| id| colA|colB|
+-------------------+-----+----+
| a|8589934652| 21| 50|
| b| 112| 9| 23|
| c|8589934629| 9| 23|
| d|8589934702| 8| 21|
| e| 20| 2| 21|
| f|8589934657| 2| 5|
| g|8589934601| 1| 5|
| h|8589934653| 1| 4|
| i|8589934620| 0| 4|
| j|8589934643| 0| 3|
| k|8589934618| 0| 3|
| l|8589934602| 0| 2|
| m|8589934664| 0| 2|
| n| 25| 0| 1|
| o| 67| 0| 1|
| p|8589934642| 0| 1|
| q|8589934709| 0| 1|
| r|8589934660| 0| 1|
| s| 30| 0| 1|
| t| 55| 0| 1|
+--------+----------+-----+----+
What I'd like is a way to rank this dataframe where tied values receive the same rank such as:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 21| 2| 3|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+--------+----------+-----+----+---------+---------+
My current implementation with the first dataframe looks like:
def getRanks(mydf, cols=None, ascending=False):
from pyspark import Row
# This takes a dataframe and a list of columns to rank
# If no list is provided, it ranks *all* columns
# returns a new dataframe
def addRank(ranked_rdd, col, ascending):
# This assumes an RDD of the form (Row(...), list[...])
# it orders the rdd by col, finds the order, then adds that to the
# list
myrdd = ranked_rdd.sortBy(lambda (row, ranks): row[col],
ascending=ascending).zipWithIndex()
return myrdd.map(lambda ((row, ranks), index): (row, ranks +
[index+1]))
myrdd = mydf.rdd
fields = myrdd.first().__fields__
ranked_rdd = myrdd.map(lambda x: (x, []))
if (cols is None):
cols = fields
for col in cols:
ranked_rdd = addRank(ranked_rdd, col, ascending)
rank_names = [x + "_rank" for x in cols]
# Hack to make sure columns come back in the right order
ranked_rdd = ranked_rdd.map(lambda (row, ranks): Row(*row.__fields__ +
rank_names)(*row + tuple(ranks)))
return ranked_rdd.toDF()
which produces:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 3| 3|
| d|8589934702| 8| 21| 4| 4|
| e| 20| 2| 21| 5| 5|
| f|8589934657| 2| 5| 6| 6|
| g|8589934601| 1| 5| 7| 7|
| h|8589934653| 1| 4| 8| 8|
| i|8589934620| 0| 4| 9| 9|
| j|8589934643| 0| 3| 10| 10|
| k|8589934618| 0| 3| 11| 11|
| l|8589934602| 0| 2| 12| 12|
| m|8589934664| 0| 2| 13| 13|
| n| 25| 0| 1| 14| 14|
| o| 67| 0| 1| 15| 15|
| p|8589934642| 0| 1| 16| 16|
| q|8589934709| 0| 1| 17| 17|
| r|8589934660| 0| 1| 18| 18|
| s| 30| 0| 1| 19| 19|
| t| 55| 0| 1| 20| 20|
+--------+----------+-----+----+---------+---------+
As you can see, the function getRanks() takes a dataframe, specifies the columns to be ranked, sorts them, and uses zipWithIndex() to generate an ordering or rank. However, I can't figure out a way to preserve ties.
This stackoverflow post is the closest solution I've found:
rank-users-by-column But it appears to only handle 1 column (I think).
Thanks so much for the help in advance!
EDIT: column 'id' is generated from calling monotonically_increasing_id() and in my implementation is cast to a string.
You're looking for dense_rank
First let's create our dataframe:
df = spark.createDataFrame(sc.parallelize([["a",8589934652,21,50],["b",112,9,23],["c",8589934629,9,23],
["d",8589934702,8,21],["e",20,2,21],["f",8589934657,2,5],
["g",8589934601,1,5],["h",8589934653,1,4],["i",8589934620,0,4],
["j",8589934643,0,3],["k",8589934618,0,3],["l",8589934602,0,2],
["m",8589934664,0,2],["n",25,0,1],["o",67,0,1],["p",8589934642,0,1],
["q",8589934709,0,1],["r",8589934660,0,1],["s",30,0,1],["t",55,0,1]]
), ["Entity","id","colA","colB"])
We'll define two windowSpec:
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("colA"))
wB = Window.orderBy(psf.desc("colB"))
df = df.withColumn(
"colA_rank",
psf.dense_rank().over(wA)
).withColumn(
"colB_rank",
psf.dense_rank().over(wB)
)
+------+----------+----+----+---------+---------+
|Entity| id|colA|colB|colA_rank|colB_rank|
+------+----------+----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 2| 2|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+------+----------+----+----+---------+---------+
I'll also pose an alternative:
for cols in data.columns[2:]:
lookup = (data.select(cols)
.distinct()
.orderBy(cols, ascending=False)
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF([cols, cols+"_rank_lookup"]))
name = cols + "_ranks"
data = data.join(lookup, [cols]).withColumn(name,col(cols+"_rank_lookup")
+ 1).drop(cols + "_rank_lookup")
Not as elegant as dense_rank() and I'm uncertain as to performance implications.