how can i bring the months in calender order like from jan to dec in scala dataframe - scala

+---------+------------------+
| Month|sum(buss_days)|
+---------+------------------+
| April| 83.93|
| August| 94.895|
| December| 53.47|
| February| 22.90|
| January| 97.45|
| July| 95.681|
| June| 23.371|
| March| 35.957|
| May| 4.24|
| November| 1.56|
| October| 1.00|
|September| 93.51|
+---------+------------------+
and i want output like this
+---------+------------------+
| Month|sum(avg_buss_days)|
+---------+------------------+
| January| 97.45
February| 22.90
March| 35.957
April| 83.93|
| May| 4.24
June| 23.371
July| 95.681
August| 94.895|
| September| 93.51
October| 1.00
November| 1.56
December| 53.47|
+---------+------------------+
this is what it is i did
df.groupBy("Month[order(match(month$month, month.abb)), ]")
And i got this.....
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Month[order(match(month$month, month.abb)), ]".Here Month is Column name in dataframe

Convert the Month Into Date form and sort the items should do.
Please find the snippet unix_timestamp(col("Month"),"MMMMM")
Df.sort(unix_timestamp(col("Month"),"MMMMM")).show
+---------+-------------+
| Month|avg_buss_days|
+---------+-------------+
| January| 97.45|
| February| 22.90|
| March| 35.957|
| April| 83.93|
| May| 4.24|
| June| 23.371|
| July| 95.681|
| August| 94.895|
|September| 93.51|
| October| 1.00|
| November| 1.56|
| December| 53.47|
+---------+-------------+

Related

pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

Pivoting a Dataframe column transforming on a User ID Spark [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 5 years ago.
I have A Dataframe that looks like
+------+------------+------------------+
|UserID|Attribute | Value |
+------+------------+------------------+
|123 | City | San Francisco |
|123 | Lang | English |
|111 | Lang | French |
|111 | Age | 23 |
|111 | Gender | Female |
+------+------------+------------------+
So i have few distinct Attributes that can be null for some users (limited Attributes say 20 max)
I want to Convert this DF to
+-----+--------------+---------+-----+--------+
|User |City | Lang | Age | Gender |
+-----+--------------+---------+-----+--------+
|123 |San Francisco | English | NULL| NULL |
|111 | NULL| French | 23 | Female |
+-----+--------------+---------+-----+--------+
I'm quite new to Spark and Scala.
You can use pivot to get the desired output:
import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._
df.groupBy("UserID")
.pivot("Attribute")
.agg(first("Value")).show()
This will give you the desired output:
+------+----+-------------+------+-------+
|UserID| Age| City|Gender| Lang|
+------+----+-------------+------+-------+
| 111| 23| null|Female| French|
| 123|null|San Francisco| null|English|
+------+----+-------------+------+-------+

Joining data without creating duplicate metric rows from the first table, (second table contains more rows but not metrics)

I have the following two tables that I would like to join for a comprehensive digital marketing report without creating duplicates in regards to metrics. The idea is to take competitor adverts and join them with my existing marketing data which is as follows;
Campaign|Impressions | Clicks | Conversions | CPC |Key
---------+------------+--------+-------------+-----+----
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12
The competitor data is as follows;
Key | Ad Copie |
---------+------------+
Hgdy24 |Click here! |
Hgdy24 |Free Trial! |
Hgdy24 |Sign Up now |
dhfg12 |Check it out|
dhfg12 |World known |
dhfg12 |Sign up |
Using conventional join queries produces the following unusable result
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Click here!
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Free Trial!
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Check it out
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|World known
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Sign up
Here is the desired output
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Click here!
USA-SIM| | | | |Hgdy24|Free Trial!
USA-SIM| | | | |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Check it out
DE-SIM | | | | |dhfg12|World known
DE-SIM | | | | |dhfg12|Sign up
Or as an alternative that would also work would be
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|
USA-SIM| | | | |Hgdy24|Click here!
USA-SIM| | | | |Hgdy24|Free Trial!
USA-SIM| | | | |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|
DE-SIM | | | | |dhfg12|Check it out
DE-SIM | | | | |dhfg12|World known
DE-SIM | | | | |dhfg12|Sign up
I have yet to find a work around that does not produce the extra metrics as a result.
MOST RECENT RESULT
campaing | impressions | clicks | conversions | cpc | key | ad_copie
----------+-------------+--------+-------------+-----+--------+------------
USA-SIM | 53432 | 5001 | 5 | 2$ | |
USA-SIM | | | | | Hgdy24 | Click here!
USA-SIM | | | | | Hgdy24 | Free Trial!
USA-SIM | | | | | Hgdy24 | Sign Up now
DE-SIM | 5389 | 4672 | 3 | 4$ | |
DE-SIM | | | | | dhfg12 | Check it out
DE-SIM | | | | | dhfg12 | World known
DE-SIM | | | | | dhfg12 | Sign up
You can use window function lag() to check what key was in previous row and either display metrics or null them.
select campaing,
case when prev_key is null or prev_key != key then impressions end as impressions,
case when prev_key is null or prev_key != key then clicks end as clicks,
case when prev_key is null or prev_key != key then conversions end as conversions,
case when prev_key is null or prev_key != key then cpc end as cpc,
key, ad_copie
from (
select campaing, lag(key) over () AS prev_key, impressions, clicks, conversions, cpc, key, ad_copie
from ad1
join comp1 using(key)
order by campaing desc, key
) sub;
result:
campaing | impressions | clicks | conversions | cpc | key | ad_copie
----------+-------------+--------+-------------+-----+--------+--------------
USA-SIM | 53432 | 5001 | 5 | 2$ | Hgdy24 | Click here!
USA-SIM | | | | | Hgdy24 | Free Trial!
USA-SIM | | | | | Hgdy24 | Sign Up now
DE-SIM | 5389 | 4672 | 3 | 4$ | dhfg12 | Check it out
DE-SIM | | | | | dhfg12 | World known
DE-SIM | | | | | dhfg12 | Sign up
(6 wierszy)
EDIT: You might need to tinker with what columns you compare before you NULL metrics and possibly by what columns you will order data. If key is unique for campaing then I suppose this will suffice.

Postgres select from table and spread evenly

I have a 2 tables. First table contains information of the object, second table contains related objects. Second table objects have 4 types( lets call em A,B,C,D).
I need a query that does something like this
|table1 object id | A |value for A|B | value for B| C | value for C|D | vlaue for D|
| 1 | 12| cat | 13| dog | 2 | house | 43| car |
| 1 | 5 | lion | | | | | | |
The column "table1 object id" in real table is multiple columns of data from table 1(for single object its all the same, just repeated on multiple rows because of table 2).
Where 2nd table is in form
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
I hope this is clear enough of the thing i want.
I have tryed using AND and OR and JOIN. This does not seem like something that can be done with crosstab.
EDIT
Table 2
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
|C |wolf | 2 | 6 |
Table 1
| id | value1 | value 2|value 3|
| 1 | hello | test | hmmm |
| 2 | bye | test2 | hmm2 |
Result
|value1| value2| value3| A| value| B |value| C|value | D | value|
|hello | test | hmmm |12| cat | 13| dog |2 | house | 23| car |
|hello | test | hmmm |5 | lion | | | | | | |
|bye | test2 | hmm2 | | | | |6 | wolf | | |
I hope this explains bit bettter of what I want to achieve.

Spark groupby filter sorting with top 3 read articles each city

I have a table data like following :
+-----------+--------+-------------+
| City Name | URL | Read Count |
+-----------+--------+-------------+
| Gurgaon | URL1 | 3 |
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL4 | 1 |
| Gurgaon | URL5 | 5 |
| Delhi | URL3 | 4 |
| Delhi | URL7 | 2 |
| Delhi | URL5 | 1 |
| Delhi | URL6 | 6 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+-------------+
I would like to see somthing like -> Top 3 Read article(if exists) each city
+-----------+--------+--------+
| City Name | URL | Count |
+-----------+--------+--------+
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL5 | 5 |
| Delhi | URL6 | 6 |
| Delhi | URL3 | 4 |
| Delhi | URL1 | 3 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+--------+
I am working on Spark 2.0.2, Scala 2.11.8
You can use window function to get the output.
import org.apache.spark.sql.expressions.Window
val df = sc.parallelize(Seq(
("Gurgaon","URL1",3), ("Gurgaon","URL3",6), ("Gurgaon","URL6",5), ("Gurgaon","URL4",1),("Gurgaon","URL5",5)
("DELHI","URL3",4), ("DELHI","URL7",2), ("DELHI","URL5",1), ("DELHI","URL6",6),("Mumbai","URL5",5)
("Punjab","URL6",6), ("Punjab","URL4",1))).toDF("City", "URL", "Count")
df.show()
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL1| 3|
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL4| 1|
|Gurgaon|URL5| 5|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| DELHI|URL5| 1|
| DELHI|URL6| 6|
| Mumbai|URL5| 5|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
val w = Window.partitionBy($"City").orderBy($"Count".desc)
val dfTop = df.withColumn("row", rowNumber.over(w)).where($"row" <= 3).drop("row")
dfTop.show
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL5| 5|
| Mumbai|URL5| 5|
| DELHI|URL6| 6|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
Output tested on Spark 1.6.2
Window functions are probably the way to go, and there is a built-in function for this purpose:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank, desc}
val window = Window.partitionBy($"City").orderBy(desc("Count"))
val dfTop = df.withColumn("rank", rank.over(window)).where($"rank" <= 3)