Pyspark convert dataframe to time series data with 2-day lag - pyspark

I have a pyspark dataframe as following:
+-----+-------+-----------+------+
|page |group |utc_date |t |
+-----+-------+-----------+------+
| A|12 |2023-01-02 |0.55 |
| A|12 |2023-01-03 |0.6 |
| A|12 |2023-01-04 |1.97 |
| A|12 |2023-01-05 |1.31 |
| B|36 |2023-01-02 |0.09 |
| B|36 |2023-01-03 |0.09 |
| B|36 |2023-01-04 |0.09 |
| B|36 |2023-01-05 |0.02 |
| C|36 |2023-01-02 |0.09 |
| C|36 |2023-01-03 |0.09 |
| C|36 |2023-01-04 |0.09 |
| C|36 |2023-01-05 |0.08 |
+-----+-------+-----------+------+
I wanna convert the dataframe into a time series dataset with 2-day lag (group by page and group):
+-----+-------+-----------+------+------+------+
|page |group |utc_date |t |t-1 |t-2 |
+-----+-------+-----------+------+------+------+
| A|12 |2023-01-02 |0.55 |null |null |
| A|12 |2023-01-03 |0.6 |0.55 |null |
| A|12 |2023-01-04 |1.97 |0.6 |0.55 |
| A|12 |2023-01-05 |1.31 |1.97 |0.6 |
| B|36 |2023-01-02 |0.09 |null |null |
| B|36 |2023-01-03 |0.09 |0.09 |null |
| B|36 |2023-01-04 |0.09 |0.09 |0.09 |
| B|36 |2023-01-05 |0.02 |0.09 |0.09 |
| C|36 |2023-01-02 |0.09 |null |null |
| C|36 |2023-01-03 |0.09 |0.09 |null |
| C|36 |2023-01-04 |0.09 |0.09 |0.09 |
| C|36 |2023-01-05 |0.08 |0.09 |0.09 |
+-----+-------+-----------+------+------+------+
How should I do it in pyspark?

You should use lag() function on window group by page and group and ordered by date desc, so something like:
window = Window.partitionBy("page", "group").orderBy("utc_date")
new_df = (
df
.withColumn("t-1", lag("t", 1).over(window))
.withColumn("t-2", lag("t", 2).over(window))
)

Related

Spark Scala input empty values according result from self joined dataframe query

I struggle to write my spark scala code to fill rows for which the coverage is empty using self join with conditions.
This is the data :
+----+--------------+----------+--------+
| ID | date_in_days | coverage | values |
+----+--------------+----------+--------+
| 1 | 2020-09-01 | | 0.128 |
| 1 | 2020-09-03 | 0 | 0.358 |
| 1 | 2020-09-04 | 0 | 0.035 |
| 1 | 2020-09-05 | | |
| 1 | 2020-09-06 | | |
| 1 | 2020-09-19 | | |
| 1 | 2020-09-12 | | |
| 1 | 2020-09-18 | | |
| 1 | 2020-09-11 | | |
| 1 | 2020-09-16 | | |
| 1 | 2020-09-21 | 13 | 0.554 |
| 1 | 2020-09-23 | | |
| 1 | 2020-09-30 | | |
+----+--------------+----------+--------+
Expected result :
+----+--------------+----------+--------+
| ID | date_in_day | coverage | values |
+----+--------------+----------+--------+
| 1 | 2020-09-01 | -1 | 0.128 |
| 1 | 2020-09-03 | 0 | 0.358 |
| 1 | 2020-09-04 | 0 | 0.035 |
| 1 | 2020-09-05 | 0 | |
| 1 | 2020-09-06 | 0 | |
| 1 | 2020-09-19 | 0 | |
| 1 | 2020-09-12 | 0 | |
| 1 | 2020-09-18 | 0 | |
| 1 | 2020-09-11 | 0 | |
| 1 | 2020-09-16 | 0 | |
| 1 | 2020-09-21 | 13 | 0.554 |
| 1 | 2020-09-23 | -1 | |
| 1 | 2020-09-30 | -1 | |
What I am trying to do:
For each different ID (Dataframe partitioned by ID) sorted by date
Use case : row coverage column is null let's call it rowEmptycoverage:
Find in the DF the first row with date_in_days > rowEmptycoverage.date_in_days and with coverage >= 0. Let's call it rowFirstDateGreater
Then if rowFirstDateGreater.values > 500 set rowEmptycoverage.coverage to 0. Set it to -1 otherwise.
I am kind of lost in mixing when join where...
I am assuming that you mean values > 0.500 and not values > 500. Also the logic remains unclear. Here I am assuming that you are searching in the order of the column date_in_days and not in the order of the dataframe.
In any case we can refine the solution to match your exact need. The overall idea is to use a Window to fetch the next date for which the coverage is not null, check if values meet the desired criteria and update coverage.
It goes as follows:
val win = Window.partitionBy("ID").orderBy("date_in_days")
.rangeBetween(Window.currentRow, Window.unboundedFollowing)
df
// creating a struct binding coverage and values
.withColumn("cov_str", when('coverage isNull, lit(null))
.otherwise(struct('coverage, 'values)))
// finding the first row (starting from the current date, in order of
// date_in_days) for which the coverage is not null
.withColumn("next_cov_str", first('cov_str, ignoreNulls=true) over win)
// updating coverage. We keep the original value if not null, put 0 if values
// meets the criteria (that you can change) and -1 otherwise.
.withColumn("coverage", coalesce(
'coverage,
when($"next_cov_str.values" > 0.500, lit(0)),
lit(-1)
))
.show(false)
+---+-------------------+--------+------+-----------+------------+
|ID |date_in_days |coverage|values|cov_str |next_cov_str|
+---+-------------------+--------+------+-----------+------------+
|1 |2020-09-01 00:00:00|-1 |0.128 |null |[0, 0.358] |
|1 |2020-09-03 00:00:00|0 |0.358 |[0, 0.358] |[0, 0.358] |
|1 |2020-09-04 00:00:00|0 |0.035 |[0, 0.035] |[0, 0.035] |
|1 |2020-09-05 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-06 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-11 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-12 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-16 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-18 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-19 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-21 00:00:00|13 |0.554 |[13, 0.554]|[13, 0.554] |
|1 |2020-09-23 00:00:00|-1 |null |null |null |
|1 |2020-09-30 00:00:00|-1 |null |null |null |
+---+-------------------+--------+------+-----------+------------+
You may then use drop("cov_str", "next_cov_str") but I leave them here for clarity.

Reduce a json string column into a key/val column

i have a dataframe with the following structure :
| a | b | c |
-----------------------------------------------------------------------------
|01 |ABC | {"key1":"valueA","key2":"valueC"} |
|02 |ABC | {"key1":"valueA","key2":"valueC"} |
|11 |DEF | {"key1":"valueB","key2":"valueD", "key3":"valueE"} |
|12 |DEF | {"key1":"valueB","key2":"valueD", "key3":"valueE"} |
i would like to turn into something like :
| a | b | key | value |
--------------------------------------------------------
|01 |ABC | key1 | valueA |
|01 |ABC | key2 | valueC |
|02 |ABC | key1 | valueA |
|02 |ABC | key2 | valueC |
|11 |DEF | key1 | valueB |
|11 |DEF | key2 | valueD |
|11 |DEF | key3 | valueE |
|12 |DEF | key1 | valueB |
|12 |DEF | key2 | valueD |
|12 |DEF | key3 | valueE |
in an efficient way, as the dataset can be quite large.
Try using from_json function then explode the array.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val df=Seq(("01","ABC","""{"key1":"valueA","key2":"valueC"}""")).toDF("a","b","c")
val Schema = MapType(StringType, StringType)
df.withColumn("d",from_json(col("c"),Schema)).selectExpr("a","b","explode(d)").show(10,false)
//+---+---+----+------+
//|a |b |key |value |
//+---+---+----+------+
//|01 |ABC|key1|valueA|
//|01 |ABC|key2|valueC|
//+---+---+----+------+

how can i bring the months in calender order like from jan to dec in scala dataframe

+---------+------------------+
| Month|sum(buss_days)|
+---------+------------------+
| April| 83.93|
| August| 94.895|
| December| 53.47|
| February| 22.90|
| January| 97.45|
| July| 95.681|
| June| 23.371|
| March| 35.957|
| May| 4.24|
| November| 1.56|
| October| 1.00|
|September| 93.51|
+---------+------------------+
and i want output like this
+---------+------------------+
| Month|sum(avg_buss_days)|
+---------+------------------+
| January| 97.45
February| 22.90
March| 35.957
April| 83.93|
| May| 4.24
June| 23.371
July| 95.681
August| 94.895|
| September| 93.51
October| 1.00
November| 1.56
December| 53.47|
+---------+------------------+
this is what it is i did
df.groupBy("Month[order(match(month$month, month.abb)), ]")
And i got this.....
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Month[order(match(month$month, month.abb)), ]".Here Month is Column name in dataframe
Convert the Month Into Date form and sort the items should do.
Please find the snippet unix_timestamp(col("Month"),"MMMMM")
Df.sort(unix_timestamp(col("Month"),"MMMMM")).show
+---------+-------------+
| Month|avg_buss_days|
+---------+-------------+
| January| 97.45|
| February| 22.90|
| March| 35.957|
| April| 83.93|
| May| 4.24|
| June| 23.371|
| July| 95.681|
| August| 94.895|
|September| 93.51|
| October| 1.00|
| November| 1.56|
| December| 53.47|
+---------+-------------+

Joining data without creating duplicate metric rows from the first table, (second table contains more rows but not metrics)

I have the following two tables that I would like to join for a comprehensive digital marketing report without creating duplicates in regards to metrics. The idea is to take competitor adverts and join them with my existing marketing data which is as follows;
Campaign|Impressions | Clicks | Conversions | CPC |Key
---------+------------+--------+-------------+-----+----
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12
The competitor data is as follows;
Key | Ad Copie |
---------+------------+
Hgdy24 |Click here! |
Hgdy24 |Free Trial! |
Hgdy24 |Sign Up now |
dhfg12 |Check it out|
dhfg12 |World known |
dhfg12 |Sign up |
Using conventional join queries produces the following unusable result
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Click here!
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Free Trial!
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Check it out
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|World known
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Sign up
Here is the desired output
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Click here!
USA-SIM| | | | |Hgdy24|Free Trial!
USA-SIM| | | | |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Check it out
DE-SIM | | | | |dhfg12|World known
DE-SIM | | | | |dhfg12|Sign up
Or as an alternative that would also work would be
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|
USA-SIM| | | | |Hgdy24|Click here!
USA-SIM| | | | |Hgdy24|Free Trial!
USA-SIM| | | | |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|
DE-SIM | | | | |dhfg12|Check it out
DE-SIM | | | | |dhfg12|World known
DE-SIM | | | | |dhfg12|Sign up
I have yet to find a work around that does not produce the extra metrics as a result.
MOST RECENT RESULT
campaing | impressions | clicks | conversions | cpc | key | ad_copie
----------+-------------+--------+-------------+-----+--------+------------
USA-SIM | 53432 | 5001 | 5 | 2$ | |
USA-SIM | | | | | Hgdy24 | Click here!
USA-SIM | | | | | Hgdy24 | Free Trial!
USA-SIM | | | | | Hgdy24 | Sign Up now
DE-SIM | 5389 | 4672 | 3 | 4$ | |
DE-SIM | | | | | dhfg12 | Check it out
DE-SIM | | | | | dhfg12 | World known
DE-SIM | | | | | dhfg12 | Sign up
You can use window function lag() to check what key was in previous row and either display metrics or null them.
select campaing,
case when prev_key is null or prev_key != key then impressions end as impressions,
case when prev_key is null or prev_key != key then clicks end as clicks,
case when prev_key is null or prev_key != key then conversions end as conversions,
case when prev_key is null or prev_key != key then cpc end as cpc,
key, ad_copie
from (
select campaing, lag(key) over () AS prev_key, impressions, clicks, conversions, cpc, key, ad_copie
from ad1
join comp1 using(key)
order by campaing desc, key
) sub;
result:
campaing | impressions | clicks | conversions | cpc | key | ad_copie
----------+-------------+--------+-------------+-----+--------+--------------
USA-SIM | 53432 | 5001 | 5 | 2$ | Hgdy24 | Click here!
USA-SIM | | | | | Hgdy24 | Free Trial!
USA-SIM | | | | | Hgdy24 | Sign Up now
DE-SIM | 5389 | 4672 | 3 | 4$ | dhfg12 | Check it out
DE-SIM | | | | | dhfg12 | World known
DE-SIM | | | | | dhfg12 | Sign up
(6 wierszy)
EDIT: You might need to tinker with what columns you compare before you NULL metrics and possibly by what columns you will order data. If key is unique for campaing then I suppose this will suffice.

NDepend query methods/types in framework assembly being used by other assemblies/types

I am trying to determine which types or methods in a base framework assembly are being used by other assemblies in the application system. I cannot seem to find a straight-cut query to do that.
What i have to do is first determine which assemblies are directly using the framework assembly, then manually list them in a second query
SELECT TYPES FROM ASSEMBLIES "IBM.Data.DB2"
WHERE IsDirectlyUsedBy "ASSEMBLY:FirstDirectUsedByAssebmly"
OR IsDirectlyUsedBy "ASSEMBLY:SecondDirectUsedByAssebmly"
OR IsDirectlyUsedBy "ASSEMBLY:ThirdDirectUsedByAssebmly"
OR IsDirectlyUsedBy "ASSEMBLY:FourthDirectUsedByAssebmly"
Is there a better/faster way to query for this?
Additionally, the query results are focused on the matched types only. The Dependency graph or matrix exported only shows details of those. I do not know how to render a graph that shows those types or methods plus show the dependent types/methods from other assemblies that are consuming them?
UPDATE
I cannot use a query like
SELECT METHODS/TYPES WHERE IsPublic AND !CouldBeInternal
because the results return very queer results of using obfuscated types within the IBM.Data.DB2 assembly.
SELECT TYPES
FROM ASSEMBLIES "IBM.Data.DB2"
WHERE IsPublic AND !CouldBeInternal
48 items
--------------------------------------------------+--------------+
types |# IL instructi|
|ons |
--------------------------------------------------+--------------+
IBM.Data.DB2.ae+m |0 |
| |
IBM.Data.DB2.ae+x |0 |
| |
IBM.Data.DB2.ae+f |0 |
| |
IBM.Data.DB2.ae+ac |0 |
| |
IBM.Data.DB2.ae+aa |0 |
| |
IBM.Data.DB2.ae+u |0 |
| |
IBM.Data.DB2.ae+z |0 |
| |
IBM.Data.DB2.ae+e |0 |
| |
IBM.Data.DB2.ae+b |0 |
| |
IBM.Data.DB2.ae+g |0 |
| |
IBM.Data.DB2.ae+ab |0 |
| |
IBM.Data.DB2.ae+h |0 |
| |
IBM.Data.DB2.ae+r |0 |
| |
IBM.Data.DB2.ae+p |0 |
| |
IBM.Data.DB2.ae+ad |0 |
| |
IBM.Data.DB2.ae+i |0 |
| |
IBM.Data.DB2.ae+j |0 |
| |
IBM.Data.DB2.ae+t |0 |
| |
IBM.Data.DB2.ae+af |0 |
| |
IBM.Data.DB2.ae+k |0 |
| |
IBM.Data.DB2.ae+l |0 |
| |
IBM.Data.DB2.ae+y |0 |
| |
IBM.Data.DB2.ae+a |0 |
| |
IBM.Data.DB2.ae+q |0 |
| |
IBM.Data.DB2.ae+n |0 |
| |
IBM.Data.DB2.ae+d |0 |
| |
IBM.Data.DB2.ae+c |0 |
| |
IBM.Data.DB2.ae+ae |0 |
| |
IBM.Data.DB2.ae+o |0 |
| |
IBM.Data.DB2.ae+w |0 |
| |
IBM.Data.DB2.ae+s |0 |
| |
IBM.Data.DB2.ae+v |0 |
| |
IBM.Data.DB2.DB2Command |2 527 |
| |
IBM.Data.DB2.DB2Connection |3 246 |
| |
IBM.Data.DB2.DB2DataAdapter |520 |
| |
IBM.Data.DB2.DB2DataReader |4 220 |
| |
IBM.Data.DB2.DB2_UDF_PLATFORM |0 |
| |
IBM.Data.DB2.DB2Enumerator+DB2EnumInstance |19 |
| |
IBM.Data.DB2.DB2Enumerator+DB2EnumDatabase |15 |
| |
IBM.Data.DB2.DB2Error |98 |
| |
IBM.Data.DB2.DB2ErrorCollection |55 |
| |
IBM.Data.DB2.DB2Exception |185 |
| |
IBM.Data.DB2.DB2Parameter |1 853 |
| |
IBM.Data.DB2.DB2ParameterCollection |1 383 |
| |
IBM.Data.DB2.DB2RowUpdatedEventHandler |0 |
| |
IBM.Data.DB2.DB2RowUpdatedEventArgs |14 |
| |
IBM.Data.DB2.DB2Type |0 |
| |
IBM.Data.DB2.DB2XmlReader |500 |
| |
--------------------------------------------------+--------------+
Sum: |14 635 |
| |
Average: |304.9 |
| |
Minimum: |0 |
| |
Maximum: |4 220 |
| |
Standard deviation: |868.22 |
| |
Variance: |753 808 |
| |
--------------------------------------------------+--------------+
Our code does not use those types and enums directly.
This query returns the methods (respectively the types), that are public and could not be internal. Hence, it returns the methods/types that are indeed used outside of their declaring assembly.
SELECT METHODS/TYPES WHERE IsPublic AND !CouldBeInternal