group by and pivot without aggregation in pyspark - pyspark

I am trying to group by a value in pyspark.
Here is my sample data,
Where i want to group by deviceID and ts(present in signalId column).
I want to aggregate data (Max) for each session started (ts) on each device(deviceId).
Expected output is
To achieve this, I need to pivot signalId without aggregation (sum, avg, max, min etc)
Then I need to group by signalId and ts and apply max aggregation to achieve my result.
But i am not able to apply pivot without aggregation.
sample data:
`df = sc.parallelize([['2020-04-07T15:50:36.618Z','A','p','Number of Packets','number','2'],
['2020-04-07T15:50:36.618Z','A','d','Session Duration','s','3'],
['2020-04-07T15:50:36.618Z','A','type','Connection Type','number','0'],
['2020-04-07T15:50:36.618Z','A','ts','Time Session Started','ms','4/7/2020 3:50:36 PM'],
['2020-04-07T15:51:37.127Z','A','p','Number of Packets','number','670810'],
['2020-04-07T15:51:37.127Z','A','d','Session Duration','s','61'],
['2020-04-07T15:51:37.127Z','A','type','Connection Type','number','0'],
['2020-04-07T15:51:37.127Z','A','ts','Time Session Started','ms','4/7/2020 3:50:36 PM'],
['2020-04-07T17:50:29.93Z','B','p','Number of Packets','number','3'],
['2020-04-07T17:50:29.93Z','B','d','Session Duration','s','2'],
['2020-04-07T17:50:29.93Z','B','type','Connection Type','number','1'],
['2020-04-07T17:50:29.93Z','B','ts','Time Session Started','ms','4/7/2020 5:50:29 PM'],
['2020-04-07T17:51:55.675Z','B','p','Number of Packets','number','2'],
['2020-04-07T17:51:55.675Z','B','d','Session Duration','s','5'],
['2020-04-07T17:51:55.675Z','B','type','Connection Type','number','1'],
['2020-04-07T17:51:55.675Z','B','ts','Time Session Started','ms','4/7/2020 5:55:55 PM']
]).toDF(("timestamp","deviceID","signalId","signalname","unit","value"))
df.show()`

Related

groupBy Id and get multiple records for multiple columns in scala

I have a spark dataframe as below.
val df = Seq(("a",1,1400),("a",1,1250),("a",2,1200),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",2,500),("b",4,250),("b",4,200),("b",4,100),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
I am working in scala language to make use of this data frame and trying to get result as shown below.
val df = Seq(("a",1,1400),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",4,250),("b",4,200),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
Rules: Grouped by id, if min(hierarchy)==1 then I take the row with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. On the other hand, if min(hierarchy)==2 then I take two rows with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. And so on for all the id's in the data.
Thanks for the suggestions..
You may use window functions to generate the criteria which you will filter upon eg
val results = df.withColumn("minh",min("hierarchy").over(Window.partitionBy("id")))
.withColumn("rnk",rank().over(Window.partitionBy("id").orderBy(col("amount").desc())))
.withColumn(
"rn4",
when(col("hierarchy")>=4, row_number().over(
Window.partitionBy("id",when(col("hierarchy")>=4,1).otherwise(0)).orderBy(col("amount").desc())
) ).otherwise(5)
)
.filter("rnk <= minh or rn4 <=3")
.select("id","hierarchy","amount")
NB. More verbose filter .filter("(rnk <= minh or rn4 <=3) and (minh in (1,2))")
Above temporary columns generated by window functions to assist in the filtering criteria are
minh : used to determine the minimum hierarchy for a group id and subsequently select the top minh number of columns from the group .
rnk used to determine the rows with the highest amount in each group
rn4 used to determine the rows with the highest amount in each group with hierarchy >=4

Aggregation on updating order data in Druid

I have streaming data using Kafka to Druid. It's an eCommerce de-normalized order event data where status and few fields get updated in every event.
I need to do aggregate query based on timestamp with the most updated entry only.
For example: If data sample is:
{"orderId":"123","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:02:33Z"}
{"orderId":"abc","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:03:33Z"}
{"orderId":"123","status":"Shipped","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-07T02:03:33Z"}
Now if I want to query on all orders stuck on "Initiated" status for more than 2 days then for above data it should only show orderId "abc".
But if I query something like
Select orderId,qty,paymentId from order where status = Initiated and WHERE "timestamp" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
This query will return both orders "123" and "abc", but 123 has another event received after 2 days so the previous events should not be included in result.
Is their any good and optimized way to handle this kind of scenarios in Apache druid?
One way I was thinking to use a separate lookup table to store orderId and latest status and perform a join with this lookup and above aggregation query on orderId and status
EDIT 1:
This query works but it joins on whole table, which can give resource limit exception for big datasets:
WITH maxOrderTime (orderId, "__time") AS
(
SELECT orderId, max("__time") FROM inline_data
GROUP BY orderId
)
SELECT inline_data.orderId FROM inline_data
JOIN maxOrderTime
ON inline_data.orderId = maxOrderTime.orderId
AND inline_data."__time" = maxOrderTime."__time"
WHERE inline_data.status='Initiated' and inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
EDIT 2:
Tried with:
SELECT
inline_data.orderID,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderID
HAVING last_status = 1
But gives this error:
Error: Unknown exception
Error while applying rule DruidQueryRule(AGGREGATE), args
[rel#1853:LogicalAggregate.NONE.,
rel#1863:DruidQueryRel.NONE.[](query={"queryType":"scan","dataSource":{"type":"table","name":"inline_data"},"intervals":{"type":"intervals","intervals":["-146136543-09-08T08:23:32.096Z/2021-03-14T09:57:05.000Z"]},"virtualColumns":[{"type":"expression","name":"v0","expression":"lookup("status",'status_as_number')","outputType":"STRING"}],"resultFormat":"compactedList","batchSize":20480,"order":"none","filter":null,"columns":["orderID","v0"],"legacy":false,"context":{"sqlOuterLimit":100,"sqlQueryId":"fbc167be-48fc-4863-b3a8-b8a7c45fb60f"},"descending":false,"granularity":{"type":"all"}},signature={orderID:LONG,
v0:STRING})]
java.lang.RuntimeException
I think this can be done easier. If you replace the status to a numeric representation, you can use it more easy.
First use an inline lookup to replace the status. See this page how to define a lookup: https://druid.apache.org/docs/0.20.1/querying/lookups.html
Now, we have for example these values in a lookup named status_as_number:
Initiated = 1
Shipped = 2
Since we now have a numeric representation, you can simply do a group by query and see the max status number. A query like this would be sufficient:
SELECT
inline_data.orderId,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderId
HAVING last_status = 1
Note: this query is not tested. The HAVING part makes sure that you only see orders which are Initiated.
I hope this solves your problem.

How to reduce multiple joins in spark

I am using Spark 2.4.1, to figure out some ratios on my data frame.
Where I need to find different ratio factors of ratios, different columns in given data frame(df_data) by joining to meta dataframe (i.e. resDs).
I am getting these ratio factors (i.e. ratio_1_factor, ratio_2_factor & ratio_3_factor) by using three different joins with different join conditions i.e. joinedDs , joinedDs2, joinedDs3
Is there any other alternative to reduce the number of joins ?? make it work optimum?
You can find the entire sample data in the below public URL.
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3521103084252405/7035720262824085/latest.html
How to handle multi-steps instead of single step in when clause:
.withColumn("step_1_ratio_1", (col("ratio_1").minus(lit(0.00000123))).cast(DataTypes.DoubleType)) // step-2
.withColumn("step_2_ratio_1", (col("step_1_ratio_1").multiply(lit(0.02))).cast(DataTypes.DoubleType)) //step-3
.withColumn("step_3_ratio_1", (col("step_2_ratio_1").divide(col("step_1_ratio_1"))).cast(DataTypes.DoubleType)) //step-4
.withColumn("ratio_1_factor", (col("ratio_1_factor")).cast(DataTypes.DoubleType)) //step-5
i.e. "ratio_1_factor" calucated based on various other columns in the dataframe , df_data
these steps -2,3,4 , are being used in other ratio_factors calculation too. i.e. ratio_2_factor, ratio_2_factor
how this should be handled ?
You can join one time and calculate ratio_1_factor, ratio_2_factor and ratio_3_factor columns using max and when function in aggregation :
val joinedDs = df_data.as("aa")
.join(
broadcast(resDs.as("bb")),
col("aa.g_date").between(col("bb.start_date"), col("bb.end_date"))
)
.groupBy("item_id", "g_date", "ratio_1", "ratio_2", "ratio_3")
.agg(
max(when(
col("aa.ratio_1").between(col("bb.A"), col("bb.A_lead")),
col("ratio_1").multiply(lit(0.1))
)
).cast(DoubleType).as("ratio_1_factor"),
max(when(
col("aa.ratio_2").between(col("bb.A"), col("bb.A_lead")),
col("ratio_2").multiply(lit(0.2))
)
).cast(DoubleType).as("ratio_2_factor"),
max(when(
col("aa.ratio_3").between(col("bb.A"), col("bb.A_lead")),
col("ratio_3").multiply(lit(0.3))
)
).cast(DoubleType).as("ratio_3_factor")
)
joinedDs.show(false)
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
//|item_id|g_date |ratio_1 |ratio_2 |ratio_3 |ratio_1_factor |ratio_2_factor|ratio_3_factor|
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
//|50312 |2016-01-04|0.0456646|0.046899415|0.046000415|0.0045664600000000005|0.009379883 |0.0138001245 |
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+

How to group by date and calculate the averages at the same time

I am quite new to this, so here it goes: I am trying to convert from unixtime to date format and then group by this by date while calculating the average on another column. This is in MariaDB.
CREATE OR REPLACE
VIEW `history_uint_view` AS select
`history_uint`.`itemid` AS `itemid`,
date(from_unixtime(`history_uint`.`clock`)) AS `mydate`,
AVG(`history_uint`.`value`) AS `value`
from
`history_uint`
where
((month(from_unixtime(`history_uint`.`clock`)) = month((now() - interval 1 month))) and ((`history_uint`.`value` in (1,
0))
and (`history_uint`.`itemid` in (54799, 54810, 54821, 54832, 54843, 54854, 54865, 54876, 54887, 54898, 54909, 54920, 58165, 58226, 59337, 59500, 59503, 59506, 60621, 60624, 60627, 60630, 60633, 60636, 60639, 60642, 60645, 60648, 60651, 60654, 60657, 60660, 60663, 60666, 60669, 60672, 60675, 60678, 60681, 60684, 60687, 60690, 60693, 60696, 60699, 64610)))
GROUP by 'itemid', 'mydate', 'value'
When you select aggregate functions (like AVG) with columns without aggregate functions, you should list all columns but the ones with aggregate function in GROUP BY-clause.
So your group by should look like:
GROUP by itemid, mydate
If you use single quotes (like 'itemid'), MariaDB treats them as strings, not columns.

sub query in select clause with hive

I am unable to figure out a way to achieve below functionality through a valid query in Hive. Intention is get the top rated movies in a released in a year based on weighted average.
To be more clear this is what I should be able to do in hive in a single query.
var allMoviesRated = select count(movieid) where year(from_unixtime(unixtime)) = 1997;
select movieid, avg(rating), count(movieid), avg(rating)/allMoviesRated as weighted from
(select movieid, rating, year(from_unixtime(unixtime)) as year from u_data where u_data_new.year = 1997) u_data_new group by movieid order by weighted desc limit 10;
sadly .. I dnt think there is a way to this in a single query using a subquery to get a count of all movies rated.
You may to write a script which execute 2 queries
First query one fetch the allMoviesRated and that is stored in a script variable.
Second query is your ranking query to which this value is passed using hiveconf
Thus your script can look like
your script.bash or python------------start--------
var allMoviesRated = os.cmd (hive -S "use db; select count(distinct movieid);")
ranking = os.cmd ( hive -S -hiveconf NUM_MOVIES = allMoviesRated -f ranking_query.hql)
your script.bash or python------------end--------
ranking_query.hql:
select movieid, avg(rating), count(movieid), avg(rating)/${hiveconf:NUM_MOVIES }as weighted
from (
select movieid, rating, year(from_unixtime(unixtime)) as year
from u_data where u_data_new.year = 1997) u_data_new
group by movieid order by weighted desc limit 10;