How to do handle this use-case (running-window data) in spark - scala
I am using spark-sql-2.4.1v with java 1.8.
Have source data as below :
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",27989978,"2019-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-06-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2020-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2019-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2017-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-06-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2018-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-12-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2019-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-09-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2016-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2020-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2019-09-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2016-03-01")
).toDF("industry_id","industry_name","country","state","revenue","generated_date");
Query :
val distinct_gen_date = df_data.select("generated_date").distinct.orderBy(desc("generated_date"));
For each "generated_date" in list distinct_gen_date , need to get all unique industry_ids for 6 months data
val cols = {col("industry_id")}
val ws = Window.partitionBy(cols).orderBy(desc("generated_date"));
val newDf = df_data
.withColumn("rank",rank().over(ws))
.where(col("rank").equalTo(lit(1)))
//.drop(col("rank"))
.select("*");
How to get moving aggregate (on unique industry_ids for 6 months data ) for each distinct item , how to achieve this moving aggregation.
more details :
Example, in the given sample data given , assume, is from "2020-03-01" to "2016-03-01". if some industry_x is not there in "2020-03-01", need to check "2020-02-01" "2020-01-01","2019-12-01","2019-11-01","2019-10-01","2019-09-01" sequentically whenever we found thats rank-1 is taken into consider for that data set for calculating "2020-03-01" data......we next go .."2020-02-01" i.e. each distinct "generated_date".. for each distinct date go back 6 months get unique industries ..pick rank 1 data...this data for ."2020-02-01"...next pick another distinct "generated_date" and do same so on .....here dataset keep changing....using for loop I can do but it is not giving parallesm..how to pick distinct dataset for each distinct "generated_date" parallell ?
I don't know how to do this with window functions but a self join can solve your problem.
First, you need a DataFrame with distinct dates:
val df_dates = df_data
.select("generated_date")
.withColumnRenamed("generated_date", "distinct_date")
.distinct()
Next, for each row in your industries data you need to calculate up to which date that industry will be included, i.e., add 6 months to generated_date. I think of them as active dates. I've used add_months() to do this but you can think of different logics.
import org.apache.spark.sql.functions.add_months
val df_active = df_data.withColumn("active_date", add_months(col("generated_date"), 6))
If we start with this data (separated by date just for our eyes):
industry_id generated_date
(("Indus_1", ..., "2020-03-01"),
("Indus_1", ..., "2019-12-01"),
("Indus_2", ..., "2019-12-01"),
("Indus_3", ..., "2018-06-01"))
It has now:
industry_id generated_date active_date
(("Indus_1", ..., "2020-03-01", "2020-09-01"),
("Indus_1", ..., "2019-12-01", "2020-06-01"),
("Indus_2", ..., "2019-12-01", "2020-06-01")
("Indus_3", ..., "2018-06-01", "2018-12-01"))
Now proceed with self join based on dates, using the join condition that will match your 6 month period:
val condition: Column = (
col("distinct_date") >= col("generated_date")).and(
col("distinct_date") <= col("active_date"))
val df_joined = df_dates.join(df_active, condition, "inner")
df_joined has now:
distinct_date industry_id generated_date active_date
(("2020-03-01", "Indus_1", ..., "2020-03-01", "2020-09-01"),
("2020-03-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2020-03-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2018-06-01", "Indus_3", ..., "2018-06-01", "2018-12-01"))
Drop that auxiliary column active_date or even better, drop duplicates based on your needs:
val df_result = df_joined.dropDuplicates(Seq("distinct_date", "industry_id"))
Which drops the duplicated "Indus_1" in "2020-03-01" (It appeared twice because it's retrieved from two different generated_dates):
distinct_date industry_id
(("2020-03-01", "Indus_1"),
("2020-03-01", "Indus_2"),
("2019-12-01", "Indus_1"),
("2019-12-01", "Indus_2"),
("2018-06-01", "Indus_3"))
Related
Tbl_Strata count by distinct individual vs rows
How can I use tbl_strata and get the output to show counts by distinct individual rather than rows? Also, how can I change the order that displays for the variable I am putting in the by= section in tbl_summary? I have a long table AND a wide table with one row per patient. Not sure how to apply the wide table to this code. I can apply the long table but getting row counts instead of distinct patient counts. I have included an example of the Long table I have and the wide table and what I would like the output to look like in the picture. Example code: #Wide Table Example df_Wide <- data.frame(patientICN =c(1, 2, 3, 4, 5) ,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid") ,gene1 =c("unk", "pos", "neg", "neg", "unk") ,gene2 =c("pos", "neg", "pos", "unk", "neg") ,gene3 =c("neg", "unk", "unk", "pos", "pos")) #Long Table Example df_Long <- data.frame(patientICN =c(1, 1, 2, 2, 3) ,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid") ,gene =c("Gene1", "Gene2", "Gene3", "Gene1", "Gene2") ,result=c("Positve", "Negative", "Unknown","Positive","Unknown")) #Table Categorized by testtype and result for long table df_Long %>% select (result, gene, testtype)%>% mutate(testcategory=paste("TestType",testtype))%>% tbl_strata( strata=testtype, .tbl_fun = ~.x %>% tbl_summary(by=result,missing="no")%>% add_n(), .header= "**{strata}**, N={n}" ) ##above is giving multiple Rows per patient counts
Is this what you're after? You can install the bstfun pkg from my R-universe: https://ddsjoberg.r-universe.dev/ui#packages library(gtsummary) library(dplyr) #Long Table Example df_Long <- data.frame(patientICN =c(1, 1, 2, 2, 3) ,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid") ,gene =c("Gene1", "Gene2", "Gene3", "Gene1", "Gene2") ,result=c("Positive", "Negative", "Unknown","Positive","Unknown")) tbl <- df_Long %>% tidyr::pivot_wider( id_cols = c(patientICN, testtype), names_from = gene, values_from = result, values_fill = "Unknown" ) %>% mutate(across(starts_with('Gene'), ~factor(.x, levels = c("Positive", "Negative", "Unknown")))) %>% tbl_strata( strata = testtype, ~ .x %>% bstfun::tbl_likert( include = starts_with("Gene") ) ) Created on 2022-10-06 with reprex v2.0.2
Spark DataFrames Scala - jump to next group during a loop
I have a dataframe as below and it records the quarter and date during which same incident occurs to which IDs. I would like to mark an ID and date if the incident happen at two consecutive quarters. And this is how I did it. val arrArray = dtf.collect.map(x => (x(0).toString, x(1).toString, x(2).toString.toInt)) if (arrArray.length > 0) { var bufBAQDate = ArrayBuffer[Any]() for (a <- 1 to arrArray.length - 1) { val (strBAQ1, strDate1, douTime1) = arrArray(a - 1) val (strBAQ2, strDate2, douTime2) = arrArray(a) if (douTime2 - douTime1 == 15 && strBAQ1 == strBAQ2 && strDate1 == strDate2) { bufBAQDate = (strBAQ2, strDate2) +: bufBAQDate //println(strBAQ2+" "+strDate2+" "+douTime2) } } val vecBAQDate = bufBAQDate.distinct.toVector.reverse Is there a better way of doing it? As the same insident can happen many times to one ID during a single day, it is better to jump to the next ID and/or date once an ID and a date is marked. I dont want to create nested loops to filter dataframe.
Note that you current solution misses 20210103 as 1400 - 1345 = 55 I think this does the trick val windowSpec = Window.partitionBy("ID") .orderBy("datetime_seconds") val consecutiveIncidents = dtf.withColumn("raw_datetime", concat($"Date", $"Time")) .withColumn("datetime", to_timestamp($"raw_datetime", "yyyyMMddHHmm")) .withColumn("datetime_seconds", $"datetime".cast(LongType)) .withColumn("time_delta", ($"datetime_seconds" - lag($"datetime_seconds", 1).over(windowSpec)) / 60) .filter($"time_delta" === lit(15)) .select("ID", "Date") .distinct() .collect() .map { case Row(id, date) => (id, date) } .toList Basically - convert the datetimes to timestamps, then look for records with the same ID and consecutive times, with their times separated by 15 minutes. This is done by using lag over a window grouped by ID and ordered by the time. In order to calculate the time difference The timestamp is converted to unix epoch seconds. If you don't want to count day-crossing incidients, you can add the date to the groupyBy clause of the window
Counting how many times each distinct value occurs in a column in PySparkSQL Join
I have used PySpark SQL to join together two tables, one containing crime location data with longitude and latitude and the other containing postcodes with their corresponding longitude and latitude. What I am trying to work out is how to tally up how many crimes have occurred within each postcode. I am new to PySpark and my SQL is rusty so I am unsure where I am going wrong. I have tried to use COUNT(DISTINCT) but that is simply giving me the total number of distinct postcodes. mySchema = StructType([StructField("Longitude", StringType(),True), StructField("Latitude", StringType(),True)]) bgl_df = spark.createDataFrame(burglary_rdd, mySchema) bgl_df.registerTempTable("bgl") rdd2 = spark.sparkContext.textFile("posttrans.csv") mySchema2 = StructType([StructField("Postcode", StringType(),True), StructField("Lon", StringType(),True), StructField("Lat", StringType(),True)]) pcode_df = spark.createDataFrame(pcode_rdd, mySchema2) pcode_df.registerTempTable("pcode") count = spark.sql("SELECT COUNT(DISTINCT pcode.Postcode) FROM pcode RIGHT JOIN bgl ON (bgl.Longitude = pcode.Lon AND bgl.Latitude = pcode.Lat)") +------------------------+ |count(DISTINCT Postcode)| +------------------------+ | 523371| +------------------------+ Instead I want something like: +--------+---+ |Postcode|Num| +--------+---+ |LN11 9DA| 2 | |BN10 8JX| 5 | | EN9 3YF| 9 | |EN10 6SS| 1 | +--------+---+
You can do a groupby count to get a distinct count of values for a column: group_df = df.groupby("Postcode").count() You will get the ouput you want. For an SQL query: query = """ SELECT pcode.Postcode, COUNT(pcode.Postcode) AS Num FROM pcode RIGHT JOIN bgl ON (bgl.Longitude = pcode.Lon AND bgl.Latitude = pcode.Lat) GROUP BY pcode.Postcode """ count = spark.sql(query) Also, I have copied in from your FROM and JOIN clause to make the query more relevant for copy-pasta.
How to get value of previous row in scala apache rdd[row]?
I need to get value from previous or next row while Im iterating through RDD[Row] (10,1,string1) (11,1,string2) (21,1,string3) (22,1,string4) I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be: (1, string1string2) (1, string3string4) I tried use groupBy, reduce, partitioning but still I can't achieve what I want. I'm trying to make something like this(I know it's not proper way): rows.groupBy(row => { row(1) }).map(rowList => { rowList.reduce((acc, next) => { diff = next(0) - acc(0) if(diff <= 3){ val strings = acc(2) + next(2) (acc(1), strings) }else{ //create new group to aggregatre strings (acc(1), acc(2)) } }) }) I wonder if my idea is proper to solve this problem. Looking for help!
I think you can use sqlContext to Solve your problem by using lag function Create RDD: val rdd = sc.parallelize(List( (10, 1, "string1"), (11, 1, "string2"), (21, 1, "string3"), (22, 1, "string4")) ) Create DataFrame: val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c") Register your Dataframe: df.registerTempTable("df") Query the result: val res = sqlContext.sql(""" SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1 ELSE ROW_NUMBER() OVER (ORDER BY b) END m, b, c FROM ( SELECT b, (a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null THEN lag(a, 1) OVER (ORDER BY a) ELSE 0 END) l, c FROM df) A """) Show the Results: res.show I Hope this will Help.
How to match starting date to the closest ending date in order to get the time difference
SELECT OUT.EMP_ID, OUT.DT_TM "DateTimeOut", IN.DT_TM "DateTimeIn", cast(timestampdiff( 4, char(timestamp(IN.DT_TM) - timestamp(OUT.DT_TM))) as decimal(30,1))/60 "Duration Out" FROM ( select e1.EMP_ID, e1.DT_TM from hr.timeout e1 WHERE month(e1.DT_TM)=09 and year(e1.DT_TM)=2016 AND e1.CD='OUT' ) OUT LEFT JOIN ( select e2.EMP_ID, e2.DT_TM from hr.timeout e2 WHERE month(e2.dt_tm)=09 and year(e2.dt_tm)=2016 AND e2.CD='IN' ) IN on out.EMP_ID=in.EMP_ID Trying to get the closest DateTimeIn match with the DateTimeOut. Currently it repeats the same DateTimeOut and DateTimeIn multiple times.
I think its normal because your table dont have constraint unique on emp_id, dt_tm and cd. But if you want unique result try this : With Period as ( select e1.EMP_ID, e1.DT_TM from hr.timeout e1 WHERE month(e1.DT_TM)=09 and year(e1.DT_TM)=2016 ) Select distinct OUT.EMP_ID, IN.DT_TM as DateTimeIn, OUT.DT_TM as DateTimeOut, TIMESTAMPDIFF(2 , CAST(timestamp(IN.DT_TM) - timestamp(ifnull(OUT.DT_TM, current date)) AS CHAR(22)) ) as DurationSecond from Period in left outer join Period out on out.EMP_ID=in.EMP_ID and out.CD='OUT' and in.CD='IN' order by 1, 2, 3 Like you can see, i use timestampdiff with '2' like first parameter for second (you divided by 60), i use ifnull because you do a left outer join (out.DT_TM can be null) and i do a distinct for unique result