xbar, last keyword and subtract from table column - kdb

a:([]time:(2021.01.31D22:18:00.000000000;2021.01.31D22:18:27.134000000;2021.01.31D22:18:27.834000000;2021.01.31D22:21:14.284000000);val:(3.2;2.9;3.9;6.8))
time val
---------------------------------
2021.01.31D22:18:00.000000000 3.2
2021.01.31D22:18:27.134000000 2.9
2021.01.31D22:18:27.834000000 3.9
2021.01.31D22:21:14.284000000 6.8
a1:select last val by 0D00:01 xbar time from a
time | val
-----------------------------| ---
2021.01.31D22:18:00.000000000| 3.9
2021.01.31D22:21:00.000000000| 6.8
a2:update diff:val - last val by 0D00:01 xbar time from a
time val diff
--------------------------------------
2021.01.31D22:18:00.000000000 3.2 -0.7
2021.01.31D22:18:27.134000000 2.9 -1
2021.01.31D22:18:27.834000000 3.9 0
2021.01.31D22:21:14.284000000 6.8 0
for the 2nd, 3rd rows in a2, when there are no matching time values in a1, how does q query work to ensure that the val column is subtracted against the "last" val corresponding to the minute? is there a general rule to understand the use of xbar here or any reference for similar examples I could read?
Appreciate your help.

This alternative approach should help you with understanding how the grouping by xbar is working:
q)ungroup{update diff:val-last val from x}each`grouper xgroup update grouper:0D00:01 xbar time from a
grouper time val diff
--------------------------------------------------------------------
2021.01.31D22:18:00.000000000 2021.01.31D22:18:00.000000000 3.2 -0.7
2021.01.31D22:18:00.000000000 2021.01.31D22:18:27.134000000 2.9 -1
2021.01.31D22:18:00.000000000 2021.01.31D22:18:27.834000000 3.9 0
2021.01.31D22:21:00.000000000 2021.01.31D22:21:14.284000000 6.8 0

Related

How to round timestamp to 10 minutes in Spark 3.0?

I have a timestamp like that in $"my_col":
2022-01-21 22:11:11
with date_trunc("minute",($"my_col"))
2022-01-21 22:11:00
with date_trunc("hour",($"my_col"))
2022-01-21 22:00:00
What is a Spark 3.0 way to get
2022-01-21 22:10:00
?
Convert the timestamp into seconds using unix_timestamp function, then perform the rounding by dividing by 600 (10 minutes), round the result of division and multiply by 600 again:
val df = Seq(
("2022-01-21 22:11:11"),
("2022-01-21 22:04:04"),
("2022-01-21 22:19:34"),
("2022-01-21 22:57:14")
).toDF("my_col").withColumn("my_col", to_timestamp($"my_col"))
df.withColumn(
"my_col_rounded",
from_unixtime(round(unix_timestamp($"my_col") / 600) * 600)
).show
//+-------------------+-------------------+
//|my_col |my_col_rounded |
//+-------------------+-------------------+
//|2022-01-21 22:11:11|2022-01-21 22:10:00|
//|2022-01-21 22:04:04|2022-01-21 22:00:00|
//|2022-01-21 22:19:34|2022-01-21 22:20:00|
//|2022-01-21 22:57:14|2022-01-21 23:00:00|
//+-------------------+-------------------+
You can also truncate the original timestamp to hours, get the minutes that your round to 10 and add them to truncated timestamp using interval:
df.withColumn(
"my_col_rounded",
date_trunc("hour", $"my_col") + format_string(
"interval %s minute",
expr("round(extract(MINUTE FROM my_col)/10.0)*10")
).cast("interval")
)

Replace date value in pyspark by maximum of two column

I'm using pyspark 3.0.1. I have a dataframe df with following details
ID Class dateEnrolled dateStarted
32 1 2016-01-09 2016-01-26
25 1 2016-01-09 2016-01-10
33 1 2016-01-16 2016-01-05
I need to replace dateEnrolled my latest of two date field & my data should look like
ID Class dateEnrolled dateStarted
32 1 2016-01-26 2016-01-26
25 1 2016-01-10 2016-01-10
33 1 2016-01-16 2016-01-05
Can you suggest me how to do that?
You can use greatest:
import pyspark.sql.functions as F
df2 = df.withColumn('dateEnrolled', F.greatest('dateEnrolled', 'dateStarted'))

Find minimum and maximum of year and month in spark scala

I would like to find minimum of year and month and maximum of year and month from spark dataframe. Below is my dataframe
code year month
xx 2004 1
xx 2004 2
xxx 2004 3
xx 2004 6
xx 2011 12
xx 2018 10
I want minimum month and Year as 2004-1 and maximum month and year as 2018-10
The solution which i tried is
val minAnMaxYearAndMonth = dataSet.agg(min(Year),max(Month)).head()
val minYear = minAnMaxYearAndMonth(0)
val maxYear = minAnMaxYearAndMonth(1)
val minMonth = dataSet.select(Month).where(col(Year) === minYear).take(1)
val maxMonth = dataSet.select(Month).where(col(Year) === maxYear).take(1)
getting minYear and MaxYear but not min and max Month. Please help
You could use struct to make tuples out of years and months and then rely on tuple ordering. Tuples are ordered primarily by the leftmost component and then using next component as a tie-breaker.
df.select(struct("year", "month") as "ym")
.agg(min("ym") as "min", max("ym") as "max")
.selectExpr("stack(2, 'min', min.*, 'max', max.*) as (agg, year, month)")
.show()
Output:
+---+----+-----+
|agg|year|month|
+---+----+-----+
|min|2004| 1|
|max|2018| 10|
+---+----+-----+

Pyspark - How to concatenate columns of multiple dataframes into columns of one dataframe

I have multiple data frames (24 in total) with one column. I need to combine all of them to a single data frame. I created indexes and joined using indexes but it is quite slow to join all of them (All has same number of rows).
Please note that I'm using Pyspark 2.1
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('Index',row_number().over(w))
df2 = df2.withColumn('Index',row_number().over(w))
joined_df = df1.join(df2,df1.Index=df2.Index,'Inner').drop(df2.Index)
df3 = df3.withColumn('Index',row_number().over(w))
joined_df = joined_df.join(df3,joined_df.Index=df3.Index).drop(df3.Index)
But as the joined_df grows, it keeps getting slower
DF1:
Col1
2
8
18
12
DF2:
Col2
abc
bcd
def
bbc
DF3:
Col3
1.0
2.2
12.1
1.9
Expected Results:
joined_df:
Col1 Col2 Col3
2 abc 1.0
8 bcd 2.2
18 def 12.1
12 bbc 1.9
You're doing it the correct way. Unfortunately without a primary key, spark is not suited for this type of operation.
Answer by pault, pulled from comment.

Scala Operation on TimeStamp values

I have the input in timestamp , based on some condition i need to minus 1 sec or minus 3 months using scala programming
Input:
val date :String = "2017-10-31T23:59:59.000"
Output:
For Minus 1 sec
val lessOneSec = "2017-10-31T23:59:58.000"
For Minus 3 Months
val less3Mon = "2017-07-31T23:59:58.000"
How to convert a string value to Timestamp and do the operations like minus in scala programming ?
I assume you are working with Dataframes, since you have the spark-dataframe tag.
You can use the SQL INTERVAL to reduce the time, but your column should be in timestamp format for that:
df.show(false)
+-----------------------+
|ts |
+-----------------------+
|2017-10-31T23:59:59.000|
+-----------------------+
import org.apache.spark.sql.functions._
df.withColumn("minus1Sec" , date_format($"ts".cast("timestamp") - expr("interval 1 second") , "yyyy-MM-dd'T'HH:mm:ss.SSS") )
.withColumn("minus3Mon" , date_format($"ts".cast("timestamp") - expr("interval 3 month ") , "yyyy-MM-dd'T'HH:mm:ss.SSS") )
.show(false)
+-----------------------+-----------------------+-----------------------+
|ts |minus1Sec |minus3Mon |
+-----------------------+-----------------------+-----------------------+
|2017-10-31T23:59:59.000|2017-10-31T23:59:58.000|2017-07-31T23:59:59.000|
+-----------------------+-----------------------+-----------------------+
Try this below code
val yourDate = "2017-10-31T23:59:59.000"
val formater = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss.SSS")
val date = LocalDateTime.parse(yourDate, formater)
println(date.minusSeconds(1).toString(formater))
println(date.minusMonths(3).toString(formater))
Output
2017-10-31T23:59:58.000
2017-07-31T23:59:59.000
Look at the jodatime library. it has all the APIs you need to minus seconds or months from a timestamp
http://www.joda.org/joda-time/
sbt dependency
"joda-time" % "joda-time" % "2.9.9"