Using an iterator on a table - kdb

I have this table:
A:2.34889 2.484112 1.045939 3.359097 1.642348 1.298948 3.046995 4.077684
B:3.845017 3.762336 3.287893 3.338063 5.861462 5.401914 3.537128 5.27197
t:([] AA:A;BB:B)
-1 + prd select (-1#AA)%(1#AA) from t
-1 + prd select (-1#BB)%(1#BB) from t
which outputs
AA| 0.7360047
BB| 0.3711175
I was wondering how I can modify the last two lines into a single line that iterates over AA and BB? For example, if I had 10 symbols, I would only have to write a single line to output the 10 results.
Also apologies on the question title, I am not sure how to phrase it well but am happy to edit if required.

Iterators on tables can either be row rise (demonstrated by 0N! below):
0N!/: t
`AA`BB!2.34889 3.845017
`AA`BB!2.484112 3.762336
`AA`BB!1.045939 3.287893
`AA`BB!3.359097 3.338063
`AA`BB!1.642348 5.861462
`AA`BB!1.298948 5.401914
`AA`BB!3.046995 3.537128
`AA`BB!4.077684 5.27197
Or column wise with flip:
0N!/: flip t
2.34889 2.484112 1.045939 3.359097 1.642348 1.298948 3.046995 4.077684
3.845017 3.762336 3.287893 3.338063 5.861462 5.401914 3.537128 5.27197
For this case, you could do the latter and apply your function to all columns with the each iterator:
{-1+prd last[x]%first x} each flip t
AA| 0.7360047
BB| 0.3711175
Use # or select to get the subset of columns you want to apply the function to if needs be:
{-1+prd last[x]%first x} each flip `AA`BB#t
AA| 0.7360047
BB| 0.3711175
More generally, when trying to build up similar code to apply to a list of columns functional form can be useful to be aware of: https://code.kx.com/q/basics/funsql/
parse "exec AA:{-1+prd last[x]%first x} AA from t"
?
`t
()
()
(,`AA)!,({-1+prd last[x]%first x};`AA)
// or cls:cols t
cls:`AA`BB ;
?[t;();();cls!({-1+prd last[x]%first x}),/:cls]
AA| 0.7360047
BB| 0.3711175

Matts answer is a better and more general answer but in your particular example the logic can be as simple as:
q)-1+last[t]%first t
AA| 0.7360047
BB| 0.3711175

No need to iterate through the columns.
The best use of iterators here is Each Left to apply both first and last to t.
q)(last;first)#\:t
AA BB
-----------------
4.077684 5.27197
2.34889 3.845017
That table is a 2-list, so you can apply Divide.
q)-1+(%).(last;first)#\:t
AA| 0.7360047
BB| 0.3711175
To define this for re-use, it’s a composition of three unaries, here spaced for clarity:
q)f:-1+ (%). (last;first)#\:
q)f t
AA| 0.7360047
BB| 0.3711175
Works for any number of columns.

Related

PySpark: Group by two columns, count the pairs, and divide the average of two different columns

I have a dataframe with several columns, some of which are labeled PULocationID, DOLocationID, total_amount, and trip_distance. I'm trying to group by both PULocationID and DOLocationID, then count the combination each into a column called "count". I also need to take the average of total_amount and trip_distance and divide them into a column called "trip_rate". The end DF should be:
PULocationID
DOLocationID
count
trip_rate
123
422
1
5.2435
3
27
4
6.6121
Where (123,422) are paired together once for a trip rate of $5.24 and (3, 27) are paired together 4 times where the trip rate is $6.61.
Through reading some other threads, I'm able to group by the locations and count them using the below:
df.groupBy("PULocationID", 'DOLocationID').agg(count(lit(1)).alias("count")).show()
OR I can group by the locations and get the averages of the two columns I need using the below:
df.groupBy("PULocationID", 'DOLocationID').agg({'total_amount':'avg', 'trip_distance':'avg'}).show()
I tried a couple of things to get the trip_rate, but neither worked:
df.withColumn("trip_rate", (pyspark.sql.functions.col("total_amount") / pyspark.sql.functions.col("trip_distance")))
df.withColumn("trip_rate", df.total_amount/sum(df.trip_distance))
I also can't figure out how to combine the two queries that work (i.e. count of locations + averages).
Using this as an example input DataFrame:
+------------+------------+------------+-------------+
|PULocationID|DOLocationID|total_amount|trip_distance|
+------------+------------+------------+-------------+
| 123| 422| 10.487| 2|
| 3| 27| 19.8363| 3|
| 3| 27| 13.2242| 2|
| 3| 27| 6.6121| 1|
| 3| 27| 26.4484| 4|
+------------+------------+------------+-------------+
You can chain together the groupBy, agg, and select (you could also use withColumn and drop if you only need the 4 columns).
import pyspark.sql.functions as F
new_df = df.groupBy(
"PULocationID",
"DOLocationID",
).agg(
F.count(F.lit(1)).alias("count"),
F.avg(F.col("total_amount")).alias("avg_amt"),
F.avg(F.col("trip_distance")).alias("avg_distance"),
).select(
"PULocationID",
"DOLocationID",
"count",
(F.col("avg_amt") / F.col("avg_distance")).alias("trip_rate")
)
new_df.show()
+------------+------------+-----+-----------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+-----------------+
| 123| 422| 1| 5.2435|
| 3| 27| 4|6.612100000000001|
+------------+------------+-----+-----------------+

How to select the N highest values for each category in spark scala

Say I have this dataset:
val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")
which looks like this:
I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,
Since the 2 highest homerun totals for them were 8 and 6.
How would I do this in the general case?
Thanks
Your problem is not really good fit for the pivot, since pivot means:
A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.
You could create an additional rank column with a window function and then select only rows with rank 1 or 2:
import org.apache.spark.sql.expressions.Window
main_df.withColumn(
"rank",
rank()
.over(
Window.partitionBy("teams")
.orderBy($"homeruns".desc)
)
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
| teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets| 8| 20| 1|
|yankees-mets| 6| 17| 2|
+------------+--------+----+----+
Then if you no longer need rank column you could just drop it.

Spark: Flatten simple multi-column DataFrame

How to flatten a simple (i.e. no nested structures) dataframe into a list?
My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.
This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.
Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1 |after.id2 |
+-----------+-----------+-----------+-----------+
| null| null| E2| E3|
| B3| B1| null| null|
| I1| I2| null| null|
| A2| A3| null| null|
| null| null| G3| G4|
The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:
{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}
Potential approaches:
Union all the columns separately and distinct
flatMap and distinct
map and flatten
Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?
Other notes
Order of id1-id2 pair is only important to for change detection
Order in the resulting list is not important
DataFrame is between 10k and 100k rows
distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation
Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:
val df = Seq(("A","B"),(null,"A")).toDF
val result = df.rdd.map(_.toSeq.toList)
.collect().toList.flatten.toSet - null

How to compare two dataframe and print columns that are different in scala

We have two data frames here:
the expected dataframe:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
and the actual data frame:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
the difference between the two dataframes now is:
+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+--------+--------+----------+-------+--------+
We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.
From the scenario that is described in the above question, it looks like that difference has to be found between columns and not rows.
So, to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.
Now, to apply selective difference we have to write code something like this:
First we need to find the columns in expected and actual data frames.
val columns = df1.schema.fields.map(_.name)
Then we have to find the difference columnwise.
val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
At last we need to find out which columns contain different values.
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})
And, we will get only the columns that contain different values. Like this:
+--------+
|emp_name|
+--------+
| romino|
+--------+
I hope this helps!
list_col=[]
cols=df1.columns
# Prepare list of dataframes/per column
for col in cols:
list_col.append(df1.select(col).subtract(df2.select(col)))
# Render/persist
for l in list_col :
if l.count() > 0 :
l.show()
Spark-extensions have an API for this - DIFF. I believe you can use it like this:
left.diff(right).show()
Or supply emp_id as an id column, like this:
left.diff(right, "emp_id").show()
This API is available for Spark 2.4.x - 3.x.

Scala / DataFrame / Spark: How do I express multiple conditional aggregates?

Let's say I have a table like:
id,date,value
1,2017-02-12,3
2,2017-03-18,2
1,2017-03-20,5
1,2017-04-01,1
3,2017-04-01,3
2,2017-04-10,2
I already have this as a dataframe (it comes from a Hive table)
Now, I want an output that looks like (logically):
id, count($"date">"2017-03"), sum($"value" where $"date">"2017-03"), count($"date">"2017-02"), sum($"value" where $"date">"2017-02")
I've tried to express this in a single agg(), but I just can't figure out how to do the inner conditionals. I know how to filter ahead of the aggregation, but that doesn't do what I need with the two different sub-ranges.
// doesn't do the right thing
myDF.where($"date">"2017-03")
.groupBy("id")
.agg(sum("value") as "value_03", count("value") as "count_03")
.where($"date">"2017-04")
.agg(sum("value") as "value_04", count("value") as "value_04")
In SQL I would have put all the aggregation into a single SELECT statement with conditionals inside the count/sum clauses. How do I do something similar with DataFrames in Spark with Scala?
The closest I can think of is calculating membership for each tuple in each of the windows before the groupBy(), and summing over that membership times value (and straight sum for count.) It seems like there should be a better way to express this with conditionals inside the agg(), but I can't find it.
In SQL I would have put all the aggregation into a single SELECT statement with conditionals inside the count/sum clauses.
You can do exactly the same thing here:
import org.apache.spark.sql.functions.{sum, when}
myDF
.groupBy($"id")
.agg(
sum(when($"date" > "2017-03", $"value")).alias("value3"),
sum(when($"date" > "2017-04", $"value")).alias("value4")
)
+---+------+------+
| id|value3|value4|
+---+------+------+
| 1| 6| 1|
| 3| 3| 3|
| 2| 4| 2|
+---+------+------+