I have one column in a DataFrame which I need to select 3 random values in Pyspark. Could anyone help-me, please?
+---+
| id|
+---+
|123|
|245|
| 12|
|234|
+---+
Desire:
Array with 3 random values get from that column:
**output**: [123, 12, 234]
You can order in random order using rand() function first:
df.select('id').orderBy(rand()).limit(3).collect()
For more information on rand() function, check out pyspark.sql.functions.rand.
Here's another approach that's probably more performant.
You can fetch three random rows with this code:
df.rdd.takeSample(False, 3)
Here's how to create an array with three integers if you don't want an array of Row objects:
list(map(lambda row: row[0], df.rdd.takeSample(False, 3)))
df.select('id').orderBy(F.rand()).limit(3) will generate this this physical plan:
== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[_nondeterministic#38 ASC NULLS FIRST], output=[id#32L])
+- *(1) Project [id#32L, rand(-4436287143488772163) AS _nondeterministic#38]
This post discusses fetching random values from a DataFrame column in more detail.
Related
Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.
Say I have this dataframe:
val df = Seq(("Mike",1),("Kevin",2),("Bob",3),("Steve",4)).toDF("name","score")
and I want to filter this dataframe so that it only returns rows where the "score" column is greater than on equal to the 75th percentile. How would I do this?
Thanks so much and have a great day!
What you want to base your filter on is the upper quartile.
It is also known as the upper quartile or the 75th empirical quartile and 75% of the data lies below this point.
Based on the answer here, you can use spark's approximateQuantile to get what you want:
val q = df.stat.approxQuantile("score", Array(.75), 0)
q: Array[Double] = Array(3.0)
This array(q) gives you the boundary between 3rd and 4th quartiles.
Using a simple spark filter should get you what you want:
df.filter($"score" >= q.head).show
+-----+-----+
| name|score|
+-----+-----+
| Bob| 3|
|Steve| 4|
+-----+-----+
Say I have this dataset:
val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")
which looks like this:
I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,
Since the 2 highest homerun totals for them were 8 and 6.
How would I do this in the general case?
Thanks
Your problem is not really good fit for the pivot, since pivot means:
A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.
You could create an additional rank column with a window function and then select only rows with rank 1 or 2:
import org.apache.spark.sql.expressions.Window
main_df.withColumn(
"rank",
rank()
.over(
Window.partitionBy("teams")
.orderBy($"homeruns".desc)
)
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
| teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets| 8| 20| 1|
|yankees-mets| 6| 17| 2|
+------------+--------+----+----+
Then if you no longer need rank column you could just drop it.
I have a spark dataframe with a column having float type values. I am trying to find the average of values between row 11 to row 20. Please note, I am not trying any sort of moving average. I tried using partition window like so -
var avgClose= avg(priceDF("Close")).over(partitionWindow.rowsBetween(11,20))
It returns an 'org.apache.spark.sql.Column' result. I don't know how to view avgClose.
I am new to Spark and Scala. Appreciate your help in getting this.
Assign an increasing id to your table. Then you can do an average between the ids.
val df = Seq(20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1).toDF("val1")
val dfWithId = df.withColumn("id", monotonically_increasing_id())
val avgClose= dfWithId.filter($"id" >= 11 && $"id" <= 20).agg(avg("val1"))
avgClose.show()
result:
+---------+
|avg(val1)|
+---------+
| 5.0|
+---------+
I have Data Frame like below with three column
id|visit_class|in_date
+--+-----------+--------
|1|Non Hf |24-SEP-2017
|1|Non Hf |23-SEP-2017
|1|Hf |27-SEP-2017
|1|Non Hf |28-SEP-2017
|2|Non Hf |24-SEP-2017
|2|Hf |25-SEP-2017
I want to group this data frame on id then sort this grouped data on indate column and want only those rows which are coming after first occurrence of HF. The output will be like below. Means first 2 rows will drop for id =1 and first 1 row will drop for id = 2.
id|visit_class|in_date
+--+-----------+--------
|1|Hf |27-SEP-2017
|1|Non Hf |28-SEP-2017
|2|Hf |25-SEP-2017
How I will achieve this in Spark and Scala.
Steps:
1) Create the WindowSpec, order by date and partition by id:
2) Create a cumulative sum as indicates of whether Hf has appeared, and then filter based on the condition:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id").orderBy(to_date($"in_date", "dd-MMM-yyyy"))
(df.withColumn("rn", sum(when($"visit_class" === "Hf", 1).otherwise(0)).over(w))
.filter($"rn" >= 1).drop("rn").show)
+---+-----------+-----------+
| id|visit_class| in_date|
+---+-----------+-----------+
| 1| Hf|27-SEP-2017|
| 1| Non Hf|28-SEP-2017|
| 2| Hf|25-SEP-2017|
+---+-----------+-----------+
Using spark 2.2.0, to_date with the format signature is a new function in 2.2.0
If you are using spark < 2.2.0, you can use unix_timestamp in place of to_date:
val w = Window.partitionBy("id").orderBy(unix_timestamp($"in_date", "dd-MMM-yyyy"))