Spark dataframe random sampling based on frequency occurrence in dataframe - scala

Input description
I have a spark job with input dataframe with a column queryId. This queryId is not unique with respect to the dataframe. For example, there are roughly 3M rows in the spark dataframe with 450k distinct query ids.
Problem
I am trying to implement sampling logic and create a new column sampledQueryId which contains randomly sampled query id for each dataframe row by looking up query ids from the aggregate spark dataframe query id set.
Sampling goal
The restriction is that sampled query id shouldn't be equal to input query id.
Sampling should correspond to frequency of occurrence of query id in the incoming spark dataframe - ie given two query id q1 and q2, if the ratio of occurrence is 10:1(q1:q2), then q1 should appear approximately 10 times more in the sample id column.
Solution tried so far
I have tried to implement this through collecting the query ids into a list and lookup query id list with random sampling but have some suspicion based on empirical evidence that the logic doesn't work as expected for eg I see a specific query id getting sampled 200 times but a query id with similar frequency never gets sampled.
Any suggestions on whether this spark code is expected to work as intended?
val random = new scala.util.Random
val queryIds = data.select($"queryId").map(row => row.getAs[Long](0)).collect()
val sampleQueryId = udf((queryId: Long) => {
val sampledId = queryIds(random.nextInt(queryIds.length))
if (sampledId != queryId) sampledId else null
})
val dataWithSampledIds = data.withColumn("sampledQueryId",sampleQueryId($"queryId"))

Received response on different forum documenting for posterity's sake. The issue is that one random instance is being passed to all executors through the udf. So the n-th row on every executor is going to give the same output.
scala> val random = new scala.util.Random
scala> val getRandom = udf((data: Long) => random.nextInt(10000))
scala> spark.range(0, 12, 1, 4).withColumn("rnd", getRandom($"id")).orderBy($"id").show
+---+----+
| id| rnd|
+---+----+
| 0|6720|
| 1|7667|
| 2|3344|
| 3|6720|
| 4|7667|
| 5|3344|
| 6|6720|
| 7|7667|
| 8|3344|
| 9|6720|
| 10|7667|
| 11|3344|
+---+----+
This df had 4 partitions. The value of rrd for every n-th row is the same (e.g. id = 1, 4, 7, 10 are the same).The solution is to use rand() built-in function in Spark like below.
val queryIds = data.select($"queryId").map(row => row.getAs[Long](0)).collect()
val sampleQueryId = udf((companyId: Long, rand: Double) => {
val sampledId = queryIds(scala.math.floor(rand*queryIds.length).toInt)
if (sampledId != queryId) sampledId else null
})
val dataWithSampledIds = data.withColumn("sampledQueryId",sampleQueryId($"queryId", rand()))

Related

Check the minimum by iterating one row in a dataframe over all the rows in another dataframe

Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.

How would I create bins of date ranges in spark scala?

Hi how's it going? I'm a Python developer trying to learn Spark Scala. My task is to create date range bins, and count the frequency of occurrences in each bin (histogram).
My input dataframe looks something like this
My bin edges are this (in Python):
bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]
and the output dataframe I'm looking for is (counts of how many values in original dataframe per bin):
Is there anyone who can guide me on how to do this is spark scala? I'm a bit lost. Thank you.
Are You Looking for A result Like Following:
+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
| 3| null|
| null| 2|
+------------------------+------------------------+
It can be achieved with little bit of spark Sql and pivot function as shown below
check out the left join condition
val binRangeData = Seq(("01-01-1990","12-31-1999"),
("01-01-2000","12-31-2009"))
val binRangeDf = binRangeData.toDF("start_date","end_date")
// binRangeDf.show
val inputDf = Seq((0,"10-12-1992"),
(1,"11-11-1994"),
(2,"07-15-1999"),
(3,"01-20-2001"),
(4,"02-01-2005")).toDF("id","input_date")
// inputDf.show
binRangeDf.createOrReplaceTempView("bin_range")
inputDf.createOrReplaceTempView("input_table")
val countSql = """
SELECT concat(date_format(c.st_dt,'MM-dd-yyyy'),' -- ',date_format(c.end_dt,'MM-dd-yyyy')) as header, c.bin_count
FROM (
(SELECT
b.st_dt, b.end_dt, count(1) as bin_count
FROM
(select to_date(input_date,'MM-dd-yyyy') as date_input , * from input_table) a
left join
(select to_date(start_date,'MM-dd-yyyy') as st_dt, to_date(end_date,'MM-dd-yyyy') as end_dt from bin_range ) b
on
a.date_input >= b.st_dt and a.date_input < b.end_dt
group by 1,2) ) c"""
val countDf = spark.sql(countSql)
countDf.groupBy("bin_count").pivot("header").sum("bin_count").drop("bin_count").show
Although, since you have 2 bin ranges there will be 2 rows generated.
We can achieve this by looking at the date column and determining within which range each record falls.
// First we set up the problem
// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")
// Get the current local date
val now = java.time.LocalDate.now
// Create a range of 1-10000 and map each to minusDays
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))
// Create a DataFrame we can work with.
val df = dates.toDF("date")
So far so good. We have date entries to work with, and they are like your format (MM-dd-yyyy).
Next up, we need a function which returns 1 if the date falls within range, and 0 if not. We create a UserDefinedFunction (UDF) from this function so we can apply it to all rows simultaneously across Spark executors.
// We will process each range one at a time, so we'll take it as a string
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat
def isWithinRange(date: String, binRange: String): Int = {
val format = new SimpleDateFormat("MM-dd-yyyy")
val startDate = format.parse(binRange.split(" - ").head)
val endDate = format.parse(binRange.split(" - ").last)
val testDate = format.parse(date)
if (!(testDate.before(startDate) || testDate.after(endDate))) 1
else 0
}
// We create a udf which uses an anonymous function taking two args and
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def isWithinRangeUdf: UserDefinedFunction =
udf((date: String, binRange: String) => isWithinRange(date, binRange))
Now that we have our UDF setup, we create new columns in our DataFrame and group by the given bins and sum the values over (hence why we made our functions evaluate to an Int)
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
"01-01-2000 - 12-31-2009",
"01-01-2010 - 12-31-2020")
// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}
val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin)))
}
withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//| date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020| 0| 0| 1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
Finally we select our bin columns and groupBy them and sum.
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)
summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//| 2450| 3653| 3897|
//+-----------------------+-----------------------+-----------------------+
2450 + 3653 + 3897 = 10000, so it seems our work was correct.
Perhaps I overdid it and there is a simpler solution, please let me know if you know a better way (especially to handle MM-dd-yyyy dates).

How to calculate the average of rows between a start index and end index of a column in a dataframe in Spark using Scala?

I have a spark dataframe with a column having float type values. I am trying to find the average of values between row 11 to row 20. Please note, I am not trying any sort of moving average. I tried using partition window like so -
var avgClose= avg(priceDF("Close")).over(partitionWindow.rowsBetween(11,20))
It returns an 'org.apache.spark.sql.Column' result. I don't know how to view avgClose.
I am new to Spark and Scala. Appreciate your help in getting this.
Assign an increasing id to your table. Then you can do an average between the ids.
val df = Seq(20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1).toDF("val1")
val dfWithId = df.withColumn("id", monotonically_increasing_id())
val avgClose= dfWithId.filter($"id" >= 11 && $"id" <= 20).agg(avg("val1"))
avgClose.show()
result:
+---------+
|avg(val1)|
+---------+
| 5.0|
+---------+

Comparing the value of columns in two dataframe

I have two dataframe, one has unique value of id and other can have multiple values of different id.
This is dataframe df1:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:48:23 4 [8,-1,0,22,1,1,1]
df2:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:54:23 4 [9,0,0,22,1,1,1]
358899055773504 2018-07-31 18:58:34 0 [9,0,-1,22,0,1,0]
358899055773504 2018-07-31 18:28:34 0 [9,0,-1,22,0,1,0]
358899055773505 2018-07-31 18:38:23 4 [8,-1,0,22,1,1,1]
I aim to compare the second dataframe with the first dataframe and updating the values in first dataframe, only if the value of dt of a particular id of df2 is greater than that in df1 and if it satisfies the greater than condition then comparing the other fields as well.
You need to join the two dataframes together to make any comparison of their columns.
What you can do is first joining the dataframes and then perform all the filtering to get a new dataframe with all rows that should be updated:
val diffDf = df1.as("a").join(df2.as("b"), Seq("id"))
.filter($"b.dt" > $"a.dt")
.filter(...) // Any other filter required
.select($"id", $"b.dt", $"b.speed", $"b.stats")
Note: In some situations it would be required to do a groupBy(id) or use a window function since there should only be one final row per id in the diffDf dataframe. This can be done as as follows (the example here will select the row with maximum in the speed, but it depends on the actual requirements):
val w = Window.partitionBy($"id").orderBy($"speed".desc)
val diffDf2 = diffDf.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")
More in-depth information about different approaches can be seen here: How to max value and keep all columns (for max records per group)?.
To replace the old rows with the same id in the df1 dataframe, combine the dataframes with an outer join and coalesce:
val df = df1.as("a").join(diffDf.as("b"), Seq("id"), "outer")
.select(
$"id",
coalesce($"b.dt", $"a.dt").as("dt"),
coalesce($"b.speed", $"a.speed").as("speed"),
coalesce($"b.stats", $"a.stats").as("stats")
)
coalesce works by first trying to take the value from the diffDf (b) dataframe. If that value is null it will take the value from df1 (a).
Result when only using the time filter with the provided example input dataframes:
+---------------+-------------------+-----+-----------------+
| id| dt|speed| stats|
+---------------+-------------------+-----+-----------------+
|358899055773504|2018-07-31 18:58:34| 0|[9,0,-1,22,0,1,0]|
|358899055773505|2018-07-31 18:54:23| 4| [9,0,0,22,1,1,1]|
+---------------+-------------------+-----+-----------------+

Spark: Computing correlations of a DataFrame with missing values

I currently have a DataFrame of doubles with approximately 20% of the data being null values. I want to calculate the Pearson correlation of one column with every other column and return the columnId's of the top 10 columns in the DataFrame.
I want to filter out nulls using pairwise deletion, similar to R's pairwise.complete.obs option in its Pearson correlation function. That is, if one of the two vectors in any correlation calculation has a null at an index, I want to remove that row from both vectors.
I currently do the following:
val df = ... //my DataFrame
val cols = df.columns
df.registerTempTable("dataset")
val target = "Row1"
val mapped = cols.map {colId =>
val results = sqlContext.sql(s"SELECT ${target}, ${colId} FROM dataset WHERE (${colId} IS NOT NULL AND ${target} IS NOT NULL)")
(results.stat.corr(colId, target) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
This runs very slowly, as every single map iteration is its own job. Is there a way to do this efficiently, perhaps using Statistics.corr in the Mllib, as per this SO Question (Spark 1.6 Pearson Correlation)
There are "na" functions on DataFrame: DataFrameNaFunctions API
They work in the same way DataFramStatFunctions do.
You can drop the rows containing a null in either of your two dataframe columns with the following syntax:
myDataFrame.na.drop("any", target, colId)
if you want to drop rows containing null any of the columns then it is:
myDataFrame.na.drop("any")
By limiting the dataframe to the two columns you care about first, you can use the second method and avoid verbose!
As such your code would become:
val df = ??? //my DataFrame
val cols = df.columns
val target = "Row1"
val mapped = cols.map {colId =>
val resultDF = df.select(target, colId).na.drop("any")
(resultDF.stat.corr(target, colId) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
Hope this helps you.