scala fast range lookup on 2 columns - scala

I have a spark dataframe that I am broadcasting as Array[Array[String]].
My requirement is to do a range lookup on 2 columns.
Right now I have something like ->
val cols = data.filter(_(0).toLong <= ip).filter(_(1).toLong >= ip).take(1) match {
case Array(t) => t
case _ => Array()
}
The following data file is stored as Array[Array[String]] (except for the header row that I have shown below only as reference.) and passed to the filter function shown above.
sample data file ->
startIPInt | endIPInt | lat | lon
676211200 | 676211455 | 37.33053 | -121.83823
16777216 | 16777342 | -34.9210644736842 | 138.598709868421
17081712 | 17081712 | 0 | 0
sample value to search ->
ip = 676211325
based on the range of the startIPInt and endIPInt values, I want the rest of the mapping rows.
This lookup takes 1-2 sec for each, and I am not even sure the 2nd filter condition is getting executed(in debug mode always it only seems to execute the 1st condition). Can someone suggest me a faster and more reliable lookup here?
Thanks!

Related

how to create range column based on a column value?

I have sample data in table which contains distance_travelled_in_meter, where the values are of Integer type as follows:
distance_travelled_in_meter |
--------------------------- |
500 |
1221|
990 |
575|
I want to create range based on the value of the column distance_travelled_in_meter. Range column has values with 500 intervals.
The result dataset is as follows:
distance_travelled_in_meter | range
--------------------------- |---------
500 | 1-500
1221|1000-1500
990 |500-1000
575|500-1000
For value 500, the range is 1-500 as it is within 500 meter, 1221 is in 1000-1500 and so on..
I tried using Spark.sql.functions.sequence but it takes the start and stop column values which is not what I want and want to be in range that I mentioned above. And also it creates an Range array from start column value to stop column value.
I'm using Spark2.4.2 with Scala 2.11.12
Any help is much appreciated.
You can chain multiple when expressions that you generate dynamically using something like this:
val maxDistance = 1221 // you can get this from the dataframe
val ranges = (0 until maxDistance by 500).map(x => (x, x + 500))
val rangeExpr = ranges.foldLeft(lit(null)) {
case (acc, (lowerBound, upperBound)) =>
when(
col("distance_travelled_in_meter").between(lowerBound, upperBound),
lit(s"$lowerBound-$upperBound")
).otherwise(acc)
}
val df1 = df.withColumn("range", rangeExpr)

PySpark dataframe column to list

I am trying to extract the list of column values from a dataframe into a list
+------+----------+------------+
|sno_id|updt_dt |process_flag|
+------+----------+------------+
| 123 |01-01-2020| Y |
+------+----------+------------+
| 234 |01-01-2020| Y |
+------+----------+------------+
| 512 |01-01-2020| Y |
+------+----------+------------+
| 111 |01-01-2020| Y |
+------+----------+------------+
Output should be the list of sno_id ['123','234','512','111']
Then I need to iterate the list to run some logic on each on the list values. I am currently using HiveWarehouseSession to fetch data from hive table into Dataframe by using hive.executeQuery(query)
it is pretty easy as you can first collect the df with will return list of Row type then
row_list = df.select('sno_id').collect()
then you can iterate on row type to convert column into list
sno_id_array = [ row.sno_id for row in row_list]
sno_id_array
['123','234','512','111']
Using Flat map and more optimized solution
sno_id_array = df.select("sno_id ").rdd.flatMap(lambda x: x).collect()
You could use toLocalIterator() to create a generator over the column.
Since you wanted to loop over the results afterwards, this may be more efficient in your case.
Using a generator you don't create and store the list first, but when iterating over the columns you apply your logic immediately:
sno_ids = df.select('sno_id').toLocalIterator()
for row in sno_ids:
sno_id = row.sno_id
# continue with your logic
...
Alternative one-liner using a generator expression:
sno_ids = (row.sno_id for row in df.select('sno_id').toLocalIterator())
for sno_id in sno_ids:
...

Define custom window functions in pySpark

I have a dataframe in Spark for which I would like to create a column defined recursively like that:
new_column_row = f(last_column_row, other_parameters)
The best way to do that would be to define my custom window function, but I cannot find a way to do it in PySpark, anybody encountered the same problem ?
The case I am working on is:
My problem is about reconstructing an order book from a list of orders:
I have a dataframe like that (value is what I would like to calculate)
size | price | output
1 | 1 | {1:1}
1.2 | 1.1 | {1:1, 1.2:1.1}
1.3 | - 1. | {1.2:1.1}
Output is updated at each row like that (in python pseudocode)
- if price not in Output:
Output[price] = size
- if price in Output:
Output[price] = output[price] + size
if Output[price] = 0:
del Output[price]

How can i fill a 2d ArrayBuffer in a dynamic way only in one dimension?

i want to have (id, ArrayBuffer[ArrayBuffer[Double]]) filled from a Dstream batch. My input data is of the form: "id, counter, value" and each id has more than one values per counter in each RDD.
So what i want is:
+-------+-----+------+----+
| |value1|value2|etc |
+-------------------------+
(id, |count1|0.2 |0.3 | |)
|count2|0.1 | | |
|etc | | | |
+------+------+------+----+
When i get the RDD input, i know the counter dimension but i don't know how many values i have stored in each counter to fill the next empty cell. I tried to create an Array[Array[Double]] with a findNextEmpty() method but i get ArrayIndexOutOfBoundsException, so i am trying ArrayBuffer after that. Is there any way to fill dynamically only one of the dimensions of a 2d ArrayBuffer?
My code is:
var example = inputRdd.transform(x =>
x.groupBy(_ (0))
.mapValues(x => x
.foldLeft( ArrayBuffer[ArrayBuffer[Double]]) { (a, b) => {
a(b(1).toInt)(??) += b(2).toDouble //Don't know the second dimension
}}))

Calculate frequency of column in data frame using spark sql

I'm trying to get the Frequency of distinct values in a Spark dataframe column, something like "value_counts" from Python Pandas. By frequency I mean, the highest occurring value in a table column (such as rank 1 value, rank 2, rank 3 etc. In the expected output, 1 has occurred 9 times in column a, so it has topmost frequency.
I'm using Spark SQL but it is not working out, may be because of the reduce operation I have written is wrong.
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
The problem area is with query below.
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
How do I solve this ? please help.
I could solve the issue with below with using count($field) instead of approx_count_distinct($field). Then I used Rank analytical function to get the first rank of value. It worked.