Scala find missing values in a range - scala

For a given range, for instance
val range = (1 to 5).toArray
val ready = Array(2,4)
the missing values (not ready) are
val missing = range.toSet diff ready.toSet
Set(5, 1, 3)
The real use case includes thousands of range instances with (possibly) thousands of missing or not ready values. Is there a more time-efficient approach in Scala?

The diff operation is implemented in Scala as a foldLeft over the left operand where each element of the right operand is removed from the left collection. Let's assume that the left and right operand have m and n elements, respectively.
Calling toSet on an Array or Range object will return a HashTrieSet, which is a HashSet implementation and, thus, offers a remove operation with complexity of almost O(1). Thus, the overall complexity for the diff operation is O(m).
Considering now a different approach, we'll see that this is actually quite good. One could also solve the problem by sorting both ranges and then traversing them once in a merge-sort fashion to eliminate all elements which occur in both ranges. This will give you a complexity of O(max(m, n) * log(max(m, n))), because you have to sort both ranges.
Update
I ran some experiments to investigate whether you can speed up your computation by using mutable hash sets instead of immutable. The result as shown in the tables below is that it depends on the size ration of range and ready.
It seems as if using immutable hash sets is more efficient if the ready.size/range.size < 0.2. Above this ratio, the mutable hash sets outperform the immutable hash sets.
For my experiments, I set range = (1 to n), with n being the number of elements in range. For ready I selected a random sub set of range with the respective number of elements. I repeated each run 20 times and summed up the times calculated with System.currentTimeMillis().
range.size == 100000
+-----------+-----------+---------+
| Fraction | Immutable | Mutable |
+-----------+-----------+---------+
| 0.01 | 28 | 111 |
| 0.02 | 23 | 124 |
| 0.05 | 39 | 115 |
| 0.1 | 113 | 129 |
| 0.2 | 174 | 140 |
| 0.5 | 472 | 200 |
| 0.75 | 722 | 203 |
| 0.9 | 786 | 202 |
| 1.0 | 743 | 212 |
+-----------+-----------+---------+
range.size == 500000
+-----------+-----------+---------+
| Fraction | Immutable | Mutable |
+-----------+-----------+---------+
| 0.01 | 73 | 717 |
| 0.02 | 140 | 771 |
| 0.05 | 328 | 722 |
| 0.1 | 538 | 706 |
| 0.2 | 1053 | 836 |
| 0.5 | 2543 | 1149 |
| 0.75 | 3539 | 1260 |
| 0.9 | 4171 | 1305 |
| 1.0 | 4403 | 1522 |
+-----------+-----------+---------+

Related

pyspark - preprocessing with a kind of "product-join"

I have 2 datasets that I can represent as:
The first dataframe is my raw data. It contains millions of row and around 6000 areas.
+--------+------+------+-----+-----+
| user | area | time | foo | bar |
+--------+------+------+-----+-----+
| Alice | A | 5 | ... | ... |
| Alice | B | 12 | ... | ... |
| Bob | A | 2 | ... | ... |
| Charly | C | 8 | ... | ... |
+--------+------+------+-----+-----+
This second dataframe is a mapping table. It has around 200 areas (not 5000) for 150 places. Each area can have 1-N places (and a place can have 1-N areas too). It can be represented unpivoted this way:
+------+--------+-------+
| area | place | value |
+------+--------+-------+
| A | placeZ | 0.1 |
| B | placeB | 0.6 |
| B | placeC | 0.4 |
| C | placeA | 0.1 |
| C | placeB | 0.04 |
| D | placeA | 0.4 |
| D | placeC | 0.6 |
| ... | ... | ... |
+------+--------+-------+
or pivoted
+------+--------+--------+--------+-----+
| area | placeA | placeB | placeC | ... |
+------+--------+--------+--------+-----+
| A | 0 | 0 | 0 | ... |
| B | 0 | 0.6 | 0.4 | ... |
| C | 0.1 | 0.04 | 0 | ... |
| D | 0.4 | 0 | 0.6 | ... |
+------+--------+--------+--------+-----+
I would like to create a kind of product-join to have something like:
+--------+--------+--------+--------+-----+--------+
| user | placeA | placeB | placeC | ... | placeZ |
+--------+--------+--------+--------+-----+--------+
| Alice | 0 | 7.2 | 4.8 | 0 | 0.5 | <- 7.2 & 4.8 comes from area B and 0.5 from area A
| Bob | 0 | 0 | 0 | 0 | 0.2 |
| Charly | 0.8 | 0.32 | 0 | 0 | 0 |
+--------+--------+--------+--------+-----+--------+
I see 2 options so far:
Perform a left join between the main table and the pivoted one
Multiply each column by the time (around 150 columns)
Groupby user with a sum
Perform a outer join between the main table and the unpivoted one
Multiply the time by value
Pivot place
Groupby user with a sum
I don't like the first option because of the number of multiplications involved (the mapping dataframe is quite sparse).
I prefer the second option but I see two problems :
If someday, the dataset does not have a place represented, the column will not exist and the dataset will have a different shape (hence failing).
Some other features like foo, bar will be duplicated with the outer-join and I'll have to handle it on case by case at the grouping stage (sum or average).
I would like to know if there is something more ready-to-use for this kind of product-join in spark ? I have seen the OneHotEncoder but it only provides only a "1" on each column (so it is even worse than the first solution).
Thanks in advance,
Nicolas

calculate aggregation and percentage simultaneous after groupBy in scala/Spark Dataset/Dataframe

I am learning to work with Scala and spark. It's my first incidents using them. I have some structured Scala DataSet(org.apache.spark.sql.Dataset) like following format.
Region | Id | RecId | Widget | Views | Clicks | CTR
1 | 1 | 101 | A | 5 | 1 | 0.2
1 | 1 | 101 | B | 10 | 4 | 0.4
1 | 1 | 101 | C | 5 | 1 | 0.2
1 | 2 | 401 | A | 5 | 1 | 0.2
1 | 2 | 401 | D | 10 | 2 | 0.1
NOTE: CTR = Clicks/Views
I want to merge the mapping regardless of Widget (i.e using Region, Id, RecID).
The Expected Output I want is like following:
Region | Id | RecId | Views | Clicks | CTR
1 | 1 | 101 | 20 | 6 | 0.3
1 | 1 | 101 | 15 | 3 | 0.2
What I am getting is like below:
>>> ds.groupBy("Region","Id","RecId").sum().show()
Region | Id | RecId | sum(Views) | sum(Clicks) | sum(CTR)
1 | 1 | 101 | 20 | 6 | 0.8
1 | 1 | 101 | 15 | 3 | 0.3
I understand that it is summing up all the CTR from original but I want to groupBy as explained but still want to get the expected CTR value. I also don't want to change column names as it is changing in my approach.
Is there any possible way of calculating in such manner. I also have #Purchases and CoversionRate (#Purchases/Views) and I want to do the same thing with that field also. Any leads will be appreciated.
You can calculate the ctr after the aggregation. Try the below code.
ds.groupBy("Region","Id","RecId")
.agg(sum(col("Views")).as("Views"), sum(col("Clicks")).as("Clicks"))
.withColumn("CTR" , col("Views") / col("Clicks"))
.show()

Split postgres records into groups based on time fields

I have a table with records that look like this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
The time column represents a time in milliseconds. What I want to do is find all coord-x, coord-y as a set of points for a given timeframe for a given id. For any given id there is a unique coord-x, coord-y, and time.
What I need to do however is group these points as long as they're n milliseconds apart. So if I have this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
| 1 | 0 | 6 | 140 |
| 1 | 0 | 7 | 141 |
I would want a result similar to this:
| id | points | start-time | end-time |
| 1 | (0,0), (0,1), (0,3) | 123 | 125 |
| 1 | (0,140), (0,141) | 140 | 141 |
I do have PostGIS installed on my database, the times I posted above are not representative but I kept them small just as a sample, the time is just a millisecond timestamp.
The tricky part is picking the expression inside your GROUP BY. If n = 5, you can do something like time / 5. To match the example exactly, the query below uses (time - 3) / 5. Once you group it, you can aggregate them into an array with array_agg.
SELECT
array_agg(("coord-x", "coord-y")) as points,
min(time) AS time_start,
max(time) AS time_end
FROM "<your_table>"
WHERE id = 1
GROUP BY (time - 3) / 5
Here is the output
+---------------------------+--------------+------------+
| points | time_start | time_end |
|---------------------------+--------------+------------|
| {"(0,0)","(0,1)","(0,3)"} | 123 | 125 |
| {"(0,6)","(0,7)"} | 140 | 141 |
+---------------------------+--------------+------------+

SQL Database Design - Combinations of factors affecting a final value

The price of a service may depend on various combinations of factors. Example Tables:
ServiceType_Duration (combination of 2 factors)
+-------------+----------+--------+
| ServiceType | Duration | Amount |
+-------------+----------+--------+
| Massage | 30 | 30 |
| Massage | 60 | 50 |
| Reflexology | 30 | 50 |
| Reflexology | 60 | 70 |
+-------------+----------+--------+
DiscountCode (1 factor)
+--------------+--------+---------+
| DiscountCode | Amount | Percent |
+--------------+--------+---------+
| D1 | -10 | NULL |
| D2 | NULL | -10% |
+--------------+--------+---------+
In this very simple example, a 30 minute massage with discount code D1 would have total price 30 - 10 = 20.
However, there may be many more such tables of the general format:
Factor1
Factor2
...
FactorN
Amount (possibly)
Percent (possibly)
I'm not sure about having lots of 'pricing tables' when the general format is always going to be the same. It could mean having to add/remove the same column from lots of tables.
Are the tables above OK? If not, what's the best practice for storing this sort of data?

scala specialization - using object instead of class causes slowdown?

I've done some benchmarks and have results that I don't know how to explain.
The situation in a nutshell:
I have 2 classes doing the same (computation heavy) thing with generic arrays, both of them use specialization (#specialized, later #spec). One class is defined like:
class A [#spec T] {
def a(d: Array[T], c: Whatever[T], ...) = ...
...
}
Second: (singleton)
object B {
def a[#spec T](d: Array[T], c: Whatever[T], ...) = ...
...
}
And in the second case, I get huge performance hit. Why this can happen? (Note: at the moment I don't understand Java bytecode very well, and Scala compiler internals too.)
More details:
Full code is here: https://github.com/magicgoose/trashbox/tree/master/sorting_tests/src/magicgoose/sorting
This is sorting algorithm ripped from Java, (almost)automagically converted to Scala and comparison operations changed to generic ones to allow using custom comparisons with primitive types without boxing. Plus simple benchmark (tests on different lengths, with JVM warmup and averaging)
The results looks like: (left column is original Java Arrays.sort(int[]))
JavaSort | JavaSortGen$mcI$sp | JavaSortGenSingleton$mcI$sp
length 2 | time 0.00003ms | length 2 | time 0.00004ms | length 2 | time 0.00006ms
length 3 | time 0.00003ms | length 3 | time 0.00005ms | length 3 | time 0.00011ms
length 4 | time 0.00005ms | length 4 | time 0.00006ms | length 4 | time 0.00017ms
length 6 | time 0.00008ms | length 6 | time 0.00010ms | length 6 | time 0.00036ms
length 9 | time 0.00013ms | length 9 | time 0.00015ms | length 9 | time 0.00069ms
length 13 | time 0.00022ms | length 13 | time 0.00028ms | length 13 | time 0.00135ms
length 19 | time 0.00037ms | length 19 | time 0.00040ms | length 19 | time 0.00245ms
length 28 | time 0.00072ms | length 28 | time 0.00060ms | length 28 | time 0.00490ms
length 42 | time 0.00127ms | length 42 | time 0.00096ms | length 42 | time 0.01018ms
length 63 | time 0.00173ms | length 63 | time 0.00179ms | length 63 | time 0.01052ms
length 94 | time 0.00280ms | length 94 | time 0.00280ms | length 94 | time 0.01522ms
length 141 | time 0.00458ms | length 141 | time 0.00479ms | length 141 | time 0.02376ms
length 211 | time 0.00731ms | length 211 | time 0.00763ms | length 211 | time 0.03648ms
length 316 | time 0.01310ms | length 316 | time 0.01436ms | length 316 | time 0.06333ms
length 474 | time 0.02116ms | length 474 | time 0.02158ms | length 474 | time 0.09121ms
length 711 | time 0.03250ms | length 711 | time 0.03387ms | length 711 | time 0.14341ms
length 1066 | time 0.05099ms | length 1066 | time 0.05305ms | length 1066 | time 0.21971ms
length 1599 | time 0.08040ms | length 1599 | time 0.08349ms | length 1599 | time 0.33692ms
length 2398 | time 0.12971ms | length 2398 | time 0.13084ms | length 2398 | time 0.51396ms
length 3597 | time 0.20300ms | length 3597 | time 0.20893ms | length 3597 | time 0.79176ms
length 5395 | time 0.32087ms | length 5395 | time 0.32491ms | length 5395 | time 1.30021ms
The latter is the one defined inside object and it's awful (about 4 times slower).
Update 1
I've run benchmark with and without scalac optimise option, and there are no noticeable differences (only slower compilation with optimise).
It's just one of many bugs in specialization--I'm not sure whether this one's been reported on the bug tracker or not. If you throw an exception from your sort, you'll see that it calls the generic version not the specialized version of the second sort:
java.lang.Exception: Boom!
at magicgoose.sorting.DualPivotQuicksortGenSingleton$.magicgoose$sorting$DualPivotQuicksortGenSingleton$$sort(DualPivotQuicksortGenSingleton.scala:33)
at magicgoose.sorting.DualPivotQuicksortGenSingleton$.sort$mFc$sp(DualPivotQuicksortGenSingleton.scala:13)
Note how the top thing on the stack is DualPivotQuicksortGenSingleton$$sort(...) instead of ...sort$mFc$sp(...)? Bad compiler, bad!
As a workaround, you can wrap your private methods inside a final helper object, e.g.
def sort[# spec T](a: Array[T]) { Helper.sort(a,0,a.length) }
private final object Helper {
def sort[#spec T](a: Array[T], i0: Int, i1: Int) { ... }
}
For whatever reason, the compiler then realizes that it ought to call the specialized variant. I haven't tested whether every specialized method that is called by another needs to be inside its own object; I'll leave that to you via the exception-throwing trick.