max() with struct() in Spark Dataset - scala

I have something like the below in spark of which I'm grouping and then trying to find the one with the highest value from my struct.
test.map(x => tester(x._1, x._2, x._3, x._4, x._5))
.toDS
.select($"ac", $"sk", struct($"num1", struct($"time", $"num1")).as("grp"))
.groupBy($"ac", $"sk")
.agg(max($"grp")).show(false)
I am not sure how the max function figures out how to decide the max. The reason I used a nested struct is because it seemed to make the max function using num1 instead of the next numbers when everything was in the same struct.

The StructTypes are compared lexicographically - field by field, from left to right and all fields have to recursively orderable. So in your case:
It will compare the first element of the struct.
If the elements are not equal it will return the struct with higher value.
Otherwise it will proceed to the point 2.
Since the second field is complex as well, it will repeat procedure from point 1 this time comparing time fields first.
Note that nested num1 can be evaluated on if top level num1 fields are equal, therefore it doesn't affect the ordering in practice.

Related

`set_sorted` when a dataframe is sorted on multiple columns

I have some panel data in polars. The dataframe is sorted by its id column and then its date column (basically it's a bunch of time series concatenated together).
I've seen that polars has a .set_sorted method for working with expressions. I can of course set pl.col("id").set_sorted() but I want it to be aware that it's actually sorted in both id and date columns. In pandas I know the Index has an .is_monotonic_increasing property that is aware of whether all the columns of the Index are sorted but is there a way to do something similar with polars?
Have you tried
df.get_column('id').is_sorted()
and
df.get_column('date').is_sorted()
to see if they're each already known to be sorted?
For instance if I do:
df=pl.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
Then I get 2 Trues even though I haven't ever told it that the columns are sorted.
In general, I don't think you want to be manually setting columns as sorted. Just sort them and it'll keep track of the fact that they're sorted.
If you do:
df=pl.DataFrame({'a':[1,2,1,2], 'b':[1,3,2,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
then you get False twice, as you'd hope. If you then do df=df.sort(['a','b']) and follow it up by checking the sortedness of a and b again then you see that it knows they're sorted

getting number of values within reduceByKey RDD

when reduceByKey operation is called, it is receiving list of values of a particular key. My question is:
are the list of values it receives in a sorted order?
is it possible to know how many values it receive?
i'm trying to calculate first quartile of the list of values of a key within reduceByKey. is this possible to do within reduceByKey?
.1. No, that would be totally going against the whole point of a reduce operation - i.e. to parallelalize an operation into an arbitrary tree of suboperations by taking advantage of associativity and commutativity.
.2. You'll need to define a new monoid by composing the integer monoid and whatever it is your doing. Let's assume your operation is op then
.
yourRdd.map(kv => (kv._1, (kv._2, 1)))
.reduceByKey((left, right) => (left._1 op right._1, left._2 + right._2))
will give you an RDD[(KeyType, (ReducedValueType, Int))] where the Int will be the number of values the reduce received for each key.
.3. You'll have to be more specific about what you mean by first quartile. Given that the answer to 1. is no, then you would have to have a bound that defines the first quartile then you won't need the data to be sorted because you could filter the values out by that bound.

Calculate hash for java.sql.ResultSet

I need to know if the results of SQL query has been changed between two queries.
The solution a came up with is to calculate and compare some hash value based on ResultSet content.
What is the preferred way?
There are no such special hashCode method, for ResultSet that is calculated based on all retrieved data. Definetly you can not use default hashCode method.
To be 100% sure that you will take into account all the changes in the data,
you have to retrieve all columns from all the rows from ResultSet one by one and calculate hash code for them with any possible way. (Put everything into single String and get it's hashCode).
But it's very time consumption operation. I would propose you to execute extra query that calculate hash sum by itself. For example it can return count of rows and sum of all columns/rows... or smth like that..

Overriding Ordering[Int] in Scala

I'm trying to sort an array of integers with a custom ordering.
E.g.
quickSort[Int](indices)(Ordering.by[Int, Double](value(_)))
Basically, I'm trying to sort indices of rows by the values of a particular column. I end up with a stackoverflow error when I run this on a fairly large data. If I use a more direct approach (e.g. sorting Tuple), this is not a problem.
Is there a problem if you try to extend the default Ordering[Int]?
You can reproduce this like this:
val indices = (0 to 99999).toArray
val values = Array.fill[Double](100000)(math.random)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values(_))) // Works
val values2 = Array.fill[Double](100000)(0.0)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values2(_))) // Fails
Update:
I think that I found out what the problem is (am answering my own question). It seems that I've created a paradoxical situation by changing the ordering definition of integers.
Within the quickSort algorithm itself, array positions are also integers, and there are certain statements comparing positions of arrays. This position comparison should be following the standard integer ordering.
But because of the new definition, now these position comparators are also following the indexed value comparator and things are getting really messed up.
I suppose that at least for the time being, I shouldn't be changing these default value type ordering as library might depend on default value type ordering.
Update2
It turns out that the above is in fact not the problem and there's a bug in quickSort when used together with Ordering. When a new Ordering is defined, the equality operator among Ordering is 'equiv', however the quickSort uses '=='. This results in the indices being compared, rather than indexed values being compared.

How to do pandas groupby([multiple columns]) so its result can be looked up

I have two dataframes: tr is a training-set, ts is a test-set.
They contain columns uid (a user_id), categ (a categorical), and response.
response is the dependent variable I'm trying to predict in ts.
I am trying to compute the mean of response in tr, broken out by columns uid and categ:
avg_response_uid_categ = tr.groupby(['uid','categ']).response.mean()
This gives the result but (unwantedly) the dataframe index is a MultiIndex. (this is the groupby(..., as_index=True) behavior):
MultiIndex[--5hzxWLz5ozIg6OMo6tpQ SomeValueOfCateg, --65q1FpAL_UQtVZ2PTGew AnotherValueofCateg, ...
But instead I want the result to keep the two columns 'uid', 'categ' and keep them separate.
Should I use aggregate() instead of groupby()?
Trying groupby(as_index=False) is useless.
The result seems to differ depending on whether you do:
tr.groupby(['uid','categ']).response.mean()
or:
tr.groupby(['uid','categ'])['response'].mean() # RIGHT
i.e. whether you slice a single Series, or a DataFrame containing a single Series. Related: Pandas selecting by label sometimes return Series, sometimes returns DataFrame