Overriding Ordering[Int] in Scala - scala

I'm trying to sort an array of integers with a custom ordering.
E.g.
quickSort[Int](indices)(Ordering.by[Int, Double](value(_)))
Basically, I'm trying to sort indices of rows by the values of a particular column. I end up with a stackoverflow error when I run this on a fairly large data. If I use a more direct approach (e.g. sorting Tuple), this is not a problem.
Is there a problem if you try to extend the default Ordering[Int]?
You can reproduce this like this:
val indices = (0 to 99999).toArray
val values = Array.fill[Double](100000)(math.random)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values(_))) // Works
val values2 = Array.fill[Double](100000)(0.0)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values2(_))) // Fails
Update:
I think that I found out what the problem is (am answering my own question). It seems that I've created a paradoxical situation by changing the ordering definition of integers.
Within the quickSort algorithm itself, array positions are also integers, and there are certain statements comparing positions of arrays. This position comparison should be following the standard integer ordering.
But because of the new definition, now these position comparators are also following the indexed value comparator and things are getting really messed up.
I suppose that at least for the time being, I shouldn't be changing these default value type ordering as library might depend on default value type ordering.
Update2
It turns out that the above is in fact not the problem and there's a bug in quickSort when used together with Ordering. When a new Ordering is defined, the equality operator among Ordering is 'equiv', however the quickSort uses '=='. This results in the indices being compared, rather than indexed values being compared.

Related

How does Snowflake calculate its HASH() output?

Take a look at this query
select
hash( col1, col2 ) as a,
col1||col2 as b, -- just taking a guess as to how hash can take multiple values
hash( b ) as c
from table_name
The result for a and c are different.
So, my question is: how does Snowflake calculate the hash when there are many fields like in a? Is it concatinating the fields first, and then signing that result of that?
Thank you
More to NickW's point that HASH is proprietary
HASH is a proprietary function that accepts a variable number of input expressions of arbitrary types and returns a signed value. It is not a cryptographic hash function and should not be used as such.
I assume the core of the problem you are trying to achieve, is to "make a value in another system, and be able to compare these "safely", of which concatenating strings together, seems very dangerous, as the number and length of each string is a property of those strings.
The usage notes section has some good hints:
Any two values of type NUMBER that compare equally will hash to the same hash value, even if the respective types have different precision and/or scale.
this implies that things are converted to this form.. but it also notes on convertion:
Note that this guarantee does not apply to other combinations of types, even if implicit conversions exist between the types.
What really would help is for you to describe, what you want to happen for you, then if "knowing how HASH works" is the best path to that end, OR not as I would suggest, would be more answerable.
Aka, this answer is a long form question, suggesting this question needs to be reworked.

`set_sorted` when a dataframe is sorted on multiple columns

I have some panel data in polars. The dataframe is sorted by its id column and then its date column (basically it's a bunch of time series concatenated together).
I've seen that polars has a .set_sorted method for working with expressions. I can of course set pl.col("id").set_sorted() but I want it to be aware that it's actually sorted in both id and date columns. In pandas I know the Index has an .is_monotonic_increasing property that is aware of whether all the columns of the Index are sorted but is there a way to do something similar with polars?
Have you tried
df.get_column('id').is_sorted()
and
df.get_column('date').is_sorted()
to see if they're each already known to be sorted?
For instance if I do:
df=pl.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
Then I get 2 Trues even though I haven't ever told it that the columns are sorted.
In general, I don't think you want to be manually setting columns as sorted. Just sort them and it'll keep track of the fact that they're sorted.
If you do:
df=pl.DataFrame({'a':[1,2,1,2], 'b':[1,3,2,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
then you get False twice, as you'd hope. If you then do df=df.sort(['a','b']) and follow it up by checking the sortedness of a and b again then you see that it knows they're sorted

How does scala's VectorMap work and how is it different than ListMap?

How does scala's VectorMap work? It says that it is constant time for look up.
I think ListMap has to iterate through everything to find an entry. Why would vector map be different?
Is it a hash table combined with a vector, where the hash table will map a key to an index in the vector, which has the entries?
Essentially, yes. It has a regular Map inside that maps keys to tuples (index, value), where index is pointing into a Vector of (keys), which is only used for in-order access (.head, .tail, .last, .iterator etc).

Error in makeClassifTask - columns to join must specify "on="

I am getting an error here for the makeClassifTask() from MLR package.
task = makeClassifTask(data = data[,2:20441], target='Disease')
Entering this I get this error.
Provided data is not a pure data.frame but from class data.table, hence it will be converted.
Error in [.data.table(data, target) :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
If someone could help me out it'd be great.
Given that you did not provide the data I can only do some guessing and suggest to read the documentation at https://mlr3book.mlr-org.com/tasks.html.
It looks like you left out the first column in your dataset which might be your target. Hence makeClassifTask() cannot find your target column.
As #Shreyash Gputa pointed out correctly, changing the data.table object to a data.frame object solves the issue:
task = makeClassifTask(data = as.data.frame(data[,2:20441]), target='Disease')
Given of course that data[,2:20441] contains the target variable Disease...

A type for sorted arrays in Swift

Not sure if this is science fiction, but would it be possible to create a type that represents an Array that matches a certain condition, such as being always sorted?
Or a 2-tuple where the first element is always bigger than the second?
What you're describing is called a dependent type (https://en.wikipedia.org/wiki/Dependent_type). Swift does not have these, and I'm not aware of any mainstream (non-research) language that does. You can of course create a special kind of collection that is indexed like an array and sorts itself whenever it is modified, and you can crate a struct with greater and lessor properties that always reorders itself. But these criteria cannot be attached to the existing Array or tuple types.