How to customize attrs field hash - joblib

I'd like to use a Numpy array as a field value while keeping my attrs class hashable. For that purpose, I found joblib's hash() function to be a good means of hashing Numpy arrays. Is there any possibility to keep using attrs's default hash function while specifying manually how to hash each field, e.g. with something like
import attrs
import numpy as np
from joblib import hash as jbhash
#attrs.frozen
class MyClass:
field: np.ndarray = attrs.field(hash=jbhash) # I know this doesn't work at the moment
or do I have to write my own __hash__()?
Notes:
I omitted a converter making field non-writeable for brevity)
Context: My goal is to use this data class as an argument of a function memoized using functools.lru_cache().

I'm afraid that's currently not possible, primarily because nobody ever asked for it. Hashing is usually something people don't care about until it breaks...
The feature request is tracked at https://github.com/python-attrs/attrs/issues/1076

Related

numba error with tuple sorting containing numpy arrays

I have a (working) function that uses the heapq module to build a priority queue of tuples and I would like to compile it with numba, however I get a very long and unclear error. It seems to boil down to a problem with tuple order comparison needed for the queue. The tuples have a fixed format, where the first item is a a floating point number (whose order I care about) and then a numpy array, which I need for computation but never get compared when running normally. This is intended because comparison on numpy arrays yields an array which cannot be used in conditionals and raises an exception. However, I guess numba needs a scalar yielding comparison to be at least defined for all items in the tuple, and hence the numba error.
I have a very minimal example:
#numba.njit
def f():
return 1 if (1, numpy.arange(3)) < (2, numpy.arange(3)) else 2
f()
where the numba compilation fails (without numba it works since it never needs to actually compare the arrays, as in the original code).
Here is a slightly less minimal but maybe clearer example, which shows what I am actually doing:
from heapq import heappush
import numpy
import numba
#numba.njit
def f(n):
heap = [(1, 0, numpy.random.rand(2, 3))]
for unique_id in range(n):
order = numpy.random.rand()
data = numpy.random.rand(2, 3)
heappush(heap, (order, unique_id, data))
return heap[0]
f(100)
Here order is the variable whose order I care about in the queue, unique_id is a trick to avoid that when order is the same the comparison goes on to data and throws an exception.
I tried to bypass the problem converting the numpy array to a list when in the tuple and back to array for computation, but while this compiles, the numba version is slower than the interpreted one, even thought the array is quite small (usually 2x3). Without converting I would need to rewrite the code as loops which I would prefer to avoid (but is doable).
Is there a better alternative to get this working with numba, hopefully running faster than the python interpreter?
I'll try to respond based on the minimal example you provided.
I think that the problem here is not related to the ability of numba to perform comparison between all the elements of the tuple, but rather on where to store the result of such a comparison. This is stated in the error log returned when trying to execute your example:
cannot store {i8*, i8*, i64, i64, i8*, [1 x i64], [1 x i64]} to i1*: mismatching types
Basically, you are trying to store the result of a comparison between a pair of floats and a pair of arrays into a single boolean, and numba doesn't know how to do that.
If you are only interested in comparing the first elements of the tuples, the quickest workaround I can think of is forcing the comparison to happen only on the first elements, e.g.
#numba.njit
def f():
return 1 if (1, numpy.arange(3))[0] < (2, numpy.arange(3))[0] else 2
f()
If this is not applicable to your use case, please provide more details about it.
EDIT
According to the further information you provided, I think the best way to solve this is avoiding pushing the numpy arrays to the heap. Since you're only interested in the ordering properties of the heap, you can just push the keys to the heap and store the corresponding numpy arrays in a separate dictionary, using as keys the same values you push in the heap.
As a sidenote, when you use standard library functions in nopython-jitted functions, you are resorting on specific numba's re-implementation of those functions rather then the "original" python's ones. A comprehensive list of the available python's features in numba can be found here.
Ok, I found a solution to the problem: since storing the array in the heap tuple is the cause of the numba error, it is enough to store it in a separate dictionary with an unique key and store only the key in the heap tuple. For instance, using an integer as the key:
from heapq import heappush
import numpy
import numba
#numba.njit
def f(n):
key = 0
array_storage = {key: numpy.random.rand(2, 3)}
heap = [(1.0, key)]
for _ in range(n):
order = numpy.random.rand()
data = numpy.random.rand(2, 3)
key += 1
heappush(heap, (order, key))
array_storage[key] = data
return heap[0]
f(100)
Now the tuples in the heap can be compared yielding a boolean value and I still get to associate the data with its tuple. I am not completely satisfied since it seems a workaround, but it works pretty well and it is not overly complicated. If anyone has a better one please let me know!

What is the good code style for scala when calling a method of an object

I don't know how to choose between two code style for a scala project, when calling a method from an object.
code style 1
import com.socgen.bsc.sqd.per.Load._
val pAndRDf: DataFrame = loadPandR(sqdDate)
code style 2
import com.socgen.bsc.sqd.per.Load
val pAndRDf: DataFrame = Load.loadPandR(sqdDate)
I would like to know which one is better or between these two styles, they are the same and we can choose whatever we like.
I prefer style 2, especially if the name of the object is so short. If you have a longer name you can also add a substitution, like:$
import com.socgen.bsc.sqd.per.{LongComplexLoad => LCL}
val pAndRDf: DataFrame = LCL.loadPandR(sqdDate)
There is another style I usually do if I am in control. That is to use traits instead of objects.
class MyClass extends Load {
val pAndRDf: DataFrame = loadPandR(sqdDate)
...
}
This has the advantage that you see what you use in the class description. If the list gets too long it is also an indication that you may think about Separation of Concerns.
There is also a discussion about that on Reddit
Best idea is to go with option 2 where possible. Have modified your value name too (as I personally think it’s more readable).
import com.socgen.bsc.sqd.per.Load
val dfPAndR: DataFrame = Load.loadPandR(sqdDate)
Avoid ambiguous imports where possible. It can make it harder to debug your code, increases the chances of accidentally importing two Load objects (if you’re doing another ambiguous import somewhere) and means you’re importing a lot of stuff that you don’t need, potentially causing bloat.
If you need to import multiple things then stick to curly brace import com.org.package.{Load, Write}.
I’d (personally) also change the value name. Most of us read left to right, so know you know it’s a dataframe first, then what value it is.
It’s a small change, but can help speed up debugging. Especially if you have an rddPAndR later on (for instance).
It really depends on you style, but , for me if you can minimize the use of "_" would be great.
And i would use like this:
import com.socgen.bsc.sqd.per.Load.loadPandR
val pAndRDf: DataFrame = loadPandR(sqdDate)
Edited

Scala JSR 223 importing types/classes

The following example fails because the definition for Stuff can't be found:
package com.example
import javax.script.ScriptEngineManager
object Driver5 extends App {
case class Stuff(s: String, d: Double)
val e = new ScriptEngineManager().getEngineByName("scala")
println(e.eval("""import Driver5.Stuff; Stuff("Hello", 3.14)"""))
}
I'm unable to find any import statement that allows me to use my own classes inside of the eval statement. Am I doing something wrong? How does one import classes to be used during eval?
EDIT: Clarified example code to elicit more direct answers.
The Scripting engine does not know the context. It surely can't access all the local variables and imports in the script, since they are not available in the classfiles. (Well, variable names may be optionally available as a debug information, but it is virtually impossible to use them for this purpose.)
I am not sure if there is a special API for that. Imports are different across various languages, so bringing an API that should fit them all can be difficult.
You should be able to add the imports to the eval-ed String. I am not sure if there is a better way to do this.

Transitively import foo._ in Scala

I'm using a utility library for dimensional analysis that i'd like to extend with my own units, and I'd like to be able to write
import my.util.units._
in files in my project. My thought was to define
package my.util
object units {
import squants._
[... other definitions ...]
}
and I expected import my.util.units._ to have the same effect as import squants._, plus the other definitions. But it seems importing units._ doesn't end up adding squants._ to the scope.
Is there a way to do this in scala?
We've dealt with this a little bit at work, and we've tried to resolve this a few ways. Here's an example of how we import rabbitmq types throughout scala-amqp:
package com.bostontechnologies
package object amqp {
type RabbitShutdownListener = com.rabbitmq.client.ShutdownListener
type RabbitConnection = com.rabbitmq.client.Connection
type RabbitChannel = com.rabbitmq.client.Channel
type RabbitConsumer = com.rabbitmq.client.Consumer
type RabbitAddress = com.rabbitmq.client.Address
...
}
So now when we import com.bostontechnologies.amqp._ we get access to the rabbitmq types that we've defined. I know it requires quite a bit of duplication, however we've found it to be somewhat useful, especially since it gives us granularity over type names.
Also, you don't need to use a package object, we mainly use it for convenience of automatically importing our types around a package. You could just use a normal object as well.
Imports are not transitive in Java or Scala. Probably the closest you are going to get to what you seek is to create an object (perhaps a package object) with a type definition for each type of interest.

Prevent automatic hash function for mutable classes

Python allows hash values only for immutable objects. For example,
hash((1,2,3))
works, but
hash([1,2,3])
raises a TypeError: unhashable type: 'list'. See the Python documentation. However, when I wrap a C++ class in Boost.Python via the usual boost::python::class_<> function, every generated Python class has a default hash function, where the hash value is related to the object's location in memory. (On my 64-bit OS, the hash value is the location divided by 8.)
When I expose a class to Python whose members can be changed (any mutable data structure, so this is a very common situation!), I do not want a default hash function but want a call to hash() raise the same TypeError as users receive for Python's own mutable data types. In particular, users shouldn't be able to accidentally use mutable objects as dictionary keys. How can I achieve this in the C++ code?
I found out how it goes:
boost::python::class_<MyClass>("MyClass")
.setattr("__hash__", boost::python::object());
A boost::python::object which is initialized with no arguments corresponds to None. The procedure for disabling hash generation in the pure Python C API is a little more complicated, as is described in the Python documentation. However, the above code snippet apparently does the job in boost::python.
On a sidenote: The Boost.Python behaviour mirrors the default behaviour of classes in Python, where objects are basically hashable as of object id (derived from id(x)):
>>> hash(object())
8795488122377
>>> class MyClass(object): pass
...
>>> hash(MyClass)
878579
>>> hash(MyClass())
8795488082665
>>>