Principles of immutability and copy-on-write in polars python api - python-polars

Hi I'm working on this fan fiction project of a full feature + syntax translation of pypolars to R called "minipolars".
I understand the pypolars API e.g. DataFrame in generel elicits immutable-behavior or isch the same as 'copy-on-write' behaviour. Most methods altering the DataFrame object will return a cheap copy.
Exceptions known to me are DataFrame.extend and the #columns.setter.
In R, most API's strive for a strictly immutable-behavior. I imagine to both support a strictly immutable behavoir, and optional pypolars-like behavior.
Rust-polars API has many mutable operations + lifetimes and what not, but it is understandably all about performance and expressiveness.
Are there many more central mutable behavoirs in the pypolars-API?
Would a pypolars-API with only immutable behavior suffer in performance and expressiveness?
The R library data.table API do stray away from immutable-behavoir some times. However all such operations that are mutable are prefixed set_ or use the set-operator :=.
Is there an obvious way in pypolars to recognize if an operation is mutable or not?
By mutable-behavoir I think of e.g. executing the method .extend() after defining variable df_mutable_copy and that still affects the value df_mutable_copy.
import polars as pl
df1 = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df2 = pl.DataFrame({"foo": [10, 20, 30], "bar": [40, 50, 60]})
df_copy = df1
df_copy_old_shape = df_copy.shape
df1.extend(df2)
df_copy_new_shape = df_copy.shape
#extend was a operation with mutable behaviour df_copy was affected.
df_copy_old_shape != df_copy_new_shape

Most of the python polars API is actually a wrapper around polars lazy.
(df.with_columns(..)
.join(..)
.groupby()
.select(..)
translates to:
```python
(df.lazy().with_columns(..).collect(no_optimization=True)
.lazy().join(..).collect(no_optimization=True)
.lazy().groupby().collect(no_optimization=True)
.lazy().select(..).collect(no_optimization=True)
That means that almost all expresions run on the polars query engine. The query engine itself determines if an operation can be done in place, or if it should clone the data (copy on write).
Polars actually has Copy on write on steroids, as it only copies if the data is not shared. If we are the only owner, we mutate in place. We can do this because Rust has a borrow checker, so if we own the data and the ref count is 1, we can mutate the data.
I would recommend you to implement your R-polars API similar to what we do in python. Then all operations can be made pure (e.g. return a new Series/Expr/DataFrame) and pollars will decide when to mutate in place.
Don't worry about copying data. All data buffers are wrapped in an Arc, so we only increment a reference count.

Related

numba error with tuple sorting containing numpy arrays

I have a (working) function that uses the heapq module to build a priority queue of tuples and I would like to compile it with numba, however I get a very long and unclear error. It seems to boil down to a problem with tuple order comparison needed for the queue. The tuples have a fixed format, where the first item is a a floating point number (whose order I care about) and then a numpy array, which I need for computation but never get compared when running normally. This is intended because comparison on numpy arrays yields an array which cannot be used in conditionals and raises an exception. However, I guess numba needs a scalar yielding comparison to be at least defined for all items in the tuple, and hence the numba error.
I have a very minimal example:
#numba.njit
def f():
return 1 if (1, numpy.arange(3)) < (2, numpy.arange(3)) else 2
f()
where the numba compilation fails (without numba it works since it never needs to actually compare the arrays, as in the original code).
Here is a slightly less minimal but maybe clearer example, which shows what I am actually doing:
from heapq import heappush
import numpy
import numba
#numba.njit
def f(n):
heap = [(1, 0, numpy.random.rand(2, 3))]
for unique_id in range(n):
order = numpy.random.rand()
data = numpy.random.rand(2, 3)
heappush(heap, (order, unique_id, data))
return heap[0]
f(100)
Here order is the variable whose order I care about in the queue, unique_id is a trick to avoid that when order is the same the comparison goes on to data and throws an exception.
I tried to bypass the problem converting the numpy array to a list when in the tuple and back to array for computation, but while this compiles, the numba version is slower than the interpreted one, even thought the array is quite small (usually 2x3). Without converting I would need to rewrite the code as loops which I would prefer to avoid (but is doable).
Is there a better alternative to get this working with numba, hopefully running faster than the python interpreter?
I'll try to respond based on the minimal example you provided.
I think that the problem here is not related to the ability of numba to perform comparison between all the elements of the tuple, but rather on where to store the result of such a comparison. This is stated in the error log returned when trying to execute your example:
cannot store {i8*, i8*, i64, i64, i8*, [1 x i64], [1 x i64]} to i1*: mismatching types
Basically, you are trying to store the result of a comparison between a pair of floats and a pair of arrays into a single boolean, and numba doesn't know how to do that.
If you are only interested in comparing the first elements of the tuples, the quickest workaround I can think of is forcing the comparison to happen only on the first elements, e.g.
#numba.njit
def f():
return 1 if (1, numpy.arange(3))[0] < (2, numpy.arange(3))[0] else 2
f()
If this is not applicable to your use case, please provide more details about it.
EDIT
According to the further information you provided, I think the best way to solve this is avoiding pushing the numpy arrays to the heap. Since you're only interested in the ordering properties of the heap, you can just push the keys to the heap and store the corresponding numpy arrays in a separate dictionary, using as keys the same values you push in the heap.
As a sidenote, when you use standard library functions in nopython-jitted functions, you are resorting on specific numba's re-implementation of those functions rather then the "original" python's ones. A comprehensive list of the available python's features in numba can be found here.
Ok, I found a solution to the problem: since storing the array in the heap tuple is the cause of the numba error, it is enough to store it in a separate dictionary with an unique key and store only the key in the heap tuple. For instance, using an integer as the key:
from heapq import heappush
import numpy
import numba
#numba.njit
def f(n):
key = 0
array_storage = {key: numpy.random.rand(2, 3)}
heap = [(1.0, key)]
for _ in range(n):
order = numpy.random.rand()
data = numpy.random.rand(2, 3)
key += 1
heappush(heap, (order, key))
array_storage[key] = data
return heap[0]
f(100)
Now the tuples in the heap can be compared yielding a boolean value and I still get to associate the data with its tuple. I am not completely satisfied since it seems a workaround, but it works pretty well and it is not overly complicated. If anyone has a better one please let me know!

Convert tuple to array in Scala

What is the best way to convert a tuple into an array in Scala? Here "best" means in as few lines of code as possible. I was shocked to search Google and StackOverflow only to find nothing on this topic, which seems like it should be trivial and common. Lists have a a toArray function; why don't tuples?
Use productIterator, immediately followed by toArray:
(42, 3.14, "hello", true).productIterator.toArray
gives:
res0: Array[Any] = Array(42, 3.14, hello, true)
The type of the result shows the main reason why it's rarely used: in tuples, the types of the elements can be heterogeneous, in arrays they must be homogeneous, so that often too much type information is lost during this conversion. If you want to do this, then you probably shouldn't have stored your information in tuples in the first place.
There is simply almost nothing you can (safely) do with an Array[Any], except printing it out, or converting it to an even more degenerate Set[Any]. Instead you could use:
Lists of case classes belonging to a common sealed trait,
shapeless HLists,
a carefully chosen base class with a bit of inheritance,
or something that at least keeps some kind of schema at runtime (like Apache Spark Datasets)
they would all be better alternatives.
In the somewhat less likely case that the elements of the "tuples" that you are processing frequently turn out to have an informative least upper bound type, then it might be because you aren't working with plain tuples, but with some kind of traversable data structure that puts restrictions on the number of substructures in the nodes. In this case, you should consider implementing something like Traverse interface for the structure, instead of messing with some "tuples" manually.

Immutable DataStructures In Scala

We know that Scala supports immutable data structures..i.e each time u update the list it will create a new object and reference in the heap.
Example
val xs:List[Int] = List.apply(22)
val newList = xs ++ (33)
So when i append the second element to a list it will create a new list which will contain both 22 and 33.This exactly works like how immutable String works in Java.
So the question is each time I append a element in the list a new object will be created each time..This ldoes not look efficient to me.
is there some special data structures like persistent data structures are used when dealing with this..Does anyone know about this?
Appending to a list has O(n) complexity and is inefficient. A general approach is to prepend to a list while building it, and finally reverse it.
Now, your question on creating new object still applies to the prepend. Note that since xs is immutable, newList just points to xs for the rest of the data after the prepend.
While #manojlds is correct in his analysis, the original post asked about the efficiency of duplicating list nodes whenever you do an operation.
As #manojlds said, constructing lists often require thinking backwards, i.e., building a list and then reversing it. There are a number of other situations where list building requires "needless" copying.
To that end, there is a mutable data structure available in Scala called ListBuffer which you can use to build up your list and then extract the result as an immutable list:
val xsa = ListBuffer[Int](22)
xsa += 33
val newList = xsa.toList
However, the fact that the list data structure is, in general, immutable means that you have some very useful tools to analyze, de-compose and re-compose the list. Many builtin operations take advantage of the immutability. By extension, your own programs can also take advantage of this immutability.

val-mutable versus var-immutable in Scala

Are there any guidelines in Scala on when to use val with a mutable collection versus using var with an immutable collection? Or should you really aim for val with an immutable collection?
The fact that there are both types of collection gives me a lot of choice, and often I don't
know how to make that choice.
Pretty common question, this one. The hard thing is finding the duplicates.
You should strive for referential transparency. What that means is that, if I have an expression "e", I could make a val x = e, and replace e with x. This is the property that mutability break. Whenever you need to make a design decision, maximize for referential transparency.
As a practical matter, a method-local var is the safest var that exists, since it doesn't escape the method. If the method is short, even better. If it isn't, try to reduce it by extracting other methods.
On the other hand, a mutable collection has the potential to escape, even if it doesn't. When changing code, you might then want to pass it to other methods, or return it. That's the kind of thing that breaks referential transparency.
On an object (a field), pretty much the same thing happens, but with more dire consequences. Either way the object will have state and, therefore, break referential transparency. But having a mutable collection means even the object itself might lose control of who's changing it.
If you work with immutable collections and you need to "modify" them, for example, add elements to them in a loop, then you have to use vars because you need to store the resulting collection somewhere. If you only read from immutable collections, then use vals.
In general, make sure that you don't confuse references and objects. vals are immutable references (constant pointers in C). That is, when you use val x = new MutableFoo(), you'll be able to change the object that x points to, but you won't be able to change to which object x points. The opposite holds if you use var x = new ImmutableFoo(). Picking up my initial advice: if you don't need to change to which object a reference points, use vals.
The best way to answer this is with an example. Suppose we have some process simply collecting numbers for some reason. We wish to log these numbers, and will send the collection to another process to do this.
Of course, we are still collecting numbers after we send the collection to the logger. And let's say there is some overhead in the logging process that delays the actual logging. Hopefully you can see where this is going.
If we store this collection in a mutable val, (mutable because we are continuously adding to it), this means that the process doing the logging will be looking at the same object that's still being updated by our collection process. That collection may be updated at any time, and so when it's time to log we may not actually be logging the collection we sent.
If we use an immutable var, we send an immutable data structure to the logger. When we add more numbers to our collection, we will be replacing our var with a new immutable data structure. This doesn't mean collection sent to the logger is replaced! It's still referencing the collection it was sent. So our logger will indeed log the collection it received.
I think the examples in this blog post will shed more light, as the question of which combo to use becomes even more important in concurrency scenarios: importance of immutability for concurrency. And while we're at it, note the preferred use of synchronised vs #volatile vs something like AtomicReference: three tools
var immutable vs. val mutable
In addition to many excellent answers to this question. Here is a simple example, that illustrates potential dangers of val mutable:
Mutable objects can be modified inside methods, that take them as parameters, while reassignment is not allowed.
import scala.collection.mutable.ArrayBuffer
object MyObject {
def main(args: Array[String]) {
val a = ArrayBuffer(1,2,3,4)
silly(a)
println(a) // a has been modified here
}
def silly(a: ArrayBuffer[Int]): Unit = {
a += 10
println(s"length: ${a.length}")
}
}
Result:
length: 5
ArrayBuffer(1, 2, 3, 4, 10)
Something like this cannot happen with var immutable, because reassignment is not allowed:
object MyObject {
def main(args: Array[String]) {
var v = Vector(1,2,3,4)
silly(v)
println(v)
}
def silly(v: Vector[Int]): Unit = {
v = v :+ 10 // This line is not valid
println(s"length of v: ${v.length}")
}
}
Results in:
error: reassignment to val
Since function parameters are treated as val this reassignment is not allowed.

Scala immutable map, when to go mutable?

My present use case is pretty trivial, either mutable or immutable Map will do the trick.
Have a method that takes an immutable Map, which then calls a 3rd party API method that takes an immutable Map as well
def doFoo(foo: String = "default", params: Map[String, Any] = Map()) {
val newMap =
if(someCondition) params + ("foo" -> foo) else params
api.doSomething(newMap)
}
The Map in question will generally be quite small, at most there might be an embedded List of case class instances, a few thousand entries max. So, again, assume little impact in going immutable in this case (i.e. having essentially 2 instances of the Map via the newMap val copy).
Still, it nags me a bit, copying the map just to get a new map with a few k->v entries tacked onto it.
I could go mutable and params.put("bar", bar), etc. for the entries I want to tack on, and then params.toMap to convert to immutable for the api call, that is an option. but then I have to import and pass around mutable maps, which is a bit of hassle compared to going with Scala's default immutable Map.
So, what are the general guidelines for when it is justified/good practice to use mutable Map over immutable Maps?
Thanks
EDIT
so, it appears that an add operation on an immutable map takes near constant time, confirming #dhg's and #Nicolas's assertion that a full copy is not made, which solves the problem for the concrete case presented.
Depending on the immutable Map implementation, adding a few entries may not actually copy the entire original Map. This is one of the advantages to the immutable data structure approach: Scala will try to get away with copying as little as possible.
This kind of behavior is easiest to see with a List. If I have a val a = List(1,2,3), then that list is stored in memory. However, if I prepend an additional element like val b = 0 :: a, I do get a new 4-element List back, but Scala did not copy the orignal list a. Instead, we just created one new link, called it b, and gave it a pointer to the existing List a.
You can envision strategies like this for other kinds of collections as well. For example, if I add one element to a Map, the collection could simply wrap the existing map, falling back to it when needed, all while providing an API as if it were a single Map.
Using a mutable object is not bad in itself, it becomes bad in a functional programming environment, where you try to avoid side-effects by keeping functions pure and objects immutable.
However, if you create a mutable object inside a function and modify this object, the function is still pure if you don't release a reference to this object outside the function. It is acceptable to have code like:
def buildVector( x: Double, y: Double, z: Double ): Vector[Double] = {
val ary = Array.ofDim[Double]( 3 )
ary( 0 ) = x
ary( 1 ) = y
ary( 2 ) = z
ary.toVector
}
Now, I think this approach is useful/recommended in two cases: (1) Performance, if creating and modifying an immutable object is a bottleneck of your whole application; (2) Code readability, because sometimes it's easier to modify a complex object in place (rather than resorting to lenses, zippers, etc.)
In addition to dhg's answer, you can take a look to the performance of the scala collections. If an add/remove operation doesn't take a linear time, it must do something else than just simply copying the entire structure. (Note that the converse is not true: it's not beacuase it takes linear time that your copying the whole structure)
I like to use collections.maps as the declared parameter types (input or return values) rather than mutable or immutable maps. The Collections maps are immutable interfaces that work for both types of implementations. A consumer method using a map really doesn't need to know about a map implementation or how it was constructed. (It's really none of its business anyway).
If you go with the approach of hiding a map's particular construction (be it mutable or immutable) from the consumers who use it then you're still getting an essentially immutable map downstream. And by using collection.Map as an immutable interface you completely remove all the ".toMap" inefficiency that you would have with consumers written to use immutable.Map typed objects. Having to convert a completely constructed map into another one simply to comply to an interface not supported by the first one really is absolutely unnecessary overhead when you think about it.
I suspect in a few years from now we'll look back at the three separate sets of interfaces (mutable maps, immutable maps, and collections maps) and realize that 99% of the time only 2 are really needed (mutable and collections) and that using the (unfortunately) default immutable map interface really adds a lot of unnecessary overhead for the "Scalable Language".