Mapping a Python dict to a Polars series - python-polars

In Pandas we can use the map function to map a dict to a series to create another series with the mapped values. More generally speaking, I believe it invokes the index operator of the argument, i.e. [].
import pandas as pd
dic = { 1: 'a', 2: 'b', 3: 'c' }
pd.Series([1, 2, 3, 4]).map(dic) # returns ["a", "b", "c", NaN]
I haven't found a way to do so directly in Polars, but have found a few alternatives. Would any of these be the recommended way to do so, or is there a better way?
import polars as pl
dic = { 1: 'a', 2: 'b', 3: 'c' }
# Approach 1 - apply
pl.Series([1, 2, 3, 4]).apply(lambda v: dic.get(v, None)) # returns ["a", "b", "c", null]
# Approach 2 - left join
(
pl.Series([1, 2, 3, 4])
.alias('key')
.to_frame()
.join(
pl.DataFrame({
'key': list(dic.keys()),
'value': list(dic.values()),
}),
on='key', how='left',
)['value']
) # returns ["a", "b", "c", null]
# Approach 3 - to pandas and back
pl.from_pandas(pl.Series([1, 2, 3, 4]).to_pandas().map(dic)) # returns ["a", "b", "c", null]
I saw this answer on mapping a dict of expressions but since its chains when/then/otherwise it might not work well for huge dicts.

Mapping a python dictionary over a polars Series should always be considered an anti-pattern. This will be terribly slow and what you want is semantically equal to a join.
Use joins. They are heavily optimized, multithreaded and don't use python.
Example
import polars as pl
dic = { 1: 'a', 2: 'b', 3: 'c' }
mapper = pl.DataFrame({
"keys": list(dic.keys()),
"values": list(dic.values())
})
pl.Series([1, 2, 3, 4]).to_frame("keys").join(mapper, on="keys", how="left").to_series(1)
Series: 'values' [str]
[
"a"
"b"
"c"
null
]

Polars is an awesome tool but even awesome tools aren't meant for everything and this is one of those cases. Using a simple python list comprehension is going to be faster.
You could just do:
[dic[x] if x in dic.keys() else None for x in [1,2,3,4]]
On my computer, the timing of that, using %%timeit is 800ns
In contrast to
pl.Series([1, 2, 3, 4]).to_frame("keys").join(pl.DataFrame([{'keys':x, 'values':y} for x,y in dic.items()]), on="keys", how="left").to_series(1)
which takes 434µs.
Notice that the first is measured in nanoseconds whereas the second is in microseconds so it's really 800ns vs 434000ns.

Related

tuple versus Tuple

Why do I get the following different results when converting a vector using either tuple or Tuple?
julia> a = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> tuple(a)
([1, 2, 3],)
julia> Tuple(a)
(1, 2, 3)
Broadcasting gives the same result though:
julia> tuple.(a)
3-element Vector{Tuple{Int64}}:
(1,)
(2,)
(3,)
julia> Tuple.(a)
3-element Vector{Tuple{Int64}}:
(1,)
(2,)
(3,)
(The latter is not so surprising as it just converts single numbers to tuples.)
(This is Julia 1.6.1.)
Tuple is a type and as with all collections in Julia base, if you pass another collection to it, it creates an instance of that type from the contents of the other collection. So Tuple([1, 2, 3]) constructs a tuple of the values 1, 2 and 3 just like Set([1, 2, 3]) constructs a set of those same values. Similarly, if you write Dict([:a => 1, :b => 2, :c => 3]) you get a dict that contains the pairs :a => 1, :b => 2 and :c => 3. This also works nicely when the argument to the constructor is an iterator; some examples:
julia> Tuple(k^2 for k=1:3)
(1, 4, 9)
julia> Set(k^2 for k=1:3)
Set{Int64} with 3 elements:
4
9
1
julia> Dict(string(k, base=2, pad=2) => k^2 for k=1:3)
Dict{String, Int64} with 3 entries:
"10" => 4
"11" => 9
"01" => 1
So that's why Tuple works the way it does. The tuple function, on the other hand, is a function that makes a tuple from its arguments like this:
julia> tuple()
()
julia> tuple(1)
(1,)
julia> tuple(1, "two")
(1, "two")
julia> tuple(1, "two", 3.0)
(1, "two", 3.0)
Why have tuple at all instead of just having Tuple? You could express this last example as Tuple([1, "two", 3.0]). However, that requires constructing a temporary untyped array only to iterate it and make a tuple from its contents, which is really inefficient. If only there was a more efficient container type that the compiler can usually eliminate the construction of... like a tuple. For that we'd write Tuple((1, "two", 3.0)). Which works, but is completely redundant since (1, "two", 3.0) is already the tuple you wanted. So why would you use tuple? Most of the time you don't, you just use the (1, "two", 3.0) syntax for constructing a tuple. But sometimes you want an actual function that you can apply to some values to get a tuple of them—and tuple is that function. You can actually make an anonymous function that does this pretty easily: (args...) -> (args...,). You can just think of tuple as a handy abbreviation for that function.

PySpark: Count pair frequency occurences

Let's say I have a dataset as follows:
1: a, b, c
2: a, d, c
3: c, d, e
I want to write a Pyspark code to count the occurrences of each of the pairs such as (a,b), (a,c), (b,c) etc.
Expected output:
(a,b) 1
(b,c) 1
(c,d) 2
etc..
Note that, (c,d) and (d,c) should be the same instant.
How should I go about it?
Till now, I have written the code to read the data from textfile as follows -
sc = SparkContext("local", "bp")
spark = SparkSession(sc)
data = sc.textFile('doc.txt')
dataFlatMap = data.flatMap(lambda x: x.split(" "))
Any pointers would be appreciated.
I relied on the answer in this question - How to create a Pyspark Dataframe of combinations from list column
Below is the code that creates a udf where itertools.combinations function is applied to the list of items. The combinations in udf are sorted to avoid double counting occurrences such as ("a", "b") and ("b", "a"). Once you get combinations, you can groupBy and count rows. You may want to count distinct rows in case list elements are repeating, like ("a", "a", "b"), but this depends on your requirements.
import pyspark.sql.functions as F
import itertools
from pyspark.sql.types import *
data = [(1, ["a", "b", "c"]), (2, ["a", "d", "c"]), (3, ["c", "d", "e"])]
df = spark.createDataFrame(data, schema = ["id", "arr"])
# df is
# id arr
# 1 ["a", "b", "c"]
# 2 ["a", "d", "c"]
# 3 ["c", "d", "e"]
#udf(returnType=ArrayType(ArrayType(StringType())))
def combinations_udf(arr):
x = (list(itertools.combinations(arr, 2)))
return [ sorted([y[0], y[1]]) for y in x ]
df1 = (df.withColumn("combinations", F.explode(combinations_udf1("arr"))))
df_ans =(df1
.groupBy("combinations")
.agg(F.countDistinct("id").alias("count"))
.orderBy(F.desc("count")))
For the given dataframe df, df_ans is

efficient way of transforming a dataframe to map in scala

I have a use case where I want to convert the dataframe to a map. For that i am using groupbykey and mapGroups operations. But they are running into memory issues.
Eg:
EmployeeDataFrame
EmployeeId, JobLevel, JobCode
1, 4, 50
1, 5, 60
2, 4, 70
2, 5, 80
3, 7, 90
case class EmployeeModel(EmployeeId: String ,JobLevel: String, JobCode: String)
val ds : Dataset[EmployeeModel] = EmployeeDataFrame.as[EmployeeModel]
val groupedData = ds
.groupByKey(_.EmployeeId)
.mapGroups((key, rows) => (key, rows.toList))
.collect()
.toMap
Expected Map
1, [(),()]
2, [(), ()]
3, [()]
Is there a better way of doing this?

Julia convert EnumerableGroupBy to array

I have the following code that does a groupby on an array of numbers and returns an array of tuples with the numbers and respective counts:
using Query
a = [1, 2, 1, 2, 3, 4, 6, 1, 5, 5, 5, 5]
key_counts = a |> #groupby(_) |> #map g -> (key(g), length(values(g)))
collect(key_counts)
Is there a way to complete the last step in the pipeline to convert the key_counts of type QueryOperators.EnumerableMap{Tuple{Int64, Int64}, QueryOperators.EnumerableIterable{Grouping{Int64, Int64}, QueryOperators.EnumerableGroupBy{Grouping{Int64, Int64}, Int64, Int64, QueryOperators.EnumerableIterable{Int64, Vector{Int64}}, var"#12#17", var"#13#18"}}, var"#15#20"} to Vector{Tuple{Int, Int}} directly by integrating the collect operation to the pipeline as one liner?
The question has been clarified. My answer is no longer intended as a solution but provides additional information.
Using key_counts |> collect instead of collect(key_counts) works on the second line, but |> collect at the end of the pipe line does not, which feels like unwanted behavior.
Below response no longer relevant
When I run your code I actually do receive a Vector{Tuple{Int, Int}} as output.
I'm using Julia v1.6.0 with Query v1.0.0.
using Query
a = [1, 2, 1, 2, 3, 4, 6, 1, 5, 5, 5, 5]
key_counts = a |> #groupby(_) |> #map g -> (key(g), length(values(g)))
output = collect(key_counts)
typeof(output) # Vector{Tuple{Int64, Int64}} (alias for Array{Tuple{Int64, Int64}, 1})

Scala Set - default behavior

Want to know how scala arranges the data in set.
scala> val imm = Set(1, 2, 3, "four") //immutable variable
imm : scala.collection.immutable.Set[Any] = Set(1, 2, 3, four)
scala> var mu = Set(1, 2, 3, "four", 9)
mu: scala.collection.immutable.Set[Any] = Set(four, 1, 9, 2, 3)
scala> mu += "five"
scala> mu.toString
res1: String = Set(four, 1, 9, 2, 3, five)
Order remains as we insert in case of immutable Set but not in mutable Set.
Also no matter how many times I create a new set with var xyz = Set(1, 2, 3, "four", 9) the order of getting stored remains same Set(four, 1, 9, 2, 3). So it's not storing in random, there is some logic behind it which I want to know. Moreover, is there any advantage of such behavior?
Sets don't have an order. An item is in the set or it isn't. That's all you can know about a set.
If you require a certain ordering, you need to use an ordered set or possibly a sorted set.
Of course, any particular implementation of a set may or may not incidentally have an ordering which may or may not be stable across multiple calls, but that would be a purely accidental implementation detail that may change at any time, during an update of Scala or even between two calls.