Polars searchsorted with a Series - python-polars

searchsorted is an incredibly useful utility in numpy and pandas for performing a binary search on every element in a list, especially for time-series data.
import numpy as np
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c']) # Returns [0, 2, 3]
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c'], side='right') # Returns [2, 3, 4]
I have a few questions about Polars
Is there any way to apply search_sorted on a list in polars in a vectorized manner?
Is there any way to specify side=right for search_sorted?
Can we use non-numeric data in search_sorted?
If answer is no to the questions, what would be the recommended approach / workaround to achieve the functionalities?
(The ideal approach is if search_sorted can be used as part of an expression, e.g. pl.col('A').search_sorted(pl.col('B)))
Here's what I have tried:
import polars as pl
pl.Series(['a', 'a', 'b', 'c']).search_sorted(['a', 'b', 'c']) # PanicException: not implemented for Utf8
pl.Series([0, 0, 1, 2]).search_sorted([0, 1, 2]) # PanicException: dtype List not implemented
list(map(pl.Series([0, 0, 1, 2]).search_sorted, [0, 1, 2])) # Returns [1, 2, 3], different from numpy results
pl.DataFrame({
'a': [0, 0, 1, 2],
'b': [0, 1, 2, 3],
}).with_columns([
pl.col('a').search_sorted(pl.col('b')).alias('c')
]) # Column C is [1, 1, 1, 1], which is incorrect
I understand Polars is still a work in progress and some functionalities are missing, so any help is greatly appreciated!

To extend on #ritchie46's answer, you need a rolling join so that missing values can be joined to their near neighbor. Unfortunately rolling joins don't work on letters, or more accurately Utf8 dtypes so for your example you have to do an extra step.
Starting from:
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
then we make a df to house all the possible values of a and map them to a numeric.
dfindx=(pl.DataFrame(pl.concat([df1.get_column('a'),df2.get_column('a')]).unique())
.sort('a').with_row_count('valindx'))
now we add that valindx to each of df1 and df2
df1=df1.join(dfindx, on='a')
df2=df2.join(dfindx, on='a')
To get almost to the finish line you'd do:
df2.join_asof(df1, on='valindx', strategy='forward')
this will leave missing the last value, the 4 from the numpy case because essentially what's happening is that the first value 'a' doesn't find a match but its nearest forward neighbor is a 'b' so it takes that value and so on but when it gets to 'e' there is nothing in df1 forward of that so we need to do a minor hack of just filling in that null with the max idx+1.
(df2.
join_asof(df1, on='valindx', strategy='forward')
.with_column(pl.col('idx').fill_null(df1.select(pl.col('idx').max()+1)[0,0]))
.get_column('idx'))
Of course, if you're using time or numerics then you can skip the first step. Additionally, I suspect that fetching this index value is an intermediate step and that overall process would be done more efficiently without extracting the index values at all but that would be through a join_asof.
If you change the strategy of join_asof then that should be largely the same as switching the side but you'd have to change the hack bit at the end too.

EDIT: I added the requested functionality and it will be available in next release: https://github.com/pola-rs/polars/pull/6083
Old answer (wrong)
For a "normal" search sorted we can use a join.
# convert to DataFrame
# provide polars with the information the data is sorted (this speeds up many algorithms)
# set a row count
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
# join
# drop duplicates
# and only show the indices that were joined
df1.join(df2, on="a", how="semi").unique(subset=["a"])["idx"]
Series: 'idx' [u32]
[
0
2
3
]

Related

print a specific partition of RDD / Dataframe

I have been experimenting with partitions and repartitioning of PySpark RDDs.
I noticed, when repartitioning a small sample RDD from 2 to 6 partitions, that simply a few empty parts are added.
rdd = sc.parallelize([1,2,3,43,54,678], 2)
rdd.glom().collect()
>>> [[1, 2, 3], [43, 54, 678]]
rdd6 = rdd.repartition(6)
rdd6.glom().collect()
>>> [[], [1, 2, 3], [], [], [], [43, 54, 678]]
Now, I wonder if that also happens in my real data.
It seems I can't use glom() on larger data (df with 192497 rows).
df.rdd.glom().collect()
Because when I try, nothing happens. It makes sense though, the resulting print would be enormous...
SO
I'd like to print each partition, to check if they are empty. or at least the top 20 elements of each partition.
any ideas?
PS: I found solutions for Spark, but I couldn't get them to work in PySpark...
How to print elements of particular RDD partition in Spark?
btw: if someone can explain to me why I get those empty partitions in the first place, I'd be all ears...
Or how I know when to expect this to happen and how to avoid this.
Or does it simply not influence performance, if there are empty partitions in a dataset?
Apparently (and surprisingly), rdd.repartition only doing coalesce, so, no shuffling, no wonder why the distribution is unequal. One way to go is using dataframe.repartition
rdd = sc.parallelize([1,2,3,43,54,678], 2)
rdd.glom().collect()
>>> [[1, 2, 3], [43, 54, 678]]
rdd6 = rdd.repartition(6)
rdd6.glom().collect()
>>> [[], [1, 2, 3], [], [], [], [43, 54, 678]]
rdd6_df = spark.createDataFrame(rdd, T.IntegerType()).repartition(6).rdd
rdd6_df.glom().collect()
[[Row(value=678)],
[Row(value=3)],
[Row(value=2)],
[Row(value=1)],
[Row(value=43)],
[Row(value=54)]]
concerning the possibility to check if partitions are empty, I came across a few solutions myself:
(if there aren't that many partitions)
rdd.glom().collect()
>>>nothing happens
rdd.glom().collect()[1]
>>>[1, 2, 3]
Careful though, it will truly print the whole partition. For my data it resulted in a few thousand lines of print. but it worked!
source: How to print elements of particular RDD partition in Spark?
count lines in each partition and show smallest/largest number.
l = df.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
min(l,key=lambda item:item[1])
>>>(2, 61705)
max(l,key=lambda item:item[1])
>>>(0, 65875)
source: Spark Dataframes: Skewed Partition after Join

Encoding String to numbers so as to use it in scikit-learn

My data consists of 50 columns and most of them are strings. I have a single multi-class variable which I have to predict. I tried using LabelEncoder in scikit-learn to convert the features (not classes) into whole numbers and feed them as input to the RandomForest model I am using. I am using RandomForest for classification.
Now, when new test data comes (stream of new data), for each column, how will I know what the label for each string will be since using LabelEncoder now will give me a new label independent of the labels I generated before. Am, I doing this wrong? Is there anything else I should use for consistent encoding?
The LabelEncoder class has two methods that handle this distinction: fit and transform. Typically you call fit first to map some data to a set of integers:
>>> le = LabelEncoder()
>>> le.fit(['a', 'e', 'b', 'z'])
>>> le.classes_
array(['a', 'b', 'e', 'z'], dtype='U1')
Once you've fit your encoder, you can transform any data to the label space, without changing the existing mapping:
>>> le.transform(['a', 'e', 'a', 'z', 'a', 'b'])
[0, 2, 0, 3, 0, 1]
>>> le.transform(['e', 'e', 'e'])
[2, 2, 2]
The use of this encoder basically assumes that you know beforehand what all the labels are in all of your data. If you have labels that might show up later (e.g., in an online learning scenario), you'll need to decide how to handle those outside the encoder.
You could save the mapping: string -> label in training data with each column.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> col_1 = ["paris", "paris", "tokyo", "amsterdam"]
>>> set_col_1 = list(set(col_1))
>>> le.fit(col_1)
>>> dict(zip(set_col_1, le.transform(set_col_1)))
{'amsterdam': 0, 'paris': 1, 'tokyo': 2}
When the testing data come, you could use those mapping to encode corresponding columns in testing data. You do not have to use encoder again in testing data.

How can I select certain rows in a dataset? Mathematica

My question is probably really easy, but I am a mathematica beginner.
I have a dataset, lets say:
Column: Numbers from 1 to 10
Column Signs
Column Other signs.
{{1,2,3,4,5,6,7,8,9,10},{d,t,4,/,g,t,w,o,p,m},{g,h,j,k,l,s,d,e,w,q}}
Now I want to extract all rows for which column 1 provides an odd number. In other words I want to create a new dataset.
I tried to work with Select and OddQ as well as with the IF function, but I have absolutely no clue how to put this orders in the right way!
Taking a stab at what you might be asking..
(table = {{1, 2, 3, 4, 5, 6, 7, 8, 9, 10} ,
Characters["abcdefghij"],
Characters["ABCDEFGHIJ"]}) // MatrixForm
table[[All, 1 ;; -1 ;; 2]] // MatrixForm
or perhaps this:
Select[table, OddQ[#[[1]]] &]
{{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}}
The convention in Mathematica is the reverse of what you use in your description.
Rows are first level sublists.
Let's take your original data
mytable = {{1,2,3,4,5,6,7,8,9,10},{d,t,4,"/",g,t,w,o,p,m},{g,h,j,k,l,s,d,e,w,q}}
Just as you suggested, Select and OddQ can do what you want, but on your table, transposed. So we transpose first and back:
Transpose[Select[Transpose[mytable], OddQ[First[#]]& ]]
Another way:
Mathematica functional command MapThread can work on synchronous lists.
DeleteCases[MapThread[If[OddQ[#1], {##}] &, mytable], Null]
The inner function of MapThread gets all elements of what you call a 'row' as variables (#1, #2, etc.). So it test the first column and outputs all columns or a Null if the test fails. The enclosing DeleteCases suppresses the unmatching "rows".

no difference between [..] and [...] for array?

Edit: made a github issue, it got closed a day later by jashkenas. So the takeaway is "working as intended" essentially.
coffee> arr
[ 0,
1,
2,
3,
'A',
'K' ]
coffee> arr[...]
[ 0,
1,
2,
3,
'A',
'K' ]
coffee> arr[..]
[ 0,
1,
2,
3,
'A',
'K' ]
According to the docs, those should be different.
With two dots (3..6), the range is inclusive (3, 4, 5, 6); with three dots (3...6), the range excludes the end (3, 4, 5).
The two slice statements that are produced are the same. Seems to me that .. should produce .slice(0) and ... should produce .slice(0, -1) Am I missing something or seeing a bug?
1.7.1
The documentation then goes on to say:
Slices indices have useful defaults. An omitted first index defaults
to zero and an omitted second index defaults to the size of the array.
This is consistent with what you're seeing. The length of your array is 6 so:
[..] is equivalent to [0..6] which would compile to .slice(0,7)
[...] is equivalent to [0...6] which would compile to .slice(0,6)
With an array of length 6, both .slice(0,6) and .slice(0,7) return all elements and so both are equivalent to .slice(0), which is what both [..] and [...] compile to.
What you are expecting would be the case if an omitted second index defaulted to the size of the array minus 1, but this is not the case.

Select One Element in Each Row of a Numpy Array by Column Indices [duplicate]

This question already has answers here:
NumPy selecting specific column index per row by using a list of indexes
(7 answers)
Closed 2 years ago.
Is there a better way to get the "output_array" from the "input_array" and "select_id" ?
Can we get rid of range( input_array.shape[0] ) ?
>>> input_array = numpy.array( [ [3,14], [12, 5], [75, 50] ] )
>>> select_id = [0, 1, 1]
>>> print input_array
[[ 3 14]
[12 5]
[75 50]]
>>> output_array = input_array[ range( input_array.shape[0] ), select_id ]
>>> print output_array
[ 3 5 50]
You can choose from given array using numpy.choose which constructs an array from an index array (in your case select_id) and a set of arrays (in your case input_array) to choose from. However you may first need to transpose input_array to match dimensions. The following shows a small example:
In [101]: input_array
Out[101]:
array([[ 3, 14],
[12, 5],
[75, 50]])
In [102]: input_array.shape
Out[102]: (3, 2)
In [103]: select_id
Out[103]: [0, 1, 1]
In [104]: output_array = np.choose(select_id, input_array.T)
In [105]: output_array
Out[105]: array([ 3, 5, 50])
(because I can't post this as a comment on the accepted answer)
Note that numpy.choose only works if you have 32 or fewer choices (in this case, the dimension of your array along which you're indexing must be of size 32 or smaller). Additionally, the documentation for numpy.choose says
To reduce the chance of misinterpretation, even though the following "abuse" is nominally supported, choices should neither be, nor be thought of as, a single array, i.e., the outermost sequence-like container should be either a list or a tuple.
The OP asks:
Is there a better way to get the output_array from the input_array and select_id?
I would say, the way you originally suggested seems the best out of those presented here. It is easy to understand, scales to large arrays, and is efficient.
Can we get rid of range(input_array.shape[0])?
Yes, as shown by other answers, but the accepted one doesn't work in general so well as what the OP already suggests doing.
I think enumerate is handy.
[input_array[enum, item] for enum, item in enumerate(select_id)]
How about:
[input_array[x,y] for x,y in zip(range(len(input_array[:,0])),select_id)]