Polars: Joining on categorical column after aggregation

Polars: Joining on categorical column after aggregation - python-polars

I understand that when I create categorical columns in different data frames they won't join/stack when not created under the same global string cache. However, when deriving a new data frame by aggregating from an existing one, shouldn't it be possible to join them without a global string cache?
import polars as pl
df = pl.DataFrame(data={'column': ['a', 'a', 'b'], 'more': [1, 2, 3]}, columns=[('column', pl.Categorical), ('more', pl.Int32)])
df_agg = df.groupby('column').agg(pl.col('more').mean())
df.join(df_agg, on='column')
Can this join be done without recasting under a global string cache?
P.S. Example just to illustrate the problem, not a best practice example how to add a mean over a group column ;-)

This has been added by ritchie46 and works now as expected.

Related

Apply scikit-learn model to pyspark dataframe column

I have a trained Scikit-learn LogisticRegression model in a sklearn.pipeline.Pipeline. This is an NLP task. The model is saved as a pkl file (actually in ML Studio models, but I download it to databricks dbfs).
I have a Hive table (delta-backed) containing some 1 million rows. The rows have, amongst other things, an id, a keyword_context column (containing the text), a modelled column (boolean, indicates the model has been run on this row), and a prediction column, which is an integer for the class output by the logistic regression.
My problem is how to update the prediction column.
running locally I can do
def generatePredictions(data:pd.DataFrame, model:Pipeline) -> pd.DataFrame:
data.loc[:, 'keyword_context'] = data.keyword_context.apply(lambda x: x.replace("\n", " ")
data['prediction'] = model.predict(data.keyword_context)
data['modelled'] = True
return data
This actually runs fast enough (~20s), but running the UPDATEs back to databricks via the databricks.sql.connector, takes many hours. So I want to do the same in a pyspark notebook to bypass the lengthy upload.
The trouble is that it is generally suggested to use inbuilt functions (which this isn't) or if there must be a udf then the examples all use inbuilt types, not Pipelines. I'm wondering whether the model should be loaded within the function, and I presume the function takes a single row, which means a lot of loading. I'm really not sure how to code the function, or call it.

I work on the Fugue project which aims to provide a simpler interface than the Spark one for porting Python/Pandas code. This is actually the first use case in our tutorial. Fugue will use the underlying Spark call (pandas_udf, udf, mapPartitions, applyInPandas, mapInPandas) based on the arguments you provide with minimal overhead.
Here is what the code looks like.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
X = pd.DataFrame({"x_1": [1, 1, 2, 2], "x_2":[1, 2, 2, 3]})
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
def predict(df: pd.DataFrame, model: LinearRegression) -> pd.DataFrame:
return df.assign(predicted=model.predict(df))
input_df = pd.DataFrame({"x_1": [3, 4, 6, 6], "x_2":[3, 3, 6, 6]})
sdf = spark.createDataFrame(input_df)
# This is the start of the Fugue portion. It's minimal
from fugue import transform
result = transform(
sdf,
predict,
schema="*,predicted:double",
params=dict(model=reg),
engine=spark
)
print(type(result))
result.show()
This code will be applied per partition. Schema is a requirement for Spark. I am not sure but it sounds like you were using a row-wise UDF, so I think this will be faster. It also leaves your logic easy to unit test because you don't need Spark to unit test.
On loading the file inside the function
If you load the file inside the function, it gets executed on the workers. If you pass it in, it gets passed through the scheduler. This can create a lot of redundant passing of data. Loading it inside might speed things up.

issues in creating a new column of tuple from two dataframe columns in pyspark

I'm trying to create a column of tuple based on other two columns in spark dataframe.
data = [ ('A', 4,5 ),
('B', 6, 9 )
]
columns= ["id","val1", "val2"]
sdf = spark.createDataFrame(data = data, schema = columns)
sdf.withColumn('values', F.struct(F.col('val1'), F.col('val2')) ).show()
what I got is:
I need column values to be tuples. So instead of {4,5} {6,9}, I want (4,5) (6,9). Does anyone know what I did wrong? Thanks a lot.

That's not how spark works.
Spark is a framework that is developped in Scala, based on Java JVM. It is not Python.
Pyspark is a set of API that calls the Scala methods to execute Spark but in Python language.
Therefore, Python types such as tuple do not exists in Spark. You have to use either :
Struct which is close to Python dict
Array which are the equivalent of list (probably what you need if you want something close to tuple).
The real question is Why do you need tuples?
EDIT: According to your comment, you need tuples because you want to use haversine. But if you use list (or Spark Array) for example, it works perfectly fine :
# Use the haversine doc example but with list
lyon = [45.7597, 4.8422]
paris = [48.8567, 2.3508]
haversine(lyon, paris)
> 392.2172595594006

How can I iterate through a column of a spark dataframe and access the values in it one by one?

I have spark dataframe
Here it is
I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question

col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)

I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).
from pyspark.sql import functions as F
var = df.select(F.col('column_you_want')).toPandas()
Then you can iterate on it like a normal pandas series.

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.

You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

Matlab: data from cell matrix to struct. How can I to organize my data with keys?

Let me suppose I'm facing some data obtained a by SQL database query as below (of course my real case is bigger, thoudans of rows and many columns).
key_names header1 header2 header3
-------------------------------------
key1 a 1 bar
key2 b 2 foo
key3 c 3 bla
My goal is to organize data in Matlab (at work I must use it) in a smart and effecient way to get the following results:
Access data by key obtaining the whole row, like dataset(key, :)
Access data by key plus header getting back a single value dataset.header(key)
If possible, getting a whole column (for all keys).
First of all, I used the dataset class provided by the Statistic Toolbox because it has all these features, but I decided to move away because it is really slow (from what I got, basically it is a wrapper onto cell arrays): the bottleneck of my code was getting the data instead of performing computations. In fact, I read that is better trying to avoid it as much as possible.
The newer class table looks more efficient but still not very much: from what I have understood, it is the new version of dataset as explained in the official documentation.
I considered also using containers.Map but it looks not to have the access by both key and column.
Therefore, struct seems to be the best choice as it is really fast and it has all the features I'm looking for.
So here my questions:
Did someone face my same problem? Which way to organize data is the best one?
Let me suppose struct is the best. How can I efficiently create and fill a structure like this: mystruct.key.header?
I'd like to get something like this:
mystruct.key1.header1
ans = a
Of course I could loop but there must be a better way. I documented in this good starting point but the struct is created empty:
fn1 = {'a', 'b', 'c'}; %first level
fn2 = {'d', 'e', 'f'}; %second level
s2 = cell2struct(cell(size(fn2(:))),fn2(:));
s = cell2struct(repmat({s2},size(fn1(:))),fn1(:))
In the cell2struct documentation all the examples do not rename all the levels. The deal help is a good way to fill the data (depending on the Matlab version as from 7.0 it was substituted with a new coding style) but I'm still missing how to combine the parts of creating the structure with the filling one.
Any suggestion or code example is really appreciated.

If you think, or sure, that structs are the best option for you, you can use table2struct. First, import all the data into Matlab as a table, and then convert it to a structure.
mystruct = table2struct(data);
to access your data you would use the following syntax:
mystruct(key).header
if key is an array, then you need to collect all the values to a list using either a cell array:
values = {mystruct(key).header}
or different variables:
[v1, v2, v3] = mystruct(key).header
but the latter option is problematic if you are not sure hoe many outputs to expect.
I'm not sure what will be more convenient to you, but you can also convert to a scalar structure by setting 'ToScalar' argument to true.