Kmeans PySpark ndarray - pyspark

I'm trying to implement KMeans by PySpark feeding with a ndarray of dimensions: (289, 768). But when call KMEANS.fit, an error occurs:
text = np.array(df_low_conf.select("text").collect()).reshape(-1)
model = SentenceTransformer('neuralmind/bert-base-english-cased')
sentence_embeddings = model.encode(text)
kmeans = KMeans(k = num_clusters, initMode = 'k-means||', maxIter= 2000, initSteps = 10)
model = kmeans.fit(sentence_embeddings)
'numpy.ndarray' object has no attribute '_jdf'
Is it possible in PySpark? Because I tried on pandas and it was fine. If you have any advice or tip, please send me a message.

Related

'SequentialFeatureSelector' object has no attribute 'ranking_'

everyone,
I would like to sort the best features that I'm getting from SequentialFeatureSelector based on their ranks.
But I get this error:
'SequentialFeatureSelector' object has no attribute 'ranking_'
Can you help me to solve it and sort my features based on their importance?
`import pandas as pd
df=pd.read_csv('')
#Deviding X and Y
Y=df.iloc[:,-1:]
X = df[df.columns.drop(Y)]
#Selecting the K best features
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
sfs = SequentialFeatureSelector(model, n_features_to_select=5)
fit = sfs.fit(X, Y)
#Selecting them and showing them in a dataframe
names = X.columns.values
ranking = sfs.ranking_
names_scores = list(zip(names, ranking))
ns_df = pd.DataFrame(data = names_scores, columns=['Feature_names', 'ranks'])
L=ns_df.sort_values('ranks', ascending=False)
L`

TabPy executing Python code, SCRIPT_REAL is being called with (string)

First time using TabPy and have the connections successfully set up. Within the "Create Calculated Field" button in Tableau, I have tried
SCRIPT_REAL("
import numpy as np
import pandas as pd
")
which results in "SCRIPT_REAL is being called with (string), did you mean (string, ...)?
Additionally, how do I refer to the dataset such as I did in Python to execute the following?
data = pd.read_csv("C:/Users/.../dataset.csv")
data.head()
plt.figure(figsize=(7,7))
plt.pie(data['stroke'].value_counts(sort = True),
explode = (0.05, 0),
labels = data['stroke'].value_counts(sort = True).index,
colors = ["blue","green"],
autopct = '%1.1f%%')
plt.title('Pie Chart')
plt.show()

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

PySpark Cosine Similarity between two vectors of TF-IDF values the Cosine Similarity using SparseMatrix + koalas or Pandas API on Spark

I do try to implement this Name Matching Cosine Similarity approach/functions get_matches_df in pyspark and pandas_on_spark(koalas) and struggling with optimizing this function (I do try to avoid conversion toPandas() for dataframes because will overload driver so I want to optimize this function and to scale it, so basically a batch approach will work perfect as in this example, or use pandas_udfs or simple UDFs that takes 1 vector and 2 dataframes:
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf[pdf.a > 1] # allow arbitrary length
...
>>> psdf.pandas_on_spark.apply_batch(pandas_plus)
this is the function I do work on optimizing (everything else I converted and created custom tfidfvectorizer, scaling cosine, pyspark sparsematrix generator and all I have left to optimize is this part (because uses loc and not sure how does work, I don't mind to have it behave as pandas aka all dataframe to driver but ideally will be
def get_matches_df(sparse_matrix, name_vector, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'left_side': left_side,
'right_side': right_side,
'similairity': similairity})

Percentile rank in pyspark using QuantileDiscretizer

I am wondering if it's possible to obtain the result of percentile_rank using the QuantileDiscretizer transformer in pyspark.
The purpose is that I am trying to avoid computing the percent_rank over the entire column, as it generates the following error:
WARN WindowExec: No Partition Defined for Window operation!
Moving all data to a single partition, this can cause serious performance degradation.
The method I am following is to first use QuantileDiscretizer then normalize to [0,1]:
from pyspark.sql.window import Window
from pyspark.ml.feature import QuantileDiscretizer
from scipy.stats import gamma
X1 = gamma.rvs(0.2, size=1000)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("perc_rank", F.percent_rank().over(Window.orderBy("x")))
df = QuantileDiscretizer(numBuckets=df.count()+1,\
inputCol="x",\
outputCol="q_discretizer").fit(df).transform(df)
agg_values = df.agg(F.max(df["q_discretizer"]).alias("maxval"),\
F.min(df["q_discretizer"]).alias("minval")).collect()[0]
xmax, xmin = agg_values.__getitem__("maxval"), agg_values.__getitem__("minval")
normalize = F.udf(lambda x: (x-xmin)/(xmax-xmin))
df = df.withColumn("perc_discretizer", normalize("q_discretizer"))
df = df.withColumn("error", F.round(F.abs(F.col("perc_discretizer")- F.col("perc_rank")),6) )
print(df.select(F.max("error")).show())
df.show(5)
However, it seems that increasing the number of datapoints the error grows, so I am not sure this is the right way to do it.
Is it possible to use QuantileDiscretizer to obtain the percentile_rank?
Alternatively is there a way to compute percentile_rank over an entire column in an efficient way?
Well you can use the below to avoid the warning message:
X1 = gamma.rvs(0.2, size=10)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("dummyCol", F.lit("some_val"))
win = Window.partitionBy("dummyCol").orderBy("x")
df = df.withColumn("perc_rank", F.percent_rank().over(win)).drop("dummyCol")
but nonetheless, the data would still be moved to a single worker, I don't think so there is any better alternative to avoid the shuffle here since the complete column needs to be rank-ordered.
In case you have multiple windows over the same column, you can try to pre-partition the data and then apply the ranking functions.