How to show the MAE, MSE, and R2 in a tabular format using data frame for the linear regression model? - linear-regression

I have these, but I am not sure how I can show them in tabular format.
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
LinearRegression_MAE = mean_squared_error(predictions, y_test)
LinearRegression_MSE = mean_absolute_error(predictions, y_test)
LinearRegression_R2 = r2_score(predictions, y_test)
Report = ?

Let's say LinarRegresion_MAE to MAE, LinearRegression_MSE to MSE, LinearRegression_R2 to R2. Firstly we create dict then we convert dict to dataframe.
Report = {"Metrics":["MAE","MSE","R2"],"Result":
[LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2]}
pd.DataFrame(Report)

Related

Kmeans PySpark ndarray

I'm trying to implement KMeans by PySpark feeding with a ndarray of dimensions: (289, 768). But when call KMEANS.fit, an error occurs:
text = np.array(df_low_conf.select("text").collect()).reshape(-1)
model = SentenceTransformer('neuralmind/bert-base-english-cased')
sentence_embeddings = model.encode(text)
kmeans = KMeans(k = num_clusters, initMode = 'k-means||', maxIter= 2000, initSteps = 10)
model = kmeans.fit(sentence_embeddings)
'numpy.ndarray' object has no attribute '_jdf'
Is it possible in PySpark? Because I tried on pandas and it was fine. If you have any advice or tip, please send me a message.

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

How to get support numbers in pyspark.ml like we get in sklearn classification report?

I am using Pyspark and able to get metrices like accuracy, f1, precison and recall from MulticlassClassificationEvaluator but I am not sure how to get the support numbers like we get in classification report for sklearn. rfc_pred in my case has the group of each class that I run in loop. So, will rfc_pred.count() do the trick?
Below is my current code:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = evaluator.evaluate(rfc_pred, {evaluator.metricName: "accuracy"})
f1 = evaluator.evaluate(rfc_pred, {evaluator.metricName: "f1"})
weightedPrecision = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedRecall"})

LSTM neural network with two sources of data

I have the following configuration: One lstm network that receives a text with n-grams with size 2. Below a simple schematic:
After some tests, I noticed that for some classes I have an significant incrise on accuracy when I use ngrams with size 3. Now I want to train a new LSTM neural network with both ngram sizes at same time, like the following schematic:
How can I provide the data and build this model, using keras to perform this task?
I assume you already have a function to split words into n-grams, as you already have the 2-grams and 3-grams model working? Therefor I just construct a one-sample example of the word "cool" for a working example. I had to use embedding for my example, as an LSTM layer with 26^3=17576 nodes was a little too much for my computer to handle. I expect you did the same in your 3-grams code?
Below is a complete working example:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, concatenate
from tensorflow.keras.models import Model
import numpy as np
# c->2 o->14 o->14 l->11
np_2_gram_in = np.array([[26*2+14,26*14+14,26*14+11]])#co,oo,ol
np_3_gram_in = np.array([[26**2*2+26*14+14,26**2*14+26*14+26*11]])#coo,ool
np_output = np.array([[1]])
output_shape=1
lstm_2_gram_embedding = 128
lstm_3_gram_embedding = 192
inputs_2_gram = Input(shape=(None,))
em_input_2_gram = Embedding(output_dim=lstm_2_gram_embedding, input_dim=26**2)(inputs_2_gram)
lstm_2_gram = LSTM(lstm_2_gram_embedding)(em_input_2_gram)
inputs_3_gram = Input(shape=(None,))
em_input_3_gram = Embedding(output_dim=lstm_3_gram_embedding, input_dim=26**3)(inputs_3_gram)
lstm_3_gram = LSTM(lstm_3_gram_embedding)(em_input_3_gram)
concat = concatenate([lstm_2_gram, lstm_3_gram])
output = Dense(output_shape,activation='sigmoid')(concat)
model = Model(inputs=[inputs_2_gram, inputs_3_gram], outputs=[output])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit([np_2_gram_in, np_3_gram_in], [np_output], epochs=5)
model.predict([np_2_gram_in,np_3_gram_in])

How to apply LSTM-autoencoder to variant-length time-series data?

I read LSTM-autoencoder in this tutorial: https://blog.keras.io/building-autoencoders-in-keras.html, and paste the corresponding keras implementation below:
from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model
inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
In this implementation, they fixed the input to be of shape (timesteps, input_dim), which means length of time-series data is fixed to be timesteps. If I remember correctly RNN/LSTM can handle time-series data of variable lengths and I am wondering if it is possible to modify the code above somehow to accept data of any length?
Thanks!
You can use shape=(None, input_dim)
But the RepeatVector will need some hacking taking dimensions directly from the input tensor. (The code works with tensorflow, not sure about theano)
import keras.backend as K
def repeat(x):
stepMatrix = K.ones_like(x[0][:,:,:1]) #matrix with ones, shaped as (batch, steps, 1)
latentMatrix = K.expand_dims(x[1],axis=1) #latent vars, shaped as (batch, 1, latent_dim)
return K.batch_dot(stepMatrix,latentMatrix)
decoded = Lambda(repeat)([inputs,encoded])
decoded = LSTM(input_dim, return_sequences=True)(decoded)