How to show the MAE, MSE, and R2 in a tabular format using data frame for the linear regression model?

How to show the MAE, MSE, and R2 in a tabular format using data frame for the linear regression model? - linear-regression

I have these, but I am not sure how I can show them in tabular format.
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
LinearRegression_MAE = mean_squared_error(predictions, y_test)
LinearRegression_MSE = mean_absolute_error(predictions, y_test)
LinearRegression_R2 = r2_score(predictions, y_test)
Report = ?

Let's say LinarRegresion_MAE to MAE, LinearRegression_MSE to MSE, LinearRegression_R2 to R2. Firstly we create dict then we convert dict to dataframe.
Report = {"Metrics":["MAE","MSE","R2"],"Result":
[LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2]}
pd.DataFrame(Report)

Related

Kmeans PySpark ndarray

I'm trying to implement KMeans by PySpark feeding with a ndarray of dimensions: (289, 768). But when call KMEANS.fit, an error occurs:
text = np.array(df_low_conf.select("text").collect()).reshape(-1)
model = SentenceTransformer('neuralmind/bert-base-english-cased')
sentence_embeddings = model.encode(text)
kmeans = KMeans(k = num_clusters, initMode = 'k-means||', maxIter= 2000, initSteps = 10)
model = kmeans.fit(sentence_embeddings)
'numpy.ndarray' object has no attribute '_jdf'
Is it possible in PySpark? Because I tried on pandas and it was fine. If you have any advice or tip, please send me a message.

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

How to get support numbers in pyspark.ml like we get in sklearn classification report?

I am using Pyspark and able to get metrices like accuracy, f1, precison and recall from MulticlassClassificationEvaluator but I am not sure how to get the support numbers like we get in classification report for sklearn. rfc_pred in my case has the group of each class that I run in loop. So, will rfc_pred.count() do the trick?
Below is my current code:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = evaluator.evaluate(rfc_pred, {evaluator.metricName: "accuracy"})
f1 = evaluator.evaluate(rfc_pred, {evaluator.metricName: "f1"})
weightedPrecision = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedRecall"})

LSTM neural network with two sources of data

I have the following configuration: One lstm network that receives a text with n-grams with size 2. Below a simple schematic:
After some tests, I noticed that for some classes I have an significant incrise on accuracy when I use ngrams with size 3. Now I want to train a new LSTM neural network with both ngram sizes at same time, like the following schematic:
How can I provide the data and build this model, using keras to perform this task?

I assume you already have a function to split words into n-grams, as you already have the 2-grams and 3-grams model working? Therefor I just construct a one-sample example of the word "cool" for a working example. I had to use embedding for my example, as an LSTM layer with 26^3=17576 nodes was a little too much for my computer to handle. I expect you did the same in your 3-grams code?
Below is a complete working example:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, concatenate
from tensorflow.keras.models import Model
import numpy as np
# c->2 o->14 o->14 l->11
np_2_gram_in = np.array([[26*2+14,26*14+14,26*14+11]])#co,oo,ol
np_3_gram_in = np.array([[26**2*2+26*14+14,26**2*14+26*14+26*11]])#coo,ool
np_output = np.array([[1]])
output_shape=1
lstm_2_gram_embedding = 128
lstm_3_gram_embedding = 192
inputs_2_gram = Input(shape=(None,))
em_input_2_gram = Embedding(output_dim=lstm_2_gram_embedding, input_dim=26**2)(inputs_2_gram)
lstm_2_gram = LSTM(lstm_2_gram_embedding)(em_input_2_gram)
inputs_3_gram = Input(shape=(None,))
em_input_3_gram = Embedding(output_dim=lstm_3_gram_embedding, input_dim=26**3)(inputs_3_gram)
lstm_3_gram = LSTM(lstm_3_gram_embedding)(em_input_3_gram)
concat = concatenate([lstm_2_gram, lstm_3_gram])
output = Dense(output_shape,activation='sigmoid')(concat)
model = Model(inputs=[inputs_2_gram, inputs_3_gram], outputs=[output])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit([np_2_gram_in, np_3_gram_in], [np_output], epochs=5)
model.predict([np_2_gram_in,np_3_gram_in])

How to apply LSTM-autoencoder to variant-length time-series data?

I read LSTM-autoencoder in this tutorial: https://blog.keras.io/building-autoencoders-in-keras.html, and paste the corresponding keras implementation below:
from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model
inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
In this implementation, they fixed the input to be of shape (timesteps, input_dim), which means length of time-series data is fixed to be timesteps. If I remember correctly RNN/LSTM can handle time-series data of variable lengths and I am wondering if it is possible to modify the code above somehow to accept data of any length?
Thanks!

You can use shape=(None, input_dim)
But the RepeatVector will need some hacking taking dimensions directly from the input tensor. (The code works with tensorflow, not sure about theano)
import keras.backend as K
def repeat(x):
stepMatrix = K.ones_like(x[0][:,:,:1]) #matrix with ones, shaped as (batch, steps, 1)
latentMatrix = K.expand_dims(x[1],axis=1) #latent vars, shaped as (batch, 1, latent_dim)
return K.batch_dot(stepMatrix,latentMatrix)
decoded = Lambda(repeat)([inputs,encoded])
decoded = LSTM(input_dim, return_sequences=True)(decoded)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to show the MAE, MSE, and R2 in a tabular format using data frame for the linear regression model? - linear-regression

Let's say LinarRegresion_MAE to MAE, LinearRegression_MSE to MSE, LinearRegression_R2 to R2. Firstly we create dict then we convert dict to dataframe. Report = {"Metrics":["MAE","MSE","R2"],"Result": [LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2]} pd.DataFrame(Report)

Related

Kmeans PySpark ndarray

Getting model parameters from regression in Pyspark in efficient way for large data

How to get support numbers in pyspark.ml like we get in sklearn classification report?

LSTM neural network with two sources of data

How to apply LSTM-autoencoder to variant-length time-series data?

Categories

Resources