Pyspark Loses Metadata After MinMaxScaler

Pyspark Loses Metadata After MinMaxScaler - pyspark

I'm using the student data set from:
https://archive.ics.uci.edu/ml/machine-learning-databases/00320/
If I scale the features in the pipeline it loses the bulk of the metadata which I need later. Here is the basic setup without scaling to produce the metadata. The scaling options are commented for easy replication.
I'm selecting out numeric and categorical columns I wish to use for the model. Here is my data setup and pipeline without scaling to see the metadata.
# load data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('student-performance').getOrCreate()
df_raw = spark.read.options(delimiter=';', header=True, inferSchema=True).csv('student-mat.csv')
# specify columns and filter
cols_cate = ['school', 'sex', 'Pstatus', 'Mjob', 'Fjob', 'famsup', 'activities', 'higher', 'internet', 'romantic']
cols_num = ['age', 'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2']
col_label = ['G3']
keep = cols_cate + cols_num + col_label
df_keep = df_raw.select(keep)
# setup pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, MinMaxScaler
cols_assembly = []
stages = []
for col in cols_cate:
string_index = StringIndexer(inputCol=col, outputCol=col+'-indexed')
encoder = OneHotEncoder(inputCol=string_index.getOutputCol(), outputCol=col+'-encoded')
cols_assembly.append(encoder.getOutputCol())
stages += [string_index, encoder]
# assemble vectors
assembler_input = cols_assembly + cols_num
assembler = VectorAssembler(inputCols=assembler_input, outputCol='features')
stages += [assembler]
# MinMaxScalar option - will need to change 'features' -> 'scaled-features' later
#scaler = MinMaxScaler(inputCol='features', outputCol='scaled-features')
#stages += [scaler]
# apply pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df_keep)
df_pipe = pipelineModel.transform(df_keep)
cols_selected = ['features'] + cols_cate + cols_num + ['G3']
df_pipe = df_pipe.select(cols_selected)
Make the training data, fit a model, and get predictions.
from pyspark.ml.regression import LinearRegression
train, test = df_pipe.randomSplit([0.7, 0.3], seed=14)
lr = LinearRegression(featuresCol='features',labelCol='G3', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(train)
lr_preds = lrModel.transform(test)
Checking the metadata of the "features" column I have a lot of information here.
lr_preds.schema['features'].metadata
Output:
{'ml_attr': {'attrs': {'numeric': [{'idx': 16, 'name': 'age'},
{'idx': 17, 'name': 'Medu'},
{'idx': 18, 'name': 'Fedu'},
{'idx': 19, 'name': 'studytime'},
{'idx': 20, 'name': 'failures'},
{'idx': 21, 'name': 'famrel'},
{'idx': 22, 'name': 'goout'},
{'idx': 23, 'name': 'Dalc'},
{'idx': 24, 'name': 'Walc'},
{'idx': 25, 'name': 'health'},
{'idx': 26, 'name': 'absences'},
{'idx': 27, 'name': 'G1'},
{'idx': 28, 'name': 'G2'}],
'binary': [{'idx': 0, 'name': 'school-encoded_GP'},
{'idx': 1, 'name': 'sex-encoded_F'},
{'idx': 2, 'name': 'Pstatus-encoded_T'},
{'idx': 3, 'name': 'Mjob-encoded_other'},
{'idx': 4, 'name': 'Mjob-encoded_services'},
{'idx': 5, 'name': 'Mjob-encoded_at_home'},
{'idx': 6, 'name': 'Mjob-encoded_teacher'},
{'idx': 7, 'name': 'Fjob-encoded_other'},
{'idx': 8, 'name': 'Fjob-encoded_services'},
{'idx': 9, 'name': 'Fjob-encoded_teacher'},
{'idx': 10, 'name': 'Fjob-encoded_at_home'},
{'idx': 11, 'name': 'famsup-encoded_yes'},
{'idx': 12, 'name': 'activities-encoded_yes'},
{'idx': 13, 'name': 'higher-encoded_yes'},
{'idx': 14, 'name': 'internet-encoded_yes'},
{'idx': 15, 'name': 'romantic-encoded_no'}]},
'num_attrs': 29}}
If I add scaling after the VectorAssembler (commented-out above) in the pipeline, retrain, and make predictions again, it loses all of this metadata.
lr_preds.schema['scaled-features'].metadata
Output:
{'ml_attr': {'num_attrs': 29}}
Is there any way to get this metadata back? Thanks in advance!

mck's suggestion of using 'features' from lr_preds works to get the metadata, it's unchanged. Thank you.
the column features should remain in the dataframelr_preds, maybe you can get it from that column instead?

Related

How to use the #app.callback function with two inputs as dropdowns?

**Hello everyone!
I have been trying to create an interactive dashboard in python using the #app.callback function with two inputs. My dataset layout can be summarized into 4 main columns. [1]: https://i.stack.imgur.com/boMKt.png
I'd like Geography and Time Period to manifest in the form of dropdowns (therefore use the Dcc. dropdown function.
The first dropdown will filter the dataset according to the Geography and the second one will define the "Period time - MAT, L12w or L4w) within the country. Therefore somehow the second dropdown is to be integrated within the first dropdown.
I am familiarized with both the dropdown and #app.callback function. But I can't seem to find a script that fuses both. Important note: the output desired is a pie chart that distinguishes Manufacturers' (Column 2) value share (column 4) according to the selected Geography and time Period. I am guessing the mystery resides in the app.layout structure. However, I tried everything and the code won't work.
Also, you will find the code I have done so far attached. The important bit is from "#DESIGN APP LAYOUT" onwards.
I'd really appreciate a quick response. Thanks in advance for the help!**
from dash import html
from dash import dcc
from dash.dependencies import Input, Output, State
import plotly.express as px
import pandas as pd
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.width=None
data = pd.read_csv (r'C:\Users\Sara.Munoz\OneDrive - Unilever\Documents\Sarita.csv',
encoding = "ISO-8859-1",
)
df=data
print(df.head())
cols=df.columns
print(cols)
###RE-ARRANGE DATASET###
df = pd.melt(df, id_vars=['Geography Node Name', 'Geography Id', 'Geography Level',
'Category Node Name', 'Category Id', 'Category Level',
'Global Manufacturer Name', 'Global Manufacturer Id',
'Brand Position Type', 'Brand Position Name', 'Brand Position Id',
'Local Brand Name', 'Local Brand Id', 'Measure',
'Currency or Unit of Measure','Latest Available Date'],value_vars=['MAT','L12W','L4W'], var_name='Period',value_name='Data')
for col in df.columns:
print(col)
###CLEAN DATASET###
df.rename(columns = {'Geography Node Name':'Geography','Category Node Name':'Category',
'Global Manufacturer Name':'Manufacturer','Geography Level':'GLevel'},inplace = True)
df.drop(["Geography Id", "Category Id","Global Manufacturer Id","Brand Position Type",
"Brand Position Name","Brand Position Id","Local Brand Name","Local Brand Id","Latest Available Date",
"Currency or Unit of Measure"], axis = 1, inplace=True)
print("SEE BELOW NEW DATASET")
print(df.head())
#####FOR VALUE SHARE
print("FOR VALUE SHARE")
df2 = df.loc[df['GLevel'] == 5]
df2 = df2.loc[df2['Measure'] == 'Value Share']
df2 = df2.loc[df2['Category'] == 'Toothpaste']
df2 = df2[df2.Manufacturer != 'ALL MANUFACTURERS']
df2 = df2[df2.Category != 'Oral Care']
df2.drop(["GLevel", "Category","Category Level"], axis = 1, inplace=True)
print(df2.head())
#####FOR VOLUME SHARE
print("FOR VOLUME SHARE")
df3 = df.loc[df['GLevel'] == 5]
df3 = df3.loc[df3['Measure'] == 'Volume Share']
df3 = df3.loc[df3['Category'] == 'Toothpaste']
df3 = df3[df3.Manufacturer != 'ALL MANUFACTURERS']
df3 = df3[df3.Category != 'Oral Care']
df3.drop(["GLevel", "Category","Category Level"], axis = 1, inplace=True)
df3=df3.sort_values(['Geography', 'Period'],ascending = [True, True])
df3 = pd.DataFrame(df3)
df3=df3[['Geography','Period','Manufacturer','Measure','Data']]
print(df3)
###############################################################################
app = dash.Dash(__name__)
app.layout = html.Div(
[
dcc.Dropdown(
id="dropdown-1",
options=[
{'label': 'Indonesia', 'value': 'Indonesia'},
{'label': 'France', 'value': 'France'},
{'label': 'Vietnam', 'value': 'Vietnam'},
{'label': 'Chile', 'value': 'Chile'},
{'label': 'United Arab Emirates', 'value': 'United Arab Emirates'},
{'label': 'Morocco', 'value': 'Morocco'},
{'label': 'Russian Federation', 'value': 'Russian Federation'},
{'label': 'China', 'value': 'China'},
{'label': 'Greece', 'value': 'Greece'},
{'label': 'Netherlands', 'value': 'Netherlands'},
{'label': 'Austria', 'value': 'Austria'},
{'label': 'Germany', 'value': 'Germany'},
{'label': 'Switzerland', 'value': 'Switzerland'},
{'label': 'Italy', 'value': 'Italy'},
{'label': 'Denmark', 'value': 'Denmark'},
{'label': 'Norway', 'value': 'Norway'},
{'label': 'Sweden', 'value': 'Sweden'}
],
multi=True,
),
dcc.Dropdown(
id="dropdown-2",
options=[
{'label': 'MAT', 'value': 'MAT'},
{'label': 'L12W', 'value': 'L12W'},
{'label': 'L4W', 'value': 'L4W'}
],
multi=True,
),
html.Div([], id="plot1", children=[])
], style={'display': 'flex'})
#app.callback(
Output("plot1", "children"),
[Input("dropdown-1", "value"), Input("dropdown-2", "value")],
prevent_initial_call=True
)
def get_graph(entered_Geo, entered_Period):
fd = df2[(df3['Geography']==entered_Geo) &
(df3['Period']==entered_Period)]
g1= fd.groupby(['Manufacturer'],as_index=False). \
mean()
g1 = g1
plot1= px.pie(g1, values='Data', names='Manufacturer', title="Value MS")
return[dcc.Graph(figure=plot1)]
if __name__ == '__main__':
app.run_server()

#DESIGN APP LAYOUT##############################################################################
app.layout = html.Div([
html.Label("Geography:",style={'fontSize':30, 'textAlign':'center'}),
dcc.Dropdown(
id='dropdown1',
options=[{'label': s, 'value': s} for s in sorted(df3.Geography.unique())],
value=None,
clearable=False
),
html.Label("Period:", style={'fontSize':30, 'textAlign':'center'}),
dcc.Dropdown(id='dropdown2',
options=[],
value=[],
multi=False),
html.Div([
html.Div([ ], id='plot1'),
html.Div([ ], id='plot2')
], style={'display': 'flex'}),
])
##############
# Populate the Period dropdown with options and values
#app.callback(
Output('dropdown2', 'options'),
Output('dropdown2', 'value'),
Input('dropdown1', 'value'),
)
def set_period_options(chosen_Geo):
dff = df3[df3.Geography==chosen_Geo]
Periods = [{'label': s, 'value': s} for s in df3.Period.unique()]
values_selected = [x['value'] for x in Periods]
return Periods, values_selected
# Create graph component and populate with pie chart
#app.callback([Output(component_id='plot1', component_property='children'),
Output(component_id='plot2', component_property='children')],
Input('dropdown2', 'value'),
Input('dropdown1', 'value'),
prevent_initial_call=True
)
def update_graph(selected_Period, selected_Geo):
if len(selected_Period) == 0:
return no_update
else:
#Volume Share
dff3 = df3[(df3.Geography==selected_Geo) & (df3.Period==selected_Period)]
#Value Share
dff2 = df2[(df2.Geography==selected_Geo) & (df2.Period==selected_Period)]
#####
fig1 = px.pie(dff2, values='Data', names='Manufacturer', title=" Value MS")
fig2 = px.pie(dff3, values='Data', names='Manufacturer', title=" Volume MS")
table =
return [dcc.Graph(figure=fig1),
dcc.Graph(figure=fig2) ]
if __name__ == '__main__':
app.run_server()```

how to read value from Json flutter

hello i have an object Json which have some values as payload , key , card .. so i aim to get the data i need directly with key "payload" .
var dataFinal= tag.data.toString();
and this what i get if i print my data
[log] handle {nfca: {identifier: [12, 4, 18, 17], atqa: [4, 0], maxTransceiveLength: 253, sak: 8, timeout: 618}, mifareclassic: {identifier: [99, 4, 150, 17], blockCount: 64, maxTransceiveLength: 253, sectorCount: 16, size: 1024, timeout: 618, type: 0}, ndef: {identifier: [99, 4, 150, 17], isWritable: true, maxSize: 716, canMakeReadOnly: false, cachedMessage: {records: [{typeNameFormat: 1, type: [84], identifier: [], payload: [1,45,989]}]}, type: com.nxp.ndef.mifareclassic}}
how can i get the payload value ?

You can check out the function jsonDecode() which expects a string as a param and returns dynamic or Map in your case

import 'dart:convert';
Map<String,dynamic> data = jsonDecode(tag.data.toString());
print(data["nfca"]);

inserting list of dicts as JSON binary into peewee model

I have a simple list of dicts I want to insert into my LabelModel table in my postgres DB
l = [{'label': 'A',
'x': 132.56338500976562,
'y': 333.7539367675781,
'width': 183.78598022460938,
'height': 404.6580505371094,
'score': 0.9848693609237671},
{'label': 'B',
'x': 179.97842407226562,
'y': 367.101318359375,
'width': 127.43386840820312,
'height': 59.047882080078125,
'score': 0.965998113155365},
{'label': 'C',
'x': 431.1368408203125,
'y': 365.9712219238281,
'width': 127.59616088867188,
'height': 60.77362060546875,
'score': 0.9622131586074829}]
class TblLabelByAI(BaseModel):
name = pw.TextField()
labels = BinaryJSONField()
modified_at = pw.DateField(default=datetime.datetime.utcnow)
q = {"imagename":"testname","labels":l}
TblLabelByAI.get_or_create(**q)
Is there any reason I get the following:
ProgrammingError: operator does not exist: jsonb = record
LINE 1: ...testname') AND ("t1"."labels" = (CAST('{...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
Peewee: 3.14.9
Python 3.8
#coleifer - amazing work btw

Here is a workaround:
from psycopg2.extras import Json
data = [{'x': 123, 'y': 123},{'x': 123, 'y': 123},{'x': 123, 'y': 123}]
data = Json(data)

Fail to redirect in Flask under a post request

I am working on a small website about food written by Flask. However, I have trouble with redirecting users to a new page after they type in something in the search bar and then click the search button.
#app.route('/', methods = ['GET','POST'])
def home():
if request.method == "POST":
search = request.form['search']
print(search)
sql_search = 'SELECT Food.name,Food.area,Food.price,Food.description,Food.available_amount,Food.score,AVG_PRICE,AVG_SCORE,User.selling_score FROM Food LEFT JOIN(SELECT Food.name, Food.area ,AVG(Food.price) AS AVG_PRICE,AVG(Food.score) AS AVG_SCORE FROM Food GROUP BY Food.name, Food.area) NEW_FOOD ON Food.name = NEW_FOOD.name AND Food.area = NEW_FOOD.area LEFT JOIN User ON Food.maker_id = User.id WHERE Food.name LIKE "%s" ORDER BY Food.price,Food.score DESC,User.selling_score DESC' % search
search_list = db.session.execute(sql_search).fetchall()
search_l = process_dict(search_list, attr_search)
print(search_l)
return redirect(url_for('test'), code = 303)
return render_template('home.html')
This is the output from the terminal after I search "curry" on the homepage.
curry
[{'name': 'curry', 'area': 61801, 'price': 6.0, 'description': 'nice', 'available_amount': 1, 'score': 5.0, 'AVG_PRICE': 6.0, 'AVG_SCORE': 5.0, 'seller_score': 0.0}, {'name': 'curry', 'area': 61820, 'price': 8.0, 'description': 'ok', 'available_amount': 2, 'score': 5.0, 'AVG_PRICE': 8.666666666666666, 'AVG_SCORE': 4.0, 'seller_score': 0.0}, {'name': 'curry', 'area': 61820, 'price': 8.0, 'description': 'excellent', 'available_amount': 2, 'score': 4.0, 'AVG_PRICE': 8.666666666666666, 'AVG_SCORE': 4.0, 'seller_score': 0.0}, {'name': 'curry', 'area': 61820, 'price': 10.0, 'description': 'excellent', 'available_amount': 1, 'score': 3.0, 'AVG_PRICE': 8.666666666666666, 'AVG_SCORE': 4.0, 'seller_score': 0.0}]
10.193.20.119 - - [31/Mar/2018 20:40:41] "POST / HTTP/1.1" 303 -
10.193.20.119 - - [31/Mar/2018 20:40:41] "GET /test HTTP/1.1" 200
Although the terminal shows that it sends a get request to the page I want to redirect to, my page actually has no change. Does anyone have some ideas about what is going wrong with my code?
And this is the code for my simple test page:
#app.route('/test')
# #login_required
def test():
return "hello"

How to use Time in Google Charts?

I'm using Google Charts. Here's an example data from Google's website:
var data = new google.visualization.DataTable(
{
cols: [{id: 'task', label: 'Employee Name', type: 'string'},
{id: 'startDate', label: 'Start Date', type: 'date'}],
rows: [{c:[{v: 'Mike'}, {v: new Date(2008, 1, 28), f:'February 28, 2008'}]},
{c:[{v: 'Bob'}, {v: new Date(2007, 5, 1)}]},
{c:[{v: 'Alice'}, {v: new Date(2006, 7, 16)}]},
{c:[{v: 'Frank'}, {v: new Date(2007, 11, 28)}]},
{c:[{v: 'Floyd'}, {v: new Date(2005, 3, 13)}]},
{c:[{v: 'Fritz'}, {v: new Date(2011, 6, 1)}]}
]
}
)
I'd like to use hours and minutes also in the date variables. I haven't found an example syntax on the website, has anyone tried this before? Thanks in advance.

The Mozilla Developer Network has an extensive JavaScript reference, including on the Date object. See the following constructor syntax:
new Date()
new Date(milliseconds)
new Date(dateString)
new Date(year, month, day [, hour, minute, second, millisecond ])
So for 2011-06-01 at 1:37 PM, you could do:
new Date(2011, 5, 1, 13, 37)
Please note that months are zero-based, so the 5 above represents the 6th month (i.e. June).
N.B. I have very little familiarity with Google Charts, so I would accept edits to make this question more relevant to the O.P.'s question.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark Loses Metadata After MinMaxScaler - pyspark

mck's suggestion of using 'features' from lr_preds works to get the metadata, it's unchanged. Thank you. the column features should remain in the dataframelr_preds, maybe you can get it from that column instead?

Related

How to use the #app.callback function with two inputs as dropdowns?

how to read value from Json flutter

inserting list of dicts as JSON binary into peewee model

Fail to redirect in Flask under a post request

How to use Time in Google Charts?

Categories

Resources