Error when using Seaborn in jupyter notebook(pyspark) - pyspark

I am trying to visualize data using Seaborn. I have created a dataframe using SQLContext in pyspark. However, when I call lmplot it results in an error. I am not sure what I am missing. Given below is my code(I am using jupyter notebook):
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load('file:///home/cloudera/Downloads/WA_Sales_Products_2012-14.csv',
format='com.databricks.spark.csv',
header='true',inferSchema='true')
sns.lmplot(x='Quantity', y='Year', data=df)
Error trace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-86-2a2b43993475> in <module>()
----> 2 sns.lmplot(x='Quantity', y='Year', data=df)
/home/cloudera/anaconda3/lib/python3.5/site-packages/seaborn/regression.py in lmplot(x, y, data, hue, col, row, palette, col_wrap, size, aspect, markers, sharex, sharey, hue_order, col_order, row_order, legend, legend_out, x_estimator, x_bins, x_ci, scatter, fit_reg, ci, n_boot, units, order, logistic, lowess, robust, logx, x_partial, y_partial, truncate, x_jitter, y_jitter, scatter_kws, line_kws)
557 hue_order=hue_order, size=size, aspect=aspect,
558 col_wrap=col_wrap, sharex=sharex, sharey=sharey,
--> 559 legend_out=legend_out)
560
561 # Add the markers here as FacetGrid has figured out how many levels of the
/home/cloudera/anaconda3/lib/python3.5/site-packages/seaborn/axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, size, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws)
255 # Make a boolean mask that is True anywhere there is an NA
256 # value in one of the faceting variables, but only if dropna is True
--> 257 none_na = np.zeros(len(data), np.bool)
258 if dropna:
259 row_na = none_na if row is None else data[row].isnull()
TypeError: object of type 'DataFrame' has no len()
Any help or pointer is appreciated. Thank you in advance:-)

sqlContext.read.load(...) returns a Spark-DataFrame. I am not sure, whether seaborn can automatically cast a Spark-DataFrame into a Pandas-Dataframe.
Try:
sns.lmplot(x='Quantity', y='Year', data=df.toPandas())
df.toPandas() returns the the pandas-DF from the Spark-DF.

Related

how to trigger callback event on a plotly scattermapbox plot

I am trying to run the code below, and the map is generated correctly (showing the locations of interest), but when clicking on a location, the callback function is not triggered. I have tried setting the layout to 'clickmode': 'event', and then the selected point is highlighted, but the callback function is not triggered. It must be something small, but I can't figure out what I am missing, any help or suggestion is highly appreciated.
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import json
import ipywidgets as widgets
from ipywidgets import Output, VBox
data2 = go.Scattermapbox(lat=list(heat_df.ycoord),
lon=list(heat_df.xcoord),
mode='markers+text',
marker=dict(size=20, color='green'),
)
layout = dict(margin=dict(l=0, t=0, r=0, b=0, pad=0),
mapbox=dict(#accesstoken=mapbox_access_token,
center=dict(lat=51.6, lon=-0.2),
style='light',
zoom=8))
f2 = go.FigureWidget(data=data2, layout=layout);
f2.update_layout(mapbox_style="stamen-terrain"
, mapbox_center = {"lat": map_meanY, "lon": map_meanX}
, mapbox_zoom= 11
#, clickmode='event'
);
scatter = f2.data[0]
scatter.marker.size = [20] * 100
f2.layout.hovermode = 'closest'
out = Output()
#out.capture(clear_output=True)
def click_callback(trace, points, selector):
inds = points.point_inds
print(inds)
f2.data[0].on_click(click_callback,append=False)
#help(f2.data[0].on_click)
f2
Setting mode='markers+text', to mode='markers' throws in the ValueError:
File ~\anaconda3\envs\geemap_test\lib\site-packages\plotly\basedatatypes.py:2637, in BaseFigure._perform_plotly_relayout(self, relayout_data)
2633 for key_path_str, v in relayout_data.items():
2635 if not BaseFigure._is_key_path_compatible(key_path_str, self.layout):
-> 2637 raise ValueError(
2638 """
2639 Invalid property path '{key_path_str}' for layout
2640 """.format(
2641 key_path_str=key_path_str
2642 )
2643 )
2645 # Apply set operation on the layout dict
2646 val_changed = BaseFigure._set_in(self._layout, key_path_str, v)
ValueError:
Invalid property path 'mapbox._derived' for layout

How to limit FPGrowth itemesets to just 2 or 3

I am running the FPGrowth algorithm using pyspark in python3.6 using jupyter notebook. When I am trying to save the association rules output of rules generated is huge. So I want to limit the number of consequent. Here is the code which I have tried. I also changed the spark context parameters.
Maximum Pattern Length fpGrowth (Apache) PySpark
from pyspark.sql.functions import col, size
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
conf = SparkConf().setAppName("App")
conf = (conf.setMaster('local[*]')
.set('spark.executor.memory', '100G')
.set('spark.driver.memory', '400G')
.set('spark.driver.maxResultSize', '200G'))
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
R = Row('ID', 'items')
df=spark.createDataFrame([R(i, x) for i, x in enumerate(lol)])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.7, minConfidence=0.9)
model = fpGrowth.fit(df)
ar=model.associationRules.where(size(col('antecedent')) == 2).where(size(col('cosequent')) == 1)
ar.cache()
ar.toPandas().to_csv('output.csv')
It gives an error
TypeError Traceback (most recent call last)
<ipython-input-1-f90c7a9f11ae> in <module>
---> 73 ar=model.associationRules.where(size(col('antecedent')) ==
2).where(size(col('consequent')) == 1)
TypeError: 'str' object is not callable
Can someone help me to solve the issue.
Here lol is list of list of transactions: [['a','b'],['c','a','e']....]
Python: 3.6.5
Pyspark
Windows 10
From the above discussion and following this link, it helped me to resolve the problem.
'str' object is not callable TypeError
import pyspark.sql.functions as func
model.associationRules.where(func.size(func.col('antecedent')) == 1).where(func.size(func.col('consequent')) == 1).show()

PySpark TypeError: object of type 'ParamGridBuilder' has no len()

I am trying to tune my model on Databricks using Pyspark.
I receive the following error:
TypeError: object of type 'ParamGridBuilder' has no len()
My code has been listed below.
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
als = ALS(userCol = "userId",itemCol="movieId", ratingCol="rating", coldStartStrategy="drop", nonnegative = True, implicitPrefs = False)
# Imports ParamGridBuilder package
from pyspark.ml.tuning import ParamGridBuilder
# Creates a ParamGridBuilder, and adds hyperparameters
param_grid = ParamGridBuilder().addGrid(als.rank, [5,10,20,40]).addGrid(als.maxIter, [5,10,15,20]).addGrid(als.regParam,[0.01,0.001,0.0001,0.02])
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
# Imports CrossValidator package
from pyspark.ml.tuning import CrossValidator
# Creates cross validator and tells Spark what to use when training and evaluates
cv = CrossValidator(estimator = als,
estimatorParamMaps = param_grid,
evaluator = evaluator,
numFolds = 5)
model = cv.fit(training)
TypeError: object of type 'ParamGridBuilder' has no len()
Full Error Log:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-1952169986445972> in <module>()
----> 1 model = cv.fit(training)
2
3 # Extract best combination of values from cross validation
4
5 best_model = model.bestModel
/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/databricks/spark/python/pyspark/ml/tuning.py in _fit(self, dataset)
279 est = self.getOrDefault(self.estimator)
280 epm = self.getOrDefault(self.estimatorParamMaps)
--> 281 numModels = len(epm)
It simple means that your object does not have a length property (unlike lists). Thus, In your line
param_grid = ParamGridBuilder()
.addGrid(als.rank, [5,10,20,40])
.addGrid(als.maxIter, [5,10,15,20])
.addGrid(als.regParam, [0.01,0.001,0.0001,0.02])
You should add .build() in the end to actually construct a grid.

iPython TypeError: 'int' object is not callable

Python 2.7.10 / Anaconda / windows 8.1
I have strange issue, the following code works on one solution file in the same working directory.
But when I copy call the exact same code to my sheet. I get this error, so I have no idea to fix this.
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.zeros(20)
x[:5] = 10
x[5:15] = np.arange(12,31,2)
x[15:] = 30
plt.plot(x)
plt.plot([4,4],[8,32],'k--')
plt.plot([14,14],[8,32],'k--')
plt.ylim(8,32)
Traceback (most recent call last)<ipython-input-65-6b573104eb1d> in <module>()
6 plt.plot([4,4],[8,32],'k--')
7 plt.plot([14,14],[8,32],'k--')
----> 8 plt.ylim(8,32)
TypeError: 'int' object is not callable

Print proper mathematical formatting

When I use sympy to get the square root of 8, the output is ugly:
2*2**(1/2)
import sympy
In [2]: sympy.sqrt(8)
Out[2]: 2*2**(1/2)
Is there any way to make sympy print output in proper mathematical notation (i.e. using the proper symbol for square root) ?
UPDATE:
when I follow the suggestions from #pqnet:
from sympy import *
x, y, z = symbols('x y z')
init_printing()
init_session()
I get following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-21d886bf3e54> in <module>()
2 x, y, z = symbols('x y z')
3 init_printing()
----> 4 init_session()
/usr/lib/python2.7/dist-packages/sympy/interactive/session.pyc in init_session(ipython, pretty_print, order, use_unicode, quiet, argv)
154 # and False means don't add the line to IPython's history.
155 ip.runsource = lambda src, symbol='exec': ip.run_cell(src, False)
--> 156 mainloop = ip.mainloop
157 else:
158 mainloop = ip.interact
AttributeError: 'ZMQInteractiveShell' object has no attribute 'mainloop'
In an ipython notebook you can enable Sympy's graphical math typesetting with the init_printing function:
import sympy
sympy.init_printing(use_latex='mathjax')
After that, sympy will intercept the output of each cell and format it using math fonts and symbols. Try:
sympy.sqrt(8)
See also:
Printing section in the Sympy Tutorial.
The simplest way to do it is this:
sympy.pprint(sympy.sqrt(8))
For me (using rxvt-unicode and ipython) it gives
___
2⋅╲╱ 2