Py4JJava wrong columns error when calling PCA of pyspark.ml.feature - pyspark

I am trying to visualize word2vec words using pyspark's PCA function, but I'm getting an unhelpful error message. Saying column features are of the wrong type, but they aren't. (Full message below)
Background
spark-2.4.0-bin-hadoop2.7
Scala 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
3.6.5 |Anaconda, Inc.
Ubuntu 16.04
My Code
maxWordsVis = 15
Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat)
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])
$dfFeat.head()
Row(features=DenseVector([-0.1282, 0.0699, -0.0891, -0.0437, -0.0915, -0.0557, 0.1432, -0.1564, 0.0058, -0.0603, 0.1383, -0.0359, -0.0306, -0.0415, -0.0191, 0.058, 0.0119, -0.0302, 0.0362, -0.0466, 0.0403, -0.1035, 0.0456, 0.0892, 0.0548, -0.0735, 0.1094, -0.0299, -0.0549, -0.1235, 0.0062, 0.1381, -0.0082, 0.085, -0.0083, -0.0346, -0.0226, -0.0084, -0.0463, -0.0448, 0.0285, -0.0013, 0.0343, -0.0056, 0.0756, -0.0068, 0.0562, 0.0638, 0.023, -0.0224, -0.0228, 0.0281, -0.0698, -0.0044, 0.0395, -0.021, 0.0228, 0.0666, 0.0362, 0.0116, -0.0088, 0.0949, 0.0265, -0.0293, -0.007, -0.0746, 0.0891, 0.0145, 0.0532, -0.0084, -0.0853, 0.0037, -0.055, -0.0706, -0.0296, 0.0321, 0.0495, -0.0776, -0.1339, -0.065, 0.0856, 0.0328, 0.0821, 0.036, -0.0179, -0.0006, -0.036, 0.0438, -0.0077, -0.0012, 0.0322, 0.0354, 0.0513, 0.0436, 0.0002, -0.0578, 0.1062, 0.019, 0.0346, -0.1261]))
numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")
Error Message
Py4JJavaError: An error occurred while calling o4583.fit. : java.lang.IllegalArgumentException: requirement failed:
Column features must be of type
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:224)

Related

Spark vs scikit-learn

I use pyspark for traffic classification using the decision tree model & I measure the time required for training the model. It took 2 min and 17 s. Then, I perform the same task using scikit-learn. In the second case, the training time is 1 min and 19 s. Why? since it is supposed that Spark performs the task in a distributed way.
This is the code for pyspark:
df = (spark.read.format("csv")\
.option('header', 'true')\
.option("inferSchema", "true")\
.load("D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv"))
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 10)
pModel = dt.fit(trainDF)
in scikit - learn
import warnings
warnings.filterwarnings('ignore')
path = 'D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv'
df= pd.read_csv(path)
#df.info()
%%time
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

Using numba in a method to randomize matrices

I'm still not very familiar with numba and my problem is that I have the piece of code bellow that I use for randomize the edges of graphs.
This code is simply used to swap some edges in a connectivity matrix given the number of desired swaps and a seed for the random number generator.
My problem is that when I try to use it with numba to speed it up I did not menage to run it. The error it returns is also pasted bellow.
#nb.jit(nopython=True)
def _randomize_adjacency_wei(A, n_swaps, seed):
np.random.seed(seed)
# Number of nodes
n_nodes = A.shape[0]
# Copy the adj. matrix
Arnd = A.copy()
# Choose edges that will be swaped
edges = np.random.choice(n_nodes, size=(4, n_swaps), replace=True).T
#itr = range(n_swaps)
#for it in tqdm(itr) if verbose else itr:
it = 0
for it in range(n_swaps):
i,j,k,l = edges[it,:]
if len(np.unique([i,j,k,l]))<4:
continue
else:
# Old values of weigths
w_ij,w_il,w_kj,w_kl=Arnd[i,j],Arnd[i,l],Arnd[k,j],Arnd[k,l]
# Swaping edges
Arnd[i,j]=Arnd[j,i]=w_il
Arnd[k,l]=Arnd[l,k]=w_kj
Arnd[i,l]=Arnd[l,i]=w_ij
Arnd[k,j]=Arnd[j,k]=w_kl
return Arnd
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function unique at 0x7f1a1c03b0d0>) found for signature:
>>> unique(list(int64)<iv=None>)
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'np_unique': File: numba/np/arrayobj.py: Line 1915.
With argument(s): '(list(int64)<iv=None>)':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Unknown attribute 'ravel' of type list(int64)<iv=None>
File "../../../home/vinicius/anaconda3/lib/python3.8/site-packages/numba/np/arrayobj.py", line 1918:
def np_unique_impl(a):
b = np.sort(a.ravel())
^
During: typing of get attribute at /home/vinicius/anaconda3/lib/python3.8/site-packages/numba/np/arrayobj.py (1918)
File "../../../home/vinicius/anaconda3/lib/python3.8/site-packages/numba/np/arrayobj.py", line 1918:
def np_unique_impl(a):
b = np.sort(a.ravel())
^
raised from /home/vinicius/anaconda3/lib/python3.8/site-packages/numba/core/typeinfer.py:1071
During: resolving callee type: Function(<function unique at 0x7f1a1c03b0d0>)
During: typing of call at <ipython-input-165-90ffd30fe0e8> (19)
File "<ipython-input-165-90ffd30fe0e8>", line 19:
def _randomize_adjacency_wei(A, n_swaps, seed):
<source elided>
i,j,k,l = edges[it,:]
if len(np.unique([i,j,k,l]))<4:
^
Thanks in advance,
Vinicius
According to the comments, you are passing a list to np.unique() but this is not supported by Numba.
Modifying the code this way:
i, j, k, l = e = edges[it, :]
if len(np.unique(e)) < 4:
...
The following example doesn't produce any errors:
>>> A = np.random.randint(0, 5, (8,8))
>>> r = _randomize_adjacency_wei(A, 4, 33)

Talos --> TypeError: __init__() got an unexpected keyword argument 'grid_downsample'

I am trying to run a hyperparameters optimization with Talos. As I have a lot of parameters to test, I want to use a 'grid_downsample' argument that will select 30% of all possible hyperparameters combinations. However when I run my code I get: TypeError: __init__() got an unexpected keyword argument 'grid_downsample'
I tested the code below without the 'grid_downsample' option and with less hyperparameters.
#load data
data = pd.read_csv('data.txt', sep="\t", encoding = "latin1")
# split into input (X) and output (y) variables
Y = np.array(data['Y'])
data_bis = data.drop(['Y'], axis = 1)
X = np.array(data_bis)
p = {'activation':['relu'],
'optimizer': ['Nadam'],
'first_hidden_layer': [12],
'second_hidden_layer': [12],
'batch_size': [20],
'epochs': [10,20],
'dropout_rate':[0.0, 0.2]}
def dnn_model(x_train, y_train, x_val, y_val, params):
model = Sequential()
#input layer
model.add(Dense(params['first_hidden_layer'], input_shape=(1024,)))
model.add(Dropout(params['dropout_rate']))
model.add(Activation(params['activation']))
#hidden layer 2
model.add(Dense(params['second_hidden_layer']))
model.add(Dropout(params['dropout_rate']))
model.add(Activation(params['activation']))
# output layer with one node
model.add(Dense(1))
model.add(Activation(params['activation']))
# Compile model
model.compile(loss='binary_crossentropy', optimizer=params['optimizer'], metrics=['accuracy'])
out = model.fit(x_train, y_train,
batch_size=params['batch_size'],
epochs=params['epochs'],
validation_data=[x_val, y_val],
verbose=0)
return out, model
scan_object = ta.Scan(X, Y, model=dnn_model, params=p, experiment_name="test")
reporting = ta.Reporting(scan_object)
report = reporting.data
report.to_csv('./Random_search/dnn/report_talos.txt', sep = '\t')
This code works well. If I change the scan_object as the end to: scan_object = ta.Scan(X, Y, model=dnn_model, grid_downsample=0.3, params=p, experiment_name="test"), it gives me the error: TypeError: __init__() got an unexpected keyword argument 'grid_downsample' while I was expecting to have the same results format as a normal grid search but with less combinations. What am I missing? Did the name of the argument change? I'm using Talos 0.6.3 in a conda environment.
Thank you!
might be too late for you now but they've switched it to fraction_limit. It would give this for you
scan_object = ta.Scan(X, Y, model=dnn_model, params=p, experiment_name="test", fraction_limit = 0.1)
Sadly, the doc isn't well updated
Check out their examples on GitHub:
https://github.com/autonomio/talos/blob/master/examples/Hyperparameter%20Optimization%20with%20Keras%20for%20the%20Iris%20Prediction.ipynb

Error on 'Param' object is not callable - Apache Spark for Recommendation System

I am using apache spark for recommendation system. On the part of evaluation, to find the precision and recall, I got the error.
The code as stated below
def print_metrics(predictions_and_labels):
metrics = MulticlassMetrics(predictions_and_labels)
print('Precision of True ', metrics.precision(1))
print('Precision of False', metrics.precision(0))
print('Recall of True ', metrics.recall(1))
print('Recall of False ', metrics.recall(0))
print('F-1 Score ', metrics.fMeasure())
print('Confusion Matrix\n', metrics.confusionMatrix().toArray())
predictions = model.transform(testRatings)
accuracy = model.predictionCol().evaluate(predictions)
print('F1 Accuracy: %f' % accuracy)
predictions_and_labels = predictions.select("prediction", "foreclosure_status").rdd \
.map(lambda r: (float(r[0]), float(r[1])))
print_metrics(predictions_and_labels)
And the error I got mention that
"'Param' object is not callable".
Please could anyone suggest me the solution. Thanks in advance.

Real time Openscoring Bad request for Linear Regression model

Using Linear Regression I trying to predict power generated based on the values of temperature, vacuum, pressure and humidity which is inspired and Adapted from "http://datascience-enthusiast.com" and applying model to real time data from Kafka topic. Properly generated pickled .pkl.z file and converts to PMML using JPMML as suggested https://github.com/jpmml/jpmml-sklearn.
Kafka Producer, a Python program (kafka_producer.py) randomly generates data within some ranges as float and converts string and sending to Kafka topic as bytes.
Kafka Consumer, a python program(kafka_consumer.py) which act as Openscoring python client, reads the data from Kafka topic, converts byte string to string and finally to a dictionary which form the arguments like arguments = {"AT" : 9.2, "V" : 39.82, "AP" : 1013.19, "RH" : 91.25} for result = os.evaluate("CCPP", arguments) statement.
It works well and predict power, but after showing the result correctly for 4 to 10 records,
Openscoring server throws
SEVERE: INFO:
Received EvaluationRequest{id=null, arguments={AT=12.12, V=41.35, AP=1031.67, RH=66.32}}
Nov 20, 2017 6:39:16 AM org.openscoring.service.ModelResource evaluate
INFO: Returned EvaluationResponse{id=null, result={PE=472.152110955029}}
Nov 20, 2017 6:39:17 AM org.openscoring.service.ModelResource evaluate
INFO: Received EvaluationRequest{id=null, arguments={AT=34.06, V=51.53, AP=1016.22, RH=91.7}}
Nov 20, 2017 6:39:17 AM org.openscoring.service.ModelResource evaluate
INFO: Returned EvaluationResponse{id=null, result={PE=444.9147880324237}}
Nov 20, 2017 6:39:18 AM org.openscoring.service.ModelResource evaluate
INFO: Received EvaluationRequest{id=null, arguments={AT=20.41, V=50.33, AP=1018.19, RH=100.18}}
Nov 20, 2017 6:39:18 AM org.openscoring.service.ModelResource doEvaluate
**SEVERE: Failed to evaluate**
org.jpmml.evaluator.InvalidResultException (at or around line 130)
at org.jpmml.evaluator.FieldValueUtil.performInvalidValueTreatment(FieldValueUtil.java:178) at org.jpmml.evaluator.FieldValueUtil.prepareInputValue(FieldValueUtil.java:90)
at org.jpmml.evaluator.InputField.prepare(InputField.java:64)
Kafka Consumer stops and show: raise Exception(self.message)
Exception: Bad Request
kafka_producer.py
import random
import time
from kafka import KafkaProducer
from kafka.errors import KafkaError
producer = KafkaProducer(bootstrap_servers='localhost:9092')
topic = "power"
for i in range(1000):
AT = "19.651231"
V = "54.305804"
AP = "1013.259078"
RH = "73.308978"
def getAT():
return str(round(random.uniform(2.0, 38.0),2))
def getV():
return str(round(random.uniform(26.0, 81.5),2))
def getAP():
return str(round(random.uniform(993.0, 1033.0),2))
def getRH():
return str(round(random.uniform(26.0, 101.0),2))
# arguments = {"AT" :9.2, "V" : 39.82, "AP" : 1013.19, "RH" : 91.25}
message = "{"AT" : " + getAT() + "," + ""V" : " +getV() + "," + ""AP" : " +getAP() + "," + ""RH" : " + getRH() + "}"
producer.send(topic, key=str.encode('key_{}'.format(i)), value=(message.encode('utf-8')))
time.sleep(1)
producer.close()
kafka_consumer.py
import ast
from kafka import KafkaConsumer
import openscoring
import os
os = openscoring.Openscoring("http://localhost:8080/openscoring")
kwargs = {"auth" : ("admin", "adminadmin")}
os.deploy("CCPP", "/home/gopinathankm/jpmml-sklearn-master/ccpp.pmml", **kwargs)
consumer = KafkaConsumer('power', bootstrap_servers='localhost:9092')
for message in consumer:
arguments =message.value
argsdict = arguments.decode("utf-8")
dict = ast.literal_eval(argsdict)
print(dict)
result = os.evaluate("CCPP", dict)
print(result)
For some data generated it is not working, I really don't know how bad request is generated.
Any assistance will be highly appreciated.
Regards
Gopinathan K.M
Got assistance and solution from Villu Ruusmann as below so that others may benefit:
https://github.com/jpmml/jpmml-evaluator/issues/84
"Exception type InvalidResultException means that the evaluation of a model could not be completed successfully, because one or more input field values are outside of their declared range.
The value ranges of your PMML document and Python script do not match. Either delete PMML document value ranges (so that all input values are considered to be valid), or contract Python script value ranges." as suggested by Villu Ruusmann.