why is sklearn giving me an value error in train_test_split - valueerror

ValueError: Expected 2D array, got 1D array instead:
array=[712. 3.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
```
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
import seaborn as sb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df=sb.load_dataset('titanic')
df2=df[['survived','pclass','age','parch']]
df3=df2.fillna(df2.mean())
x=df3.drop('survived',axis=1)
y=df3['survived']
x_train,y_train,x_test,y_test=train_test_split(x,y,test_size=0.2, random_state=51)
print('x_train',x_train.shape)
sc=StandardScaler()
sc.fit(x_train.shape)
x_train=x_train.reshape(-1,1)
x_train_sc=sc.transform(x_train)
x_test_sc=sc.transform(x_test)
print(x_train_sc)`
```
```
`I would really appreciate if could fid me a solution
I have applied train_test_split to x & y variables and also transformed it into the x_train variabel. I was trying to print x_train. But it showed me an error
`
```
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[712. 3.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
```
`

You're supposed to give your StandardScaler your X_train and not the shape of your X_train :)
sc=StandardScaler()
sc.fit(x_train)
x_train_sc=sc.transform(x_train)
x_test_sc=sc.transform(x_test)
If you want to normalize your data in a -1/1 range, it's better to use MinMaxScaler :
from sklearn.preprocessing import MinMaxScaler
...
sc = MinMaxScaler(feature_range=(-1, 1)).fit(X_train)
x_train_sc=sc.transform(x_train)
x_test_sc=sc.transform(x_test)

Related

Passing Argument to a Generator to build a tf.data.Dataset

I am trying to build a tensorflow dataset from a generator. I have a list of tuples called some_list , where each tuple has an integer and some text.
When I do not pass some_list as an argument to the generator, the code works fine
import tensorflow as tf
import random
import numpy as np
some_list=[(1,'One'),[2,'Two'],[3,'Three'],[4,'Four'],
(5,'Five'),[6,'Six'],[7,'Seven'],[8,'Eight']]
def text_gen1():
random.shuffle(some_list)
size=len(some_list)
i=0
while True:
yield some_list[i][0],some_list[i][1]
i+=1
if i>size:
i=0
random.shuffle(some_list)
#Not passing any argument
tf_dataset1 = tf.data.Dataset.from_generator(text_gen1,output_types=(tf.int32,tf.string),
output_shapes = ((),()))
for count_batch in tf_dataset1.repeat().batch(3).take(2):
print(count_batch)
(<tf.Tensor: shape=(3,), dtype=int32, numpy=array([7, 1, 2])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Seven', b'One', b'Two'], dtype=object)>)
(<tf.Tensor: shape=(3,), dtype=int32, numpy=array([3, 5, 4])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Three', b'Five', b'Four'], dtype=object)>)
However, when I try to pass some_list as an argument, the code fails
def text_gen2(file_list):
random.shuffle(file_list)
size=len(file_list)
i=0
while True:
yield file_list[i][0],file_list[i][1]
i+=1
if i>size:
i=0
random.shuffle(file_list)
tf_dataset2 = tf.data.Dataset.from_generator(text_gen2,args=[some_list],output_types=
(tf.int32,tf.string),output_shapes = ((),()))
for count_batch in tf_dataset1.repeat().batch(3).take(2):
print(count_batch)
ValueError: Can't convert Python sequence with mixed types to Tensor.
I noticed , when I try to pass a list of integers as an argument , the code works. However, a list of tuples seems to make it crash. Can someone shed some light on it ?
The problem is what it says is, you cannot have heterogeneous data types (int and str) in the same tf.Tensor. I did a few changes and came up with the code below.
Separate your some_list to two lists using zip(), i.e. int_list and str_list and make your generator function accept two lists.
I don't understand why you're manually shuffling stuff within the generator. You can do it in a cleaner way using tf.data.Dataset.shuffle()
import tensorflow as tf
import random
import numpy as np
some_list=[(1,'One'),[2,'Two'],[3,'Three'],[4,'Four'],
(5,'Five'),[6,'Six'],[7,'Seven'],[8,'Eight']]
def text_gen2(int_list, str_list):
for x, y in zip(int_list, str_list):
yield x, y
tf_dataset2 = tf.data.Dataset.from_generator(
text_gen2,
args=list(zip(*some_list)),
output_types=(tf.int32,tf.string),output_shapes = ((),())
)
i = 0
for count_batch in tf_dataset2.repeat().batch(4).shuffle(buffer_size=6):
print(count_batch)
i += 1
if i > 10: break;

How can I reuse the dataframe and use alternative for iloc to run an iterative imputer in Azure databricks

I am running an iterative imputer in Jupyter Notebook to first mark the known incorrect values as "Nan" and then run the iterative imputer to impute the correct values to achieve required sharpness in the data. The sample code is given below:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
import pandas as p
idx = [761, 762, 763, 764]
cols = ['11','12','13','14']
def fit_imputer():
for i in range(len(idx)):
for col in cols:
dfClean.iloc[idx[i], col] = np.nan
print('Index = {} Col = {} Defiled Value is: {}'.format(idx[i], col, dfClean.iloc[idx[i], col]))
# Run Imputer for Individual row
tempOut = imp.fit_transform(dfClean)
print("Imputed Value = ",tempOut[idx[i],col] )
dfClean.iloc[idx[i], col] = tempOut[idx[i],col]
print("new dfClean Value = ",dfClean.iloc[idx[i], col])
origVal.append(dfClean_Orig.iloc[idx[i], col])
I get an error when I try to run this code on Azure Databricks using pyspark or scala. Because the dataframes in spark are immutable also I cannot use iloc as I have used it in pandas dataframe.
Is there a way or better way of implementing such imputation in databricks?

PySpark: Error "Cannot pickle standard input" on function map

I'm trying to learn to use Pyspark.
I'm usin spark-2.2.0- with Python3
I'm in front of a problem now and I can't find where it came from.
My project is to adapt a algorithm wrote by data-scientist to be distributed. The code below it's what I have to use to extract the features from images and I have to adapt it to extract features whith pyspark.
import json
import sys
# Dependencies can be installed by running:
# pip install keras tensorflow h5py pillow
# Run script as:
# ./extract-features.py images/*.jpg
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
def main():
# Load model VGG16 as described in https://arxiv.org/abs/1409.1556
# This is going to take some time...
base_model = VGG16(weights='imagenet')
# Model will produce the output of the 'fc2'layer which is the penultimate neural network layer
# (see the paper above for mode details)
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
# For each image, extract the representation
for image_path in sys.argv[1:]:
features = extract_features(model, image_path)
with open(image_path + ".json", "w") as out:
json.dump(features, out)
def extract_features(model, image_path):
img = image.load_img(image_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
return features.tolist()[0]
if __name__ == "__main__":
main()
I have written the begining of the Code:
rdd = sc.binaryFiles(PathImages)
base_model = VGG16(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
rdd2 = rdd.map(lambda x : (x[0], extract_features(model, x[0][5:])))
rdd2.collect()[0]
when I try to extract the feature. There is an error.
~/Code/spark-2.2.0-bin-hadoop2.7/python/pyspark/cloudpickle.py in
save_file(self, obj)
623 return self.save_reduce(getattr, (sys,'stderr'), obj=obj)
624 if obj is sys.stdin:
--> 625 raise pickle.PicklingError("Cannot pickle standard input")
626 if hasattr(obj, 'isatty') and obj.isatty():
627 raise pickle.PicklingError("Cannot pickle files that map to tty objects")
PicklingError: Cannot pickle standard input
I try multiple thing and here is my first result. I know that the error come from the line below in the method extract_features:
features = model.predict(x)
and when I try to run this line out of a map function or pyspark, this work fine.
I think the problem come from the object "model" and his serialisation whith pyspark.
Maybe I don't use a good way to distribute this with pyspark and if you have any clew to help me, I will take them.
Thanks in advance.

Iterate over image pixel and modify them

I am trying to move a pixel range (32x32) through an 256x256 image in raster pattern. and I've defined a function for that. I am using PIL
Could you please help me?
from PIL import Image
import sys
import os
from Crypto.Cipher import AES
from collections import deque
import numpy as np
import numpy as numpy
from scipy.sparse import csr_matrix
from pylab import *
Using these libraries I have made the following function.
def raster_SCAN(img_re,pixel_1,num_scan):
img_w,img_h=img_re.size
for j in range(0,num_scan):
img_2=img_re
pixel_2=img_re.load()
for i in range(8): # for every block of pixels:
for j in range(8):
print(i,j)
if i==1 & j==1:
pixel_1 [i:(i+1)*32,j:j*32] = pixel_2[224:256,224:256] # set the colour accordingly
if i<8 & j<8 & i!=1 & j!=1:
pixel_1[(i-1)*32:i*32,j:j*32] = pixel_2[(i-2)*32:(i-1)*32,(j-2)*32:(j-1)*32]
if i==1& j!=1:
pixel_1[i:i*32,(j-1)*32:j*32] = pixel_2[224:256,(j-2)*32:j*32]
if i==8 & j==8:
pixel_1[224:256,224:256] = pixel_2[1:32,1:32]
return pixel_1

Training an LSTM neural network to forecast time series in pybrain, python

I have a neural network created using PyBrain and designed to forecast time series.
I am using the sequential dataset function, and trying to use a sliding window of 5 previous values to predict the 6th. One of my problems is that I can't figure out how to create the required dataset by appending the 5 previous values to the inputs and the 6th as an output.
I am also unsure of how exactly to forecast values in the series once the network is trained.
Posting my code below:
from pybrain.datasets import SupervisedDataSet
from pybrain.datasets import SequentialDataSet
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.supervised.trainers import RPropMinusTrainer
from pylab import ion, ioff, figure, draw, contourf, clf, show, hold, plot
from pybrain.structure import RecurrentNetwork
from pybrain.structure import FeedForwardNetwork
from pybrain.structure import LinearLayer, SigmoidLayer, TanhLayer
from pybrain.structure import FullConnection
from pybrain.structure import LSTMLayer
from pybrain.structure import BiasUnit
from pybrain.rl.learners.valuebased import Q
import pybrain
import matplotlib as plt
import translate
import time
import pickle
import scipy as sp
import numpy as np
import pylab as pl
import itertools
#Opening data from database
data = translate.translate(3600)
time, price, volume = zip(*data)
#Creating data lists instead of tuples
timeList = []
priceList = []
volumeList = []
for record in time:
timeList.append(record)
for record in price:
priceList.append(record)
for record in volume:
volumeList.append(record)
#Creating lookback window and target
datain = priceList[:5]
dataout = priceList[6]
print datain
print dataout
#Creating the dataset
ds = SequentialDataSet(5, 1)
for x, y in itertools.izip(datain, dataout):
ds.newSequence()
ds.appendLinked(tuple(x), tuple(y))
print (x, y)
print ds
#Building the network
n = RecurrentNetwork()
#Create the network modules
n.addInputModule(SigmoidLayer(5, name = 'in'))
n.addModule(LSTMLayer(100, name = 'LSTM'))
n.addModule(LSTMLayer(100, name = 'LSTM2'))
n.addOutputModule(SigmoidLayer(1, name = 'out'))
#Add the network connections
n.addConnection(FullConnection(n['in'], n['LSTM'], name = 'c_in_to_LSTM'))
n.addConnection(FullConnection(n['in'], n['LSTM2'], name = 'c_in_to_LSTM2'))
n.addConnection(FullConnection(n['LSTM'], n['out'], name = 'c_LSTM_to_out'))
n.addConnection(FullConnection(n['LSTM2'], n['out'], name = 'c_LSTM2_to_out'))
n.sortModules()
n.randomize()
#Creating the trainer
trainer = BackpropTrainer(n, ds)
#Training the network
#for i in range (1000):
# print trainer.train()
#Make predictions
#Plotting the results
pl.plot(time, price)
pl.show()
The above code gives:
TypeError: izip argument #2 must support iteration
I have seen the question linked below however I haven't been successful
Event Sequences, Recurrent Neural Networks, PyBrain
First question on this great site, any help is appreciated
#Creating lookback window and target
datain = priceList[:5]
dataout = priceList[6]
Not an expert. But it seems your datain is a list with length=6 while dataout is not.
I'd guess the TypeError says it all. Whereas priceList[:5] is a list and hence iterable, priceList[6] is a single element.
You'd probably want something like
datain = priceList[:5]
dataout = priceList[6:6]
which will make dataout a list with a single element.