Filling missing values: using bins, .merge(), .apply(), lambda, - merge

bins = pd.cut(data["odometer"], bins=25)
display(bins)
data['bin'] = pd.qcut(data['odometer'], q=25)
display(data.head())
df_agg = data[[ "model_year","bin"]].groupby('bin').agg(pd.Series.mode).reset_index()
df_agg.columns = ['bin','model_year_agg']
display(df_agg)
print(data.shape)
data = pd.merge(data, df_agg, how='left', on='bin')
print(data.shape)
data['model_year_temp'] = data.apply (lambda row: \
row['model_year_agg'] if np.isnan(row['model_year']) else row['model_year'], axis=1)
Above is my work. Only the last step is running an error, key error message says: model_year_agg. Any tips on how to fix it?

Related

pyspark image dimension reduction with PCA

I am using Pyspark in AWS cloud to extract the image features:
ImageSchema.imageFields
img2vec = F.udf(lambda x: DenseVector(ImageSchema.toNDArray(x).flatten()),
VectorUDT())
df_vec = df_cat.withColumn('original_vectors', img2vec("image"))
df_vec.show()
After having standardized the data:
standardizer = MinMaxScaler(inputCol="original_vectors",
outputCol="scaledFeatures",
min=-1.0,
max=1.0)
#withStd=True, withMean=True)
model_std = standardizer.fit(df_vec)
df_std = model_std.transform(df_vec)
df_std.show()
... when I apply PCA for dimension reduction, I receive an error that I could not debug for a couple of weeks :(
Error_1
Error_2
Could you please help me to solve that?
I use Pyspark spark-3.0.3-bin-hadoop2.7
img2vec = F.udf(lambda x : Vectors.dense(x), VectorUDT())
df = df.withColumn("data_as_vector", img2vec("data_as_resized_array"))
standardizer = StandardScaler(withMean=True, withStd=True, inputCol="data_as_vector", outputCol="scaledFeatures")
for image it needs to resize image data with this code and you must use the resized image data;
def resize_img(img_data, resize=True):
mode = 'RGBA' if (img_data.nChannels == 4) else 'RGB'
img = Image.frombytes(mode=mode, data=img_data.data, size=[img_data.width, img_data.height])
img = img.convert('RGB') if (mode == 'RGBA') else img
img = img.resize([224, 224], resample=Image.Resampling.BICUBIC) if (resize) else img
arr = convert_bgr_array_to_rgb_array(np.asarray(img))
arr = arr.reshape([224*224*3]) if (resize) else arr.reshape([img_data.width*img_data.height*3])
return arr
def resize_image_udf(dataframe_batch_iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for dataframe_batch in dataframe_batch_iterator:
dataframe_batch["data_as_resized_array"] = dataframe_batch.apply(resize_img, args=(True,), axis=1)
dataframe_batch["data_as_array"] = dataframe_batch.apply(resize_img, args=(False,), axis=1)
yield dataframe_batch
resized_df = df_image.select("image.*").mapInPandas(resize_image_udf, schema)
then you can make standardscaler and PCA with;
model_std = standardizer.fit(df)
df = model_std.transform(df)
# algorithm
pca = PCA(k=n_components, inputCol='data_as_vector', outputCol='pcaFeatures')
model_pca = pca.fit(df)
# Transformation images
df = model_pca.transform(df)
i think, i am too late to answer your questions, sorry

"IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices." Pls slove this

I'm trying the cluster a dataset using a hierarchical clustering algorithm, by I got an error in the last step.
plt.figure(figsize = (10,7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(df_nor,method = "ward"))
plt.axhline(y =6, color = 'black' , linestyle = '--')
Dendrogram
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2,affinity='euclidean', linkage='ward')
cluster.fit_predict(df_nor)
The array
a = (df_nor['Milk'])
b = (df_nor['Grocery'])
plt.figure(figsize=(10, 7))
plt.scatter(a,b, c=cluster.labels_)
error

Can someone help me with the update function in Matlab to move values to MySQL?

I'm dealing with a problem with the update function in Matlab.
conn=database('MySQL','user','password');)
selectquery_select = 'SELECT * FROM inputs WHERE i_read = 0';
data_select = select(conn,selectquery_select);
for j=1:size(data_select)
id_data = data_select(j,1);
id_data = string(id_data.(1));
time_data = data_select(j,4);
time_data = string(time_data.(1));
time_dataform = datetime(time_data,'InputFormat','yyyy-MM-dd HH:mm:ss');
y0=data_select(j,2);
y0 = str2num(string(y0.(1)));
r0=data_select(j,3);
r0 = str2num(string(r0.(1)));
if id_data == "115"
run("C:\Users\...\uu.m")
update(conn,'inputs','i_read',1,'WHERE (ID_code = "115") AND WHERE (i_Time = time_data)');
end
end
Basically, I'm taking some value from the database when i_read is equal to 0 (i_read is a boolean variable in the database that should give 1 if the value is already processed and 0 if not). After a value is read, we want to change the i_read in the database from 0 to 1. We decide to use the update function, but this gave us the following error:
Error using database.odbc.connection/update
Too many input arguments.
Error in Patient_Identification (line 57)
update(conn,'inputs','i_read',1,'WHERE (ID_code = "112") AND WHERE (i_Time = ', time_data,')');
Someone is able to help us with this problem? Thank you.

Implement Louvain in pyspark using dataframes

I'm trying to implement the Louvain algorihtm in pyspark using dataframes. The problem is that my implementation is reaaaally slow. This is how I do it:
I collect all vertices and communityIds into simple python lists
For each vertex - communityId pair I calculate the modularity gain using dataframes (just a fancy formula involving edge weights sums/differences)
Repeat untill no change
What am I doing wrong?
I suppose that if I could somehow parallelize the for each loop the performance would increase, but how can I do that?
LATER EDIT:
I could use vertices.foreach(changeCommunityId) instead of the for each loop, but then I'd have to compute the modularity gain (that fancy formula) without dataframes.
See the code sample below:
def louvain(self):
oldModularity = 0 # since intially each node represents a community
graph = self.graph
# retrieve graph vertices and edges dataframes
vertices = verticesDf = self.graph.vertices
aij = edgesDf = self.graph.edges
canOptimize = True
allCommunityIds = [row['communityId'] for row in verticesDf.select('communityId').distinct().collect()]
verticesIdsCommunityIds = [(row['id'], row['communityId']) for row in verticesDf.select('id', 'communityId').collect()]
allEdgesSum = self.graph.edges.groupBy().sum('weight').collect()
m = allEdgesSum[0]['sum(weight)']/2
def computeModularityGain(vertexId, newCommunityId):
# the sum of all weights of the edges within C
sourceNodesNewCommunity = vertices.join(aij, vertices.id == aij.src) \
.select('weight', 'src', 'communityId') \
.where(vertices.communityId == newCommunityId);
destinationNodesNewCommunity = vertices.join(aij, vertices.id == aij.dst) \
.select('weight', 'dst', 'communityId') \
.where(vertices.communityId == newCommunityId);
k_in = sourceNodesNewCommunity.join(destinationNodesNewCommunity, sourceNodesNewCommunity.communityId == destinationNodesNewCommunity.communityId) \
.count()
# the rest of the formula computation goes here, I just wanted to show you an example
# just return some value for the modularity
return 0.9
def changeCommunityId(vertexId, currentCommunityId):
maxModularityGain = 0
maxModularityGainCommunityId = None
for newCommunityId in allCommunityIds:
if (newCommunityId != currentCommunityId):
modularityGain = computeModularityGain(vertexId, newCommunityId)
if (modularityGain > maxModularityGain):
maxModularityGain = modularityGain
maxModularityGainCommunityId = newCommunityId
if (maxModularityGain > 0):
return maxModularityGainCommunityId
return currentCommunityId
while canOptimize:
while self.changeInModularity:
self.changeInModularity = False
for vertexCommunityIdPair in verticesIdsCommunityIds:
vertexId = vertexCommunityIdPair[0]
currentCommunityId = vertexCommunityIdPair[1]
newCommunityId = changeCommunityId(vertexId, currentCommunityId)
self.changeInModularity = False
canOptimize = False

Unable to read file 'subjFiles(1).name'. No such file or directory

I'm using some codes that called specific function.
in line 17 of this function I get the error
Index exceeds matrix dimensions.
Error in generateExpReport (line 17)
checkPValuesField = subjFiles(1).name;
the function up to line 20 is like this:
function [] = generateExpReport(copyDir,resultDir,params)
% Syntax
methodNames = fieldnames(params.methods);
numMethods = length(methodNames);
for i = 1 : numMethods
cd(resultDir)
numTargets = length(params.methods.(methodNames{i,1}).idTargets);
idDrivers = params.methods.(methodNames{i,1}).idDrivers;
nameFiles = [methodNames{i,1} '*.mat'];
subjFiles = dir(nameFiles);
numSubj = length(subjFiles);
significanceOnDrivers = zeros(numSubj,numTargets);
matrixTransferEntropy = zeros(numSubj,(numTargets)+1);
% check if the pValues matrix is present
checkPValuesField = load(subjFiles(1).name);
fields = fieldnames(checkPValuesField);
nameFields = checkPValuesField.(fields{1,1});
I can't find the problem
please help me:(( what's wrong with
checkPValuesField = load(subjFiles(1).name);