Cluster features based on their attributes - cluster-analysis

I have a set of 5000 points each has an attribute that is set to a value from 0 to 5. I am trying to cluster this data to reduce the amount of points drawn but wish to create clusters containing only the features that have the same attribute value. Having searched around I discovered this extended clustering example from Openlayers 2.
http://dev.openlayers.org/examples/strategy-cluster-extended.html
However I see no information on how to implement this in Openlayers 3 and above? Is there something simple I am missing in order to achieve this?
Many thanks for the help on this and any advice would be much appreciated!

You would need a cluster source and layer for each value. You could use the geometry function as a filter
cluster0 = new Cluster({
source: vectorSource,
geometryFunction: function(feature) {
if (feature.get('attribute') == '0') {
return feature.getGeometry();
}
return null;
}
});
cluster1 = new Cluster({
source: vectorSource,
geometryFunction: function(feature) {
if (feature.get('attribute') == '1') {
return feature.getGeometry();
}
return null;
}
});

This doesn't really sound like a clustering question. It seems like you just need to group by value 1, 2, 3, 4, and 5. If you really want to do clustering, you can do it like this.
import pandas as pd
import numpy as np
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
from sklearn.cluster import KMeans
df= pd.read_csv("filename.csv")
#format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(df['Value1']),np.asarray(df['Value2'])])
X = data
distorsions = []
for k in range(2, 20):
k_means = KMeans(n_clusters=k)
k_means.fit(X)
distorsions.append(k_means.inertia_)
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')
# computing K-Means with K = 5 (5 clusters)
centroids,_ = kmeans(data,5)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
details = [(name,cluster) for name, cluster in zip(df.index,idx)]
for detail in details:
print(detail)

Related

nearest building with open street map

I have a csv of relevant points with latitude and longitude and trying to get the nearest
building data to each point and add a column to the csv (or panda) in python. Tried using Pyrosm and various libraries but can't seem to prune the data to get the nearest building and then add the data. Thanks
This is what I have
from pyrosm import OSM
from pyrosm import get_data
import geopandas as gpd
from sklearn.neighbors import BallTree
import numpy as np
import osmnx as ox
# get rid of weird error
import shapely
import warnings
from shapely.errors import ShapelyDeprecationWarning
import csv
def get_gig_data(csv_fname):
with open(csv_fname, "r", encoding="latin-1") as gig_records:
for gig_record in csv.reader(gig_records):
yield gig_record
def main():
warnings.filterwarnings("ignore", category=ShapelyDeprecationWarning)
chicago_osm = OSM(get_data("chicago"))
#get a Point of Interest GeoDataFrame
points_of_interest = chicago_osm.get_pois() #can use a custom filter if we want to filter the types, but I think no filter might be the best
# get buildings nodes and edges
nodes, edges = chicago_osm.get_network(nodes=True, network_type="walking")
buildings = chicago_osm.get_buildings()
b_cnt = len(buildings)
G = chicago_osm.to_graph(nodes, edges)
#nodes = get_igraph_nodes(G)
buildings['geometry'] = buildings.centroid
# poi_list = np.asarray([ point.coords for point in points_of_interest['geometry'] ]) #if point.geom_type == point])
#print(poi_list.shape)
#tree = BallTree( np.asarray([ point.coords for point in points_of_interest['geometry'] if point.geom_type == point]), metric="manhattan") #Note: the scipy implementation of manhattan/cityblock distance might be faster according to the internet bc it uses a C function
#Read in the gig work data - I think the best way to do this will probably be with the CSV.reader with open thing because it will go line by line and save a ton of memory
'''for i in points_of_interest:
print('Type: ', type(i) , ' ',i)'''
gig_fp = "data_sample.csv"
#gig_data = gpd.read_file(gig_fp)
iter_gig = iter(get_gig_data(gig_fp))
next(iter_gig)
ids=dict()
for building in buildings.iterrows():
#print(type(building[1][32]) , ' ', building[1][32])
#tup = tuple(float(x) for x in [trip[17][8:-1].split()])
ids[building[1][32]] = building
#make the tree that determines closest POI
#if we use the CSV reader this for loop will be done already
for trip in iter_gig:
# Using generator so this should be efficient memory wise.
tup = tuple([float(x) for x in trip[17][8:-1].split() ])
print(type(tup), ' ', tup)
src_ids,euclidean_distance=ox.distance.nearest_nodes(G,tup)
src_ids, euclidean_distance= ox.distance.nearest_nodes(G,tup)
# find nearest node
#THEN ADD THE PICKUP AND DROPOFF IDS TO THIS TUPLE AND ADD TO A NEW NP ARRAY
if __name__ == '__main__':
main()

Can I draw a bipartite graph from every dataset?

I am trying to draw a bipartite graph for my data set, which is like below:
source target weight
reduce energy 25
reduce consumption 25
energy pennsylvania 4
energy natural 4
consumption balancing 4
the code That I am trying to plot the graph is as below:
C_2021 = nx.Graph()
C_2021.add_nodes_from(df_final_2014['source'], bipartite=0)
C_2021.add_nodes_from(df_final_2014['target'], bipartite=1)
edges = df_final_2014[['source', 'target','weight']].apply(tuple, axis=1)
C_2021.add_weighted_edges_from(edges)
But when I check with the below code whether it is bipartite or not, I get the "False" feedback.
nx.is_bipartite(C_2021)
Could you please advise what the issue is?
The previous issue is resolved, but when I want to plot the bipartite graph with the below steps, I do not get a proper result. If someone could help me, I will be appreciated it:
top_nodes_2021 = set(n for n,d in C_2021.nodes(data=True) if d['bipartite']==0)
top_nodes_2021
the output of the above is:
{'reduce'}
bottom_nodes_2021 = set(C_2021) - top_nodes_2021
bottom_nodes_2021
the output of the above is:
{'balancing', 'consumption', 'energy', 'natural', 'pennsylvania '}
then plot it by:
pos = nx.bipartite_layout(C_2021,top_nodes_2021)
plt.figure(figsize=[8,6])
# Pass that layout to nx.draw
nx.draw(C_2021,pos,node_color='#A0CBE2',edge_color='black',width=0.2,
edge_cmap=plt.cm.Blues,with_labels=True)
and the result is:
It works for me using your code. nx.is_bipartite(C_2021) returns true. Check the example below:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
data = StringIO('''source;target;weight
reduce;energy;25
reduce;consumption;25
energy;pennsylvania ;4
energy;natural;4
consumption;balancing;4
''')
df_final_2014 = pd.read_csv(data, sep=";")
C_2021 = nx.Graph()
C_2021.add_nodes_from(df_final_2014['source'], bipartite=0)
C_2021.add_nodes_from(df_final_2014['target'], bipartite=1)
edges = df_final_2014[['source', 'target','weight']].apply(tuple, axis=1)
C_2021.add_weighted_edges_from(edges)
nx.is_bipartite(C_2021)
Finally to draw them get the bipartite sets. The data you passed during the creation is false (i.g. bipartite=0 and bipartite=1).
Use the following commands:
from networkx.algorithms import bipartite
top_nodes_2021, bottom_nodes_2021 = bipartite.sets(C_2021)
pos = nx.bipartite_layout(C_2021, top_nodes_2021)
plt.figure(figsize=[8,6])
# Pass that layout to nx.draw
nx.draw(C_2021,pos,node_color='#A0CBE2',edge_color='black',width=0.2,
edge_cmap=plt.cm.Blues,with_labels=True)
With the following result:

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

Gaussian Mixture Model in scala spark 1.5.1 weights are always uniformly distributed

I implemented the default gmm model provided in mllib for my algorithm.
I am repeatedly finding that the resultant weights are always equally waited no matter how many clusters i initiate. Is there any specific reason why the weights are not being adjusted ? Am I implementing it wrong ?
import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.clustering.GaussianMixtureModel
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrameNaFunctions
var colnames= df.columns;
for(x<-colnames)
{
if (df.select(x).dtypes(0)._2.equals("StringType")|| df.select(x).dtypes(0)._2.equals("LongType"))
{df = df.drop(x)}
}
colnames= df.columns;
var assembler = new VectorAssembler().setInputCols(colnames).setOutputCol("features")
var output = assembler.transform(df)
var normalizer= new Normalizer().setInputCol("features").setOutputCol("normalizedfeatures").setP(2.0)
var normalizedOutput = normalizer.transform(output)
var temp = normalizedOutput.select("normalizedfeatures")
var outputs = temp.rdd.map(_.getAs[org.apache.spark.mllib.linalg.Vector]("normalizedfeatures"))
var gmm = new GaussianMixture().setK(2).setMaxIterations(10000).setSeed(25).run(outputs)
Output code :
for (i <- 0 until gmm.k) {
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}
And therefore the points are being predicted in the same cluster for all the points .
var ol=gmm.predict(outputs).toDF
I am also having this issue. The weights and the gaussians are always the same. It seems independent of K.
My code is pretty simple. My data is 39 dimensional vectors of doubles. I just train like this...
val gmm = new GaussianMixture().setK(2).run(vectors)
for (i <- 0 until gmm.k) {
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}
I tried KMeans, and it worked as expected. So I thought this has to be a bug with GaussianMixture.
But then I tried clustering just the first dimension, and it worked. Now I think it must be an EM issue with to little data... except I have lots.
Any GMM experts out there? How much data does one need GaussianMixture and 39 dimensions.
Or is this a bug after all?

Trying to balance my dataset through sample_weight in scikit-learn

I'm using RandomForest for classification, and I got an unbalanced dataset, as: 5830-no, 1006-yes. I try to balance my dataset with class_weight and sample_weight, but I can`t.
My code is:
X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
But I don't get any improvement on my ratios TPR, FPR, ROC when using class_weight and sample_weight.
Why? Am I doing anything wrong?
Nevertheless, if I use the function called balanced_subsample, my ratios obtain a great improvement:
def balanced_subsample(x,y,subsample_size):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
My new code is:
X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5)
X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
This is not a full answer yet, but hopefully it'll help get there.
First some general remarks:
To debug this kind of issue it is often useful to have a deterministic behavior. You can pass the random_state attribute to RandomForestClassifier and various scikit-learn objects that have inherent randomness to get the same result on every run. You'll also need:
import numpy as np
np.random.seed()
import random
random.seed()
for your balanced_subsample function to behave the same way on every run.
Don't grid search on n_estimators: more trees is always better in a random forest.
Note that sample_weight and class_weight have a similar objective: actual sample weights will be sample_weight * weights inferred from class_weight.
Could you try:
Using subsample=1 in your balanced_subsample function. Unless there's a particular reason not to do so we're better off comparing the results on similar number of samples.
Using your subsampling strategy with class_weight and sample_weight both set to None.
EDIT: Reading your comment again I realize your results are not so surprising!
You get a better (higher) TPR but a worse (higher) FPR.
It just means your classifier tries hard to get the samples from class 1 right, and thus makes more false positives (while also getting more of those right of course!).
You will see this trend continue if you keep increasing the class/sample weights in the same direction.
There is a imbalanced-learn API that helps with oversampling/undersampling data that might be useful in this situation. You can pass your training set into one of the methods and it will output the oversampled data for you. See simple example below
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1)
x_oversampled, y_oversampled = ros.fit_sample(orig_x_data, orig_y_data)
Here it the link to the API: http://contrib.scikit-learn.org/imbalanced-learn/api.html
Hope this helps!