How to see ELKI DBSCAN clustering result - cluster-analysis

I am using ELKI for DBSCAN clustering of some ~14,000 GPS points.Its running fine but I want to see information about clusters like how many points are in a cluster.?

If you use the -resulthandler ResultWriter and output to text, the cluster sizes will be at the top of each cluster file.
The visualizer currently doesn't seem to show cluster sizes.

If you use the -resulthandler ResultWriter and output to text, the cluster sizes will be at the top of each cluster file.
Also, if you want to merge all those results into a single file, here is a python script that works:
clusterout_path = "path/to/where/files/all/go/"
finalout_path = "/path/for/single/merged/file/"
consol_filename= "single_merged_file.txt"
cll_file = open(finalout_path + consol_filename,"a")
cll_file.write("ClusterID"+ "\t" + "Lon" + "\t" + "Lat" + "\n")
def readFile(file):
f = open(clusterout_path + file)
counter = 0
cluster = ""
lon = ""
lat = ""
for line in f.readlines():
counter+=1
if counter == 1:
cluster = line.split(":")[1].strip().lower()
if counter > 4 and line.startswith("ID"):
arr = line.split(" ")
lon = arr[1]
lat = arr[2]
cll_file.write(cluster + "\t" + lon + "\t" + lat + "\n")
f.close()
listing = os.listdir(clusterout_path)
for infile in listing:
print "Processing file: " + infile
readFile(infile)
cll_file.close()

Related

Performance issue with spark Dataframe, each iteration takes longer

I´m using Spark 2.2.1 with Scala 2.11.12 version as a language to generate a recursive algorithm. First, I tried an implementation using RDD but the time when I used a lot of data was too much. I have made a new version using DataFrames but with very little data it takes too much, taking less data in each iteration than in the previous iteration.
I have tried to cache variables in different ways (types of persistence included), to use checkpoints in different moments, using the repartition method with different values ​​and in different functions, and nothing works.
The code starts looking for the minimum distance between the points that make up the matrix (matrix is a DataFrame):
println("Finding minimum:")
val minDistRes = matrix.select(min("dist")).first().getFloat(0)
val clusterRes = matrix.where($"dist" === minDistRes)
println(s"New minimum:")
clusterRes.show(1)
Then, save the coordenates to the points for later calculations:
val point1 = clusterRes.first().getInt(0)
val point2 = clusterRes.first().getInt(1)
After, made several filters to use them in the new points generated in the next iteration (the creation of a broadcast variable is necessary to be able to access this data in a later map):
matrix = matrix.where("!(idW1 == " + point1 +" and idW2 ==" + point2 + " )").cache()
val dfPoints1 = matrix.where("idW1 == " + point1 + " or idW2 == " + point1).cache()
val dfPoints2 = matrix.where("idW1 == " + point2 + " or idW2 == " + point2).cache()
val dfPoints2Broadcast = spark.sparkContext.broadcast(dfPoints2)
val dfUnionPoints = dfPoints1.union(dfPoints2).cache()
val matrixSub = matrix.except(dfUnionPoints).cache()
Continued with the calculation of the new points and I return the matrix that will be used recursively by the algorithm:
val newPoints = dfPoints1.map{
r => val distAux = dfPoints2Broadcast.value.where("idW1 == " + r.getInt(0) +
" or idW1 == " + r.getInt(1) + " or idW2 == " + r.getInt(0) + " or idW2 == " +
r.getInt(1)).first().getFloat(2)
(newIndex.toInt, filterDF(r.getInt(0),r.getInt(1), point1, point2), math.min(r.getFloat(2), distAux))
}.asInstanceOf[Dataset[Row]]
matrix = matrixSub.union(newPoints)
Finalize each iteration caching the matrix variable and realized a checkpoint every so often:
matrix.cache()
if (a % 5 == 0)
matrix.checkpoint()

Convert geographical coordinates using measurements package

I am trying to convert some data with the indicated measurements package, but I'm not succeeding on it.
My data:
Long Lat
62ᵒ36.080 58ᵒ52.940
61ᵒ28.020 54ᵒ59.940
62ᵒ07.571 56ᵒ48.873
62ᵒ04.929 57ᵒ33.605
63ᵒ01.419 60ᵒ30.349
63ᵒ09.555 61ᵒ29.199
63ᵒ43.499 61ᵒ23.590
64ᵒ34.175 62ᵒ30.304
63ᵒ16.342 59ᵒ16.437
60ᵒ55.090 54ᵒ49.269
61ᵒ28.013 54ᵒ59.928
62ᵒ07.868 56ᵒ48.040
62ᵒ04.719 57ᵒ32.120
62ᵒ36.083 58ᵒ51.766
63ᵒ01.644 60ᵒ30.714
64ᵒ33.897 62ᵒ30.772
63ᵒ43.604 61ᵒ23.426
63ᵒ09.288 61ᵒ29.888
63ᵒ16.722 59ᵒ16.204
What I'm trying:
library(measurements)
library(readxl)
coord = read.table('coord_converter.txt', header = T, stringsAsFactors = F)
# change the degree symbol to a space
lat = gsub('°','', coord$Lat)
long = gsub('°','', coord$Long)
# convert from decimal minutes to decimal degrees
lat = measurements::conv_unit(lat, from = 'deg_dec_min', to = 'dec_deg')
long = measurements::conv_unit(long, from = 'deg_dec_min', to = 'dec_deg')
What I'm getting with this penultimate line:
Warning messages:
In split(as.numeric(unlist(strsplit(x, " "))) * c(3600, 60), f = rep(1:length(x), : NAs introduced by coercion
In as.numeric(unlist(strsplit(x, " "))) * c(3600, 60) : longer object length is not a multiple of shorter object length
In split.default(as.numeric(unlist(strsplit(x, " "))) * c(3600, : data length is not a multiple of split variable
Can someone point my mistake or make a suggestion of how to proceed?
Thank you!
I think the issue here was that after gsub call, degrees and minutes were not space delimited, as required by measurements::conv_unit.
For example, this works fine (for this reproducible example I also changed "ᵒ" to "°"):
library(measurements)
#read your data
txt <-
"Long Lat
62°36.080 58°52.940
61°28.020 54°59.940
62°07.571 56°48.873
62°04.929 57°33.605
63°01.419 60°30.349
63°09.555 61°29.199
63°43.499 61°23.590
64°34.175 62°30.304
63°16.342 59°16.437
60°55.090 54°49.269
61°28.013 54°59.928
62°07.868 56°48.040
62°04.719 57°32.120
62°36.083 58°51.766
63°01.644 60°30.714
64°33.897 62°30.772
63°43.604 61°23.426
63°09.288 61°29.888
63°16.722 59°16.204"
coord <- read.table(text = foo, header = TRUE, stringsAsFactors = F)
# change the degree symbol to a space
lat = gsub('°',' ', coord$Lat)
long = gsub('°',' ', coord$Long)
# convert from decimal minutes to decimal degrees
lat = measurements::conv_unit(lat, from = 'deg_dec_min', to = 'dec_deg')
long = measurements::conv_unit(long, from = 'deg_dec_min', to = 'dec_deg')
yields...
> cbind(long, lat)
long lat
[1,] "62.6013333333333" "58.8823333333333"
[2,] "61.467" "54.999"
[3,] "62.1261833333333" "56.81455"
[4,] "62.08215" "57.5600833333333"
[5,] "63.02365" "60.5058166666667"
[6,] "63.15925" "61.48665"
[7,] "63.7249833333333" "61.3931666666667"
[8,] "64.5695833333333" "62.5050666666667"
[9,] "63.2723666666667" "59.27395"
[10,] "60.9181666666667" "54.82115"
[11,] "61.4668833333333" "54.9988"
[12,] "62.1311333333333" "56.8006666666667"
[13,] "62.07865" "57.5353333333333"
[14,] "62.6013833333333" "58.8627666666667"
[15,] "63.0274" "60.5119"
[16,] "64.56495" "62.5128666666667"
[17,] "63.7267333333333" "61.3904333333333"
[18,] "63.1548" "61.4981333333333"
[19,] "63.2787" "59.2700666666667"

How to predict in pycaffe?

I have a model that has been trained on CIFAR-10, but I don't realise how can I make a prediction in pycaffe.
I got an image from lmdb but I don't know how to load it in a net and get a predicted class.
My code:
net = caffe.Net('acc81/model.prototxt',
'acc81/cifar10_full_iter_70000.caffemodel.h5',
caffe.TEST)
lmdb_env = lmdb.open('cifar10_test_lmdb/')
lmdb_txn = lmdb_env.begin()
lmdb_cursor = lmdb_txn.cursor()
for key, value in lmdb_cursor:
datum = caffe.proto.caffe_pb2.Datum()
datum.ParseFromString(value)
image = caffe.io.datum_to_array(datum)
image = image.astype(np.uint8)
# What's next with the image variable?
# If i try:
# out = net.forward_all(data=np.asarray([image]))
# I get Exception: Input blob arguments do not match net inputs.
print("Image class is " + label)
Use this python script
# Run the script with anaconda-python
# $ /home/<path to anaconda directory>/anaconda/bin/python LmdbClassification.py
import sys
import numpy as np
import lmdb
import caffe
from collections import defaultdict
caffe.set_mode_gpu()
# Modify the paths given below
deploy_prototxt_file_path = '/home/<username>/caffe/examples/cifar10/cifar10_deploy.prototxt' # Network definition file
caffe_model_file_path = '/home/<username>/caffe/examples/cifar10/cifar10_iter_5000.caffemodel' # Trained Caffe model file
test_lmdb_path = '/home/<username>/caffe/examples/cifar10/cifar10_test_lmdb/' # Test LMDB database path
mean_file_binaryproto = '/home/<username>/caffe/examples/cifar10/mean.binaryproto' # Mean image file
# Extract mean from the mean image file
mean_blobproto_new = caffe.proto.caffe_pb2.BlobProto()
f = open(mean_file_binaryproto, 'rb')
mean_blobproto_new.ParseFromString(f.read())
mean_image = caffe.io.blobproto_to_array(mean_blobproto_new)
f.close()
# CNN reconstruction and loading the trained weights
net = caffe.Net(deploy_prototxt_file_path, caffe_model_file_path, caffe.TEST)
count = 0
correct = 0
matrix = defaultdict(int) # (real,pred) -> int
labels_set = set()
lmdb_env = lmdb.open(test_lmdb_path)
lmdb_txn = lmdb_env.begin()
lmdb_cursor = lmdb_txn.cursor()
for key, value in lmdb_cursor:
datum = caffe.proto.caffe_pb2.Datum()
datum.ParseFromString(value)
label = int(datum.label)
image = caffe.io.datum_to_array(datum)
image = image.astype(np.uint8)
out = net.forward_all(data=np.asarray([image]) - mean_image)
plabel = int(out['prob'][0].argmax(axis=0))
count += 1
iscorrect = label == plabel
correct += (1 if iscorrect else 0)
matrix[(label, plabel)] += 1
labels_set.update([label, plabel])
if not iscorrect:
print("\rError: key = %s, expected %i but predicted %i" % (key, label, plabel))
sys.stdout.write("\rAccuracy: %.1f%%" % (100.*correct/count))
sys.stdout.flush()
print("\n" + str(correct) + " out of " + str(count) + " were classified correctly")
print ""
print "Confusion matrix:"
print "(r , p) | count"
for l in labels_set:
for pl in labels_set:
print "(%i , %i) | %i" % (l, pl, matrix[(l,pl)])

Issues with naming ranges for charts within the Google Spreadsheet Script

I've been trying for days to create charts with an intelligent range, that differs when the data in the google spreadsheet is updated. However i succeeded doing so, i can't get the .setOption aspect to work. I want for example, a title, description etc with the chart. But this is not the main issue since i can insert there by hand.
More important however is the range name, because there isn't when i use the script. So, within the chart it is not possible to see what each column represents, and i really want to fix that. I tried to use the .setNamedRange() aspects, but that is not working.
Someone who can help me with that?
function check() {
var sheet = SpreadsheetApp.getActiveSheet();
var end = sheet.getLastRow();
var start = (end - 5);
var endnew = (end - 4);
var startnew = (end - 6);
if(sheet.getCharts().length == 0){
Logger.log("Er is geen grafiek");
var chartBuilder = sheet.newChart()
.asColumnChart().setStacked()
.addRange(sheet.getRange("A" + startnew + ":" + "A" + endnew)) // should have a name
.addRange(sheet.getRange("B" + startnew + ":" + "B" + endnew)) // should have a name
.addRange(sheet.getRange("E" + startnew + ":" + "E" + endnew)) //should have a name
.setOption('title', 'Effectief gebruik kantoorruimte') //not working
.setPosition(10, 10, 0, 0)
var chart = chartBuilder.build();
sheet.insertChart(chart);
}
else{
Logger.log("Er is wel een grafiek");
var charts = sheet.getCharts();
for (var i in charts) {
var chart = charts[i];
var ranges = chart.getRanges();
var builder = chart.modify();
for (var j in ranges) {
var range = ranges[j];
builder.removeRange(range);
builder
.addRange(sheet.getRange("A" + (start) + ":" + "A" + end)) //should have a name
.addRange(sheet.getRange("B" + (start) + ":" + "B" + end)) //should have a name
.addRange(sheet.getRange("E" + (start) + ":" + "E" + end)) // should have a name
.setOption('title', 'Effectief gebruik kantoorruimte')
.build();
sheet.updateChart(builder.build());
}
}
}
}
I'm assuming that this code is the issue?
builder
.addRange(sheet.getRange("A" + (start) + ":" + "A" + end))
Maybe try using the JavaScript toString() method to make sure that your text formula is working.
.addRange(sheet.getRange("A" + start.toString() + ":" + "A" + end.toString()))
There is a different format that you can use:
getRange(row, column, numRows, numColumns)
So, it would be:
getRange(start, 1, 1, numColumns)
That starts on row "start" in column A. It gets one row of data, and how ever many number of columns.

read data from chart, ms as x value

i have some trouble with adding pinots to a chart and read them back to an array.
with tis code iam adding a new piont to my chart, y_value is a normal double var
time_stamp is a string with the current daytime (15:56:45:799) with millisecounds
string time_stamp = DateTime.Now.ToLongTimeString() + ":" + DateTime.Now.Millisecond.ToString();
chart_logger.Series[0].Points.AddXY(time_stamp, y_value);
after plotting the chart i want to save alle datapionts in an txt file, so i want the read all points form the chart
i tried it with
DataPoint[] asd = chart_logger.Series[0].Points.ToArray();
it read all the y values from the chart but i the x values are always zero
does someone have any idea
thanks for the help
Ralf
You need to use `ToOADate()' and 'FromOADate(double d)'.
chart_logger.Series[0].XValueType = ChartValueType.DateTime;
chart_logger.ChartAreas[0].AxisX.LabelStyle.Format = "MM/dd/yyyy HH:mm:ss.fff";
chart_logger.Series[0].Points.AddXY(DateTime.Now.ToOADate(), y_value);
DataPoint[] asd = chart_logger.Series[0].Points.ToArray();
var x = DateTime.FromOADate(asd[0].XValue);
Or
chart_logger.Series[0].YValuesPerPoint = 2;
var time = DateTime.Now;
string time_stamp = time.ToLongTimeString() + ":" + time.Now.Millisecond.ToString();
chart_logger.Series[0].Points.AddXY(time_stamp, y_value, time.ToOADate());
DataPoint[] asd = chart_logger.Series[0].Points.ToArray();
var x = DateTime.FromOADate(asd[0].YValues[1]);