Mahout clustering - all text vectors in single cluster - why? - cluster-analysis
I've run the following example:
https://github.com/technobium/mahout-clustering/blob/master/src/main/java/com/technobium/ClusteringDemo.java#L64
Document 1 -> John saw a red car.
Document 2 -> Marta found a red bike.
Document 3 -> Don need a blue coat.
Document 4 -> Mike bought a blue boat.
Document 5 -> Albert wants a blue dish.
Document 6 -> Lara likes blue glasses.
Document 7 -> Donna, do you have red apples?
Document 8 -> Sonia needs blue books.
Document 9 -> I like blue eyes.
Document 10 -> Arleen has a red carpet.
and it works as expected with EuclideanDistanceMeasure. But I'm not sure why the text-intended distance measures (TanimotoDistanceMeasure and CosineDistanceMeasure) are giving me just a single cluster.
Why is this? I'm not pretending I know anything about these 2 distance measures that are giving unsatisfactory results - but what might I need to change? There are a few too many numbers in there for me to understand the effect of each. I do have the book "Mahout in Action" though I have only read 2 chapters.
EuclideanDistanceMeasure (2 clusters - good)
Clusters:
7 -> wt: 1.0 distance: 4.4960791719810365 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
7 -> wt: 1.0 distance: 4.496079376645949 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
7 -> wt: 1.0 distance: 4.496079576525459 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
9 -> wt: 1.0 distance: 4.389955960700927 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
9 -> wt: 1.0 distance: 4.389956011306051 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
9 -> wt: 1.0 distance: 4.3899560687101395 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
9 -> wt: 1.0 distance: 4.389956137136399 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
7 -> wt: 1.0 distance: 5.577549042707083 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
9 -> wt: 1.0 distance: 4.389956708176695 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
9 -> wt: 1.0 distance: 4.389471924190491 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
produced by:
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new EuclideanDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
CosineDistanceMeasure (just 1 cluster - bad)
Clusters:
0 -> wt: 1.0 distance: 0.6362357041216559 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
0 -> wt: 1.0 distance: 0.6362357041216559 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.636235704121656 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
0 -> wt: 1.0 distance: 0.5876411474816594 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
produced by
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new CosineDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
TanimotoDistanceMeasure (just 1 cluster - bad)
Clusters:
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
0 -> wt: 1.0 distance: 0.8723755210900389 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
produced via
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new TanimotoDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
As Anony-Mousse said in his first response, the data I fed it belongs in a single cluster. After some soul searching in recent weeks (or more specifically, experimenting with the distance measure classes directly), I found a data set that results in more than one cluster:
1) Make sure the data is different enough
Text id1 = new Text("Document 1");
Text text1 = new Text("Atletico Madrid win");
writer.append(id1, text1);
Text id6 = new Text("Document 6");
Text text6 = new Text("Both apple and orange are fruit");
writer.append(id6, text6);
Text id7 = new Text("Document 7");
Text text7 = new Text("Both orange and apple are fruit");
writer.append(id7, text7);
2) Determine good radius values
a) Experiment with the DistanceMeasure class with your sample data
Vector v1 = toVector("Atletico Madrid win");
Vector v2 = toVector("Both apple and orange are fruit");
Vector v3 = toVector("Both orange and apple are fruit");
of = ImmutableList.of(v1, v2, v3);
List<Vector> vectorList = new LinkedList();
vectorList.addAll(of);
List<Canopy> canopies = CanopyClusterer.createCanopies(vectorList, new CosineDistanceMeasure(), 0.3, 0.3);
for (Canopy canopy : canopies) {
System.out.println("DistanceMeasureMain.main() " + canopy.asFormatString());
}
produces:
DistanceMeasureMain.main() distance is 0.19193857965451055
DistanceMeasureMain.main() distance is 0.5281191379648771
DistanceMeasureMain.main() distance is 0.19193857965451055
DistanceMeasureMain.main() C0: {0:1.1,117724:1.0,378550445:1.0,1997849123:1.0}
DistanceMeasureMain.main() C1: {0:1.1,96727:1.0,96852:1.0,2076577:1.0,93029210:1.0,97711124:1.0,1008851410:1.0}
b) Use the distances as your radius values
I think the t1 and t2 values (0.2 and 0.2) for CanopyDriver.run() were also significant, though I don't know in intricate detail the effect of all the numerical parameters in the invocation below:
// CosineDistanceMeasure
CanopyDriver.run(new Path(vectorsFolder),
new Path(canopyCentroids), new CosineDistanceMeasure(),
0.2, 0.2, true, 1, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(
canopyCentroids, "clusters-0-final"), new Path(
clusterOutput), 0.01, 20, 2, true, true, 0, false);
Output
Document 1 -> Atletico Madrid win
Document 6 -> Both apple and orange are fruit
Document 7 -> Both orange and apple are fruit
Clusters:
0 -> wt: 1.0 distance: 0.0 vec: Document 1 = [1:1.405, 4:1.405, 6:1.405]
1 -> wt: 1.0 distance: 0.0 vec: Document 6 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]
1 -> wt: 1.0 distance: 0.0 vec: Document 7 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]
Related
How to calculate height based on item count in swift
How do I archive this result using a simple mathematical formula. I have an initial offset = 100, and initial count = 0, that i want to increase the offset based on count value. I tried using the below code but it doesn't function correctly. Example When count is 0 to 3, then offset should be 100. When count is 4 to 6, then offset should be 200. When count is 7 to 9, then offset should be 300. When count is 10 to 12, then offset should be 400. Attempt func getHeight(count: Int) ->CGFloat { var index = 0 for i in 0..<count{ if(i % 2 != 0){ index += 1 } } return CGFloat(100 * index) //return CGFloat(count + 3 / 9 * 100) } Testing print("0 to 3 = \(self.getHeight(count: 0)), expected = 100") print("0 to 3 = \(self.getHeight(count: 2)), expected = 100") print("4 to 6 = \(self.getHeight(count: 4)), expected = 200") print("7 to 9 = \(self.getHeight(count: 7)), expected = 300") print("10 to 12 = \(self.getHeight(count: 12)), expected = 400") Results 0 to 3 = 0.0, expected = 100 0 to 3 = 100.0, expected = 100 4 to 6 = 200.0, expected = 200 7 to 9 = 300.0, expected = 300 10 to 12 = 600.0, expected = 400
Formula with integer division: let cnt = count != 0 ? count : 1 result = 100 * ((cnt + 2) / 3)
Can I remove some element from a Range in Swift?
I have a ClosedRange from 1 to 10, I wanted to know can we remove some element of it? like removing 5 and 7 from that closedRange then we should have: 1 2 3 4 6 8 9 10 instead of having 1 2 3 4 5 6 7 8 9 10 let closedRange: ClosedRange<Int> = 1...10
It ends up being an array of type [ClosedRange<Int>.Element] because once you remove elements, it's not a range any more (since by definition it doesn't include all the elements between the bounds). From the Apple docs (https://developer.apple.com/documentation/swift/closedrange): An interval from a lower bound up to, and including, an upper bound. let closedRange: ClosedRange<Int> = 1...10 let closedRangeArrayWithElementsMissing = closedRange.filter { $0 != 5 && $0 != 7 } // = [1, 2, 3, 4, 6, 8, 9, 10]
Like the other answer states, no. symmetricDifference is good for this. ([5, 7] as Set).symmetricDifference(1...10)
Confusing Get Set behaviour
I am trying to understand Computed Properties mostly I have understood the concept but one output is confusing me struct SomePrices { var eighth: Double var quarter: Double var half: Double var zip: Double { get { return half * 2 - 20 } set { eighth = newValue / 8 + 15 quarter = newValue / 4 + 10 half = newValue / 2 + 5 } } } var gdp = SomePrices(eighth: 37.0, quarter: 73.0, half: 123.0) gdp.eighth // 37 gdp.quarter // 73 gdp.half // 123 gdp.zip // 226 gdp.zip = 300 gdp.eighth // 52.5 gdp.quarter // 85 gdp.half // 155 gdp.zip // 290 Been trying to understand how did I get 290 when gdp.zip = 300
You set zip to 300 so half becomes (300 / 2 + 5) = 155. half = newValue / 2 + 5 Then you get zip which is (155 * 2 - 20) = 290. return half * 2 - 20
Create multiplication table with swift
im a beginner here. Ive been stuck on a problem for some time now. Practicing in playground and i need to make a multiplication table. basically, if i input 3, i want the table to read 1 2 3 2 4 6 3 6 9 Im confused on the loop for this though. Any help please? Code so far var x = 3 var width = 1 for x in 1...x { for width in 1...width { print(x, width*2) } } this code prints 1 2 2 2 3 2
You could do it like this. func multiplicationTable(till limit: Int) { for i in 1...limit { for j in 1...limit { print(i * j, terminator: "\t") } print("") } } multiplcationTable(till: 5) Output 1 2 3 4 5 2 4 6 8 10 3 6 9 12 15 4 8 12 16 20 5 10 15 20 25
If conciseness is paramount: let x = 3 let range = 1...x for i in range { print(range.map { String(i * $0) }.joined(separator: "\t")) }
You can store the multiplication table in a 2D array of Ints. First, you can populate the first row and first column with numbers from 1 to the size of the multiplication table. Then for each element in the remaining empty positions, you just need to multiply the first element of the same row and the first element of the same column that the element resides in. func multiplicationTable(ofSize n:Int) -> [[Int]] { var table = Array(repeating: Array(repeating: 0, count: n), count: n) table[0] = Array(1...n) for i in 1..<n { table[i][0] = i+1 for j in 1..<n { table[i][j] = table[i][0] * table[0][j] } } return table } multiplicationTable(ofSize: 5).forEach { row in print(row,"\n") } Output: [1, 2, 3, 4, 5] [2, 4, 6, 8, 10] [3, 6, 9, 12, 15] [4, 8, 12, 16, 20] [5, 10, 15, 20, 25]
How to check a particular value in all Maps in an RDD[Map[Int,String]] at a strech using scala?
I want to check particular value in all Maps in an RDD[Map[Int,String]] at a strech using scala. My csv file is, Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> false, 4 -> no) Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> true, 4 -> no) Map(0 -> overcast, 1 -> hot, 2 -> high, 3 -> false, 4 -> yes) Map(0 -> rainy, 1 -> mild, 2 -> high, 3 -> false, 4 -> yes) Map(0 -> rainy, 1 -> cool, 2 -> normal, 3 -> false, 4 -> yes) Here I want to check all the last value in each map, ie no,no,yes,yes,yes with a particular value check(yes/no) at a single stretch.
scala> val a = List(Map(0 -> "sunny", 1 -> "hot", 2 -> "high", 3 -> "false", 4 -> "no"), | Map(0 -> "sunny", 1 -> "hot", 2 -> "high", 3 -> "true", 4 -> "no"), | Map(0 -> "overcast", 1 -> "hot", 2 -> "high", 3 -> "false", 4 -> "yes"), | Map(0 -> "rainy", 1 -> "mild", 2 -> "high", 3 -> "false", 4 -> "yes"), | Map(0 -> "rainy", 1 -> "cool", 2 -> "normal", 3 -> "false", 4 -> "yes")) a: List[scala.collection.immutable.Map[Int,String]] = List(Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> false, 4 -> no), Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> true, 4 -> no), Map(0 -> overcast, 1 -> hot, 2 -> high, 3 -> false, 4 -> yes), Map(0 -> rainy, 1 -> mild, 2 -> high, 3 -> false, 4 -> yes), Map(0 -> rainy, 1 -> cool, 2 -> normal, 3 -> false, 4 -> yes)) scala> sc.parallelize(a) res0: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,String]] = ParallelCollectionRDD[0] at parallelize at <console>:15 scala> val l = sc.parallelize(a) l: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,String]] = ParallelCollectionRDD[1] at parallelize at <console>:14 scala> def check( s : String) : Boolean = if (s.equals("yes")) true else false check: (s: String)Boolean scala> val res = l.map{ x => check(x(4)) } res: org.apache.spark.rdd.RDD[Boolean] = MappedRDD[4] at map at <console>:18 14/11/28 00:18:47 INFO DAGScheduler: Stage 5 (take at <console>:21) finished in 0.020 s 14/11/28 00:18:47 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 14/11/28 00:18:47 INFO DAGScheduler: Job 5 finished: take at <console>:21, took 0.026501 s false false true true true UPDATE The following will be true only when all values are true else it will be false. scala> res.reduce( _ && _ )