Possible bug in the function hclust() of R-Project - tags

Hi my frinds the observation is the following. I don't know what the problem is.
When I am making clusters with the hclust function, the labels of the object that it creates are lost if the way I subset the data frame is "incorrect".
This is the data frame.
set.seed(1234)
x <- rnorm(12,mean=rep(1:3,each=4),sd=0.2)
y <- rnorm(12,mean=rep(c(1,2,1),each=4),sd=0.2)
z <- as.factor(sample(c("A","B"),12,replace=T))
df <- data.frame(x=x,y=y,z=z)
plot(df$x,df$y,col=z,pch=19,cex=2)
This chunck of code returns NULL for the labels.
df1 <- df[c("x","y")]
d <- dist(df1)
cluster <- hclust(d)
cluster$labels #NULL
This chunck of code returns NULL as well.
df2 <- df[,1:2]
d <- dist(df2)
cluster <- hclust(d)
cluster$labels #NULL
This chunck of code does not return NULL.
df3 <- df[1:12,1:2]
d <- dist(df3)
cluster <- hclust(d)
cluster$labels #Character Vector
This has represented a problem for me because I have some codes that uses this information.
As you can see, the data frames are identical.
identical(df1, df2) #True
identical(df1, df3) #True
identical(df2, df3) #True

Related

Speed up geomesa query

I've been testing geomesa with simple spatial queries and comparing it with Postgis. For example this SQL query runs in 30 sec in Postgis:
with series as (
select
generate_series(0, 5000) as i
),
points as (
select ST_Point(i, i*2) as geom from series
)
select st_distance(a.geom, b.geom) from points as a, points as b
Now, the following geomesa version takes 5 min (using -Xmx10g ):
import org.apache.spark.sql.SparkSession
import org.locationtech.geomesa.spark.jts._
import org.locationtech.jts.geom._
object HelloWorld {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.config("spark.sql.crossJoin.enabled", "true")
.config("spark.executor.memory", "12g")
.config("spark.driver.memory", "12g")
.config("spark.cores.max", "4")
.master("local")
.appName("Geomesa")
.getOrCreate()
spark.withJTS
import spark.implicits._
val x = 0 until 5000
val y = for (i <- x) yield i*2
val coords = for ((i, n) <- x.zipWithIndex) yield (i, y(n))
val points = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val points2 = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val all_points = for {
i <- points
j <- points2} yield (i, j)
val df = all_points.toDF("point", "point2")
val df2 = df.withColumn("dist", st_distance($"point", $"point2"))
df2.show()
}
}
I'd have expected similar or better performance from geomesa, what can be done to tune a query like this?
FIRST EDIT
As Emilio suggests, this is not really a query but a computation.
This query could have been written without spark. The code below runs in less than two seconds:
import org.locationtech.jts.geom._
object HelloWorld {
def main(args: Array[String]): Unit = {
val x = 0 until 5000
val y = for (i <- x) yield i*2
val coords = for ((i, n) <- x.zipWithIndex) yield (i, y(n))
val points = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val points2 = for {
i <- points
j <- points} yield i.distance(j)
println(points2.slice(0,30))
}
}
GeoMesa is not going to be as fast as PostGIS for small amounts of data. GeoMesa is designed for distributed, NoSQL databases. If your dataset fits in PostGIS, you should probably just use PostGIS. Once you start hitting the limits of PostGIS, you should consider using GeoMesa. GeoMesa does offer integration with arbitrary GeoTools data stores (including PostGIS), which can make some of the GeoMesa Spark and command-line features available to PostGIS.
For your particular snippet, I suspect that most of the time is spent spinning up an RDD and running through the loops. There isn't really a 'query', as you are just running a pair-wise calculation. If you are querying data stored in a table, then GeoMesa has a chance to optimize the scan. However, GeoMesa isn't a SQL database, and doesn't have any native support for joins. Generally the join is done in memory by Spark, although there are some things you can do to speed it up (i.e. a broadcast join or RDD partitioning). If you want to do complex spatial joins, you might want to check out GeoSpark and/or Magellan, which specialize in spatial Spark operations.

select multiple columns with dplyr having factor "No", "Yes" levels

I want to select all factor columns having two levels ("Yes", "No").
I want to use dpylr for this but, could not fix the problem.
AB %>%
select_if(.predicate = function(x) length(levels(x))==2 & unique(x) %in% c("No", "Yes"))
unique(x) %in% c('No','Yes') returns a vector the same length as unique(x), rather than a scalar. I think your better off using setequal(x,c('No','Yes')) as shown below:
library(dplyr)
# generate the dataframe with different factor levels
n<-100
no_yes <- sample(c('No','Yes'), n, replace = T)
no_yes_maybe <- sample(c('No','Yes','Maybe'), n, replace = T)
no <- sample(c('No'), n, replace = T)
no_maybe <- sample(c('No','Maybe'), n, replace = T)
AB<-data.frame(
no_yes, # only this column should get returned
no_yes_maybe,
no,
no_maybe,
stringsAsFactors = T
)%>%as.tbl
# function to return TRUE if column has only No/Yes factors.
desired_levels <- c('No','Yes')
predicate_function <- function(x) setequal(levels(x),desired_levels)
# use dplyr to select columns with desired factor levels
AB%>%select_if(predicate_function)

Confused about behavior on Option between single-level and nested for comprehension

I'm new to Scala so please bear with me.
I'm confused about the behaviors below:
val l = List(Option(1))
for (i <- l; x <- i) yield x //Example 1: gives me List(1)
for(x <- Option(1)) yield x //Example 2: gives me Some(1)
Why doesn't the second for comprehension give me 1 instead? Because that would look more consistent to me, intuitively, since the second for comprehension in the first example x <- i looks like it should behave exactly the same way as the second example, as the second example basically has extracted the option out of the list to begin with.
Simply put, for comprehension wraps into the type that was used the first time.
for (x <- Option(1)) yield x // Returns Option
for (x <- List(1)) yield x // Returns List
for (x <- Array(1)) yield x // Returns Array
This:
for (i <- List(Some(1)); x <- i) yield x
Desugares into this:
List(Some(1)).flatMap { case i => i.map { case x => x } }
flatMap of List returns List[T], that's why it behaves like that

Haskell GHCi prints lazy sequence but Scala REPL doesn't

I would to print out a stream of numbers but the following code only prints out the first number in the sequence:
for ( n <- Stream.from(2) if n % 2 == 0 ) yield println(n)
2
res4: scala.collection.immutable.Stream[Unit] = Stream((), ?)
In Haskell the following keeps printing out numbers until interrupted and I would like similar behaviour in Scala:
naturals = [1..]
[n | n <- naturals, even n]
[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,
Instead of yielding just println (why would one want infinite sequence of Unit's?):
for ( n <- Stream.from(2) if n % 2 == 0 ) println(n)
If you really want that infinite sequence of Units, force result:
val infUnit = for ( n <- Stream.from(2) if n % 2 == 0 ) yield println(n)
infUnit.force // or convert to any other non-lazy collection
Though, eventually, it will crash program (due to large length of materialized sequence).
The result type of a for comprehension is a collection of the same type of the collection in the first clause. See the flatMap function signature
So the result of a
for ( n <- Stream.from(2) .....
is a collection of type Stream[_] which is lazy so you have to pull the element values or actions out.
Look at the result types:
scala> :type for( n <- Stream.from(2)) yield n
scala.collection.immutable.Stream[Int]
scala> :type for( n <- List(1,2,3)) yield n
List[Int]
scala> :type for( n <- Set(1,2,3)) yield n
scala.collection.immutable.Set[Int]
To print out numbers until interrupted try this:
Stream.from(2).filter(_ % 2 == 0) foreach println
Its type grants us it will work:
scala> :type Stream.from(2).filter(_ % 2 == 0) foreach println
Unit
I think you meant:
for (n <- Stream.from(2) if n % 2 == 0) yield n
(because yield println(n) will always yield () with a side effect of printing n)
This gives you the collection you want. However, Scala, unlike Haskell, doesn't evaluate all members of a (lazy) list when you print the lazy list (a Stream). But you can convert it into a non-lazy list using .toList. However, you won't see the same infinite printing behaviour in Scala as it will try to build the entire (infinite) list first before printing anything at all.
Basically there is no way to get the exact same combination of semantics and behaviour in Scala compared to Haskell when printing infinite lists using the built-in toString infrastructure.
P.S.
for (n <- Stream.from(2) if n % 2 == 0) yield n
is expressed more shortly as
Stream.from(2).filter(_ % 2 == 0)

How to select just one first or last record compliant to a where clause with ScalaQuery?

Having the following query template to select all:
val q = for {
a <- Parameters[Int]
b <- Parameters[Int]
t <- T if t.a == a && t.b == b
_ <- Query.orderBy(t.c, t.d)
} yield t
I need to modify it to select the very first (with minimum c and d minimum for this c) or the very last (with maximum c and d maximum for this c) record of those matching the where condition. I'd usually strongly prefer no other (than the last/first) records to be selected as there are hundreds thousands of them...
There's a potential danger here in how the OP's query is currently constructed. Run as is, getting the first or last result of a 100K result set is not terribly efficient (unlikely, yes, but the point is, the query places no limit on number of number of rows returned)
With straight SQL you would never do such a thing; instead you would tack on a LIMIT 1
In ScalaQuery, LIMIT = take(n), so add take(1) to get a single record returned from the query itself
val q = (for {
a <- Parameters[Int]
b <- Parameters[Int]
t <- T if t.a == a && t.b == b
_ <- Query.orderBy(t.c, t.d)
} yield t) take(1)
q.firstOption
There is method firstOption defined on the Invoker trait, and by some magic it is available on the Query class. So maybe you can try it like this:
val q = for {
a <- Parameters[Int]
b <- Parameters[Int]
t <- T if t.a == a && t.b == b
_ <- Query.orderBy(t.c, t.d)
} yield t
q.firstOption