select multiple columns with dplyr having factor "No", "Yes" levels - select

I want to select all factor columns having two levels ("Yes", "No").
I want to use dpylr for this but, could not fix the problem.
AB %>%
select_if(.predicate = function(x) length(levels(x))==2 & unique(x) %in% c("No", "Yes"))

unique(x) %in% c('No','Yes') returns a vector the same length as unique(x), rather than a scalar. I think your better off using setequal(x,c('No','Yes')) as shown below:
library(dplyr)
# generate the dataframe with different factor levels
n<-100
no_yes <- sample(c('No','Yes'), n, replace = T)
no_yes_maybe <- sample(c('No','Yes','Maybe'), n, replace = T)
no <- sample(c('No'), n, replace = T)
no_maybe <- sample(c('No','Maybe'), n, replace = T)
AB<-data.frame(
no_yes, # only this column should get returned
no_yes_maybe,
no,
no_maybe,
stringsAsFactors = T
)%>%as.tbl
# function to return TRUE if column has only No/Yes factors.
desired_levels <- c('No','Yes')
predicate_function <- function(x) setequal(levels(x),desired_levels)
# use dplyr to select columns with desired factor levels
AB%>%select_if(predicate_function)

Related

Possible bug in the function hclust() of R-Project

Hi my frinds the observation is the following. I don't know what the problem is.
When I am making clusters with the hclust function, the labels of the object that it creates are lost if the way I subset the data frame is "incorrect".
This is the data frame.
set.seed(1234)
x <- rnorm(12,mean=rep(1:3,each=4),sd=0.2)
y <- rnorm(12,mean=rep(c(1,2,1),each=4),sd=0.2)
z <- as.factor(sample(c("A","B"),12,replace=T))
df <- data.frame(x=x,y=y,z=z)
plot(df$x,df$y,col=z,pch=19,cex=2)
This chunck of code returns NULL for the labels.
df1 <- df[c("x","y")]
d <- dist(df1)
cluster <- hclust(d)
cluster$labels #NULL
This chunck of code returns NULL as well.
df2 <- df[,1:2]
d <- dist(df2)
cluster <- hclust(d)
cluster$labels #NULL
This chunck of code does not return NULL.
df3 <- df[1:12,1:2]
d <- dist(df3)
cluster <- hclust(d)
cluster$labels #Character Vector
This has represented a problem for me because I have some codes that uses this information.
As you can see, the data frames are identical.
identical(df1, df2) #True
identical(df1, df3) #True
identical(df2, df3) #True

add columns in dataframes dynamically with column names as elements in List

I have List[N] like below
val check = List ("a","b","c","d")
where N can be any number of elements.
I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y)
I have tried all possible ways, like withColumn, selectExpr, nothing works.
Please consider substring(X,Y) where X and Y as some numbers based on some metadata
Below are my different codes which I tried, but none worked,
val df = sqlContext.read.text("xxxxx")
val coder: (String => String) = (arg: String) => {
val param = "NULL"
if (arg.length() > Y )
arg.substring(X,Y)
else
val sqlfunc = udf(coder)
val check = List ("a","b","c","d")
for (name <- check){val testDF2 = df.withColumn(name, sqlfunc(df("value")))}
testDF2 has only last column d and other columns such as a,b,c are not added in table
var z:Array[String] = new Array[String](check.size)
var i=0
for ( x <- check ) {
if ( (i+1) == check.size) {
z(i) = s""""substring(a.value,X,Y) as $x""""
i = i+1}
else{
z(i) = s""""substring(a.value,X,Y) as $x","""
i = i+1}}
val zz = z.mkString(" ")
df.alias("a").selectExpr(s"$zz").show()
This throws error
Please help how to add columns in DF dynamically with column names as elements in List
I am expecting an Df like below
-----------------------------
Value| a | b | c | d | .... N
-----------------------------
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
-----------------------------
you can dynamically add columns from your list using for instance this answer by user6910411 to a similar question (see her/his full answer for more possibilities):
val newDF = check.foldLeft(<yourdf>)((df, name) => df.withColumn(name,<yourUDF>$"value"))

Selective sampling in spark RDD

I have a RDD from logged events I wanted to take few samples of each category.
Data is like below
|xxx|xxxx|xxxx|type1|xxxx|xxxx
|xxx|xxxx|xxxx|type2|xxxx|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type3|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type3|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type4|xxxx|xxxx|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type1|xxxx|xxxx
|xxx|xxxx|xxxx|type6|xxxx
My try
eventlist = ['type1', 'type2'....]
orginalRDD = sc.textfile("/path/to/file/*.gz").map(lambda x: x.split("|"))
samplelist = []
for event in event list:
eventsample = orginalRDD.filter(lambda x: x[3] == event).take(5).collect()
samplelist.extend(eventsample)
print samplelist
I have two questions on this,
1. Some better way/efficient way to collect sample based on specific condition?
2. Is it possible to collect the unsplit lines instead of splitted lines?
Python or scala suggestion are welcome!
If sample doesn't have to be random something like this should work just fine:
n = ... # Number of elements you want to sample
pairs = orginalRDD.map(lambda x: (x[3], x))
pairs.aggregateByKey(
[], # zero values
lambda acc, x: (acc + [x])[:n], # Add new value a trim to n elements
lambda acc1, acc2: (acc1 + acc2)[:n]) # Combine two accumulators and trim
Getting a random sample is a little bit harder. One possible approach is to add a random value and sort before aggregation:
import os
import random
def add_random(iter):
seed = int(os.urandom(4).encode('hex'), 16)
rs = random.Random(seed)
for x in iter:
yield (rs.random(), x)
(pairs
.mapPartitions(add_random)
.sortByKey()
.values()
.aggregateByKey(
[],
lambda acc, x: (acc + [x])[:n],
lambda acc1, acc2: (acc1 + acc2)[:n]))
For a DataFrame specific solution see Choosing random items from a Spark GroupedData Object

Scala: difference of two sets by key

I have two sets of (k,v) pairs:
val x = Set((1,2), (2,10), (3,5), (7,15))
val y = Set((1,200), (3,500))
How to find difference of these two sets by keys, to get:
Set((2,10),(7,15))
Any quick and simple solution?
val ym = y.toMap
x.toMap.filterKeys(k => !(ym contains k)).toSet
Sets don't have keys, maps do. So you convert to map. Then, you can't create a difference on maps, but you can filter the keys to exclude the ones you don't want. And then you're done save for converting back to a Set. (It's not the most efficient way to do this, but it's not bad and it's easy to write.)
Let val keys = y.map(_._1).toSet be the set of keys (first element in the pair) that must not occur as key in x; thus
for ( p <- x if !keys(p._1) ) yield p
as well as
x.collect { case p#(a,b) if !keys(a) => p }
and
x.filter ( p => !keys(p._1) )
x.filterNot ( p => keys(p._1) )
you can try this one:
x filter{ m => y map{_._1} contains m._1} toSet

How to select just one first or last record compliant to a where clause with ScalaQuery?

Having the following query template to select all:
val q = for {
a <- Parameters[Int]
b <- Parameters[Int]
t <- T if t.a == a && t.b == b
_ <- Query.orderBy(t.c, t.d)
} yield t
I need to modify it to select the very first (with minimum c and d minimum for this c) or the very last (with maximum c and d maximum for this c) record of those matching the where condition. I'd usually strongly prefer no other (than the last/first) records to be selected as there are hundreds thousands of them...
There's a potential danger here in how the OP's query is currently constructed. Run as is, getting the first or last result of a 100K result set is not terribly efficient (unlikely, yes, but the point is, the query places no limit on number of number of rows returned)
With straight SQL you would never do such a thing; instead you would tack on a LIMIT 1
In ScalaQuery, LIMIT = take(n), so add take(1) to get a single record returned from the query itself
val q = (for {
a <- Parameters[Int]
b <- Parameters[Int]
t <- T if t.a == a && t.b == b
_ <- Query.orderBy(t.c, t.d)
} yield t) take(1)
q.firstOption
There is method firstOption defined on the Invoker trait, and by some magic it is available on the Query class. So maybe you can try it like this:
val q = for {
a <- Parameters[Int]
b <- Parameters[Int]
t <- T if t.a == a && t.b == b
_ <- Query.orderBy(t.c, t.d)
} yield t
q.firstOption