Consolidating a data table in Scala - scala

I am working on a small data analysis tool, and practicing/learning Scala in the process. However I got stuck at a small problem.
Assume data of type:
X Gr1 x_11 ... x_1n
X Gr2 x_21 ... x_2n
..
X GrK x_k1 ... x_kn
Y Gr1 y_11 ... y_1n
Y Gr3 y_31 ... y_3n
..
Y Gr(K-1) ...
Here I have entries (X,Y...) that may or may not exist in up to K groups, with a series of values for each group. What I want to do is pretty simple (in theory), I would like to consolidate the rows that belong to the same "entity" in different groups. so instead of multiple lines that start with X, I want to have one row with all values from x_11 to x_kn in columns.
What makes things complicated however is that not all entities exist in all groups. So wherever there's "missing data" I would like to pad with for instance zeroes, or some string that denotes a missing value. So if I have (X,Y,Z) in up to 3 groups, the type I table I want to have is as follows:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 N/A N/A y_31 y_32
Z N/A N/A z_21 z_22 N/A N/A
I have been stuck trying to figure this out, is there a smart way to use List functions to solve this?
I wrote this simple loop:
for {
(id, hitlist) <- hits.groupBy(_.acc)
h <- hitlist
} println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
to able to generate the tables that look like the example above. Note that, my original data is of a different format and layout,but that has little to do with the problem at hand, thus I have skipped all steps regarding parsing. I should be able to use groupBy in a better way that actually solves this for me, but I can't seem to get there.
Then I modified my loop mapping the hits to ratios and appending them to one another:
for ((id, hitlist) <- hits.groupBy(_.acc)){
val l = hitlist.map(_.ratios).foldRight(List[Double]()){
(l1: List[Double], l2: List[Double]) => l1 ::: l2
}
println(id + "\t" + l.mkString("\t"))
//println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
}
That gets me one step closer but still no cigar! Instead of a fully padded "matrix" I get a jagged table. Taking the example above:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 y_31 y_32
Z z_21 z_22
Any ideas as to how I can pad the table so that values from respective groups are aligned with one another? I should be able to use _.sampleId, which holds the "group membersip" for each "hit", but I am not sure how exactly. ´hits´ is a List of type Hit which is practically a wrapper for each row, giving convenience methods for getting individual values, so essentially a tuple which have "named indices" (such as .acc, .sampleId..)
(I would like to solve this problem without hardcoding the number of groups, as it might change from case to case)
Thanks!

This is a bit of a contrived example, but I think you can see where this is going:
case class Hit(acc:String, subAcc:String, value:Int)
val hits = List(Hit("X", "x_11", 1), Hit("X", "x_21", 2), Hit("X", "x_31", 3))
val kMax = 4
val nMax = 2
for {
(id, hitlist) <- hits.groupBy(_.acc)
k <- 1 to kMax
n <- 1 to nMax
} yield {
val subId = "x_%s%s".format(k, n)
val row = hitlist.find(h => h.subAcc == subId).getOrElse(Hit(id, subId, 0))
println(row)
}
//Prints
Hit(X,x_11,1)
Hit(X,x_12,0)
Hit(X,x_21,2)
Hit(X,x_22,0)
Hit(X,x_31,3)
Hit(X,x_32,0)
Hit(X,x_41,0)
Hit(X,x_42,0)
If you provide more information on your hits lists then we could probably come with something a little more accurate.

I have managed to solve this problem with the following code, I am putting it here as an answer in case someone else runs into a similar problem and requires some help. The use of find() from Noah's answer was definitely very useful, so do give him a +1 in case this code snippet helps you out.
val samples = hits.groupBy(_.sampleId).keys.toList.sorted
for ((id, hitlist) <- hits.groupBy(_.acc)) {
val ratios =
for (sample <- samples)
yield hitlist.find(h => h.sampleId == sample).map(_.ratios)
.getOrElse(List(Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN))
println(id + "\t" + ratios.flatten.mkString("\t"))
}
I figure it's not a very elegant or efficient solution, as I have two calls to groupBy and I would be interested to see better solutions to this problem.

Related

Scala Range.Double missing last element

I am trying to create a list of numBins numbers evenly spaced in the range [lower,upper). Of course, there are floating point issues and this approach is not the best. The result of using Range.Double, however, surprises me as the element missing is not close to the upper bound at all.
Setup:
val lower = -1d
val upper = 1d
val numBins = 11
val step = (upper-lower)/numBins // step = 0.18181818181818182
Problem:
scala> Range.Double(lower, upper, step)
res0: scala.collection.immutable.NumericRange[Double] = NumericRange(-1.0, -0.8181818181818182, -0.6363636363636364, -0.45454545454545453, -0.2727272727272727, -0.0909090909090909, 0.09090909090909093, 0.27272727272727276, 0.4545454545454546, 0.6363636363636364)
Issue: The list seems to be one element short. 0.8181818181818183 is one step further, and is less than 1.
Workaround:
Scala> for (bin <- 0 until numBins) yield lower + bin * step
res1: scala.collection.immutable.IndexedSeq[Double] = Vector(-1.0, -0.8181818181818181, -0.6363636363636364, -0.4545454545454546, -0.2727272727272727, -0.09090909090909083, 0.09090909090909083, 0.2727272727272727, 0.4545454545454546, 0.6363636363636365, 0.8181818181818183)
This result now contains the expected number of elements, including 0.818181..
I think the root cause of your problem is some features in implementation of toString for NumericRange
217 override def toString() = {
218 val endStr = if (length > Range.MAX_PRINT) ", ... )" else ")"
219 take(Range.MAX_PRINT).mkString("NumericRange(", ", ", endStr)
220 }
UPD: It's not about toString. Some other methods like map and foreach cut last elements from returned collection.
Anyway by checking size of collection you've got - you'll find out - all elements are there.
What you've done in your workaround example - is used different underlying datatype.

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.

Use forall instead of filter on List[A]

Am trying to determine whether or not to display an overtime game display flag in weekly game results report.
Database game results table has 3 columns (p4,p5,p6) that represent potential overtime game period score total ( for OT, Double OT, and Triple OT respectively). These columns are mapped to Option[Int] in application layer.
Currently I am filtering through game result teamA, teamB pairs, but really I just want to know if an OT game exists of any kind (vs. stepping through the collection).
def overtimeDisplay(a: GameResult, b: GameResult) = {
val isOT = !(List(a,b).filter(_.p4.isDefined).filter(_.p5.isDefined).filter(_.p6.isDefined).isEmpty)
if(isOT) {
<b class="b red">
{List( ((a.p4,a.p5,a.p6),(b.p4,b.p5,b.p6)) ).zipWithIndex.map{
case( ((Some(_),None,None), (Some(_),None,None)), i)=> "OT"
case( ((Some(_),Some(_),None), (Some(_),Some(_),None )), i)=> "Double OT"
case( ((Some(_),Some(_),Some(_)), (Some(_),Some(_),Some(_) )), i)=> "Triple OT"
}}
</b>
}
else scala.xml.NodeSeq.Empty
}
Secondarily, the determination of which type of overtime to display, currently that busy pattern match (which, looking at it now, does not appear cover all the scoring scenarios), could probably be done in a more functional/concise manner.
Feel free to lay it down if you have the better way.
Thanks
Not sure if I understand the initial code correctly, but here is an idea:
val results = List(a, b).map(r => Seq(r.p4, r.p5, r.p6).flatten)
val isOT = results.exists(_.nonEmpty)
val labels = IndexedSeq("", "Double ", "Triple ")
results.map(p => labels(p.size - 1) + "OT")
Turning score column to flat list in first line is crucial here. You have GameResult(p4: Option[Int], p5: Option[Int], p6: Option[Int]) which you can map to Seq[Option[Int]]: r => Seq(r.p4, r.p5, r.p6) and later flatten to turn Some[Int] to Int and get rid of None. This will turn Some(42), None, None into Seq(42).
Looking at this:
val isOT = !(List(a,b).filter(_.p4.isDefined).filter(_.p5.isDefined).filter(_.p6.isDefined).isEmpty)
This can be rewritten using exists instead of filter. I would rewrite it as follows:
List(a, b).exists(x => x.p4.isDefined && x.p5.isDefined && x.p6.isDefined)
In addition to using exists, I am combining the three conditions you passed to the filters into a single anonymous function.
In addition, I don't know why you're using zipWithIndex when it doesn't seem as though you're using the index in the map function afterwards. It could be removed entirely.

How to generate a list of random numbers?

This might be the least important Scala question ever, but it's bothering me. How would I generate a list of n random number. What I have so far:
def n_rands(n : Int) = {
val r = new scala.util.Random
1 to n map { _ => r.nextInt(100) }
}
Which works, but doesn't look very Scalarific to me. I'm open to suggestions.
EDIT
Not because it's relevant so much as it's amusing and obvious in retrospect, the following looks like it works:
1 to 20 map r.nextInt
But the index of each entry in the returned list is also the upper bound of that last. The first number must be less than 1, the second less than 2, and so on. I ran it three or four times and noticed "Hmmm, the result always starts with 0..."
You can either use Don's solution or:
Seq.fill(n)(Random.nextInt)
Note that you don't need to create a new Random object, you can use the default companion object Random, as stated above.
How about:
import util.Random.nextInt
Stream.continually(nextInt(100)).take(10)
regarding your EDIT,
nextInt can take an Int argument as an upper bound for the random number, so 1 to 20 map r.nextInt is the same as 1 to 20 map (i => r.nextInt(i)), rather than a more useful compilation error.
1 to 20 map (_ => r.nextInt(100)) does what you intended. But it's better to use Seq.fill since that more accurately represents what you're doing.