I've seen how to use tMap in TOS to map different fields in a SQL-like JOIN. How do I aggregate based on certain fields?
If I have two tables:
[ A, B, C, D ]
and that are tMap'ped to [ B, C, F, G ]
[ B, E, F, G]
how can I aggregate the results to that instead of the many entries of the non-unique B I can see something like:
[ B, count(B), avg(C), avg(F), avg(G) ]
Thanks!
You certainly can. Use the tAggregate component to do that. You can group by column B and then compute all of the different aggregations, like count, sum, and average in the other columns.
Related
I would like to split my RDD regarding commas and access a predefined set of elements.
For example, I have a RDD like that:
a, b, c, d
e, f, g, h
and I need to split then access the first and fourth element on the first line and the second and third element on the second line to get this resulting RDD:
a, d
f, g
I can't hard write "1" and "4" on my code, that's why solution like that won't work:
rdd.map{line => val words = line.split(",") (words(0),words(3)) }
Lets assume I have a second RRD with the same number of lines which contains the elements I want to get for each line
1,4
2,3
Is there a way to get my elements ?
Lets assume I have a second RRD with the same number of lines which contains the elements I want to get for each line
1,4
2,3
Is there a way to get my elements ?
If you have a second RDD that already has the numbers of the groups you want for each line, you could zip them.
From Spark docs:
<U> RDD<scala.Tuple2<T,U>> zip(RDD<U> other, scala.reflect.ClassTag<U> evidence$13)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc.
So in your example, a, b, c, d would be in a key-value pair with 1,4 and e, f, g, h with 2,3 . So you could do something like:
val groupNumbers = lettersRDD zip numbersRDD
groupnumbers.map{tuple ->
val numbers: Seq[Int] = // get the numbers from tuple._2
val words = tuple._1.split(",") (words(numbers.head),words(numbers(1) ) }
}
Supposed we have a db called A. The structure of A can be:
1) A( a, b, c, d).
a, b, c, d are collections.
And the element in each collection is like { _id:id, data : data }
2) A(k).
k(a, b, c, d)
k is a colletion. and a, b, c, d are elements inside k.
a, b, c, d are like
{
type : 'a / b / c / d',
data : [
{_id : id1, data : data1 },
{_id : id2, data : data2},
...
]
}
the daily operations are { get, inserting element into, empty element of } a, b, c and d.
Which one is better in terms of efficiency?
#Markus-W-Mahlberg is right about your actual-use-case.
As you are using mongodb and it uses documents not tabular data structure (such as ms-sql), your both approaches work fine and if you define right index, u get same performance.
But in my opinion if your types (a, b, c and d ) have different structures (different properties, different queries, different update scenarios, aggregation plans and ...) Use way1, other wise use Way2 with right index.
I recently encountered a DB2 table that has three different indexes that are unique.
Index 1 (Columns: A, B, C)
Index 2 (Columns: A, B, C, D)
Index 3 (Columns: A, B, C, D, E)
Is the most specific one the actual unique index? Or does the definition of uniqueness differ depending about which index DB2 uses to access the table?
I'm a bit confused since, index 1 suggests that as long as my values for A, B, C are unique, I can have duplicate values for D and E. But then there's index 3 saying that A, B, C, D, E are unique, so I can't have duplicate values for D and E after all?
Quite the opposite, the only unique index that counts is Index 1 (for uniqueness).
I haven't tried it, but for accessing purpose, DB2 would use the index that is better for the actual query you are performing.
For instance, if you are querying { A=1, B=2, C=3 } it should use Index 1;
if you are querying {A =1, B=2, C=3, D=4 } it should use Index 2, even if it could just use Index 1, but you won't see any performance gain.
I have 2 collections: a is a sequence of Scala objects of class C. b is a sequence of strings. C has a string field, name, that could possibly match an item in b. What I want is to loop through a and find all c.name that matches with one of the item in b. How do I do this in Scala?
Iterating through both a and b can get expensive because one loop nested inside another yields O(n^2) time. If b is sufficiently large, you probably want to make it into a Set first to bring this down to O(n).
val bSet = b.toSet;
a.filter(c => b.contains(c.name))
I would read this as "Apply the following filter to a: for each item c in a, include it in the result if and only if the name of c is in b."
Here's the equivalent for loop with yield.
for(c <- a if b.contains(c.name)) yield c.name
In Python
def cross(A, B):
"Cross product of elements in A and elements in B."
return [a+b for a in A for b in B]
returns an one-dimensional array if you call it with two arrays (or strings).
But in CoffeeScript
cross = (A, B) -> (a+b for a in A for b in B)
returns a two-dimensional array.
Do you think it's by design in CoffeeScript or is it a bug?
How do I flatten arrays in CoffeScript?
First I would say say that 2 array comprehensions in line is not a very maintainable pattern. So lets break it down a little.
cross = (A, B) ->
for a in A
for b in B
a+b
alert JSON.stringify(cross [1,2], [3,4])
What's happening here is that the inner creates a closure, which has its own comprehension collector. So it runs all the b's, then returns the results as an array which gets pushed onto the parent comprehension result collector. You are sort of expecting a return value from an inner loop, which is a bit funky.
Instead I would simply collect the results myself.
cross = (A, B) ->
results = []
for a in A
for b in B
results.push a + b
results
alert JSON.stringify(cross [1,2], [3,4])
Or if you still wanted to do some crazy comprehension magic:
cross = (A, B) ->
results = []
results = results.concat a+b for b in B for a in A
results
alert JSON.stringify(cross [1,2], [3,4])
Whether this is a bug in CS or not is a bit debatable, I suppose. But I would argue it's good practice to do more explicit comprehension result handling when dealing with nested iterators.
https://github.com/jashkenas/coffee-script/issues/1191