Consider the following example:
List(0,3,2,1) map List("A","B","C","D")
it gives List(A, D, C, B)
Whether this order is maintained always or it can change the order for a different execution.
I want reorder the list by a generator function(index generator).
Whether we can assume the order of map method or is it undecidable?
Related
Rephrasing of my questions:
I am writing a program that implements a data mining algorithm. In this program I want to save the input data which is supposed to be minded. Imagine the input data to be a table with rows and columns. Each row is going to be represented by an instance of my Scala class (the one in question). The columns of the input data can be of different type (Integer, Double, String, whatnot) and which type will change depending on the input data. I need a way to store a row inside my Scala class instance. Thus I need an ordered collection (like a special List) that can hold (many) different types as elements and it must be possible that the type is only determined at runtime. How can I do this? A Vector or a List require that all elements are supposed to be of the same type. A Tuple can hold different types (which can be determined at runtime if I am not mistaken), but only up to 22 elements which is too few.
Bonus (not sure if I am asking too much now):
I would also like to have the rows' columns to be named and excess-able by name. However, I thinkg this problem can easily be solved by using two lists. (Altough, I just read about this issue somewhere - but I forgot where - and think this was solved more elegantly.)
It might be good to have my collection to be random access (so "Vector" rather than "List").
Having linear algebra (matrix multiplication etc.) capabilities would be nice.
Even more bonus: If I could save matrices.
Old phrasing of my question:
I would like to have something like a data.frame as we know it from R in Scala, but I am only going to need one row. This row is going to be a member in a class. The reason for this construct is that I want methods related to each row to be close to the data itself. Each data row is also supposed to have meta data about itself and it will be possible to give functions so that different rows will be manipulated differently. However I need to save rows somehow within the class. A List or Vector comes to mind, but they only allow to be all Integer, String, etc. - but as we know from data.frame, different columns (here elements in Vector or List) can be of different type. I also would like to save the name of each column to be able to access the row values by column name. That seems the smallest issue though. I hope it is clear what I mean. How can I implement this?
DataFrames in R are heterogenous lists of homogeneous column vectors:
> df <- data.frame(c1=c(r1=1,r2=2), c2=c('a', 'b')); df
c1 c2
r1 1 a
r2 2 b
You could think of each row as a heterogeneous list of scalar values:
> as.list(df['r1',])
$c1
[1] 1
$c2
[1] a
An analogous implementation in scala would be a tuple of lists:
scala> val df = (List(1, 2), List('a', 'b'))
df: (List[Int], List[Char]) = (List(1, 2),List(a, b))
Each row could then just be a tuple:
scala> val r1 = (1, 'a')
r1: (Int, Char) = (1,a)
If you want to name all your variables, another possibility is a case class:
scala> case class Row (col1:Int, col2:Char)
defined class Row
scala> val r1 = Row(col1=1, col2='a')
r1: Row = Row(1,a)
Hope that helps bridge the R to scala divide.
In Spark's RDDs and DStreams we have the 'reduce' function for transforming an entire RDD into one element. However the reduce function takes (T,T) => T
However if we want to reduce a List in Scala we can use foldLeft or foldRight which takes type (B)( (B,A) => B) This is very useful because you start folding with a type other then what is in your list.
Is there a way in Spark to do something similar? Where I can start with a value that is of different type then the elements in the RDD itself
Use aggregate instead of reduce. It allows you also to specify a "zero" value of type B and a function like the one you want: (B,A) => B. Do note that you also need to merge separate aggregations done on separate executors, so a (B, B) => B function is also required.
Alternatively, if you want this aggregation as a side effect, an option is to use an accumulator. In particular, the accumulable type allows for the result type to be of a different type than the accumulating type.
Also, if you even need to do the same with a key-value RDD, use aggregateByKey.
I have a LinkedHashSet which was created from a Seq. I used a LinkedHashSet because I need to keep the order of the Seq, but also ensure uniqueness, like a Set. I need to check this LinkedHashSet against another sequence to verify that various properties within them are the same. I assumed that I could loop through using an index, i, but it appears not. Here is an example of what I would like to accomplish.
var s: Seq[Int] = { 1 to mySeq.size }
return s.forall { i =>
myLHS.indexOf(i).something == mySeq.indexOf(i).something &&
myLHS.indexOf(i).somethingelse == mySeq.indexOf(i).somethingelse
}
So how do I access individual elements of the LHS?
Consider using the zip method on collections to create a collection of pairs (Tuples). The specifics of this depend on your specifics. You may want to do mySeq.zip(myLHS) or myLHS.zip(mySeq), which will create different structures. You probably want mySeq.zip(myLHS), but I'm guessing. Also, if the collections are very large, you may want to take a view first, e.g. mySeq.view.zip(myLHS) so that the pair collection is also non-strict.
Once you have this combined collection, you can use a for-comprehension (or directly, myZip.foreach) to traverse it.
A LinkedHashSet is not necessary in this situation. Since I made it from a Seq, it is already ordered. I do not have to convert it to a LHS in order to also make it unique. Apparently, Seq has the distinct method which will remove duplicates from the sequence. From there, I can access the items via their indexes.
I am new user in Pentaho and maybe my question is very simple. I have two streams with identical columns, e.g. stream S1 has the columns: A, B, C, and stream S2 has columns: A, B, C (same name, same order, same data type). I want to merge or append these two streams into a single stream containing the columns A, B, C. However, when I use merge join (with the option FUL OUTER JOIN) my result is a stream with the columns: A, B, C, A_1, B_1, C_1. It is not what I want. I tried to use the append stream step, but in this case appeared nothing in the preview.
As per your requirement first create two stream.
Here we have taken two streams i.e. "stream1.xls" and "stream2.xls".
Then built the transformation using the "Sorted merge" join
For better understanding please refer the screenshots.
for(elt <- bufferObject: scala.collection.mutable.Buffer)
// Do something with the element of the collection
In which order will the elements in the for loop be accessed ? Randomly ?
From the Scala API one can see that Buffer is a subclass of Seq, in which the elements are ordered. Does this also hold for the loop above ?
Here's a selection of the super-types of mutable.Buffer[A], and the traversal guarantees they provide:
Seq[A] -- All elements have a position, with an index associated; they are always traversed one by one, from lowest index to highest index.
GenSeq[A] -- All elements have a position, with an index associated; they may be traversed one by one or in parallel; if a new collection is generated, the position of the elements in the new collection will correspond to the old collection, even if the traversal is in parallel.
Iterable[A] -- Elements may be traversed in any order, but will always be traversed in the same order (that is, it can't change from one iteration to another); they'll be traversed one by one.
GenIterable[A] -- Elements may be traversed in any order, but will always be traversed in the same order (that is, it can't change from one iteration to another); traversal may happen one by one or in paralllel; if a new collection is generated, the position of the elements in the new collection will correspond to the old collection, even if the traversal is in parallel.
Traversable[A] -- Same guarantees as Iterable[A], with the additional limitation that you can interrupt traversal, but you cannot determine when the next element will be choosen (which you can in Iterable[A] and descendants, by producing an Iterator).
GenTraversable[A] -- Same guarantees as GenIterable[A], with the additional limitation that you can interrupt traversal, but you cannot determine when the next element will be choosen (which you can in GenIterable[A] and descendants, by producing an Iterator).
TraversableOnce[A] -- Same guarantees as in Traversable[A], with the additional limitation that you might not be able to traverse the elements more than once.
GenTraversableOnce[A] -- Same guarantees as in GenTraversable[A], with the additional limitation that you might not be able to traverse the elements more than once.
Now, all guarantees apply, with the fewer limitations, which effectively means that everything said about Seq[A] holds true for mutable.Buffer[A].
Now, to the for loop:
for(elt <- bufferObject: scala.collection.mutable.Buffer)
doSomething(elt)
is the same thing as:
bufferObject.foreach(elt => dosomething(elt))
And
for(elt <- bufferObject: scala.collection.mutable.Buffer)
yield makeSomething(elt)
is the same thing as:
bufferObject.map(elt => makeSomething(elt))
In fact, all for variants will be translated into methods available on Buffer (or whatever other type you have there), so the guarantees provided by the collections all apply. Note, for instance, that a GenSeq used with map (for-yield) may process all elements in parallel, but will produce a collection where newCollection(i) == makeSomething(bufferObject(i)), that is, the indices are preserved.
Yes, the for-comprehension will desugar to some combination of map, flatMap, and foreach, and these all follow the Seq's defined order.
Unless you're using parallel collections (via par method), the order of operations (like for comprehension, map, foreach and other sequential methods) in mutable Buffer is guaranteed.
Should be ordered. Buffer defaults to ArrayBuffer I believe.
scala> import scala.collection.mutable.Buffer
import scala.collection.mutable.Buffer
scala> val x = Buffer(1, 2, 3, 4, 5)
x: scala.collection.mutable.Buffer[Int] = ArrayBuffer(1, 2, 3, 4, 5)
scala> for (y <- x) println(y)
1
2
3
4
5