SO answer by Jerry includes this use of deep:
println(k.deep)
Works as described:
scala> println(Array(10, 20, 30, 40).deep)
Array(10, 20, 30, 40)
I am looking for documentation on deep for an Array. I go to Scala Standard Library 2.13.0 Array and do a search of the page for deepand get no matches.
How is this the incorrect sequence?
It seems it has been removed from Scala 2.13 according to https://github.com/scala/bug/issues/10985:
It's a hacky ugly testing utility to print values in (nested) arrays.
If you feel strongly about it, we can add it deprecated.
You can still find it in 2.12 docs and in 2.12 branch:
/** Creates a possible nested `IndexedSeq` which consists of all the elements
* of this array. If the elements are arrays themselves, the `deep` transformation
* is applied recursively to them. The `stringPrefix` of the `IndexedSeq` is
* "Array", hence the `IndexedSeq` prints like an array with all its
* elements shown, and the same recursively for any subarrays.
*
* Example:
* {{{
* Array(Array(1, 2), Array(3, 4)).deep.toString
* }}}
* prints: `Array(Array(1, 2), Array(3, 4))`
*
* #return An possibly nested indexed sequence of consisting of all the elements of the array.
*/
def deep: scala.collection.IndexedSeq[Any]
Related
I am new in scala. and I want to get below size of element from ArraySeq(Split by comma)
ArraySeq(A,B,C,D) // 4
ArraySeq(A,B,C)) // 3
ArraySeq(A,B)) //2
ArraySeq(A,B,C,D,E)) // 5
Also I would like to access each element from ArraySeq.
Does anyone can help me out of that?
You can use .size on any of the scala collections.
e.g. scala ArraySeq(A,B,C,D).size would return 4
There are a few ways to go about accessing the elements of the sequence
val example = ArraySeq(1,2,3,4,5)
// Using the apply method
example(1) // Result 2
example(10) // ArrayIndexOutOfBoundsException is thrown
// Using .get to avoid exceptions
example.get(2) // Result Some(3)
example.get(10) // Result None
// You can transform all elements using map
example.map(_ * 2) // Result ArraySeq(2,4,6,8,10)
// Or do something with all elements
example.foreach(println) // prints 1,2,3,4,5 separated by newlines
If you have a nested ArraySeq (i,e, ArraySeq[ArraySet[T]]) you can get the total size using:
val exampleNested = ArraySeq(ArraySeq(1,2),ArraySeq(3,4,5), ArraySeq(6,7,8))
exampleNested.map(_.size).sum // Returns total size of 8
You can also flatten the nested collections to a single array seq for ease of use with
exampleNested.flatten // Result ArraySeq(1, 2, 3, 4, 5, 6, 7, 8)
I just encountered an issue with degrading fs2 performance using a stream of strings to be written to a file via text.utf8encode. I tried to change my source to use chunked strings to increase performance, but the observation was performance degradation instead.
As far as I can see, it boils down to the following: Invoking flatMap on a stream that originates from Stream.emits() can be very expensive. Time usage seems to be exponential based on the size of the sequence passed to Stream.emits(). The code snippet below shows an example:
/*
Test done with scala 2.11.11 and fs2 version 0.10.0-M7.
*/
val rangeSize = 20000
val integers = (1 to rangeSize).toVector
// Note that the last flatMaps are just added to show extreme load for streamA.
val streamA = Stream.emits(integers).flatMap(Stream.emit(_))
val streamB = Stream.range(1, rangeSize + 1).flatMap(Stream.emit(_))
streamA.toVector // Uses approx. 25 seconds (!)
streamB.toVector // Uses approx. 15 milliseconds
Is this a bug, or should usage of Stream.emits() for large sequences be avoided?
TLDR: Allocations.
Longer answer:
Interesting question. I ran a JFR profile on both methods separately, and looked at the results. First thing which immediately attracted my eye was the amount of allocations.
Stream.emit:
Stream.range:
We can see that Stream.emit allocates a significant amount of Append instances, which are the concrete implementation of Catenable[A], which is the type used in Stream.emit to fold:
private[fs2] final case class Append[A](left: Catenable[A], right: Catenable[A]) extends Catenable[A]
This actually comes from the implementation of how Catenable[A] implemented foldLeft:
foldLeft(empty: Catenable[B])((acc, a) => acc :+ f(a))
Where :+ allocates a new Append object for each element. This means we're at least generating 20000 such Append objects.
There is also a hint in the documentation of Stream.range about how it produces a single chunk instead of dividing the stream further, which may be bad if this was a big range we're generating:
/**
* Lazily produce the range `[start, stopExclusive)`. If you want to produce
* the sequence in one chunk, instead of lazily, use
* `emits(start until stopExclusive)`.
*
* #example {{{
* scala> Stream.range(10, 20, 2).toList
* res0: List[Int] = List(10, 12, 14, 16, 18)
* }}}
*/
def range(start: Int, stopExclusive: Int, by: Int = 1): Stream[Pure,Int] =
unfold(start){i =>
if ((by > 0 && i < stopExclusive && start < stopExclusive) ||
(by < 0 && i > stopExclusive && start > stopExclusive))
Some((i, i + by))
else None
}
You can see that there is no additional wrapping here, only the integers that get emitted as part of the range. On the other hand, Stream.emits creates an Append object for every element in the sequence, where we have a left containing the tail of the stream, and right containing the current value we're at.
Is this a bug? I would say no, but I would definitely open this up as a performance issue to the fs2 library maintainers.
I have got big Data file loaded in Spark but wish to work on a small portion of it to run the analysis, is there any way to do that ?. I tried doing repartition but it brings a lot of reshuffling. Is there any good of way of processing the only small chunk of a Big file loaded in Spark?.
In short
You can use sample() or randomSplit() transformations on RDD
sample()
/**
* Return a sampled subset of this RDD.
*
* #param withReplacement can elements be sampled multiple times
* #param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]
* with replacement: expected number of times each element is chosen; fraction must be
* greater than or equal to 0
* #param seed seed for the random number generator
*
* #note This is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*/
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T]
Example:
val sampleWithoutReplacement = rdd.sample(false, 0.2, 2)
randomSplit()
/**
* Randomly splits this RDD with the provided weights.
*
* #param weights weights for splits, will be normalized if they don't sum to 1
* #param seed random seed
*
* #return split RDDs in an array
*/
def randomSplit(
weights: Array[Double],
seed: Long = Utils.random.nextLong): Array[RDD[T]]
Example:
val rddParts = randomSplit(Array(0.8, 0.2)) //Which splits RDD into 80-20 ratio
You can use any of the following RDD API's :
yourRDD.filter(on some condition)
yourRDD.sample(<with replacement>,<fraction of data>,<random seed>)
Ex: yourRDD.sample(false, 0.3, System.currentTimeMillis().toInt)
If you want any random fraction of data I suggest you use second method. Or if you need part of the data satisfying some condition use the first one.
I'm new to Scala and I'm having a mental block on a seemingly easy problem. I'm using the Scala library breeze and need to take an array buffer (mutable) and put the results into a matrix. This... should be simple but? Scala is so insanely type casted breeze seems really picky about what data types it will take when making a DenseVector. This is just some prototype code, but can anyone help me come up with a solution?
Right now I have something like...
//9 elements that need to go into a 3x3 matrix, 1-3 as top row, 4-6 as middle row, etc)
val numbersForMatrix: ArrayBuffer[Double] = (1, 2, 3, 4, 5, 6, 7, 8, 9)
//the empty 3x3 matrix
var M: breeze.linalg.DenseMatrix[Double] = DenseMatrix.zeros(3,3)
In breeze you can do stuff like
M(0,0) = 100 and set the first value to 100 this way,
You can also do stuff like:
M(0, 0 to 2) := DenseVector(1, 2, 3)
which sets the first row to 1, 2, 3
But I cannot get it to do something like...
var dummyList: List[Double] = List(1, 2, 3) //this works
var dummyVec = DenseVector[Double](dummyList) //this works
M(0, 0 to 2) := dummyVec //this does not work
and successfully change the first row to the 1, 2,3.
And that's with a List, not even an ArrayBuffer.
Am willing to change datatypes from ArrayBuffer but just not sure how to approach this at all... could try updating the matrix values one by one but that seems like it would be VERY hacky to code up(?).
Note: I'm a Python programmer who is used to using numpy and just giving it arrays. The breeze documentation doesn't provide enough examples with other datatypes for me to have been able to figure this out yet.
Thanks!
Breeze is, in addition to pickiness over types, pretty picky about vector shape: DenseVectors are column vectors, but you are trying to assign to a subset of a row, which expects a transposed DenseVector:
M(0, 0 to 2) := dummyVec.t
I need to maintain a sorted sequence (mutable or immutable — I don't care), dynamically inserting elements into the middle of it (to keep it sorted) and removing them likewise (so, random access by index is crucial).
The best thing I came onto is using a Vector and scala.collections.Searching from 2.11, and then:
var vector: Vector[Ordered]
...
val ip = vector.search(element)
Inserting:
vector = (vector.take(ip.insertionPoint) :+ element) ++ vector.drop(ip.insertionPoint)
Deleting:
vector.patch(from = ip.insertionPoint, patch = Nil, replaced = 1)
Doesn't look elegant to me, and I suspect performance issues. Is there a better way? Splicing sequences seems like a very basic operation to me, but I can't find an elegant solution.
You should use SortedSet. Default implementation of SortedSet is immutable red-black tree. There is also a mutable implementation.
SortedSet[Int]() + 5 + 3 + 4 + 7 + 1
// SortedSet[Int] = TreeSet(1, 3, 4, 5, 7)
Set contains no duplicate elements. In case you want to count duplicate elements you could use SortedMap[Key, Int] with elements as keys and counts as values. See this answer for MultiSet emulation using Map.