reduce from PySpark RDD returns tuple - pyspark

data = data.withColumn('n', F.lit(10))
result = data.select('n').rdd.reduce(lambda x, y: x + y)
print(result)
Above code returns following output. But I was expecting single value which is the sum.
output:
(10,
10,
10,
10,
10,
10,
10,
10,.....)

Related

Scala flattening loses the desired grouping of the subsets by size

Count of the number of sum of the cubes equal to the target value.
For a small number of sets this code works (target is 100 vs 1000). When the target value increases the system runs out of resources. I have not flattened allsets with the intention of only creating & processing the smaller subsets as needed.
How do I lazily create/use the subsets by size until the sums for all the Sets of one size equal or exceed the target, at which point nothing more needs to be examined because the rest of the sums will be larger than the target.
val target = 100; val exp = 3; val maxi = math.pow(target, 1.0/exp).toInt;
target: Int = 100
exp: Int = 3
maxi: Int = 4
val allterms=(1 to maxi).map(math.pow(_,exp).toInt).to[Set];
allterms: Set[Int] = Set(1, 8, 27, 64)
val allsets = (1 to maxi).map(allterms.subsets(_).to[Vector]); allsets.mkString("\n");
allsets: scala.collection.immutable.IndexedSeq[Vector[scala.collection.immutable.Set[Int]]] = Vector(Vector(Set(1), Set(8), Set(27), Set(64)), Vector(Set(1, 8), Set(1, 27), Set(1, 64), Set(8, 27), Set(8, 64), Set(27, 64)), Vector(Set(1, 8, 27), Set(1, 8, 64), Set(1, 27, 64), Set(8, 27, 64)), Vector(Set(1, 8, 27, 64)))
res7: String =
Vector(Set(1), Set(8), Set(27), Set(64))
Vector(Set(1, 8), Set(1, 27), Set(1, 64), Set(8, 27), Set(8, 64), Set(27, 64))
Vector(Set(1, 8, 27), Set(1, 8, 64), Set(1, 27, 64), Set(8, 27, 64))
Vector(Set(1, 8, 27, 64))
allsets.flatten.map(_.sum).filter(_==target).size;
res8: Int = 1
This implementation loses the separation of the subsets by size.
You can add laziness to your calculations in two ways:
Use combinations() instead of subsets(). This creates an Iterator so the combination (collection of Int values) won't be realized until it is needed.
Use a Stream (or LazyList if Scala 2.13.0) so that each "row" (same sized combinations) won't be realized until it is needed.
Then you can trim the number of rows to be realized by using the fact that the first combination of each row is going to have the minimum sum of that row.
val target = 125
val exp = 2
val maxi = math.pow(target, 1.0/exp).toInt //maxi: Int = 11
val allterms=(1 to maxi).map(math.pow(_,exp).toInt)
//allterms = Seq(1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121)
val allsets = Stream.range(1,maxi+1).map(allterms.combinations)
//allsets: Stream[Iterator[IndexedSeq[Int]]] = Stream(<iterator>, ?)
// 11 rows, 2047 combinations, all unrealized
allsets.map(_.map(_.sum).buffered) //Stream[BufferedIterator[Int]]
.takeWhile(_.head <= target) // 6 rows
.flatten // 1479 combinations
.count(_ == target)
//res0: Int = 5

Ensure list contains only specific values

How can I make sure a list only contains a specific set of items?
List[Int]
A function to make sure the list only contains the values 10, 20 or 30.
I'm sure this is built in by I can't find it!
Your question doesn't specify what you want to happen when the list doesn't contain the requisite items.
The following will return true if all the items in the List match your criteria, false otherwise:
val ints1: List[Int] = List(1, 2, 3, 4, 5, 6, 7)
val ints2: List[Int] = List(10, 10, 10, 10)
ints1.forall(i => List(10, 20, 30).contains(i)) // false
ints2.forall(i => List(10, 20, 30).contains(i)) // true
The following will return a List with only those items which matched the criteria:
val ints1: List[Int] = List(10, 20, 30, 40, 50, 60, 70)
val ints2: List[Int] = List(10, 10, 10)
ints1.filter(i => List(10, 20, 30).contains(i)) // List(10, 20, 30)
ints2.filter(i => List(10, 20, 30).contains(i)) // List(10, 10, 10)
forall
You may use forall with a Set containing elements which are valid or legal and you want to see in the list.
list.forall(Set(10, 20, 30).contains) //true means list only contains 10, 20, 30
Set is Function
You need not use contains method as Set extends Int => Boolean. You can use Set like a function
list forall Set(10, 20, 30)
Filter
You can use filter to filter out the elements which are not in the given list. Again you can use Set as function as Set extends Function.
list.filter(Set(10, 20, 30)).nonEmpty //true means list only contains 10, 20 and 30
Collect if you like pattern matching
Collect takes a Partial function. If you like pattern matching just use collect
list.collect {
case 10 => 10
case 20 => 20
case 30 => 30
}.nonEmpty //true means list only contains 10, 20 and 30
Scala REPL
scala> val list = List(10, 20, 30, 40, 50)
list: List[Int] = List(10, 20, 30, 40, 50)
scala> list forall Set(10, 20, 30)
res6: Boolean = false
If you simply want to determine whether all of the values in your list are "legal", use forall:
def isLegal(i: Int): Boolean = ??? // e.g. is it 10, 20, or 30
val allLegal = list forall isLegal
If you want to trim down your list so that only legal values remain, use filter:
val onlyLegalValues = list filter isLegal
Note that a Set[Int] counts as a Int => Boolean function, so you could use that in place of your isLegal method:
val isLegal = Set(10, 20, 30)
val allLegal = list forall isLegal
val onlyLegalValues = list filter isLegal

ScalaCheck: choose an integer with custom probability distribution

I want to create a generator in ScalaCheck that generates numbers between say 1 and 100, but with a bell-like bias towards numbers closer to 1.
Gen.choose() distributes numbers randomly between the min and max value:
scala> (1 to 10).flatMap(_ => Gen.choose(1,100).sample).toList.sorted
res14: List[Int] = List(7, 21, 30, 46, 52, 64, 66, 68, 86, 86)
And Gen.chooseNum() has an added bias for the upper and lower bounds:
scala> (1 to 10).flatMap(_ => Gen.chooseNum(1,100).sample).toList.sorted
res15: List[Int] = List(1, 1, 1, 61, 85, 86, 91, 92, 100, 100)
I'd like a choose() function that would give me a result that looks something like this:
scala> (1 to 10).flatMap(_ => choose(1,100).sample).toList.sorted
res15: List[Int] = List(1, 1, 1, 2, 5, 11, 18, 35, 49, 100)
I see that choose() and chooseNum() take an implicit Choose trait as an argument. Should I use that?
You could use Gen.frequency() (1):
val frequencies = List(
(50000, Gen.choose(0, 9)),
(38209, Gen.choose(10, 19)),
(27425, Gen.choose(20, 29)),
(18406, Gen.choose(30, 39)),
(11507, Gen.choose(40, 49)),
( 6681, Gen.choose(50, 59)),
( 3593, Gen.choose(60, 69)),
( 1786, Gen.choose(70, 79)),
( 820, Gen.choose(80, 89)),
( 347, Gen.choose(90, 100))
)
(1 to 10).flatMap(_ => Gen.frequency(frequencies:_*).sample).toList
res209: List[Int] = List(27, 21, 31, 1, 21, 18, 9, 29, 69, 29)
I got the frequencies from https://en.wikipedia.org/wiki/Standard_normal_table#Complementary_cumulative. The code is just a sample of the table (% 3 or mod 3), but I think you can get the idea.
I can't take much credit for this, and will point you to this excellent page:
http://www.javamex.com/tutorials/random_numbers/gaussian_distribution_2.shtml
A lot of this depends what you mean by "bell-like". Your example doesn't show any negative numbers but the number "1" can't be in the middle of the bell and not produce any negative numbers unless it was a very, very tiny bell!
Forgive the mutable loop but I use them sometimes when I have to reject values in a collection build:
object Test_Stack extends App {
val r = new java.util.Random()
val maxBellAttempt = 102
val stdv = maxBellAttempt / 3 //this number * 3 will happen about 99% of the time
val collectSize = 100000
var filled = false
val l = scala.collection.mutable.Buffer[Int]()
//ref article above "What are the minimum and maximum values with nextGaussian()?"
while(l.size < collectSize){
val temp = (r.nextGaussian() * stdv + 1).abs.round.toInt //the +1 is the mean(avg) offset. can be whatever
//the abs is clipping the curve in half you could remove it but you'd need to move the +1 over more
if (temp <= maxBellAttempt) l+= temp
}
val res = l.to[scala.collection.immutable.Seq]
//println(res.mkString("\n"))
}
Here's the distribution I just pasted the output into excel and did a "countif" to show the freq of each:

Sum array of vectors by one of fields - scala

I have an array of vectors in scala:
import org.apache.mahout.math.{ VectorWritable, Vector, DenseVector }
import org.apache.mahout.clustering.dirichlet.UncommonDistributions
val data = new ArrayBuffer[Vector]()
for (i <- 100 to num) {
data += new DenseVector(Array[Double](
i % 30,
UncommonDistributions.rNorm(100, 100),
UncommonDistributions.rNorm(100, 100)
)
}
Lets say I want to sum second and third fields grouping by first row.
What is the better way to do that?
I would suggest to use the groupBy method present in Collections:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.Vector#groupBy[K](f:A=>K):scala.collection.immutable.Map[K,Repr]
This will create a Map of Vectors base on a discriminator you specify.
Edit: Some code example:
// I created a different Array of Vector as I don't have Mahout dependencies
// But the output is similar
// A List of Vectors with 3 values inside
val num = 100
val data = (0 to num).toList.map(n => {
Vector(n % 30, n / 100, n * 100)
})
// The groupBy will create a Map of Vectors where the Key is the result of the function
// And here, the function return the first value of the Vector
val group = data.groupBy(v => { v.apply(0) })
// Also a subset of the result:
// group:
// scala.collection.immutable.Map[Int,List[scala.collection.immutable.Vector[Int]]] = Map(0 -> List(Vector(0, 0, 0), Vector(0, 0, 3000), Vector(0, 0, 6000), Vector(0, 0, 9000)), 5 -> List(Vector(5, 0, 500), Vector(5, 0, 3500), Vector(5, 0, 6500), Vector(5, 0, 9500)))
use groupBy function on list, and then map each group - simply in one line of code:
data groupBy (_(0)) map { case (k,v) => k -> (v map (_(2)) sum) }

Scala groupBy: Want array indices satisfying predicate, not array values

val m = Array(10,20,30,30,50,60,70,80) groupBy ( s => s %30 == 0)
m(true).map { kv => println(kv) }
prints the values 30, 30, 60
I want the indices i.e. 2, 3, 5 to be printed.
How do I go about this?
val m = Array(10,20,30,30,50,60,70,80).zipWithIndex.groupBy(s =>
s._1 % 30 == 0).map(e => e._1 -> (e._2.unzip._2))
Just FYI, if you only want the true values, then you could go with #missingfaktor's approach and equally you could partition this:
val m = Array(10, 20, 30, 30, 50, 60, 70, 80).zipWithIndex.partition(s =>
s._1 % 30 == 0)._1.unzip._2
Here's another way to do it:
Array(10,20,30,30,50,60,70,80).zipWithIndex.filter{ _._1 % 30 == 0 }.map{ _._2 }
I find the .map{ _._2 } easier to comprehend than .unzip._2, but maybe that's just me. What's also interesting is that the above returns:
Array[Int] = Array(2, 3, 5)
While the unzip variant returns this:
scala.collection.mutable.IndexedSeq[Int] = ArrayBuffer(2, 3, 5)
Array(10, 20, 30, 30, 50, 60, 70, 80)
.zipWithIndex
.collect { case (element, index) if element % 30 == 0 => index }
// Array[Int] = Array(2, 3, 5)
Here's a more direct way,
val m = Array(10,20,30,30,50,60,70,80).zipWithIndex.filter(_._1 % 30 == 0).unzip
obtains the values and indices as a pair, (ArrayBuffer(30, 30, 60),ArrayBuffer(2, 3, 5)) You can print just the indices with
m._2.foreach(println _)
val a=Array(10,20,30,30,50,60,70,80)
println( a.indices.filter( a(_)%30==0 ) )