Related
I have a quiz program in swift. Everything is working fine but I thought to make it more efficient by using a function instead of taking each item separately!
Your help appreciated!
On the playground, I tried to use :
func random( a:Int, b:Int, c:Int) -> (Int, Int, Int)
return Int(arc4random_uniform(UInt32(a, b, c)))
I got an error saying:
Cannot invoke initializer for type 'UInt32' with an argument list of type '(Int, Int, Int)'
You have a couple of issues.
You need to generate the three numbers separately and then return a tuple of the three results. You are only calling arc4random_uniform() once.
arc4random_uniform() takes a single UInt32 and you are trying to give it the result of passing three Ints to the UInt32 initializer (which is a non-existent initializer).
I'd suggest using Int.random(in:) instead of arc4random_uniform():
func random(a: Int, b: Int, c: Int) -> (Int, Int, Int) {
return (.random(in: 0..<a), .random(in: 0..<b), .random(in: 0..<c))
}
Note: You don't have to explicitly use the Int in the call Int.random(in: 0..<a) because Swift is able to infer the Int from the return type of the function.
Example:
for _ in 1...20 {
print(random(a: 2, b: 6, c: 100))
}
(0, 3, 5)
(0, 1, 32)
(0, 1, 90)
(0, 3, 17)
(1, 0, 34)
(0, 1, 78)
(1, 0, 71)
(0, 1, 85)
(1, 0, 27)
(1, 0, 26)
(0, 0, 93)
(0, 1, 46)
(1, 4, 47)
(1, 1, 12)
(0, 2, 21)
(1, 3, 72)
(0, 2, 62)
(0, 5, 50)
(1, 2, 23)
(1, 4, 21)
Alternate implementation
By taking a variable number of maximum inputs and returning an array, you can flexibly handle any number of randoms needed:
func random(maxVals: Int...) -> [Int] {
return maxVals.map { .random(in: 0..<$0) }
}
Example:
print(random(maxVals: 2, 6, 100))
[1, 4, 57]
print(random(maxVals: 2, 6))
[0, 5]
You have a function with this signature: (Int, Int, Int) -> Int.
The function arc4random_uniform has to be applied on UInt32 and not Int.
If you convert a, b, c it will be good !
I was learning pyspark when I encounterd this.
from pyspark.sql import Row
df = spark.createDataFrame([Row([0,45,63,0,0,0,0]),
Row([0,0,0,85,0,69,0]),
Row([0,89,56,0,0,0,0])],
['features'])
+--------------------+
| features|
+--------------------+
|[0, 45, 63, 0, 0,...|
|[0, 0, 0, 85, 0, ...|
|[0, 89, 56, 0, 0,...|
+--------------------+
sample = df.rdd.map(lambda row: row[0]*2)
sample.collect()
[[0, 45, 63, 0, 0, 0, 0, 0, 45, 63, 0, 0, 0, 0],
[0, 0, 0, 85, 0, 69, 0, 0, 0, 0, 85, 0, 69, 0],
[0, 89, 56, 0, 0, 0, 0, 0, 89, 56, 0, 0, 0, 0]]
My question is why is row[0] is taken as a complete list rather than one value?
What is the property that gives the above output
It is Taken as Complete list as you have given it as one, and you have also defined it under one column "features"
when You are saying
df.rdd.map(lambda row: row[0]*2)
You are just asking spark that "I want all values in this list to occur twice". Hence you get the output that you are getting.
Now How to get Individual values in list.
df = spark.createDataFrame([Row(0,45,63,0,0,0,0),
Row(0,0,0,85,0,69,0),
Row(0,89,56,0,0,0,0)],
['feature1' , 'feature2' , 'feature3' , 'feature4', 'feature5' , 'feature6' , 'feature7'])
This should give you access to individual values in a dedicated column.
Note : syntax for schema is just representation. please refer spark docs for exact syntax.
Hope This helps :)
I have the following function which returns a list of 8 elements or list:
def orderCost(orderItems: List[Double]) = {
if (orderItems.length <= 8) orderItems else orderItems.grouped(8).toList
}
So my question is that why my function is returning List[Any] instead of List[Double] or List[List[Double]]. Is there a bug 2.11.8 which i'm using.
orderItems can be one of the below:
orderItems: List[Double] = List(4.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99)
or
orderItems: List[Double] = List(4.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 4.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99)
I want to a list of eight elements if order item length is 8 or create a multiple list from order item, where each sub list contains 8 elements max
Thanks
You need not to check length, You can do directly like this
def orderCost(orderItems: List[Double]) = {
orderItems.grouped(8).toList
}
Sample Input 1:
val orderItems: List[Double] = List(4.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99)
Sample Output 1:
List(List(4.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99), List(8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99), List(8.99, 8.99, 8.99, 8.99))
Sample Input 2:
val orderItems1: List[Double] = List(1,2,3,4,5.8)
Sample Output 2:
List(List(1.0, 2.0, 3.0, 4.0, 5.8))
Just change the return of the function. grouped with take care of all cases.
def orderCost(orderItems: List[Double]): List[List[Double]] =
orderItems.grouped(8).toList
Scala REPL
scala> val l = (1 to 10)
l: scala.collection.immutable.Range.Inclusive = Range 1 to 10
scala> l.grouped(8).toList
res0: List[scala.collection.immutable.IndexedSeq[Int]] = List(Vector(1, 2, 3, 4, 5, 6, 7, 8), Vector(9, 10))
scala> val l = (1 to 4)
l: scala.collection.immutable.Range.Inclusive = Range 1 to 4
scala> l.grouped(8).toList
res1: List[scala.collection.immutable.IndexedSeq[Int]] = List(Vector(1, 2, 3, 4))
So, function looks like
scala> def orderCost(orderItems: List[Double]): List[List[Double]] = orderItems.grouped(8).toList
orderCost: (orderItems: List[Double])List[List[Double]]
scala> orderCost(List(1, 2, 3, 4))
res2: List[List[Double]] = List(List(1.0, 2.0, 3.0, 4.0))
scala> orderCost((1 to 20).toList.map(_.toDouble))
res4: List[List[Double]] = List(List(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0), List(9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0), List(17.0, 18.0, 19.0, 20.0))
The problem is that the only type that is compatible with both List[Double] and List[List[Double]] is List[Any], so that is the result type of the function. There are no union types (until 3.0) so you can't return List[Double] | List[List[Double]].
You can unpick the current return value with a match statement (but beware type erasure). Or you could return Either[List[Double], List[List[Double]] like this:
def orderCost(orderItems: List[Double]) = {
if (orderItems.length <= 8) Left(orderItems) else Right(orderItems.grouped(8).toList)
}
orderCost(myItems) match {
case Left(ld) => // Handle List[Double]
case Right(lld) => // Handle List[List[Double]]
}
I have a user data from movielense ml-100K dataset.
Sample rows are -
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
I have read data as RDD as follows-
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31
scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7), (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))
I want to do one hot encoding for Occupation column.
Expected output is -
userId Age Gender Occupation Zipcodes technician other writer
1 24 M technician 85711 1 0 0
2 53 F other 94043 0 1 0
3 23 M writer 32067 0 0 1
4 24 M technician 43537 1 0 0
5 33 F other 15213 0 1 0
How do I achieve this on RDD in scala.
I want to perform operation on RDD without converting it to dataframe.
Any help
Thanks
I did this in following way -
1) Read user data -
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) show 5 rows of data-
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
3) Create map of profession by indexing-
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()
scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) create encode function which does one hot encoding of profession
scala> def encode(x: String) =
|{
| var encodeArray = Array.fill(21)(0)
| encodeArray(indexed_user.get(x).get.toInt)=1
| encodeArray
}
5) Apply encode function to user data -
scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) show encoded data -
scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] =
1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)),
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.
var OutputDf = df
for (cat <- catMap.keys) {
val categories = catMap(cat)
for (oneHotVal <- categories) {
OutputDf = OutputDf.withColumn(oneHotVal,
when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
}
}
OutputDf
I have user hobby data(RDD[Map[String, Int]]) like:
("food" -> 3, "music" -> 1),
("food" -> 2),
("game" -> 5, "twitch" -> 3, "food" -> 3)
I want to calculate stats of them, and represent the stats as Map[String, Array[Int]] while the array size is 5, like:
("food" -> Array(0, 1, 2, 0, 0),
"music" -> Array(1, 0, 0, 0, 0),
"game" -> Array(0, 0, 0, 0, 1),
"twitch" -> Array(0, 0, 1, 0 ,0))
foldLeft seems to be the right solution, but RDD cannot use it, and the data is too big to convert to List/Array to use foldLeft, how could I do this job?
The trick is to replace the Array in your example by a class that contains the statistic you want for some part of the data, and that can be combined with another instance of the same statistic (covering other part of the data) to provide the statistic on the whole data.
For instance, if you have a statistic that covers the data 3, 3, 2 and 5, I gather it would look something like (0, 1, 2, 0, 1) and if you have another instance covering the data 3,4,4 it would look like (0, 0, 1, 2,0). Now all you have to do is define a + operation that let you combine (0, 1, 2, 0, 1) + (0, 0, 1, 2, 0) = (0,1,3,2,1), covering the data 3,3,2,5 and 3,4,4.
Let's just do that, and call the class StatMonoid:
case class StatMonoid(flags: Seq[Int] = Seq(0,0,0,0,0)) {
def + (other: StatMonoid) =
new StatMonoid( (0 to 4).map{idx => flags(idx) + other.flags(idx)})
}
This class contains the sequence of 5 counters, and define a + operation that let it be combined with other counters.
We also need a convenience method to build it, this could be a constructor in StatMonoid, in the companion object, or just a plain method, as you prefer:
def stat(value: Int): StatMonoid = value match {
case 1 => new StatMonoid(Seq(1,0,0,0,0))
case 2 => new StatMonoid(Seq(0,1,0,0,0))
case 3 => new StatMonoid(Seq(0,0,1,0,0))
case 4 => new StatMonoid(Seq(0,0,0,1,0))
case 5 => new StatMonoid(Seq(0,0,0,0,1))
case _ => throw new RuntimeException("illegal init value: $value")
}
This allows us to easily compute instance of the statistic covering one single piece of data, for example:
scala> stat(4)
res25: StatMonoid = StatMonoid(List(0, 0, 0, 1, 0))
And it also allows us to combine them together simply by adding them:
scala> stat(1) + stat(2) + stat(2) + stat(5) + stat(5) + stat(5)
res18: StatMonoid = StatMonoid(Vector(1, 2, 0, 0, 3))
Now to apply this to your example, let's assume we have the data you mention as an RDD of Map:
val rdd = sc.parallelize(List(Map("food" -> 3, "music" -> 1), Map("food" -> 2), Map("game" -> 5, "twitch" -> 3, "food" -> 3)))
All we need to do to find the stat for each kind of food, is to flatten the data to get ("foodId" -> id) tuples, transform each id into an instance of StatMonoid above, and finally combine them all together for each kind of food:
import org.apache.spark.rdd.PairRDDFunctions
rdd.flatMap(_.toList).mapValue(stat).reduceByKey(_ + _).collect
Which yields:
res24: Array[(String, StatMonoid)] = Array((game,StatMonoid(List(0, 0, 0, 0, 1))), (twitch,StatMonoid(List(0, 0, 1, 0, 0))), (music,StatMonoid(List(1, 0, 0, 0, 0))), (food,StatMonoid(Vector(0, 1, 2, 0, 0))))
Now, for the side story, if you wonder why I call the class StateMonoid it's simply because... it is a monoid :D, and a very common and handy one, called product . In short, monoids are just thingies that can be combined with each other in associative fashion, they are super common when developing in Spark since they naturally define operations that can be executed in parallel on the distributed slaves, and gathered together into a final result.