Related
I was trying to do:
random select a few columns from the DataFrame
shuffle the value from the column select from step 1
add these columns from step 2 back to the DataFrame
The code is as following:
# Step 0: create data frame using list and tuple
df = sqlContext.createDataFrame([
("user1", 0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
("user2", 1, 1, 0, 1, 0, 1, 1, 1, 1, 0),
("user3", 1, 1, 1, 1, 0, 0, 0, 1, 1, 0),
("user4", 0, 1, 0, 1, 1, 1, 1, 1, 0, 0),
("user5", 1, 1, 1, 1, 0, 1, 0, 1, 1, 0),
("user6", 0, 1, 0, 1, 1, 1, 1, 0, 1, 0)
], ["ID", "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"])
df.show()
The DataFrame is:
import random
from pyspark.sql import functions as F
# define features
feature = [x for x in df.columns if x not in ['ID']]
# Step 1: random select a few columns from the DataFrame
random.seed(123)
random_col = random.sample(feature, 2)
print(random_col)
Step 1 works well. The random selected feature are 'x0', 'x4'
# shuffle the random selected columns to create random noise feature
for i in range(0, 2):
# Step 2: shuffle the value from the column select from step 1
rnd_df = df.select(random_col[i]).orderBy(F.rand(i)).withColumnRenamed(random_col[i], 'rnd_col').rnd_col
# step 3: add these columns from step 2 back to the DataFrame
df = df.withColumn('random'+ str(i+1), rnd_df)
Step 2 works well. But Step 3 fails with the following error. Does anyone know how to solve this problem?
I have a user data from movielense ml-100K dataset.
Sample rows are -
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
I have read data as RDD as follows-
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31
scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7), (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))
I want to do one hot encoding for Occupation column.
Expected output is -
userId Age Gender Occupation Zipcodes technician other writer
1 24 M technician 85711 1 0 0
2 53 F other 94043 0 1 0
3 23 M writer 32067 0 0 1
4 24 M technician 43537 1 0 0
5 33 F other 15213 0 1 0
How do I achieve this on RDD in scala.
I want to perform operation on RDD without converting it to dataframe.
Any help
Thanks
I did this in following way -
1) Read user data -
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) show 5 rows of data-
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
3) Create map of profession by indexing-
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()
scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) create encode function which does one hot encoding of profession
scala> def encode(x: String) =
|{
| var encodeArray = Array.fill(21)(0)
| encodeArray(indexed_user.get(x).get.toInt)=1
| encodeArray
}
5) Apply encode function to user data -
scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) show encoded data -
scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] =
1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)),
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.
var OutputDf = df
for (cat <- catMap.keys) {
val categories = catMap(cat)
for (oneHotVal <- categories) {
OutputDf = OutputDf.withColumn(oneHotVal,
when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
}
}
OutputDf
I have a construct of n lists that are being used to record the beginning and ending times associated with something I want to monitor (say a task). A task can be repeated multiple times (although the same task cannot overlap / run concurrently ). Each task has a unique id and its begin / end times are stored in it’s own list.
I’m trying to find the period of time where all tasks were running at the same time.
So as an example, below I have 3 tasks; taskId 1 happens 7 times, taskId 2 happens twice and taskId 3 happens only once;
import org.joda.time.DateTime
case class CVT(taskId: Int, begin: DateTime, end: DateTime)
val cvt1: CVT = CVT (3, new DateTime(2015, 1, 1, 1, 0), new DateTime(2015, 1, 1, 20,0) )
val cvt2: CVT = CVT (1, new DateTime(2015, 1, 1, 2, 0), new DateTime(2015, 1, 1, 3, 0) )
val cvt3: CVT = CVT (1, new DateTime(2015, 1, 1, 4, 0), new DateTime(2015, 1, 1, 6, 0) )
val cvt4: CVT = CVT (2, new DateTime(2015, 1, 1, 5, 0), new DateTime(2015, 1, 1, 11,0) )
val cvt5: CVT = CVT (1, new DateTime(2015, 1, 1, 7, 0), new DateTime(2015, 1, 1, 8, 0) )
val cvt6: CVT = CVT (1, new DateTime(2015, 1, 1, 9, 0), new DateTime(2015, 1, 1, 10, 0) )
val cvt7: CVT = CVT (1, new DateTime(2015, 1, 1, 12, 0), new DateTime(2015, 1, 1, 14,0) )
val cvt8: CVT = CVT (2, new DateTime(2015, 1, 1, 13, 0), new DateTime(2015, 1, 1, 16,0) )
val cvt9: CVT = CVT (1, new DateTime(2015, 1, 1, 15, 0), new DateTime(2015, 1, 1, 17,0) )
val cvt10: CVT = CVT (1, new DateTime(2015, 1, 1, 18, 0), new DateTime(2015, 1, 1, 19,0) )
val combinedTasks: List[CVT] = List(cvt1, cvt2, cvt3, cvt4, cvt5, cvt6, cvt7, cvt8, cvt9, cvt10).sortBy(_.begin)
The result I’m trying to get is :
CVT(123, DateTime(2015, 1, 1, 5, 0), DateTime(2005, 1, 1, 6 0) )
CVT(123, DateTime(2015, 1, 1, 7, 0), DateTime(2005, 1, 1, 8 0) )
CVT(123, DateTime(2015, 1, 1, 9, 0), DateTime(2005, 1, 1, 10 0) )
CVT(123, DateTime(2015, 1, 1, 13, 0), DateTime(2005, 1, 1, 14 0) )
CVT(123, DateTime(2015, 1, 1, 15, 0), DateTime(2005, 1, 1, 16 0) )
Note : I don’t mind what the ‘taskId’ is in the result, I’m just showing ‘123’ to try and show in this example that all three tasks were running between these start and end times.
I’ve looked at trying to use both a recursive fn and also the Joda Interval with the .gap method but can’t seem to find the solution.
Any tips on how I could achieve what I’m trying to do would be great.
Tks
I got a library for sets of non-overlapping intervals at https://github.com/rklaehn/intervalset . It is going to be in the next version of spire
Here is how you would use it:
import org.joda.time.DateTime
import spire.algebra.Order
import spire.math.Interval
import spire.math.extras.interval.IntervalSeq
// define an order for DateTimes
implicit val dateTimeOrder = Order.from[DateTime](_ compareTo _)
// create three sets of DateTime intervals
val intervals = Map[Int, IntervalSeq[DateTime]](
1 -> (IntervalSeq.empty |
Interval(new DateTime(2015, 1, 1, 2, 0), new DateTime(2015, 1, 1, 3, 0)) |
Interval(new DateTime(2015, 1, 1, 4, 0), new DateTime(2015, 1, 1, 6, 0)) |
Interval(new DateTime(2015, 1, 1, 7, 0), new DateTime(2015, 1, 1, 8, 0)) |
Interval(new DateTime(2015, 1, 1, 9, 0), new DateTime(2015, 1, 1, 10, 0)) |
Interval(new DateTime(2015, 1, 1, 12, 0), new DateTime(2015, 1, 1, 14, 0)) |
Interval(new DateTime(2015, 1, 1, 15, 0), new DateTime(2015, 1, 1, 17, 0)) |
Interval(new DateTime(2015, 1, 1, 18, 0), new DateTime(2015, 1, 1, 19, 0))),
2 -> (IntervalSeq.empty |
Interval(new DateTime(2015, 1, 1, 5, 0), new DateTime(2015, 1, 1, 11, 0)) |
Interval(new DateTime(2015, 1, 1, 13, 0), new DateTime(2015, 1, 1, 16, 0))),
3 -> (IntervalSeq.empty |
Interval(new DateTime(2015, 1, 1, 1, 0), new DateTime(2015, 1, 1, 20, 0))))
// calculate the intersection of all intervals
val result = intervals.values.foldLeft(IntervalSeq.all[DateTime])(_ & _)
// print the result
for (interval <- result.intervals)
println(interval)
Note that spire intervals are significantly more powerful than what you probably need. They distinguish between open and closed interval bounds, and can handle infinite intervals. But nevertheless the above should be pretty fast.
Additionaly to Rüdiger 's library, which I believe is powerful, fast and extensible here is simple implementation using built-in collections lib.
I did redefine your CVT class reflecting ability to carry intersections as
case class CVT[Id](taskIds: Id, begin: DateTime, end: DateTime)
All you individual cvt defs now changed to
val cvtN: CVT[Int] = ???
We will try to catch events enters scope and leaves scope within our collection. For that algo we'll define following ADT:
sealed class Event
case object Enter extends Event
case object Leave extends Event
And corresponding ordering instances:
implicit val eventOrdering = Ordering.fromLessThan[Event](_ == Leave && _ == Enter)
implicit val dateTimeOrdering = Ordering.fromLessThan[DateTime](_ isBefore _)
Now we can write following
val combinedTasks: List[CVT[Set[Int]]] = List(cvt1, cvt2, cvt3, cvt4, cvt5, cvt6, cvt7, cvt8, cvt9, cvt10)
.flatMap { case CVT(id, begin, end) => List((id, begin, Enter), (id, end, Leave)) }
.sortBy { case (id, time, evt) => (time, evt: Event) }
.foldLeft((Set.empty[Int], List.empty[CVT[Set[Int]]], DateTime.now())) { (state, event) =>
val (active, accum, last) = state
val (id, time, evt) = event
evt match {
case Enter => (active + id, accum, time)
case Leave => (active - id, CVT(active, last, time) :: accum, time)
}
}._2.filter(_.taskIds == Set(1,2,3)).reverse
The most important here foldLeft part. After ordering events where Leaves are coming before Enterings, we are just carrying set of current working jobs from event to event, adding to this set when new job enters and capturing interval, using last entering time when some job leaves.
2347 #define F_GFX3D(f, s, m, n) \
2348 { \
2349 .freq_hz = f, \
2350 .src_clk = &s##_clk.c, \
2351 .md_val = MD4(4, m, 0, n), \
2352 .ns_val = NS_MND_BANKED4(18, 14, n, m, 3, 0, s##_to_mm_mux), \
2353 .ctl_val = CC_BANKED(9, 6, n), \
2354 .mnd_en_mask = (BIT(8) | BIT(5)) * !!(n), \
2355 }
2356 static struct clk_freq_tbl clk_tbl_gfx3d[] = {
2357 F_GFX3D( 0, gnd, 0, 0),
2358 F_GFX3D( 27000000, pxo, 0, 0),
2359 F_GFX3D( 48000000, pll8, 1, 8),
2360 F_GFX3D( 54857000, pll8, 1, 7),
2361 F_GFX3D( 64000000, pll8, 1, 6),
2362 F_GFX3D( 76800000, pll8, 1, 5),
2363 F_GFX3D( 96000000, pll8, 1, 4),
2364 F_GFX3D(128000000, pll8, 1, 3),
2365 F_GFX3D(145455000, pll2, 2, 11),
2366 F_GFX3D(160000000, pll2, 1, 5),
2367 F_GFX3D(177778000, pll2, 2, 9),
2368 F_GFX3D(200000000, pll2, 1, 4),
2369 F_GFX3D(228571000, pll2, 2, 7),
2370 F_GFX3D(266667000, pll2, 1, 3),
2371 F_GFX3D(320000000, pll2, 2, 5),
2372 F_END
2373 };
2374
I am trying to understand what the F_GFX3D macro does, but what does an ampersand mean in a macro? Is is same as when you put an ampersand in front of a variable?
It doesn't mean anything special in the context of a macro.
So as usual, the preprocessor copy-and-pastes the macro body to wherever it gets instantiated (other than substitution of macro arguments and ##, etc.).
Macros are processed by the preprocessor; the & will not be touched so in the end the code will look e.g. like this: &gnd_clk.c
I'm new to Scala ,just started learning it today.I would like to know how to initialize an array in Scala.
Example Java code
String[] arr = { "Hello", "World" };
What is the equivalent of the above code in Scala ?
scala> val arr = Array("Hello","World")
arr: Array[java.lang.String] = Array(Hello, World)
To initialize an array filled with zeros, you can use:
> Array.fill[Byte](5)(0)
Array(0, 0, 0, 0, 0)
This is equivalent to Java's new byte[5].
Can also do more dynamic inits with fill, e.g.
Array.fill(10){scala.util.Random.nextInt(5)}
==>
Array[Int] = Array(0, 1, 0, 0, 3, 2, 4, 1, 4, 3)
Additional to Vasil's answer: If you have the values given as a Scala collection, you can write
val list = List(1,2,3,4,5)
val arr = Array[Int](list:_*)
println(arr.mkString)
But usually the toArray method is more handy:
val list = List(1,2,3,4,5)
val arr = list.toArray
println(arr.mkString)
If you know Array's length but you don't know its content, you can use
val length = 5
val temp = Array.ofDim[String](length)
If you want to have two dimensions array but you don't know its content, you can use
val row = 5
val column = 3
val temp = Array.ofDim[String](row, column)
Of course, you can change String to other type.
If you already know its content, you can use
val temp = Array("a", "b")
Another way of declaring multi-dimentional arrays:
Array.fill(4,3)("")
res3: Array[Array[String]] = Array(Array("", "", ""), Array("", "", ""),Array("", "", ""), Array("", "", ""))
[Consolidating all the answers]
Initializing 1-D Arrays
// With fixed values
val arr = Array("a", "ab", "c")
// With zero value of the type
val size = 13
val arrWithZeroVal = Array.ofDim[Int](size) //values = 0
val arrBoolWithZeroVal = Array.ofDim[Boolean](size) //values = false
// With default value
val defVal = -1
val arrWithDefVals = Array.fill[Int](size)(defVal)
//With random values
val rand = scala.util.Random
def randomNumberGen: Int = rand.nextInt(5)
val arrWithRandomVals = Array.fill[Int](size){randomNumberGen}
Initializing 2-D/3-D/n-D Arrays
// With zero value of the type
val arr3dWithZeroVal = Array.ofDim[Int](5, 4, 3)
// With default value
val defVal = -1
val arr3dWithDefVal = Array.fill[Int](5, 4, 3)(defVal)
//With random values
val arr3dWithRandomValv = Array.fill[Int](5, 4, 3){randomNumberGen}
Conclusion :
Use Array.ofDim[TYPE](d1, d2, d3...) to use zero value of the type.
Use Array.fill[TYPE](d1, d2, d3...)(functionWhichReturnsTYPE) otherwise
Output for reference :
scala> val arr = Array("a", "ab", "c")
arr: Array[String] = Array(a, ab, c)
scala> val size = 13
size: Int = 13
scala> val arrWithZeroVal = Array.ofDim[Int](size) //values = 0
arrWithZeroVal: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
scala> val arrBoolWithZeroVal = Array.ofDim[Boolean](size) //values = false
arrBoolWithZeroVal: Array[Boolean] = Array(false, false, false, false, false, false, false, false, false, false, false, false, false)
scala> val defVal = -1
defVal: Int = -1
scala> val arrWithDefVals = Array.fill[Int](size)(defVal)
arrWithDefVals: Array[Int] = Array(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)
scala> val rand = scala.util.Random
rand: util.Random.type = scala.util.Random$#6e3dd5ce
scala> def randomNumberGen: Int = rand.nextInt(5)
randomNumberGen: Int
scala> val arrWithRandomVals = Array.fill[Int](size){randomNumberGen}
arrWithRandomVals: Array[Int] = Array(2, 2, 3, 1, 1, 3, 3, 3, 2, 3, 2, 2, 0)
scala> val arr3dWithZeroVal = Array.ofDim[Int](5, 4, 3)
arr3dWithZeroVal: Array[Array[Array[Int]]] = Array(Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)), Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)), Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)), Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)), Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0)))
scala> val arr3dWithDefVal = Array.fill[Int](5, 4, 3)(defVal)
arr3dWithDefVal: Array[Array[Array[Int]]] = Array(Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)), Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)), Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)), Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)), Array(Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1), Array(-1, -1, -1)))
scala> val arr3dWithRandomVals = Array.fill[Int](5, 4, 3){randomNumberGen}
arr3dWithRandomVals: Array[Array[Array[Int]]] = Array(Array(Array(2, 0, 0), Array(4, 1, 0), Array(4, 0, 0), Array(0, 0, 1)), Array(Array(0, 1, 2), Array(2, 0, 2), Array(0, 4, 2), Array(0, 4, 2)), Array(Array(4, 3, 0), Array(2, 2, 4), Array(4, 0, 4), Array(4, 2, 1)), Array(Array(0, 3, 3), Array(0, 0, 4), Array(4, 1, 3), Array(2, 2, 3)), Array(Array(0, 2, 3), Array(1, 4, 1), Array(1, 3, 3), Array(0, 0, 3)))