Related
I want to write a python 3.7 function that has a sorted list of numbers as an input, and a number n which is the max number each one of the integers can be repeated and modifies the list in place, so that any numbers that are repeated more than n times, would be cut to n repeats, and it should be done in O(1) space, no additional data structures allowed (e.g. set()). Special case - remove duplicates where n = 1. Example:
dup_list = [1, 1, 1, 2, 3, 7, 7, 7, 7, 12]
dedup(dup_list, n = 1)
print(dup_list)
[1, 2, 3, 7, 12]
dup_list = [1, 1, 1, 2, 3, 7, 7, 7, 7, 12]
dedup(dup_list, n = 2)
print(dup_list)
[1, 1, 2, 3, 7, 7, 12]
dup_list = [1, 1, 1, 2, 3, 7, 7, 7, 7, 12]
dedup(dup_list, n = 3)
print(dup_list)
[1, 1, 1, 2, 3, 7, 7, 7, 12]
Case n = 1 is easy, the code is below (code is taken from Elements of Prograqmming Interviews, 2008, page 49 except the last line return dup_list[:write_index]):
def dedup(dup_list):
if not dup_list:
return 0
write_index = 1
for i in range(1, len(dup_list)):
if dup_list[write_index-1] != dup_list[i]:
dup_list[write_index] = dup_list[i]
write_index += 1
return dup_list[:write_index]
This should work:
def dedup2(dup_list, n):
count = 1
list_len = len(dup_list)
i = 1
while i < list_len:
if dup_list[i - 1] != dup_list[i]:
count = 1
else:
count += 1
if count > n:
del(dup_list[i])
i -= 1
list_len -= 1
i += 1
return dup_list
print(dedup2([1, 2, 3, 3, 4, 4, 5, 5, 5, 5, 8, 9], 1))
I know this is a lengthy question :) I'm trying to implement Hamiltonian Cycle on a dataset in Scala 2.11, as part of this I'm trying to generate Adjacency matrix from a Map of values.
Explanation:
Keys 0 to 4 are the different cities, so in below "allRoads" Variable
0 -> Set(1, 2) Means city0 is connected to city1 and city2
1 -> Set(0, 2, 3, 4) Means City1 is connected to city0,city2,city3,city4
.
.
I need to generate adj Matrix, for E.g:
I need to generate 1 if the city is connected, or else I've to generate 0, meaning
for: "0 -> Set(1, 2)", I need to generate: Map(0 -> Array(0,1,1,0,0))
input-
var allRoads = Map(0 -> Set(1, 2), 1 -> Set(0, 2, 3, 4), 2 -> Set(0, 1, 3, 4), 3 -> Set(2, 4, 1), 4 -> Set(2, 3, 1))
My Code:
val n: Int = 5
val listOfCities = (0 to n-1).toList
var allRoads = Map(0 -> Set(1, 2), 1 -> Set(0, 2, 3, 4), 2 -> Set(0, 1, 3, 4), 3 -> Set(2, 4, 1), 4 -> Set(2, 3, 1))
var adjmat:Array[Int] = Map()
for( i <- 0 until allRoads.size;j <- listOfCities) {
allRoads.get(i) match {
case Some(elem) => if (elem.contains(j)) adjmat = adjmat:+1 else adjmat = adjmat :+0
case _ => None
}
}
which outputs:
output: Array[Int] = Array(0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0)
Expected output - Something like this, please suggest if there's something better to generate input to Hamiltonian Cycle
Map(0 -> Array(0, 1, 1, 0, 0),1 -> Array(1, 0, 1, 1, 1),2 -> Array(1, 1, 0, 1, 1),3 -> Array(0, 1, 1, 0, 1),4 -> Array(0, 1, 1, 1, 0))
Not sure how to store the above output as a Map or a Plain 2D Array.
Try
val cities = listOfCities.toSet
allRoads.map { case (city, roads) =>
city -> listOfCities.map(city => if ((cities diff roads).contains(city)) 0 else 1)
}
which outputs
Map(0 -> List(0, 1, 1, 0, 0), 1 -> List(1, 0, 1, 1, 1), 2 -> List(1, 1, 0, 1, 1), 3 -> List(0, 1, 1, 0, 1), 4 -> List(0, 1, 1, 1, 0))
Let's say I have a sequence of ints like this:
val mySeq = Seq(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
I want to split this by let's say 0 as a delimiter to look like this:
val mySplitSeq = Seq(Seq(0, 1, 2, 1), Seq(0, -1), Seq(0, 1, 2, 3, 2))
What is the most elegant way to do this in Scala?
This works alright
mySeq.foldLeft(Vector.empty[Vector[Int]]) {
case (acc, i) if acc.isEmpty => Vector(Vector(i))
case (acc, 0) => acc :+ Vector(0)
case (acc, i) => acc.init :+ (acc.last :+ i)
}
where 0 (or whatever) is your delimiter.
Efficient O(n) solution
Tail-recursive solution that never appends anything to lists:
def splitBy[A](sep: A, seq: List[A]): List[List[A]] = {
#annotation.tailrec
def rec(xs: List[A], revAcc: List[List[A]]): List[List[A]] = xs match {
case Nil => revAcc.reverse
case h :: t =>
if (h == sep) {
val (pref, suff) = xs.tail.span(_ != sep)
rec(suff, (h :: pref) :: revAcc)
} else {
val (pref, suff) = xs.span(_ != sep)
rec(suff, pref :: revAcc)
}
}
rec(seq, Nil)
}
val mySeq = List(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
println(splitBy(0, mySeq))
produces:
List(List(0, 1, 2, 1), List(0, -1), List(0, 1, 2, 3, 2))
It also handles the case where the input does not start with the separator.
For fun: Another O(n) solution that works for small integers
This is more of warning rather than a solution. Trying to reuse String's split does not result in anything sane:
val mySeq = Seq(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
val z = mySeq.min
val res = (mySeq
.map(x => (x - z).toChar)
.mkString
.split((-z).toChar)
.map(s => 0 :: s.toList.map(_.toInt + z)
).toList.tail)
It will fail if the integers span a range larger than 65535, and it looks pretty insane. Nevertheless, I find it amusing that it works at all:
res: List[List[Int]] = List(List(0, 1, 2, 1), List(0, -1), List(0, 1, 2, 3, 2))
You can use foldLeft:
val delimiter = 0
val res = mySeq.foldLeft(Seq[Seq[Int]]()) {
case (acc, `delimiter`) => acc :+ Seq(delimiter)
case (acc, v) => acc.init :+ (acc.last :+ v)
}
NOTE: This assumes input necessarily starts with delimiter.
One more variant using indices and reverse slicing
scala> val s = Seq(0,1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
s: scala.collection.mutable.Seq[Int] = ArrayBuffer(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
scala> s.indices.filter( s(_)==0).+:(if(s(0)!=0) -1 else -2).filter(_>= -1 ).reverse.map( {var p=0; x=>{ val y=s.slice(x,s.size-p);p=s.size-x;y}}).reverse
res173: scala.collection.immutable.IndexedSeq[scala.collection.mutable.Seq[Int]] = Vector(ArrayBuffer(0, 1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
if the starting doesn't have the delimiter, then also it works.. thanks to jrook
scala> val s = Seq(1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
s: scala.collection.mutable.Seq[Int] = ArrayBuffer(1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
scala> s.indices.filter( s(_)==0).+:(if(s(0)!=0) -1 else -2).filter(_>= -1 ).reverse.map( {var p=0; x=>{ val y=s.slice(x,s.size-p);p=s.size-x;y}}).reverse
res174: scala.collection.immutable.IndexedSeq[scala.collection.mutable.Seq[Int]] = Vector(ArrayBuffer(1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
UPDATE1:
More compact version by removing the "reverse" in above
scala> val s = Seq(0,1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
s: scala.collection.mutable.Seq[Int] = ArrayBuffer(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
scala> s.indices.filter( s(_)==0).+:(if(s(0)!=0) -1 else -2).filter(_>= -1 ).:+(s.size).sliding(2,1).map( x=>s.slice(x(0),x(1)) ).toList
res189: List[scala.collection.mutable.Seq[Int]] = List(ArrayBuffer(0, 1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
scala> val s = Seq(1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
s: scala.collection.mutable.Seq[Int] = ArrayBuffer(1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
scala> s.indices.filter( s(_)==0).+:(if(s(0)!=0) -1 else -2).filter(_>= -1 ).:+(s.size).sliding(2,1).map( x=>s.slice(x(0),x(1)) ).toList
res190: List[scala.collection.mutable.Seq[Int]] = List(ArrayBuffer(1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
scala>
Here is a solution I believe is both short and should run in O(n):
def seqSplitter[T](s: ArrayBuffer[T], delimiter : T) =
(0 +: s.indices.filter(s(_)==delimiter) :+ s.size) //find split locations
.sliding(2)
.map(idx => s.slice(idx.head, idx.last)) //extract the slice
.dropWhile(_.isEmpty) //take care of the first element
.toList
The idea is to take all the indices where the delimiter occurs, slide over them and slice the sequence at those locations. dropWhile takes care of the first element being a delimiter or not.
Here I am putting all the data in an ArrayBuffer to ensure slicing will take O(size_of_slice).
val mySeq = ArrayBuffer(0, 1, 2, 1, 0, -1, 0, 1, 2, 3, 2)
seqSplitter(mySeq, 0).toList
Gives:
List(ArrayBuffer(0, 1, 2, 1), ArrayBuffer(0, -1), ArrayBuffer(0, 1, 2, 3, 2))
A more detailed complexity analysis
The operations are:
Filter the delimiter indices (O(n))
loop over a list of indices obtained from previous step (O(num_of_delimeters)); for each pair of indices corresponding to a slice:
Copy the slice from the array and put it into the final collection (O(size_of_slice))
The last two steps sum up to O(n).
I have a user data from movielense ml-100K dataset.
Sample rows are -
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
I have read data as RDD as follows-
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31
scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7), (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))
I want to do one hot encoding for Occupation column.
Expected output is -
userId Age Gender Occupation Zipcodes technician other writer
1 24 M technician 85711 1 0 0
2 53 F other 94043 0 1 0
3 23 M writer 32067 0 0 1
4 24 M technician 43537 1 0 0
5 33 F other 15213 0 1 0
How do I achieve this on RDD in scala.
I want to perform operation on RDD without converting it to dataframe.
Any help
Thanks
I did this in following way -
1) Read user data -
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) show 5 rows of data-
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
3) Create map of profession by indexing-
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()
scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) create encode function which does one hot encoding of profession
scala> def encode(x: String) =
|{
| var encodeArray = Array.fill(21)(0)
| encodeArray(indexed_user.get(x).get.toInt)=1
| encodeArray
}
5) Apply encode function to user data -
scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) show encoded data -
scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] =
1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)),
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.
var OutputDf = df
for (cat <- catMap.keys) {
val categories = catMap(cat)
for (oneHotVal <- categories) {
OutputDf = OutputDf.withColumn(oneHotVal,
when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
}
}
OutputDf
I have a construct of n lists that are being used to record the beginning and ending times associated with something I want to monitor (say a task). A task can be repeated multiple times (although the same task cannot overlap / run concurrently ). Each task has a unique id and its begin / end times are stored in it’s own list.
I’m trying to find the period of time where all tasks were running at the same time.
So as an example, below I have 3 tasks; taskId 1 happens 7 times, taskId 2 happens twice and taskId 3 happens only once;
import org.joda.time.DateTime
case class CVT(taskId: Int, begin: DateTime, end: DateTime)
val cvt1: CVT = CVT (3, new DateTime(2015, 1, 1, 1, 0), new DateTime(2015, 1, 1, 20,0) )
val cvt2: CVT = CVT (1, new DateTime(2015, 1, 1, 2, 0), new DateTime(2015, 1, 1, 3, 0) )
val cvt3: CVT = CVT (1, new DateTime(2015, 1, 1, 4, 0), new DateTime(2015, 1, 1, 6, 0) )
val cvt4: CVT = CVT (2, new DateTime(2015, 1, 1, 5, 0), new DateTime(2015, 1, 1, 11,0) )
val cvt5: CVT = CVT (1, new DateTime(2015, 1, 1, 7, 0), new DateTime(2015, 1, 1, 8, 0) )
val cvt6: CVT = CVT (1, new DateTime(2015, 1, 1, 9, 0), new DateTime(2015, 1, 1, 10, 0) )
val cvt7: CVT = CVT (1, new DateTime(2015, 1, 1, 12, 0), new DateTime(2015, 1, 1, 14,0) )
val cvt8: CVT = CVT (2, new DateTime(2015, 1, 1, 13, 0), new DateTime(2015, 1, 1, 16,0) )
val cvt9: CVT = CVT (1, new DateTime(2015, 1, 1, 15, 0), new DateTime(2015, 1, 1, 17,0) )
val cvt10: CVT = CVT (1, new DateTime(2015, 1, 1, 18, 0), new DateTime(2015, 1, 1, 19,0) )
val combinedTasks: List[CVT] = List(cvt1, cvt2, cvt3, cvt4, cvt5, cvt6, cvt7, cvt8, cvt9, cvt10).sortBy(_.begin)
The result I’m trying to get is :
CVT(123, DateTime(2015, 1, 1, 5, 0), DateTime(2005, 1, 1, 6 0) )
CVT(123, DateTime(2015, 1, 1, 7, 0), DateTime(2005, 1, 1, 8 0) )
CVT(123, DateTime(2015, 1, 1, 9, 0), DateTime(2005, 1, 1, 10 0) )
CVT(123, DateTime(2015, 1, 1, 13, 0), DateTime(2005, 1, 1, 14 0) )
CVT(123, DateTime(2015, 1, 1, 15, 0), DateTime(2005, 1, 1, 16 0) )
Note : I don’t mind what the ‘taskId’ is in the result, I’m just showing ‘123’ to try and show in this example that all three tasks were running between these start and end times.
I’ve looked at trying to use both a recursive fn and also the Joda Interval with the .gap method but can’t seem to find the solution.
Any tips on how I could achieve what I’m trying to do would be great.
Tks
I got a library for sets of non-overlapping intervals at https://github.com/rklaehn/intervalset . It is going to be in the next version of spire
Here is how you would use it:
import org.joda.time.DateTime
import spire.algebra.Order
import spire.math.Interval
import spire.math.extras.interval.IntervalSeq
// define an order for DateTimes
implicit val dateTimeOrder = Order.from[DateTime](_ compareTo _)
// create three sets of DateTime intervals
val intervals = Map[Int, IntervalSeq[DateTime]](
1 -> (IntervalSeq.empty |
Interval(new DateTime(2015, 1, 1, 2, 0), new DateTime(2015, 1, 1, 3, 0)) |
Interval(new DateTime(2015, 1, 1, 4, 0), new DateTime(2015, 1, 1, 6, 0)) |
Interval(new DateTime(2015, 1, 1, 7, 0), new DateTime(2015, 1, 1, 8, 0)) |
Interval(new DateTime(2015, 1, 1, 9, 0), new DateTime(2015, 1, 1, 10, 0)) |
Interval(new DateTime(2015, 1, 1, 12, 0), new DateTime(2015, 1, 1, 14, 0)) |
Interval(new DateTime(2015, 1, 1, 15, 0), new DateTime(2015, 1, 1, 17, 0)) |
Interval(new DateTime(2015, 1, 1, 18, 0), new DateTime(2015, 1, 1, 19, 0))),
2 -> (IntervalSeq.empty |
Interval(new DateTime(2015, 1, 1, 5, 0), new DateTime(2015, 1, 1, 11, 0)) |
Interval(new DateTime(2015, 1, 1, 13, 0), new DateTime(2015, 1, 1, 16, 0))),
3 -> (IntervalSeq.empty |
Interval(new DateTime(2015, 1, 1, 1, 0), new DateTime(2015, 1, 1, 20, 0))))
// calculate the intersection of all intervals
val result = intervals.values.foldLeft(IntervalSeq.all[DateTime])(_ & _)
// print the result
for (interval <- result.intervals)
println(interval)
Note that spire intervals are significantly more powerful than what you probably need. They distinguish between open and closed interval bounds, and can handle infinite intervals. But nevertheless the above should be pretty fast.
Additionaly to Rüdiger 's library, which I believe is powerful, fast and extensible here is simple implementation using built-in collections lib.
I did redefine your CVT class reflecting ability to carry intersections as
case class CVT[Id](taskIds: Id, begin: DateTime, end: DateTime)
All you individual cvt defs now changed to
val cvtN: CVT[Int] = ???
We will try to catch events enters scope and leaves scope within our collection. For that algo we'll define following ADT:
sealed class Event
case object Enter extends Event
case object Leave extends Event
And corresponding ordering instances:
implicit val eventOrdering = Ordering.fromLessThan[Event](_ == Leave && _ == Enter)
implicit val dateTimeOrdering = Ordering.fromLessThan[DateTime](_ isBefore _)
Now we can write following
val combinedTasks: List[CVT[Set[Int]]] = List(cvt1, cvt2, cvt3, cvt4, cvt5, cvt6, cvt7, cvt8, cvt9, cvt10)
.flatMap { case CVT(id, begin, end) => List((id, begin, Enter), (id, end, Leave)) }
.sortBy { case (id, time, evt) => (time, evt: Event) }
.foldLeft((Set.empty[Int], List.empty[CVT[Set[Int]]], DateTime.now())) { (state, event) =>
val (active, accum, last) = state
val (id, time, evt) = event
evt match {
case Enter => (active + id, accum, time)
case Leave => (active - id, CVT(active, last, time) :: accum, time)
}
}._2.filter(_.taskIds == Set(1,2,3)).reverse
The most important here foldLeft part. After ordering events where Leaves are coming before Enterings, we are just carrying set of current working jobs from event to event, adding to this set when new job enters and capturing interval, using last entering time when some job leaves.