how to put the "<url>" to "</url>" into List? - scala

there is a List that read from a file, as follow:
lines: List[String] = List(a, b, <url>, <loc>1</loc>, </url>, c, <url>, <loc>2</loc>, </url>, d)
expected:
result = List(a, b, List(<url>, <loc>1</loc>, </url>), c, List(<url>, <loc>2</loc>, </url>), d)

This appears to work.
val result = lines.foldRight(List[List[String]]()){
case (s, lls) => if (s.matches("<.+>")
&& lls.nonEmpty
&& lls.head.head.matches("<.+>"))
(s :: lls.head) :: lls.tail
else
List(s) :: lls
}
// result: List[List[String]] = List(List(a), List(b), List(<url>, <loc>1</loc>, </url>), List(c), List(<url>, <loc>2</loc>, </url>), List(d))
lines is folded from the right so that the result List, and sub-lists, can be built by pre-pending, which is the most efficient when working with lists.

Related

How to get intersection of two RDD[(String, Iterable[String])]

The data consists of two columns
A B
A C
A D
B A
B C
B D
B E
C A
C B
C D
C E
D A
D B
D C
D E
E B
E C
E D
In the first row, think of it as A is friends with B, etc.
How do I find their common friends?
(A,B) -> (C D)
Meaning A and B have common friends C and D. I came as close as doing a groupByKey with the following result.
(B,CompactBuffer(A, C, D, E))
(A,CompactBuffer(B, C, D))
(C,CompactBuffer(A, B, D, E))
(E,CompactBuffer(B, C, D))
(D,CompactBuffer(A, B, C, E))
The code:
val rdd: RDD[String] = spark.sparkContext.textFile("twocols.txt")
val splitrdd: RDD[(String, String)] = rdd.map { s =>
var str = s.split(" ")
new Tuple2(str(0), str(1))
}
val group: RDD[(String, Iterable[String])] = splitrdd.groupByKey()
group.foreach(println)
First swap the elements:
val swapped = splitRDD.map(_.swap)
Then self-join and swap back:
val shared = swapped.join(swapped).map(_.swap)
Finally filter out duplicates (if needed) and groupByKey:
shared.filter { case ((x, y), _) => x < y }.groupByKey
This is just an ugly attempt:
Suppose you have converted your two columns into Array[Array[String]] (or List[List[String]], it's really the same), say
val pairs=Array(
Array("A","B"),
Array("A","C"),
Array("A","D"),
Array("B","A"),
Array("B","C"),
Array("B","D"),
Array("B","E"),
Array("C","A"),
Array("C","B"),
Array("C","D"),
Array("C","E"),
Array("D","A"),
Array("D","B"),
Array("D","C"),
Array("D","E"),
Array("E","B"),
Array("E","C"),
Array("E","D")
)
Define the group for which you want to find their common friends:
val group=Array("C","D")
The following will find the friends for each member in your group
val friendsByMemberOfGroup=group.map(
i => pairs.filter(x=> x(1) contains i)
.map(x=>x(0))
)
For example, pairs.filter(x=>x(1) contains "C").map(x=>x(0)) returns the friends of "C" where "C" is being taken from the second column and its friends are taken from the first column:
scala> pairs.filter(x=> x(1) contains "C").map(x=>x(0))
res212: Array[String] = Array(A, B, D, E)
And the following loop will find the common friends of all the members in your group
var commonFriendsOfGroup=friendsByMemberOfGroup(0).toSet
for(i <- 1 to friendsByMemberOfGroup.size-1){
commonFriendsOfGroup=
commonFriendsOfGroup.intersect(friendsByMemberOfGroup(i).toSet)
}
So you get
scala> commonFriendsOfGroup.toArray
res228: Array[String] = Array(A, B, E)
If you change your group to val group=Array("A","B","E") and apply the previous lines then you will get
scala> commonFriendsOfGroup.toArray
res230: Array[String] = Array(C, D)
Continuing from where you left off:
val group: RDD[(String, Iterable[String])] = splitrdd.groupByKey()
val group_map = group.collectAsMap
val common_friends = group
.flatMap{case (x, friends) =>
friends.map{y =>
((x,y),group_map.get(y).get.toSet.intersect(friends.toSet))
}
}
scala> common_friends.foreach(println)
((B,A),Set(C, D))
((B,C),Set(A, D, E))
((B,D),Set(A, C, E))
((B,E),Set(C, D))
((D,A),Set(B, C))
((D,B),Set(A, C, E))
((D,C),Set(A, B, E))
((D,E),Set(B, C))
((A,B),Set(C, D))
((A,C),Set(B, D))
((A,D),Set(B, C))
((C,A),Set(B, D))
((C,B),Set(A, D, E))
((C,D),Set(A, B, E))
((C,E),Set(B, D))
((E,B),Set(C, D))
((E,C),Set(B, D))
((E,D),Set(B, C))
Note: this assumes your data has the relationship in both directions like in your example: (A B and B A). If it's not the case you need to add some code to deal with the fact that group_map.get(y) might return None.
So I ended up doing this on the client side. DO NOT DO THIS
val arr: Array[(String, Iterable[String])] = group.collect()
//arr.foreach(println)
var arr2 = scala.collection.mutable.Set[((String, String), List[String])]()
for (i <- arr)
for (j <- arr)
if (i != j) {
val s1 = i._2.toSet
val s2 = j._2.toSet
val s3 = s1.intersect(s2).toList
//println(s3)
val pair = if (i._1 < j._1) (i._1, j._1) else (j._1, i._1)
arr2 += ((pair, s3))
}
arr2.foreach(println)
The result is
((B,E),List(C, D))
((A,C),List(B, D))
((A,B),List(C, D))
((A,D),List(B, C))
((B,D),List(A, C, E))
((C,D),List(A, B, E))
((B,C),List(A, D, E))
((C,E),List(B, D))
((D,E),List(B, C))
((A,E),List(B, C, D))
I am wondering if I can do this using transformations within Spark.

what can be best approch - Get all possible sublists

I have a List "a, b, c,d", and this is the expected result
a
ab
abc
abcd
b
bc
bcd
c
cd
d
I tried bruteforce, but I think, there might be any other efficient solution, provided I am having very long list.
Here's a one-liner.
"abcd".tails.flatMap(_.inits.toSeq.init.reverse).mkString(",")
//res0: String = a,ab,abc,abcd,b,bc,bcd,c,cd,d
The mkString() is added just so we can see the result. Otherwise the result is an Iterator[String], which is a pretty memory efficient collection type.
The reverse is only there so that it comes out in the order you specified. If the order of the results is unimportant then that can be removed.
The toSeq.init is there to remove empty elements left behind by the inits call. If those can be dealt with elsewhere then this can also be removed.
This may not be the best solution but one way of doing this is by using sliding function as follow,
val lst = List('a', 'b', 'c', 'd')
val groupedElements = (1 to lst.size).flatMap(x =>
lst.sliding(x, 1))
groupedElements.foreach(x => println(x.mkString("")))
//output
/* a
b
c
d
ab
bc
cd
abc
bcd
abcd
*/
It may not be the best solution, but I think is a good one, and it's tailrec
First this function to get the possible sublists of a List
def getSubList[A](lista: Seq[A]): Seq[Seq[A]] = {
for {
i <- 1 to lista.length
} yield lista.take(i)
}
And then this one to perform the recursion calling the first function and obtain all the sublists possible:
def getSubListRecursive[A](lista: Seq[A]): Seq[Seq[A]] = {
#tailrec
def go(acc: Seq[Seq[A]], rest: Seq[A]): Seq[Seq[A]] = {
rest match {
case Nil => acc
case l => go(acc = acc ++ getSubList(l), rest= l.tail)
}
}
go(Nil, lista)
}
The ouput:
getSubListRecursive(l)
res4: Seq[Seq[String]] = List(List(a), List(a, b), List(a, b, c), List(a, b, c, d), List(b), List(b, c), List(b, c, d), List(c), List(c, d), List(d))

For Expression with recursive generator

In the 3 code variations below, the For Expression produces totally different output. The recursive generator seems to be sourced from real values (A,B,C) but in version2 and version3 of the function below, none of the letters were present in the yield output. What is the reason?
def permuteV1(coll:List[Char]) : List[List[Char]] = {
if (coll.isEmpty) List(List())
else {
for {
pos <- coll.indices.toList
elem <- permuteV1(coll.filter(_ != coll(pos)))
} yield coll(pos) :: elem
}
}
permuteV1("ABC".toList)
//res1: List[List[Char]] = List(List(A, B, C), List(A, C, B), List(B, A, C), List(B, C, A), List(C, A, B), List(C, B, A))
def permuteV2(coll:List[Char]) : List[List[Char]] = {
if (coll.isEmpty) List(List())
else {
for {
pos <- coll.indices.toList
elem <- permuteV2(coll.filter(_ != coll(pos)))
} yield elem
}
}
permuteV2("ABC".toList)
//res2: List[List[Char]] = List(List(), List(), List(), List(), List(), List())
def permuteV3(coll:List[Char]) : List[List[Char]] = {
if (coll.isEmpty) List(List())
else {
for {
pos <- coll.indices.toList
elem <- permuteV3(coll.filter(_ != coll(pos)))
} yield '-' :: elem
}
}
permuteV3("ABC".toList)
//res3: List[List[Char]] = List(List(-, -, -), List(-, -, -), List(-, -, -), List(-, -, -), List(-, -, -), List(-, -, -))
In all three examples elem is the empty List(). When you walk through the recursions you'll see it is the only result possible.
In the 1st case you pre-pend a meaningful value to the empty elem. Those values are saved on the stack and collected when the recursion has reached its conclusion and the stack is unwound.
A different way to get the same result: "ABC".toList.permutations.toList
Let's start with the last snippet, because it is kinda more obvious.
The only thing you ever add to the list is '-', so, it would be surprising if the result contained anything else, right?
Now, by a similar reasoning, in the second example, you never add anything the list at all, so, the result can only possibly contain an empty list.
The first one works ... well, because it actually does add data to the result :)

Apply element to successive elements in a List

I'm looking for a way to apply each element in a List with its successive elements in Scala without writing a nested for loop. Basically I'm looking for a List comprehension that allows me to do the following:
Let
A = {a, b, c, d}
Then A' = {ab, ac, ad, bc, bd, cd}
I thought about using map for example, A.map(x => ...), but I can't figure out how the rest of the statement would look.
Hopefully this all makes sense. Any help would be greatly appreciated.
This seems a natural for a recursive evaluation. since it's prepend the first element to the rest of the list, then follow with the same thing applied to the rest of the list.
def pairs(xs: List[Char]): List[String] = xs match {
case Nil | _ :: Nil => Nil
case y :: ys => ys.map(z => s"$y$z") ::: pairs(ys)
}
pairs(a) //> res0: List[String] = List(ab, ac, ad, bc, bd, cd)
Tail recursive
def pairs2(xs: List[Char], acc:List[String]): List[String] = xs match {
case Nil | _ :: Nil => acc.reverse
case y :: ys => pairs2(ys, ys.foldLeft(acc){(acc, z) => s"$y$z"::acc})
}
pairs2(a, Nil) //> res0: List[String] = List(ab, ac, ad, bc, bd, cd)
Or if you really want a comprehension:
val res = for {(x::xs) <- a.tails
y <- xs
}
yield s"$x$y"
(returns an iterator, so force its evaluation)
res.toList //> res1: List[String] = List(ab, ac, ad, bc, bd, cd)
which suggests yet another variant, from desugaring
a.tails.collect{case(x::xs) => xs.map(y=>s"$x$y")}.flatten.toList
//> res2: List[String] = List(ab, ac, ad, bc, bd, cd)
Remembering that in Scala what we have is a 'for-comprehension' rather than a 'for-loop' in the Java sense, the construction isn't "nested" in the same sense as it would be in Java. Specifically, it would look something like:
// For a list of items of some type 'A':
val items: List[A] = ???
// and some suitable combining function (which might be inlined if simple enough):
def fn(i1: A, i2: A): A = ???
// an example for-comprehension that will achieve the output you describe:
for {
x <- items.zipWithIndex
y <- items.zipWithIndex
z <- List(fn(x._1, y._1)) if (x._2 < y._2)
} yield z
which seems clean enough to me. This de-sugars to something like:
items.zipWithIndex.flatMap( x =>
items.zipWithIndex.flatMap( y =>
List(fn(x._1, y._1)).withFilter( z => x._2 < y._2 ).map( z => z ) ) )
which while being much more along the lines of the "List comprehension" you specifically asked for, seems less clear to read to me!
If you are combining Strings you could do something like this.
scala> List("a","b","c","d").combinations(2).map(s => s.head+s.last).toList
res5: List[String] = List(ab, ac, ad, bc, bd, cd)
But you refer to it as "a way to apply each element" so maybe you mean something else? If so, perhaps this approach could get you started.

Functional equivalent of if (p(f(a), f(b)) a else b

I'm guessing that there must be a better functional way of expressing the following:
def foo(i: Any) : Int
if (foo(a) < foo(b)) a else b
So in this example f == foo and p == _ < _. There's bound to be some masterful cleverness in scalaz for this! I can see that using BooleanW I can write:
p(f(a), f(b)).option(a).getOrElse(b)
But I was sure that I would be able to write some code which only referred to a and b once. If this exists it must be on some combination of Function1W and something else but scalaz is a bit of a mystery to me!
EDIT: I guess what I'm asking here is not "how do I write this?" but "What is the correct name and signature for such a function and does it have anything to do with FP stuff I do not yet understand like Kleisli, Comonad etc?"
Just in case it's not in Scalaz:
def x[T,R](f : T => R)(p : (R,R) => Boolean)(x : T*) =
x reduceLeft ((l, r) => if(p(f(l),f(r))) r else l)
scala> x(Math.pow(_ : Int,2))(_ < _)(-2, 0, 1)
res0: Int = -2
Alternative with some overhead but nicer syntax.
class MappedExpression[T,R](i : (T,T), m : (R,R)) {
def select(p : (R,R) => Boolean ) = if(p(m._1, m._2)) i._1 else i._2
}
class Expression[T](i : (T,T)){
def map[R](f: T => R) = new MappedExpression(i, (f(i._1), f(i._2)))
}
implicit def tupleTo[T](i : (T,T)) = new Expression(i)
scala> ("a", "bc") map (_.length) select (_ < _)
res0: java.lang.String = a
I don't think that Arrows or any other special type of computation can be useful here. Afterall, you're calculating with normal values and you can usually lift a pure computation that into the special type of computation (using arr for arrows or return for monads).
However, one very simple arrow is arr a b is simply a function a -> b. You could then use arrows to split your code into more primitive operations. However, there is probably no reason for doing that and it only makes your code more complicated.
You could for example lift the call to foo so that it is done separately from the comparison. Here is a simiple definition of arrows in F# - it declares *** and >>> arrow combinators and also arr for turning pure functions into arrows:
type Arr<'a, 'b> = Arr of ('a -> 'b)
let arr f = Arr f
let ( *** ) (Arr fa) (Arr fb) = Arr (fun (a, b) -> (fa a, fb b))
let ( >>> ) (Arr fa) (Arr fb) = Arr (fa >> fb)
Now you can write your code like this:
let calcFoo = arr <| fun a -> (a, foo a)
let compareVals = arr <| fun ((a, fa), (b, fb)) -> if fa < fb then a else b
(calcFoo *** calcFoo) >>> compareVals
The *** combinator takes two inputs and runs the first and second specified function on the first, respectively second argument. >>> then composes this arrow with the one that does comparison.
But as I said - there is probably no reason at all for writing this.
Here's the Arrow based solution, implemented with Scalaz. This requires trunk.
You don't get a huge win from using the arrow abstraction with plain old functions, but it is a good way to learn them before moving to Kleisli or Cokleisli arrows.
import scalaz._
import Scalaz._
def mod(n: Int)(x: Int) = x % n
def mod10 = mod(10) _
def first[A, B](pair: (A, B)): A = pair._1
def selectBy[A](p: (A, A))(f: (A, A) => Boolean): A = if (f.tupled(p)) p._1 else p._2
def selectByFirst[A, B](f: (A, A) => Boolean)(p: ((A, B), (A, B))): (A, B) =
selectBy(p)(f comap first) // comap adapts the input to f with function first.
val pair = (7, 16)
// Using the Function1 arrow to apply two functions to a single value, resulting in a Tuple2
((mod10 &&& identity) apply 16) assert_≟ (6, 16)
// Using the Function1 arrow to perform mod10 and identity respectively on the first and second element of a `Tuple2`.
val pairs = ((mod10 &&& identity) product) apply pair
pairs assert_≟ ((7, 7), (6, 16))
// Select the tuple with the smaller value in the first element.
selectByFirst[Int, Int](_ < _)(pairs)._2 assert_≟ 16
// Using the Function1 Arrow Category to compose the calculation of mod10 with the
// selection of desired element.
val calc = ((mod10 &&& identity) product) ⋙ selectByFirst[Int, Int](_ < _)
calc(pair)._2 assert_≟ 16
Well, I looked up Hoogle for a type signature like the one in Thomas Jung's answer, and there is on. This is what I searched for:
(a -> b) -> (b -> b -> Bool) -> a -> a -> a
Where (a -> b) is the equivalent of foo, (b -> b -> Bool) is the equivalent of <. Unfortunately, the signature for on returns something else:
(b -> b -> c) -> (a -> b) -> a -> a -> c
This is almost the same, if you replace c with Bool and a in the two places it appears, respectively.
So, right now, I suspect it doesn't exist. It occured to me that there's a more general type signature, so I tried it as well:
(a -> b) -> ([b] -> b) -> [a] -> a
This one yielded nothing.
EDIT:
Now I don't think I was that far at all. Consider, for instance, this:
Data.List.maximumBy (on compare length) ["abcd", "ab", "abc"]
The function maximumBy signature is (a -> a -> Ordering) -> [a] -> a, which, combined with on, is pretty close to what you originally specified, given that Ordering is has three values -- almost a boolean! :-)
So, say you wrote on in Scala:
def on[A, B, C](f: ((B, B) => C), g: A => B): (A, A) => C = (a: A, b: A) => f(g(a), g(b))
The you could write select like this:
def select[A](p: (A, A) => Boolean)(a: A, b: A) = if (p(a, b)) a else b
And use it like this:
select(on((_: Int) < (_: Int), (_: String).length))("a", "ab")
Which really works better with currying and dot-free notation. :-) But let's try it with implicits:
implicit def toFor[A, B](g: A => B) = new {
def For[C](f: (B, B) => C) = (a1: A, a2: A) => f(g(a1), g(a2))
}
implicit def toSelect[A](t: (A, A)) = new {
def select(p: (A, A) => Boolean) = t match {
case (a, b) => if (p(a, b)) a else b
}
}
Then you can write
("a", "ab") select (((_: String).length) For (_ < _))
Very close. I haven't figured any way to remove the type qualifier from there, though I suspect it is possible. I mean, without going the way of Thomas answer. But maybe that is the way. In fact, I think on (_.length) select (_ < _) reads better than map (_.length) select (_ < _).
This expression can be written very elegantly in Factor programming language - a language where function composition is the way of doing things, and most code is written in point-free manner. The stack semantics and row polymorphism facilitates this style of programming. This is what the solution to your problem will look like in Factor:
# We find the longer of two lists here. The expression returns { 4 5 6 7 8 }
{ 1 2 3 } { 4 5 6 7 8 } [ [ length ] bi# > ] 2keep ?
# We find the shroter of two lists here. The expression returns { 1 2 3 }.
{ 1 2 3 } { 4 5 6 7 8 } [ [ length ] bi# < ] 2keep ?
Of our interest here is the combinator 2keep. It is a "preserving dataflow-combinator", which means that it retains its inputs after the given function is performed on them.
Let's try to translate (sort of) this solution to Scala.
First of all, we define an arity-2 preserving combinator.
scala> def keep2[A, B, C](f: (A, B) => C)(a: A, b: B) = (f(a, b), a, b)
keep2: [A, B, C](f: (A, B) => C)(a: A, b: B)(C, A, B)
And an eagerIf combinator. if being a control structure cannot be used in function composition; hence this construct.
scala> def eagerIf[A](cond: Boolean, x: A, y: A) = if(cond) x else y
eagerIf: [A](cond: Boolean, x: A, y: A)A
Also, the on combinator. Since it clashes with a method with the same name from Scalaz, I'll name it upon instead.
scala> class RichFunction2[A, B, C](f: (A, B) => C) {
| def upon[D](g: D => A)(implicit eq: A =:= B) = (x: D, y: D) => f(g(x), g(y))
| }
defined class RichFunction2
scala> implicit def enrichFunction2[A, B, C](f: (A, B) => C) = new RichFunction2(f)
enrichFunction2: [A, B, C](f: (A, B) => C)RichFunction2[A,B,C]
And now put this machinery to use!
scala> def length: List[Int] => Int = _.length
length: List[Int] => Int
scala> def smaller: (Int, Int) => Boolean = _ < _
smaller: (Int, Int) => Boolean
scala> keep2(smaller upon length)(List(1, 2), List(3, 4, 5)) |> Function.tupled(eagerIf)
res139: List[Int] = List(1, 2)
scala> def greater: (Int, Int) => Boolean = _ > _
greater: (Int, Int) => Boolean
scala> keep2(greater upon length)(List(1, 2), List(3, 4, 5)) |> Function.tupled(eagerIf)
res140: List[Int] = List(3, 4, 5)
This approach does not look particularly elegant in Scala, but at least it shows you one more way of doing things.
There's a nice-ish way of doing this with on and Monad, but Scala is unfortunately very bad at point-free programming. Your question is basically: "can I reduce the number of points in this program?"
Imagine if on and if were differently curried and tupled:
def on2[A,B,C](f: A => B)(g: (B, B) => C): ((A, A)) => C = {
case (a, b) => f.on(g, a, b)
}
def if2[A](b: Boolean): ((A, A)) => A = {
case (p, q) => if (b) p else q
}
Then you could use the reader monad:
on2(f)(_ < _) >>= if2
The Haskell equivalent would be:
on' (<) f >>= if'
where on' f g = uncurry $ on f g
if' x (y,z) = if x then y else z
Or...
flip =<< flip =<< (if' .) . on (<) f
where if' x y z = if x then y else z