Scala function does not return a value - scala

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}

foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]

Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail

Related

Return type to assign to val for RDDs

I am playing around with spark code to know more about shuffling. I wrote the following code to see how are stages formed if there is a if-else statement. I have declared val result so that the result could be assigned to it later in the if statement. But I am not sure about the return type to assign to it.
Is there an abstract class that goes with all the RDDs?
val conf = new SparkConf().setMaster("local").setAppName("spark shuffle")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000).map(i => (i%1000, i))
val x = d.reduceByKey(_+_)
val count = 1
val result: RDD // What is the correct return type here?
if(count == 1)
{
result= d.rightOuterJoin(x)
result.collect()
}
d is a RDD[(Int, Int)]
Then doing a reduce by key gives the same thing but reduced down
Doing a right outer join then gives you RDD of (Int, (Option[Int], Int)) - ie for each key the L and R value (with the L option being optional if not there)
So doing a collect gives you an array of the same thing
The API documentation is not easy to follow for all these functions, there is a lot of generic types, and a lot of implicit types. I would recommend that you either use an IDE which will hint the types for you, or else use a tool that gives you a console that you can try snippets in.
you can avoid assignment to var (it should be var, not val)
val conf = new SparkConf().setMaster("local").setAppName("spark shuffle")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000).map(i => (i%1000, i))
val x = d.reduceByKey(_+_)
val count = 1
if (count == 1) {
d.rightOuterJoin(x).collect()
}

Applying multiple map functions to streaming database results in Play 2.6

I have a large query that seems to be a prime candidate for streaming results.
I would like to make a call to a function, which returns an object which I can apply additional map transformations on, and then ultimately convert the entire result into a list. This is because the conversions will results in a set of objects much smaller than the results in the database and there are many different transformations that must take place sequentially. Processing each result at a time will save me significant memory.
For example, if the results from the database were a stream (though the correct thing is likely an AkkaStream or an Iteratee), then I could do something like:
def outer(converter1[String, Int}, converter2[Int,Double]) {
val sqlIterator = getSqlIterator()
val mappedIterator1 = sqlIterator.map(x => converter1(x.bigColumn))
val mappedIterator2 = sqlIterator.map(x => converter2(x))
val retVal = mappedIterator.toList
retVal
}
def getSqlIterator() {
val selectedObjects = SQL( """SELECT * FROM table""").map { x =>
val id = x[Long]("id")
val tinyColumn = x[String]("tiny_column")
val bigColumn = x[String]("big_column")
NewObject(id, tinyColumn, bigColumn)
}
val transformed = UNKNOWN_FUNCTION(selectedObjects)
transformed
}
Most of the documentation appears to provide the mechanism to apply a "reduce" function to the results, rather than a "map" function, but the resulting mapped functions will be much smaller, saving me significant memory. What should I do for UNKNOWN_FUNCTION?
The following is a simple example of using Anorm's Akka Streams support to read the values from a single column of type String, applying two transformations to each element, and placing the results in a Seq. I'll leave it as an exercise for you to retrieve the values from multiple columns at a time, if that's what you need.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import anorm._
import scala.collection.immutable.Seq
import scala.concurrent.Future
implicit val system = ActorSystem("MySystem")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher
val convertStringToInt: String => Int = ???
val convertIntToDouble: Int => Double = ???
val result: Future[Seq[Double]] =
AkkaStream.source(SQL"SELECT big_column FROM table", SqlParser.scalar[String])
.map(convertStringToInt)
.map(convertIntToDouble)
.runWith(Sink.seq[Double])

How to iterate scala wrappedArray? (Spark)

I perform the following operations:
val tempDict = sqlContext.sql("select words.pName_token,collect_set(words.pID) as docids
from words
group by words.pName_token").toDF()
val wordDocs = tempDict.filter(newDict("pName_token")===word)
val listDocs = wordDocs.map(t => t(1)).collect()
listDocs: Array
[Any] = Array(WrappedArray(123, 234, 205876618, 456))
My question is how do I iterate over this wrapped array or convert this into a list?
The options I get for the listDocs are apply, asInstanceOf, clone, isInstanceOf, length, toString, and update.
How do I proceed?
Here is one way to solve this.
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val data = Seq((Seq(1,2,3),Seq(4,5,6),Seq(7,8,9)))
val df = sqlContext.createDataFrame(data)
val first = df.first
// use a pattern match to deferral the type
val mapped = first.getAs[WrappedArray[Int]](0)
// now we can use it like normal collection
mapped.mkString("\n")
// get rows where has array
val rows = df.collect.map {
case Row(a: Seq[Any], b: Seq[Any], c: Seq[Any]) =>
(a, b, c)
}
rows.mkString("\n")

Scala Akka Stream: How to Pass Through a Seq

I'm trying to wrap some blocking calls in Future.The return type is Seq[User] where User is a case class. The following just wouldn't compile with complaints of various overloaded versions being present. Any suggestions? I tried almost all the variations is Source.apply without any luck.
// All I want is Seq[User] => Future[Seq[User]]
def findByFirstName(firstName: String) = {
val users: Seq[User] = userRepository.findByFirstName(firstName)
val sink = Sink.fold[User, User](null)((_, elem) => elem)
val src = Source(users) // doesn't compile
src.runWith(sink)
}
First of all, I assume that you are using version 1.0 of akka-http-experimental since the API may changed from previous release.
The reason why your code does not compile is that the akka.stream.scaladsl.Source$.apply() requires
scala.collection.immutable.Seq instead of scala.collection.mutable.Seq.
Therefore you have to convert from mutable sequence to immutable sequence using to[T] method.
Document: akka.stream.scaladsl.Source
Additionally, as you see the document, Source$.apply() accepts ()=>Iterator[T] so you can also pass ()=>users.iterator as argument.
Since Sink.fold(...) returns the last evaluated expression, you can give an empty Seq() as the first argument, iterate over the users with appending the element to the sequence, and finally get the result.
However, there might be a better solution that can create a Sink which puts each evaluated expression into Seq, but I could not find it.
The following code works.
import akka.actor._
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Source,Sink}
import scala.concurrent.ExecutionContext.Implicits.global
case class User(name:String)
object Main extends App{
implicit val system = ActorSystem("MyActorSystem")
implicit val materializer = ActorMaterializer()
val users = Seq(User("alice"),User("bob"),User("charlie"))
val sink = Sink.fold[Seq[User], User](Seq())(
(seq, elem) =>
{println(s"elem => ${elem} \t| seq => ${seq}");seq:+elem})
val src = Source(users.to[scala.collection.immutable.Seq])
// val src = Source(()=>users.iterator) // this also works
val fut = src.runWith(sink) // Future[Seq[User]]
fut.onSuccess({
case x=>{
println(s"result => ${x}")
}
})
}
The output of the code above is
elem => User(alice) | seq => List()
elem => User(bob) | seq => List(User(alice))
elem => User(charlie) | seq => List(User(alice), User(bob))
result => List(User(alice), User(bob), User(charlie))
If you need just Future[Seq[Users]] dont use akka streams but
futures
import scala.concurrent._
import ExecutionContext.Implicits.global
val session = socialNetwork.createSessionFor("user", credentials)
val f: Future[List[Friend]] = Future {
session.getFriends()
}

Scala initialize collection from Java iterable

In scala, how can I initialize a scala collection from a Java iterable, in a clean idiomatic way?
Here's somewhat lame code taking a less functional approach for that:
var collection = Seq[MyClass]()
while (iterator.hasNext) {
val asArray: Array[String] = iterator.next.toArray
val val2 = asArray(2)
val val3 = asArray(3)
collection = collection :+ new MyClass(val2, val3)
}
How can initialization of a collection from a Java iterable take place more idiomatically?
import scala.collection.JavaConverters._
val collection = iterator.asScala.map{ x =>
val asArray = x.toArray
new MyClass(asArray(2), asArray(3))
}.toIndexedSeq
Scala can convert to and from Java collections seamlessly, provided you have imported the conversion helpers like below:
import scala.collection.JavaConversions._
val jl = new java.util.ArrayList[String]()
jl.add("Hello")
jl.add("There")
val collection = j1.map{ x => new MyClass(x(2), x(3)) }.toList