Map Partition Iterator return - scala

Can anyone help in accepting the returning Iterator listWords() method to mapPartitions.
object MapPartitionExample {
def main(args: Array[String]): Unit = {
val conf= new SparkConf().setAppName("MapPartitionExample").setMaster("local[*]")
val sc= new SparkContext(conf)
val input:RDD[String] = sc.parallelize(List("ABC","DEF","GHU","YHG"))
val x= input.mapPartitions(word => listWords(word))
}
def listWords(words: Iterator[String]) : util.Iterator[String] = {
val arrList = new util.ArrayList[String]()
while( words.hasNext ) {
arrList.add( words.next())
}
return arrList.iterator()
}
}

Return type of the function used in mapPartitions should be scala.collection.Iterator, not java.util.Iterator. I don't see much point of your current code, but you can use Scala mutable collections:
import scala.collection.mutable.ArrayBuffer
def listWords(words: Iterator[String]) : Iterator[String] = {
val arr = ArrayBuffer[String]()
while( words.hasNext ) {
arr += words.next()
}
arr.toIterator
}
Personally I'd just map:
def listWords(words: Iterator[String]) : Iterator[String] = {
// Some init code
words.map(someFunction)
}

Iterable[NotInferU] is expected but you are returning java.util.Iterator[String]
You would need to convert the java.util.Iterator to scala Iterator by importing scala.collection.JavaConversions._ as below
def listWords(words: Iterator[String]) : Iterator[String] = {
val arrList = new util.ArrayList[String]()
while( words.hasNext ) {
arrList.add( words.next())
}
import scala.collection.JavaConversions._
return arrList.toList.iterator
}
Rest of the codes are as it is.
I hope the answer is helpful

Related

Return Future[List[DiagnosisCode]] from fetchDiagnosisForUniqueCodes method

I am not able to return Future[List[DiagnosisCode]] from fetchDiagnosisForUniqueCodes
import scala.concurrent._
import ExecutionContext.Implicits.global
case class DiagnosisCode(rootCode: String, uniqueCode: String, description: Option[String] = None)
object Database {
private val data: List[DiagnosisCode] = List(
DiagnosisCode("A00", "A001", Some("Cholera due to Vibrio cholerae")),
DiagnosisCode("A00", "A009", Some("Cholera, unspecified")),
DiagnosisCode("A08", "A080", Some("Rotaviral enteritis")),
DiagnosisCode("A08", "A083", Some("Other viral enteritis"))
)
def getAllUniqueCodes: Future[List[String]] = Future {
Database.data.map(_.uniqueCode)
}
def fetchDiagnosisForUniqueCode(uniqueCode: String): Future[Option[DiagnosisCode]] = Future {
Database.data.find(_.uniqueCode.equalsIgnoreCase(uniqueCode))
}
}
getAllUniqueCodes returns all unique codes from data List.
fetchDiagnosisForUniqueCode returns DiagnosisCode when uniqueCode matches.
From fetchDiagnosisForUniqueCodes, I would like to return Future[List[DiagnosisCode]] using getAllUniqueCodes() and fetchDiagnosisForUniqueCode(uniqueCode).*
def fetchDiagnosisForUniqueCodes: Future[List[DiagnosisCode]] = {
val xa: Future[List[Future[DiagnosisCode]]] = Database.getAllUniqueCodes.map { (xs:
List[String]) =>
xs.map { (uq: String) =>
Database.fetchDiagnosisForUniqueCode(uq)
}
}.map(n =>
n.map(y=>
y.map(_.head))) // Future[List[Future[DiagnosisCode]]]
}
If I understood your post correctly, your question is: "How can I convert a Future[List[Future[DiagnosisCode]]] into a Future[List[DiagnosisCode]]?"
The answer to that question would be: use Future.sequence:
// assuming an implicit ExecutionContext is in scope:
val xa: Future[List[Future[DiagnosisCode]]] = // ... your code here
val flattened: Future[List[DiagnosisCode]] =
xa.flatMap { listOfFutures =>
Future.sequence(listOfFutures)
}

Run Object notebook in Databricks

I am trying to execute this code on databricks in scala. Everything is in an object, then I have a case class and def main and other def functions.
Trying to work with "package cells" but I got Warning: classes defined within packages cannot be redefined without a cluster restart.
Compilation successful.
removing the object didn't work either
package x.y.z
import java.util.Date
import java.io.File
import java.io.PrintWriter
import org.apache.hadoop.fs.{FileSystem, Path}
object Meter {
val dateFormat = new SimpleDateFormat("yyyyMMdd")
case class Forc (cust: String, Num: String, date: String, results: Double)
def main(args: Array[String]): Unit = {
val inputFile = "sv" //
val outputFile = "ssv" //
val fileSystem = getFileSystem(inputFile)
val inputData = readLines(fileSystem, inputFile, skipHeader = true).toSeq
val filtinp = inputData.filter(x => x.nonEmpty)
.map(x => Results(x(6), x(5), x(0), x(8).toDouble))
def getTimestamp(date: String): Long = dateFormat.parse(date).getTime
def getDate(timeStampInMills: Long): String = {
val time = new Date(timeStampInMills)
dateFormat.format(time)
}
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration()
new Path(path).getFileSystem(hconf)
}
override def next(): String = {
val result = line
line = inputData.readLine()
if (line == null) {
inputData.close()
}
result
}
}
}
}

Convert Any type in scala to Array[Byte] and back

I have a variable value declared as Any in my program.
I want to convert this value to Array[Byte].
How can I serialize to Array[Byte] and back? I found examples related to other types such as Double or Int, but not to Any.
This should do what you need. It's pretty similar to how one would do it in Java.
import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}
object Serialization extends App {
def serialise(value: Any): Array[Byte] = {
val stream: ByteArrayOutputStream = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(stream)
oos.writeObject(value)
oos.close()
stream.toByteArray
}
def deserialise(bytes: Array[Byte]): Any = {
val ois = new ObjectInputStream(new ByteArrayInputStream(bytes))
val value = ois.readObject
ois.close()
value
}
println(deserialise(serialise("My Test")))
println(deserialise(serialise(List(1))))
println(deserialise(serialise(Map(1 -> 2))))
println(deserialise(serialise(1)))
}
def anyTypeToByteArray(value: Any): Array[Byte] = {
val valueConverted :Array[Byte] = SerializationUtils.serialize(value.isInstanceOf[Serializable])
valueConverted
}
def ByteArrayToAny(value: Array[Byte]): Any = {
val valueConverted: Any = SerializationUtils.deserialize(value)
valueConverted
}

Split key value in map scala

I don't know if it is possible, but I'd like in my mapPartitions to split in two lists the variable "a". Like here to have a list l that stores all numbers and an other list let's say b that stores all words. with something like a.mapPartitions((p,v) =>{ val l = p.toList; val b = v.toList; ....}
With for example in my for loop l(i)=1 and b(i) ="score"
import scala.io.Source
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ListBuffer
val a = sc.parallelize(List(("score",1),("chicken",2),("magnacarta",2)) )
a.mapPartitions(p =>{val l = p.toList;
val ret = new ListBuffer[Int]
val words = new ListBuffer[String]
for(i<-0 to l.length-1){
words+= b(i)
ret += l(i)
}
ret.toList.iterator
}
)
Spark is a distributed computing engine. you can perform operation on partitioned data across nodes of the cluster. Then you need a Reduce() method that performs a summary operation.
Please see this code that should do what you want:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SimpleApp {
class MyResponseObj(var numbers: List[Int] = List[Int](), var words: List[String] = List[String]()) extends java.io.Serializable{
def +=(str: String, int: Int) = {
numbers = numbers :+ int
words = words :+ str
this
}
def +=(other: MyResponseObj) = {
numbers = numbers ++ other.numbers
words = words ++ other.words
this
}
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val a = sc.parallelize(List(("score", 1), ("chicken", 2), ("magnacarta", 2)))
val myResponseObj = a.mapPartitions[MyResponseObj](it => {
var myResponseObj = new MyResponseObj()
it.foreach {
case (str :String, int :Int) => myResponseObj += (str, int)
case _ => println("unexpected data")
}
Iterator(myResponseObj)
}).reduce( (myResponseObj1, myResponseObj2) => myResponseObj1 += myResponseObj2 )
println(myResponseObj.words)
println(myResponseObj.numbers)
}
}

How to run IntellijIDEA (Spark and Scala) code in Apache Spark in terminal mode

I have written my code in IntelliJIDEA (Scala and Spark) and i want to run this code on linux using terminal how can i do this? I can't access to Graphical mode in this Linux Server.
for example this is a code similar my code:
package LDAv1
import java.io._
import org.apache.commons.math3.special._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd._
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.immutable._
import scala.collection.mutable._
object SparkLDA {
implicit def arrayToVector(s: Array[Int]) = new Vector(s)
implicit def vectorToArray(s: Vector) = s.data
def main(args: Array[String]){
var numTopics:Int=3
var inPath:String="data/MR.dat"
var outPath:String="data"
var master:String="local[*]"
var iter:Int=100
var mem="4g"
var debug=false
lda(inPath,outPath,master,numTopics,(50/numTopics),0.1,iter,debug,mem);
}
def lda(pathToFileIn:String,pathToFileOut:String,URL:String,numTopics:Int,alpha:Double,beta:Double,numIter:Int,deBug:Boolean,mem:String){
val (conf,sc)=initializeSpark(URL,deBug,mem)
var(documents,dictionary,topicCount)=importText(pathToFileIn,numTopics,sc)
val ll:MutableList[Double]= MutableList[Double]()
for(i<-0 to numIter){
var (doc,dict,tC)=step(sc,documents,numTopics,dictionary,topicCount,alpha,beta)
documents=doc
dictionary=dict
topicCount=tC
if(deBug)ll+=logLikelihood(dictionary,topicCount,alpha,beta)
System.gc()
}
saveAll(documents,ll,sc,dictionary,topicCount,pathToFileOut,deBug)
}
def initializeSpark(URL:String,debug:Boolean,mem:String)={
if(!debug)Logger.getLogger("org").setLevel(Level.WARN)
val conf = new SparkConf()
.setAppName("Spark LDA")
.setMaster(URL)
.set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
(conf,sc)
}
def importText(pathToFileIn:String,numTopics:Int,sc:SparkContext)={
val stopWords =sc.broadcast(List[String]("a","able","about","above","according","accordingly","across","actually","after"));
val textFile=sc.textFile(pathToFileIn,4)
val documents=textFile.map(line=>{
val topicDistrib=new Array[Int](numTopics)
val lineCleaned=line.replaceAll("[^A-Za-z ]","").toLowerCase()
(lineCleaned.split(" ").map(word=>{
var topic:Int=0
var wrd:String=""
if(word.length()>1&&(!stopWords.value.contains(word))){
topic =Integer.parseInt(Math.round(Math.random()*(numTopics-1)).toString)
topicDistrib.increment(topic)
wrd=word
}
(wrd,topic)
})
,topicDistrib)
})
val(dictionary,topicCount)=updateVariables(documents,numTopics)
(documents,dictionary,topicCount)
}
def updateVariables(documents:RDD[(Array[(String, Int)], Array[Int])],numTopics:Int)={
val dictionary=documents.flatMap(line=>line._1).map(tuple=>{
var value:Array[Int]=new Array[Int](numTopics)
if(!tuple._1.equals("")){
value(tuple._2)+=1
}
(tuple._1,value)
}).reduceByKey((a:Array[Int],b)=>{
for(i<-0 to a.length-1){
a(i)+=b(i)
}
(a)
}).collect().toMap
println(dictionary.take(2))
val topicCount:Array[Int]=new Array[Int](numTopics)
dictionary.foreach(t=>topicCount.add(t._2))
(dictionary,topicCount)
}
def step(sc:SparkContext,documents:RDD[(Array[(String, Int)], Array[Int])],numTopics:Int,dict:scala.collection.immutable.Map[String, Array[Int]],tC: Array[Int],alpha:Double,beta:Double)={
val dictionary=sc.broadcast(dict)
val topicCount=sc.broadcast(tC)
val v=dict.size
val doc=documents.map(tuple=>{
val topicDistrib=tuple._2
val line=tuple._1
val lineupDated=line.map(t=>{
val word=t._1
var top=t._2
if(!t._1.equals("")){
topicDistrib.decrement(top)
top=gibbsSampling(topicDistrib,dictionary.value(word),topicCount.value,alpha,beta,v)
topicDistrib.increment(top)
}
(word,top)
})
(lineupDated,topicDistrib)
})
val(dicti,topC)=updateVariables(doc,numTopics)
(doc:RDD[(Array[(String, Int)], Array[Int])],dicti,topC)
}
def saveAll(documents: RDD[(Array[(String, Int)], Array[Int])],LogLikelihood:MutableList[Double],sc: SparkContext, dictionary: scala.collection.immutable.Map[String, Array[Int]], topicCount: Array[Int],path: String,deBug:Boolean){
removeAll(path)
saveDocuments(documents,path)
saveDictionary(sc,dictionary,path)
saveTopicCount(sc,topicCount,path)
if(deBug)saveLogLikelihood (sc,LogLikelihood, path)
}
def saveDocuments (documents: RDD[(Array[(String, Int)], Array[Int])], path: String) {
removeAll(path+"/documentsTopics")
documents.map {
case (topicAssign, topicDist) =>
var topicDistNorm:Array[Double] = topicDist.normalize()
val probabilities = topicDistNorm.toList.mkString(", ")
(probabilities)
}.saveAsTextFile(path+"/documentsTopics")
}
def saveDictionary(sc: SparkContext, dictionary: scala.collection.immutable.Map[String, Array[Int]], path: String) {
removeAll(path+"/wordsTopics")
val dictionaryArray = dictionary.toArray
val temp = sc.parallelize(dictionaryArray).map {
case (word, topics) =>
var topicsNorm:Array[Double] = topics.normalize()
val topArray = topicsNorm.toList.mkString(", ")
val wordCount = topics.sumAll()
val temp2 = List(word, wordCount, topArray).mkString("\t")
(temp2)
}
temp.saveAsTextFile(path+"/wordsTopics")
}
def saveTopicCount (sc: SparkContext, topicCount: Array[Int], path: String) {
removeAll(path+"/topicCount")
val temp = sc.parallelize(topicCount).map {
case (count) =>
(count)
}
temp.saveAsTextFile(path+"/topicCount")
}
def saveLogLikelihood (sc: SparkContext,LogLikelihood:MutableList[Double], path: String) {
removeAll(path+"/logLikelihood")
val temp = sc.parallelize(LogLikelihood).map {
case (count) =>
(count)
}
temp.saveAsTextFile(path+"/logLikelihood")
}
def gibbsSampling(docTopicDistrib:Array[Int],wordTopicDistrib:Array[Int],topicCount:Array[Int],alpha:Double,beta:Double,v:Int):Int={
val numTopic=docTopicDistrib.length
var ro:Array[Double]=new Array[Double](numTopic)
ro(0)=(docTopicDistrib(0)+alpha)*(wordTopicDistrib(0)+beta)/(topicCount(0)+v*beta)
for(i<-1 to numTopic-1){
ro(i)=ro(i-1)+(docTopicDistrib(i)+alpha)*(wordTopicDistrib(i)+beta)/(topicCount(i)+v*beta)
}
var x=Math.random()*ro(numTopic-1)
var i:Int=0
while(x>ro(i)&&i<numTopic-1)i+=1
return i
}
def logLikelihood(dictionary: scala.collection.immutable.Map[String, Array[Int]],topicCount:Array[Int],alpha:Double,beta:Double):Double={
val V:Int=dictionary.size
val numTopics:Int=topicCount.length-1
var logLikelihood:Double=numTopics*(Gamma.logGamma(V*beta)-V*Gamma.logGamma(beta))
for (i<-0 to numTopics){
var sum:Double=0
dictionary.foreach{t=> sum+=Gamma.logGamma(t._2(i)+beta)
}
logLikelihood+=sum-Gamma.logGamma(topicCount(i)+V*beta)
}
(logLikelihood)
}
def removeAll(pathDir: String) = {
def delete(file: File): Array[(String, Boolean)] = {
Option(file.listFiles).map(_.flatMap(f => delete(f))).getOrElse(Array()) :+ (file.getPath -> file.delete)
}
}
}
and it has one Scala class :
package LDAv1
class Vector(val vect:Array[Int]) {
var data:Array[Int]=vect;
def this(size:Int){
this(new Array[Int](size));
}
def increment(index:Int){
data(index)+=1;
}
def decrement(index:Int){
data(index)-=1;
}
def printIt(){
print("[")
for(i<-0 to data.length-1)print(data(i)+",");
print("]\n")
}
def forEach(callback:(Int) => Unit)={
for(i<-0 to data.length-1)callback(data(i));
}
def add(a:Array[Int]){
for(i<-0 to data.length-1)data(i)+=a(i);
}
def sumAll():Int={
var sum:Int=0;
for(i<-0 to data.length-1)sum+=data(i);
(sum)
}
def normalize():Array[Double]={
var temp:Array[Double] = new Array[Double](data.length);
var sum:Double=0;
for(i<-0 to data.length-1)sum+=data(i);
if (sum>0) {
for(i<-0 to data.length-1) {
temp(i) = data(i).toDouble/sum
};
}
(temp)
}
}
You have to create a fat jar, with all dependencies included, then you can build your application using the spark build-in function, spark-submit.
https://spark.apache.org/docs/latest/submitting-applications.html