I am not able to return Future[List[DiagnosisCode]] from fetchDiagnosisForUniqueCodes
import scala.concurrent._
import ExecutionContext.Implicits.global
case class DiagnosisCode(rootCode: String, uniqueCode: String, description: Option[String] = None)
object Database {
private val data: List[DiagnosisCode] = List(
DiagnosisCode("A00", "A001", Some("Cholera due to Vibrio cholerae")),
DiagnosisCode("A00", "A009", Some("Cholera, unspecified")),
DiagnosisCode("A08", "A080", Some("Rotaviral enteritis")),
DiagnosisCode("A08", "A083", Some("Other viral enteritis"))
)
def getAllUniqueCodes: Future[List[String]] = Future {
Database.data.map(_.uniqueCode)
}
def fetchDiagnosisForUniqueCode(uniqueCode: String): Future[Option[DiagnosisCode]] = Future {
Database.data.find(_.uniqueCode.equalsIgnoreCase(uniqueCode))
}
}
getAllUniqueCodes returns all unique codes from data List.
fetchDiagnosisForUniqueCode returns DiagnosisCode when uniqueCode matches.
From fetchDiagnosisForUniqueCodes, I would like to return Future[List[DiagnosisCode]] using getAllUniqueCodes() and fetchDiagnosisForUniqueCode(uniqueCode).*
def fetchDiagnosisForUniqueCodes: Future[List[DiagnosisCode]] = {
val xa: Future[List[Future[DiagnosisCode]]] = Database.getAllUniqueCodes.map { (xs:
List[String]) =>
xs.map { (uq: String) =>
Database.fetchDiagnosisForUniqueCode(uq)
}
}.map(n =>
n.map(y=>
y.map(_.head))) // Future[List[Future[DiagnosisCode]]]
}
If I understood your post correctly, your question is: "How can I convert a Future[List[Future[DiagnosisCode]]] into a Future[List[DiagnosisCode]]?"
The answer to that question would be: use Future.sequence:
// assuming an implicit ExecutionContext is in scope:
val xa: Future[List[Future[DiagnosisCode]]] = // ... your code here
val flattened: Future[List[DiagnosisCode]] =
xa.flatMap { listOfFutures =>
Future.sequence(listOfFutures)
}
I am trying to execute this code on databricks in scala. Everything is in an object, then I have a case class and def main and other def functions.
Trying to work with "package cells" but I got Warning: classes defined within packages cannot be redefined without a cluster restart.
Compilation successful.
removing the object didn't work either
package x.y.z
import java.util.Date
import java.io.File
import java.io.PrintWriter
import org.apache.hadoop.fs.{FileSystem, Path}
object Meter {
val dateFormat = new SimpleDateFormat("yyyyMMdd")
case class Forc (cust: String, Num: String, date: String, results: Double)
def main(args: Array[String]): Unit = {
val inputFile = "sv" //
val outputFile = "ssv" //
val fileSystem = getFileSystem(inputFile)
val inputData = readLines(fileSystem, inputFile, skipHeader = true).toSeq
val filtinp = inputData.filter(x => x.nonEmpty)
.map(x => Results(x(6), x(5), x(0), x(8).toDouble))
def getTimestamp(date: String): Long = dateFormat.parse(date).getTime
def getDate(timeStampInMills: Long): String = {
val time = new Date(timeStampInMills)
dateFormat.format(time)
}
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration()
new Path(path).getFileSystem(hconf)
}
override def next(): String = {
val result = line
line = inputData.readLine()
if (line == null) {
inputData.close()
}
result
}
}
}
}
I have a variable value declared as Any in my program.
I want to convert this value to Array[Byte].
How can I serialize to Array[Byte] and back? I found examples related to other types such as Double or Int, but not to Any.
This should do what you need. It's pretty similar to how one would do it in Java.
import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}
object Serialization extends App {
def serialise(value: Any): Array[Byte] = {
val stream: ByteArrayOutputStream = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(stream)
oos.writeObject(value)
oos.close()
stream.toByteArray
}
def deserialise(bytes: Array[Byte]): Any = {
val ois = new ObjectInputStream(new ByteArrayInputStream(bytes))
val value = ois.readObject
ois.close()
value
}
println(deserialise(serialise("My Test")))
println(deserialise(serialise(List(1))))
println(deserialise(serialise(Map(1 -> 2))))
println(deserialise(serialise(1)))
}
def anyTypeToByteArray(value: Any): Array[Byte] = {
val valueConverted :Array[Byte] = SerializationUtils.serialize(value.isInstanceOf[Serializable])
valueConverted
}
def ByteArrayToAny(value: Array[Byte]): Any = {
val valueConverted: Any = SerializationUtils.deserialize(value)
valueConverted
}
I don't know if it is possible, but I'd like in my mapPartitions to split in two lists the variable "a". Like here to have a list l that stores all numbers and an other list let's say b that stores all words. with something like a.mapPartitions((p,v) =>{ val l = p.toList; val b = v.toList; ....}
With for example in my for loop l(i)=1 and b(i) ="score"
import scala.io.Source
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ListBuffer
val a = sc.parallelize(List(("score",1),("chicken",2),("magnacarta",2)) )
a.mapPartitions(p =>{val l = p.toList;
val ret = new ListBuffer[Int]
val words = new ListBuffer[String]
for(i<-0 to l.length-1){
words+= b(i)
ret += l(i)
}
ret.toList.iterator
}
)
Spark is a distributed computing engine. you can perform operation on partitioned data across nodes of the cluster. Then you need a Reduce() method that performs a summary operation.
Please see this code that should do what you want:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SimpleApp {
class MyResponseObj(var numbers: List[Int] = List[Int](), var words: List[String] = List[String]()) extends java.io.Serializable{
def +=(str: String, int: Int) = {
numbers = numbers :+ int
words = words :+ str
this
}
def +=(other: MyResponseObj) = {
numbers = numbers ++ other.numbers
words = words ++ other.words
this
}
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val a = sc.parallelize(List(("score", 1), ("chicken", 2), ("magnacarta", 2)))
val myResponseObj = a.mapPartitions[MyResponseObj](it => {
var myResponseObj = new MyResponseObj()
it.foreach {
case (str :String, int :Int) => myResponseObj += (str, int)
case _ => println("unexpected data")
}
Iterator(myResponseObj)
}).reduce( (myResponseObj1, myResponseObj2) => myResponseObj1 += myResponseObj2 )
println(myResponseObj.words)
println(myResponseObj.numbers)
}
}
I have written my code in IntelliJIDEA (Scala and Spark) and i want to run this code on linux using terminal how can i do this? I can't access to Graphical mode in this Linux Server.
for example this is a code similar my code:
package LDAv1
import java.io._
import org.apache.commons.math3.special._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd._
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.immutable._
import scala.collection.mutable._
object SparkLDA {
implicit def arrayToVector(s: Array[Int]) = new Vector(s)
implicit def vectorToArray(s: Vector) = s.data
def main(args: Array[String]){
var numTopics:Int=3
var inPath:String="data/MR.dat"
var outPath:String="data"
var master:String="local[*]"
var iter:Int=100
var mem="4g"
var debug=false
lda(inPath,outPath,master,numTopics,(50/numTopics),0.1,iter,debug,mem);
}
def lda(pathToFileIn:String,pathToFileOut:String,URL:String,numTopics:Int,alpha:Double,beta:Double,numIter:Int,deBug:Boolean,mem:String){
val (conf,sc)=initializeSpark(URL,deBug,mem)
var(documents,dictionary,topicCount)=importText(pathToFileIn,numTopics,sc)
val ll:MutableList[Double]= MutableList[Double]()
for(i<-0 to numIter){
var (doc,dict,tC)=step(sc,documents,numTopics,dictionary,topicCount,alpha,beta)
documents=doc
dictionary=dict
topicCount=tC
if(deBug)ll+=logLikelihood(dictionary,topicCount,alpha,beta)
System.gc()
}
saveAll(documents,ll,sc,dictionary,topicCount,pathToFileOut,deBug)
}
def initializeSpark(URL:String,debug:Boolean,mem:String)={
if(!debug)Logger.getLogger("org").setLevel(Level.WARN)
val conf = new SparkConf()
.setAppName("Spark LDA")
.setMaster(URL)
.set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
(conf,sc)
}
def importText(pathToFileIn:String,numTopics:Int,sc:SparkContext)={
val stopWords =sc.broadcast(List[String]("a","able","about","above","according","accordingly","across","actually","after"));
val textFile=sc.textFile(pathToFileIn,4)
val documents=textFile.map(line=>{
val topicDistrib=new Array[Int](numTopics)
val lineCleaned=line.replaceAll("[^A-Za-z ]","").toLowerCase()
(lineCleaned.split(" ").map(word=>{
var topic:Int=0
var wrd:String=""
if(word.length()>1&&(!stopWords.value.contains(word))){
topic =Integer.parseInt(Math.round(Math.random()*(numTopics-1)).toString)
topicDistrib.increment(topic)
wrd=word
}
(wrd,topic)
})
,topicDistrib)
})
val(dictionary,topicCount)=updateVariables(documents,numTopics)
(documents,dictionary,topicCount)
}
def updateVariables(documents:RDD[(Array[(String, Int)], Array[Int])],numTopics:Int)={
val dictionary=documents.flatMap(line=>line._1).map(tuple=>{
var value:Array[Int]=new Array[Int](numTopics)
if(!tuple._1.equals("")){
value(tuple._2)+=1
}
(tuple._1,value)
}).reduceByKey((a:Array[Int],b)=>{
for(i<-0 to a.length-1){
a(i)+=b(i)
}
(a)
}).collect().toMap
println(dictionary.take(2))
val topicCount:Array[Int]=new Array[Int](numTopics)
dictionary.foreach(t=>topicCount.add(t._2))
(dictionary,topicCount)
}
def step(sc:SparkContext,documents:RDD[(Array[(String, Int)], Array[Int])],numTopics:Int,dict:scala.collection.immutable.Map[String, Array[Int]],tC: Array[Int],alpha:Double,beta:Double)={
val dictionary=sc.broadcast(dict)
val topicCount=sc.broadcast(tC)
val v=dict.size
val doc=documents.map(tuple=>{
val topicDistrib=tuple._2
val line=tuple._1
val lineupDated=line.map(t=>{
val word=t._1
var top=t._2
if(!t._1.equals("")){
topicDistrib.decrement(top)
top=gibbsSampling(topicDistrib,dictionary.value(word),topicCount.value,alpha,beta,v)
topicDistrib.increment(top)
}
(word,top)
})
(lineupDated,topicDistrib)
})
val(dicti,topC)=updateVariables(doc,numTopics)
(doc:RDD[(Array[(String, Int)], Array[Int])],dicti,topC)
}
def saveAll(documents: RDD[(Array[(String, Int)], Array[Int])],LogLikelihood:MutableList[Double],sc: SparkContext, dictionary: scala.collection.immutable.Map[String, Array[Int]], topicCount: Array[Int],path: String,deBug:Boolean){
removeAll(path)
saveDocuments(documents,path)
saveDictionary(sc,dictionary,path)
saveTopicCount(sc,topicCount,path)
if(deBug)saveLogLikelihood (sc,LogLikelihood, path)
}
def saveDocuments (documents: RDD[(Array[(String, Int)], Array[Int])], path: String) {
removeAll(path+"/documentsTopics")
documents.map {
case (topicAssign, topicDist) =>
var topicDistNorm:Array[Double] = topicDist.normalize()
val probabilities = topicDistNorm.toList.mkString(", ")
(probabilities)
}.saveAsTextFile(path+"/documentsTopics")
}
def saveDictionary(sc: SparkContext, dictionary: scala.collection.immutable.Map[String, Array[Int]], path: String) {
removeAll(path+"/wordsTopics")
val dictionaryArray = dictionary.toArray
val temp = sc.parallelize(dictionaryArray).map {
case (word, topics) =>
var topicsNorm:Array[Double] = topics.normalize()
val topArray = topicsNorm.toList.mkString(", ")
val wordCount = topics.sumAll()
val temp2 = List(word, wordCount, topArray).mkString("\t")
(temp2)
}
temp.saveAsTextFile(path+"/wordsTopics")
}
def saveTopicCount (sc: SparkContext, topicCount: Array[Int], path: String) {
removeAll(path+"/topicCount")
val temp = sc.parallelize(topicCount).map {
case (count) =>
(count)
}
temp.saveAsTextFile(path+"/topicCount")
}
def saveLogLikelihood (sc: SparkContext,LogLikelihood:MutableList[Double], path: String) {
removeAll(path+"/logLikelihood")
val temp = sc.parallelize(LogLikelihood).map {
case (count) =>
(count)
}
temp.saveAsTextFile(path+"/logLikelihood")
}
def gibbsSampling(docTopicDistrib:Array[Int],wordTopicDistrib:Array[Int],topicCount:Array[Int],alpha:Double,beta:Double,v:Int):Int={
val numTopic=docTopicDistrib.length
var ro:Array[Double]=new Array[Double](numTopic)
ro(0)=(docTopicDistrib(0)+alpha)*(wordTopicDistrib(0)+beta)/(topicCount(0)+v*beta)
for(i<-1 to numTopic-1){
ro(i)=ro(i-1)+(docTopicDistrib(i)+alpha)*(wordTopicDistrib(i)+beta)/(topicCount(i)+v*beta)
}
var x=Math.random()*ro(numTopic-1)
var i:Int=0
while(x>ro(i)&&i<numTopic-1)i+=1
return i
}
def logLikelihood(dictionary: scala.collection.immutable.Map[String, Array[Int]],topicCount:Array[Int],alpha:Double,beta:Double):Double={
val V:Int=dictionary.size
val numTopics:Int=topicCount.length-1
var logLikelihood:Double=numTopics*(Gamma.logGamma(V*beta)-V*Gamma.logGamma(beta))
for (i<-0 to numTopics){
var sum:Double=0
dictionary.foreach{t=> sum+=Gamma.logGamma(t._2(i)+beta)
}
logLikelihood+=sum-Gamma.logGamma(topicCount(i)+V*beta)
}
(logLikelihood)
}
def removeAll(pathDir: String) = {
def delete(file: File): Array[(String, Boolean)] = {
Option(file.listFiles).map(_.flatMap(f => delete(f))).getOrElse(Array()) :+ (file.getPath -> file.delete)
}
}
}
and it has one Scala class :
package LDAv1
class Vector(val vect:Array[Int]) {
var data:Array[Int]=vect;
def this(size:Int){
this(new Array[Int](size));
}
def increment(index:Int){
data(index)+=1;
}
def decrement(index:Int){
data(index)-=1;
}
def printIt(){
print("[")
for(i<-0 to data.length-1)print(data(i)+",");
print("]\n")
}
def forEach(callback:(Int) => Unit)={
for(i<-0 to data.length-1)callback(data(i));
}
def add(a:Array[Int]){
for(i<-0 to data.length-1)data(i)+=a(i);
}
def sumAll():Int={
var sum:Int=0;
for(i<-0 to data.length-1)sum+=data(i);
(sum)
}
def normalize():Array[Double]={
var temp:Array[Double] = new Array[Double](data.length);
var sum:Double=0;
for(i<-0 to data.length-1)sum+=data(i);
if (sum>0) {
for(i<-0 to data.length-1) {
temp(i) = data(i).toDouble/sum
};
}
(temp)
}
}
You have to create a fat jar, with all dependencies included, then you can build your application using the spark build-in function, spark-submit.
https://spark.apache.org/docs/latest/submitting-applications.html