Problems with spark datasets - scala

When I execute a function in a mapPartition of dataset (executeStrategy()) it returns a result which I could check by debug but when I use dataset.show () it shows me an empty table and I do not know why this happens
This is for a data mining job at my school. I'm using windows 10, scala 2.11.12 and spark-2.2.0, which work without problems.
case class MyState(code: util.ArrayList[Object], evaluation: util.ArrayList[java.lang.Double])
private def executeStrategy(iter: Iterator[Row]): Iterator[(String,MyState)] = {
val listBest = new util.ArrayList[State]
Predicate.fuzzyValues = iter.toList
for (i <- 0 until conf.runNumber) {
Strategy.executeStrategy(conf.iterByRun, 1, conf.algorithm("algorithm").asInstanceOf[GeneratorType])
listBest.addAll(Strategy.getStrategy.listBest)
}
val result = postMining(listBest)
result.map(x => (x.getCode.toString, MyState(x.getCode,x.getEvaluation))).iterator
}
def run(sparkSession: SparkSession, n: Int): Unit = {
import sparkSession.implicits._
var data0 = conf.dataBase.repartition(n).persist(StorageLevel.MEMORY_AND_DISK_SER)
var listBest = new util.ArrayList[State]
implicit def enc1 = Encoders.bean(classOf[(String,MyState)])
val data1 = data0.mapPartitions(executeStrategy)
data1.show(3)
}
I expect that the dataset has the results of the processing of each partition, which I can see when I debug, but I get an empty dataset.
I have tried rdd with the same function executeStrategy() and this one returns an rdd with the results. What is the problem with the dataset?

Related

How to batch columns of spark dataframe, process with REST API and add it back?

I have a dataframe in spark and I need to process a particular column in that dataframe using a REST API. The API does some transformation to a string and returns a result string. The API can process multiple strings at a time.
I can iterate over the columns of the dataframe, collect n values of the column in a batch and call the api and then add it back to the dataframe, and continue with the next batch. But this seems like the normal way of doing it without taking advantage of spark.
Is there a better way to do this which can take advantage of spark sql optimiser and spark parallel processing?
For Spark parallel processing you can use mapPartitions
case class Input(col: String)
case class Output ( col : String,new_col : String )
val data = spark.read.csv("/a/b/c").as[Input].repartiton(n)
def declare(partitions: Iterator[Input]): Iterator[Output] ={
val url = ""
implicit val formats: DefaultFormats.type = DefaultFormats
var list = new ListBuffer[Output]()
val httpClient =
try {
while (partitions.hasNext) {
val x = partitions.next()
val col = x.col
val concat_url =""
val apiResp = HttpClientAcceptSelfSignedCertificate.call(httpClient, concat_url)
if (apiResp.isDefined) {
val json = parse(apiResp.get)
val new_col = (json \\"value_to_take_from_api").children.head.values.toString
val output = Output(col,new_col)
list+=output
}
else {
val new_col = "Not Found"
val output = Output(col,new_col)
list+=output
}
}
} catch {
case e: Exception => println("api Exception with : " + e.getMessage)
}
finally {
HttpClientAcceptSelfSignedCertificate.close(httpClient)
}
list.iterator
}
val dd:Dataset[Output] =data.mapPartitions(x=>declare(x))

How to construct (key, value) list from parallelized list in scala spark?

I am new to Scala and I have some questions about how it works.
I want to do the next thing : given list of values, I want to construct some imitation of dictionary in parallel, something like that: (1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) ). I know that if we deal with parallelized collections we should use accumulators. So here is my attempt:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.AccumulatorV2
import scala.collection.mutable.ListBuffer
class DictAccumulatorV2 extends AccumulatorV2[Int, ListBuffer[(Int, Int)]] {
private var dict:ListBuffer[(Int, Int)]= new ListBuffer[(Int, Int)]
def reset(): Unit = {
dict.clear()
}
def add(v: Int): Unit = {
dict.append((v, v))
}
def value():ListBuffer[(Int, Int)] = {
return dict
}
def isZero(): Boolean = {
return dict.isEmpty
}
def copy() : AccumulatorV2[Int, ListBuffer[(Int, Int)]] = {
// I do not understand how to code it correctly
return new DictAccumulatorV2
}
def merge(other:AccumulatorV2[Int, ListBuffer[(Int, Int)]]): Unit = {
// I do not understand how to code it correctly without reinitializing dict from val to var
dict = dict ++ other.value
}
}
object FirstSparkApplication {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MyFirstApp").setMaster("local")
val sc = new SparkContext(conf)
val accum = new DictAccumulatorV2()
sc.register(accum, "mydictacc")
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
var res = distData.map(x => accum.add(x))
res.count()
println(accum)
}
}
So I wonder if I do it right or there are any mistakes.
In general I also have questions about how sc.parallelize works. Does it actually parallelize job on my machine or it's just fictional string of code? What should I put instead of "local" in setMaster? How can I see on which nodes is the task been performing? Is the task performed on the all of the nodes at the same time or there is some sequence?
(1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) )
You can do this in Scala by doing
val list = List(1,2,3,4)
val dict = list.map(i => (i,i))
Spark Accumulators are used as a communication means from Spark executor to Driver.
If you want to do the above in Parallel, then you would construct an RDD out of this list and applying map transformation to it like shown above.
In spark shell it would look like
val list = List(1,2,3,4)
val listRDD = sc.parallelize(list)
val dictRDD = listRDD.map(i => (i,i))
how sc.parallelize works
It creates a distributed Dataset (RDD in spark terms) using the collection that you pass in to the function. More information.
It does parallelize your job.
If you are submitting your spark job to a cluster then you should be able to see a YARN application ID or URL after running spark-submit command.You can visit the YARN application URL and see how many executors are processing that distributed dataset and what sequence they are performed in.
What should I put instead of "local" in setMaster
From the Spark documentation -
The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.

Getting java.lang.ArrayIndexOutOfBoundsException: 1 in spark when applying aggregate functions

I am trying to do some transformations on a data set. After reading the data set when performing df.show() operations, I am getting the rows listed in spark shell. But when I try to do df.count or any aggregate functions, I am getting
java.lang.ArrayIndexOutOfBoundsException: 1.
val itpostsrow = sc.textFile("/home/jayk/Downloads/spark-data")
import scala.util.control.Exception.catching
import java.sql.Timestamp
implicit class StringImprovements(val s:String) {
def toIntSafe = catching(classOf[NumberFormatException])
opt s.toInt
def toLongsafe = catching(classOf[NumberFormatException])
opt s.toLong
def toTimeStampsafe = catching(classOf[IllegalArgumentException]) opt Timestamp.valueOf(s)
}
case class Post(commentcount:Option[Int],lastactivitydate:Option[java.sql.Timestamp],ownerUserId:Option[Long],body:String,score:Option[Int],creattiondate:Option[java.sql.Timestamp],viewcount:Option[Int],title:String,tags:String,answerCount:Option[Int],acceptedanswerid:Option[Long],posttypeid:Option[Long],id:Long)
def stringToPost(row:String):Post = {
val r = row.split("~")
Post(r(0).toIntSafe,
r(1).toTimeStampsafe,
r(2).toLongsafe,
r(3),
r(4).toIntSafe,
r(5).toTimeStampsafe,
r(6).toIntSafe,
r(7),
r(8),
r(9).toIntSafe,
r(10).toLongsafe,
r(11).toLongsafe,
r(12).toLong)
}
val itpostsDFcase1 = itpostsrow.map{x=>stringToPost(x)}
val itpostsDF = itpostsDFcase1.toDF()
Your function stringToPost() might cause a Java error ArrayIndexOutOfBoundsException if the text file contains some empty row or if the number of fields after the split is not 13.
Due to Spark's lazy evaluation one notices such errors only when performing an action like count.

Unable to convert a ResultSet into a List in Cassandra datastax driver

In the following Cassandra code, I am querying a database and expect multiple values. The function takes and id and should return Option[List[M]] where M is my model. I have a function rowToModel(row: Row): MyModel which could take a row from ResultSet and convert it into instance of my model.
My issue is that the List I am returning is always empty even though ResultSet has data. I checked it by adding debug prints in rowToModel
def getRowsByPartitionKeyId(id:I):Option[List[M]] = {
val whereClause = whereConditions(tablename, id);
val resultSet = session.execute(whereClause) //resultSet is an iterator
val it = resultSet.iterator();
val resultList:List[M] = List();
if(it.hasNext){
while(it.hasNext) {
val item:M = rowToModel(it.next())
resultList.:+(item)
}
Some(resultList) //THIS IS ALWAYS List()
}
else
None
}
I suspect that as resultList is a val, its value is not getting changed in the while loop. I probably should use yield or something else but I don't know what and how.
Solved it by converting Java Iterator to Scala and then using toList
import collection.JavaConverters._
val it = resultSet.iterator();
if(it.hasNext){
val resultSetAsList:List[Row] = asScalaIterator(it).toList
val resultSetAsModelList = resultSetAsList.map((row:Row) => rowToModel(row))
Some(resultSetAsModelList)

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.