Code with explanation:
val partitions = preparePartitioningDataset(dataset, "sdp_id").map { partitions =>
val resultPartitionedDataset: Iterator[Future[Iterable[String]]] = for {
partition <- partitions
} yield {
val whereStatement = s"SDP_ID = '$partition'"
val partitionedDataset =
datasetService.getFullDatasetResultIterable(
dataset = dataset,
format = format._1,
limit = none[Int],
where = whereStatement.some
)
partitionedDataset
}
resultPartitionedDataset
}
partitions.map { partitionedDataset =>
for {
partition <- partitionedDataset
} notifyPartitionedDataset(
bearerToken = bearerToken,
endpoint = endpoint,
dataset = partition
)
}
So now
preparePartitioningDataset(dataset, "sdp_id") returns a Future[Iterator[String]]
datasetService.getFullDatasetResultIterable returns itself also a Future[Iterable[String]]
Pretty much you see that resultPartitionedDataset is an Iterator[Future[Iterable[String]]]
and Finally notifyPartitionedDataset returns a Future[Unit]
About some explanation of what's happening and what I'm trying to achieve
I have preparePartioningDataset that performs a Select Distinct on a single value, giving back a Future[ResultSet] (mapped to an Iterator). This because for each single value I want to perform a SELECT * WHERE column=that_value. This happens on getFullDatasetResultIterable, again a Future[ResultSet] mappet to an Iterator as well.
Last step is to forward via a POST, every single query I got.
It works, but everything happens in parallel (well I guess that's why I wanted to go for a Future), but now I got required that each POST (notifyPartionedDataset) happens sequentially, so to send a post after another and not in parallel.
I've tried a lot of different approaches but I still get the same outcome.
How could I move forward?
You can take advantage of the laziness of the IO datatype to ensure that some operations are executed in order.
import cats.effect.IO
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
def preparePartitioningDatasetIO(dataset: String, foo: String): IO[List[String]] =
IO.fromFuture(IO(
preparePartitioningDataset(dataset, foo))
)).map(_.toList)
def getFullDatasetResultIterableIO(dataset: String, format: String, limit: Option[Int], where: Option[String]): IO[List[String]] =
IO.fromFuture(IO(
datasetService.getFullDatasetResultIterable(
dataset,
format,
limit,
where
)
))
def notifyPartitionedDatasetIO(bearerToken: String, endpoint: String, dataset: List[String]): IO[Unit] =
IO.fromFuture(IO(
notifyPartitionedDataset(
bearerToken,
endpoint,
dataset
)
))
def program(dataset: String): IO[Unit] =
preparePartitioningDatasetIO(dataset, "sdp_id").flatMap { partitions =>
partitions.traverse_ { partition =>
val whereStatement = s"SDP_ID = '$partition'"
getFullDatasetResultIterableIO(
dataset = dataset,
format = format._1,
limit = none,
where = whereStatement.some
).flatMap { dataset =>
notifyPartitionedDatasetIO(
bearerToken = bearerToken,
endpoint = endpoint,
dataset = dataset
)
}
}
}
def run(dataset: String): Future[Unit] = {
import cats.effect.unsafe.implicits.global
program(dataset).unsafeToFuture()
}
The code needs to be carefully reviewed and fixed, especially the arguments of the functions.
But, this should help to get the result you want without needing to refactor the whole codebase; yet.
If you want getFullDatasetResultIterableIO to run in parallel while notifyPartitionedDatasetIO to run serially you can do this:
def program(dataset: String): IO[Unit] =
preparePartitioningDatasetIO(dataset, "sdp_id").flatMap { partitions =>
partitions.parTraverse { partition =>
val whereStatement = s"SDP_ID = '$partition'"
getFullDatasetResultIterableIO(
dataset = dataset,
format = format._1,
limit = none,
where = whereStatement.some
)
} flatMap { datasets =>
datasets.traverse_ { dataset =>
notifyPartitionedDatasetIO(
bearerToken = bearerToken,
endpoint = endpoint,
dataset = dataset
)
}
}
}
Although this would imply that the whole data is kept in memory before starting to notify.
Related
I have a dataframe in spark and I need to process a particular column in that dataframe using a REST API. The API does some transformation to a string and returns a result string. The API can process multiple strings at a time.
I can iterate over the columns of the dataframe, collect n values of the column in a batch and call the api and then add it back to the dataframe, and continue with the next batch. But this seems like the normal way of doing it without taking advantage of spark.
Is there a better way to do this which can take advantage of spark sql optimiser and spark parallel processing?
For Spark parallel processing you can use mapPartitions
case class Input(col: String)
case class Output ( col : String,new_col : String )
val data = spark.read.csv("/a/b/c").as[Input].repartiton(n)
def declare(partitions: Iterator[Input]): Iterator[Output] ={
val url = ""
implicit val formats: DefaultFormats.type = DefaultFormats
var list = new ListBuffer[Output]()
val httpClient =
try {
while (partitions.hasNext) {
val x = partitions.next()
val col = x.col
val concat_url =""
val apiResp = HttpClientAcceptSelfSignedCertificate.call(httpClient, concat_url)
if (apiResp.isDefined) {
val json = parse(apiResp.get)
val new_col = (json \\"value_to_take_from_api").children.head.values.toString
val output = Output(col,new_col)
list+=output
}
else {
val new_col = "Not Found"
val output = Output(col,new_col)
list+=output
}
}
} catch {
case e: Exception => println("api Exception with : " + e.getMessage)
}
finally {
HttpClientAcceptSelfSignedCertificate.close(httpClient)
}
list.iterator
}
val dd:Dataset[Output] =data.mapPartitions(x=>declare(x))
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().collect.toList
}
I am basicaly Calling this function 2 times For getting the list for different purposes . I just want to know is there a way to retain the list in memory and we dont have to call the same function again and again to generate the list and only have to generate the list only one time in scala spark.
Try something as below and you can also check the performance using time func.
Also find the code explanation inline
import org.apache.spark.rdd
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, functions}
object HandleCachedDF {
var cachedAnimalDF : rdd.RDD[String] = _
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val df = spark.read.json("src/main/resources/hugeTest.json") // Load your Dataframe
val df1 = time[rdd.RDD[String]] {
getAnimalName(df)
}
val resultList = df1.collect().toList
val df2 = time{
getAnimalName(df)
}
val resultList1 = df2.collect().toList
println(resultList.equals(resultList1))
}
def getAnimalName(dataFrame: DataFrame): rdd.RDD[String] = {
if (cachedAnimalDF == null) { // Check if this the first initialization of your dataframe
cachedAnimalDF = dataFrame.select("animal").
filter(functions.col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().cache() // Cache your dataframe
}
cachedAnimalDF // Return your cached dataframe
}
def time[R](block: => R): R = { // COmpute the time taken by function to execute
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
}
You would have to persist or cache at this point
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().persist
and then call the function as follow
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.collect.toList
}
as many times as you need it without repeat the process.
I hope it helps.
I have big dataset to transform one structure to another. During that phase I want also collect some info about computed field (quadkeys for given lat/longs). I dont want attach this info to every result row, since it would give a lot of duplication information and memory overhead. All I need is to know which particular quadkeys are touched by given coordinates. If there are any way to do it within one job to not iterate dataset twice?
def load(paths: Seq[String]): (Dataset[ResultStruct], Dataset[String]) = {
val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.schema(schema)
.option("delimiter", "\t")
.load(paths:_*)
.as[InitialStruct]
val qkSet = mutable.HashSet.empty[String]
val result = df.map(c => {
val id = c.id
val points = toPoints(c.geom)
points.foreach(p => qkSet.add(Quadkey.get(p.lat, p.lon, 6).getId))
createResultStruct(id, points)
})
return result, //some dataset created from qkSet's from all executors
}
You could use accumulators
class SetAccumulator[T] extends AccumulatorV2[T, Set[T]] {
import scala.collection.JavaConverters._
private val items = new ConcurrentHashMap[T, Boolean]
override def isZero: Boolean = items.isEmpty
override def copy(): AccumulatorV2[T, Set[T]] = {
val other = new SetAccumulator[T]
other.items.putAll(items)
other
}
override def reset(): Unit = items.clear()
override def add(v: T): Unit = items.put(v, true)
override def merge(
other: AccumulatorV2[T, Set[T]]): Unit = other match {
case setAccumulator: SetAccumulator[T] => items.putAll(setAccumulator.items)
}
override def value: Set[T] = items.keys().asScala.toSet
}
val df = Seq("foo", "bar", "foo", "foo").toDF("test")
val acc = new SetAccumulator[String]
spark.sparkContext.register(acc)
df.map {
case Row(str: String) =>
acc.add(str)
str
}.count()
println(acc.value)
Prints
Set(bar, foo)
Note that map itself is lazy so something like count etc. is needed to actually force the calculation. Depending on the real use-case, another option would be to cache the data frame and just using plain SQL functions df.select("test").distinct()
When I execute a function in a mapPartition of dataset (executeStrategy()) it returns a result which I could check by debug but when I use dataset.show () it shows me an empty table and I do not know why this happens
This is for a data mining job at my school. I'm using windows 10, scala 2.11.12 and spark-2.2.0, which work without problems.
case class MyState(code: util.ArrayList[Object], evaluation: util.ArrayList[java.lang.Double])
private def executeStrategy(iter: Iterator[Row]): Iterator[(String,MyState)] = {
val listBest = new util.ArrayList[State]
Predicate.fuzzyValues = iter.toList
for (i <- 0 until conf.runNumber) {
Strategy.executeStrategy(conf.iterByRun, 1, conf.algorithm("algorithm").asInstanceOf[GeneratorType])
listBest.addAll(Strategy.getStrategy.listBest)
}
val result = postMining(listBest)
result.map(x => (x.getCode.toString, MyState(x.getCode,x.getEvaluation))).iterator
}
def run(sparkSession: SparkSession, n: Int): Unit = {
import sparkSession.implicits._
var data0 = conf.dataBase.repartition(n).persist(StorageLevel.MEMORY_AND_DISK_SER)
var listBest = new util.ArrayList[State]
implicit def enc1 = Encoders.bean(classOf[(String,MyState)])
val data1 = data0.mapPartitions(executeStrategy)
data1.show(3)
}
I expect that the dataset has the results of the processing of each partition, which I can see when I debug, but I get an empty dataset.
I have tried rdd with the same function executeStrategy() and this one returns an rdd with the results. What is the problem with the dataset?
I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})