Run a read-only test in Spark - scala

I want to compare the read performance of different storage systems using Spark ,e.g. HDFS/S3N. I have written a small Scala program for this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val file = sc.textFile("s3n://test/wordtest")
val splits = file.map(word => word)
splits.saveAsTextFile("s3n://test/myoutput")
}
}
My question is, is it possible to run a read-only test with Spark? For the program above, isn't saveAsTextFile() causing some write as well?

I am not sure if that is possible at all. In order to run a transformation, a posterior action is necessary.
From the official Spark documentation:
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
Taking this into account, saveAsTextFile might not be considered the lightest from the wide range of actions available. Several lightweight alternatives exists, actions like count or first, for example. These would leverage almost the totality of the work to the transformations phase, making you able to measure the performance of your solution.
You might want to check the available actions and choose the one that best fits your requirements.

Yes."saveAsTextFile" writes the RDD data to text file using given path.

Related

apply map partitions on pyspark dataframe to run python logic

I would like to apply spacy nlp on my pyspark dataframe. I am using map partitions concept on my pyspark dataframe to apply python logic that consists of spacy.
Spark version: 3.2.0
Below is the sample pyspark dataframe:
token id
0 [This, is, java, world] 0
1 [This, is, spark, world] 1
Below is the code where I am passing a data to the python function and returning a dictionary
def get_spacy_doc_parallel_map_func(partitionData):
import spacy
from tqdm import tqdm
import pandas as pd
nlp=spacy.load('en_core_web_sm')
nlp.tokenizer=nlp.tokenizer.tokens_from_list
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
columnNames = broadcasted_source_columns.value
partitionData = pd.DataFrame(partitionData, columns=columnNames)
'''
This function creates a mapper of review id and review spacy.doc.Doc type
'''
def get_spacy_doc_parallel(data):
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
docsf = []
for doc, context in tqdm(doc_tuples,total=len(text_tuples)):
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
return vv
id_spacy_doc_mapper = get_spacy_doc_parallel(partitionData)
partitionData['spacy_doc'] = id_spacy_doc_mapper
partitionData.reset_index(inplace=True)
partitionData_dict = partitionData.to_dict("index")
result = []
for key in partitionData_dict:
result.append(partitionData_dict[key])
return iter(result)
resultDF_tokens = data.rdd.mapPartitions(get_spacy_doc_parallel_map_func)
df = spark.createDataFrame(resultDF_tokens)
The issue I am getting here is that the length of dictionary values does not match with length of the dataframe. Below is the error
Error:
ValueError: Length of values (954) does not match length of index (1438)
Output:
{0: This is java word, 1: This is spark world }
The above dictionary is assigned as a column to the python dataframe after applying spacy (partitionData['spacy_doc'] = id_spacy_doc_mapper)
I don't have enough experience with spacy to figure out what the intent is here and I'm very confused by the input and output because the input looks tokenized, but I'll take a stab at it and list my assumptions and the problems I ran into.
First off, I think Fugue can make this type of transformation significantly easier. It will use the underlying Spark UDF, pandas_udf, mapPartition, or mapInPandas depending what parameters you supply. The point is that Fugue will handle that boilerplate. For you, it seems you have Pandas DataFrame in (that part is clear), but the output is less clear. I think you are passing some iterable of list to make Spark happy, but I think Pandas DataFrame output might be simpler. I'm guessing here.
So first we set some stuff up. This is all native Python. The tokens_from_list portion was removed from the original because it seems like the latest versions deprecated it. Shouldn't matter for the example.
import pandas as pd
from typing import List, Any, Iterable, Dict
import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
data = pd.DataFrame({"token": ["This is java world", "This is spark world"],
"id": [0, 1]})
and then you define your logic for one partition. I am assuming Pandas DataFrame in and Pandas DataFrame out, but Fugue can actually support many other types such as Pandas DataFrame in and Iterable[List] out. The important thing is just you annotate your logic so Fugue knows how to handle it. Note this code is still native Python. I edited the logic a bit to just get it to work. Again, I am pretty sure I butchered the logic somewhere, but the example can still work. I really couldn't find a way for the original to work (because I don't know spacy enough)
def get_spacy_doc(data: pd.DataFrame) -> pd.DataFrame:
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=1,disable=['tagger','parser','ner'])
docsf = []
for doc, context in doc_tuples:
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
id_spacy_doc_mapper = vv.copy()
data['space_doc'] = id_spacy_doc_mapper
return data
Now to bring this to Spark, all you have to do with Fugue is:
from fugue import transform
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(data)
sdf = transform(sdf, get_spacy_doc, schema="*, space_doc:int", engine=spark)
sdf.show()
and the Fugue transform will handle it. This is to run on Spark, but you can also run on Pandas if you don't supply an engine like this:
df = transform(data, get_spacy_doc, schema="*, space_doc:int")
This allows you to test the logic clearly without relying on Spark. It will then work when you bring it to Spark. Schema is a requirement because it is a requirement for Spark.
On partitioning
The Fugue transform can take partitioning strategy. For example:
transform(df, func, schema="*", partition={"by":"col1"}, engine=spark)
but for this case, I don't think you partition on anything so you can just use the default partitions, which is what will happen.
On parallelization
You have this code like:
nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
This is two-stage parallelism. The first stage is Spark mapping over partitions, and the second stage is this pipe operation (I assume). Two stage parallelism is an anti-pattern in distributed computing because the first stage will already occupy all the available resources on the cluster. The parallelism should be done on the partition level. If you do something like this, it's very common to run into resource deadlocks when the 2nd stage tries to occupy resources also. I would recommend setting the n_process=1.
On tqdm
I may be wrong on this one but I don't think tqdm plays well with Spark because I don't think you can get a real time progress bar for work that happens on worker machines. It can only work on the driver machine. The workers don't send logs to the driver for the functions it runs.
If the example is clearer, I can certainly help you port this logic to Spark. Feel free to reach out. I hope at least some bit of this was useful.

How to convert Dataset to a Scala Iterable?

Is there a way to convert a org.apache.spark.sql.Dataset to a scala.collection.Iterable? It seems like this should be simple enough.
You can do myDataset.collect or myDataset.collectAsList.
But then it will no longer be distributed. If you want to be able to spread your computations out on multiple machines you need to use one of the distributed datastructures such as RDD, Dataframe or Dataset.
You can also use toLocalIterator if you just need to iterate the contents on the driver as it has the advantage of only loading one partition at a time, instead of the entire dataset, into memory. Iterator is not an Iterable (although it is a Traverable) but depending on what you are doing it may be what you want.
You could try something like this:
def toLocalIterable[T](dataset: Dataset[T]): Iterable[T] = new Iterable[T] {
def iterator = scala.collection.JavaConverters.asScalaIterator(dataset.toLocalIterator)
}
The conversion via JavaConverters.asScalaIterator is necessary because the toLocalIterator method of Dataset returns a java.util.Iterator instead of a scala.collection.Iterator (which is what the toLocalIterator on RDD returns.) I suspect this is a bug.
In Scala 2.11 you can do the following:
import scala.collection.JavaConverters._
dataset.toLocalIterator.asScala.toIterable

Throttle or debounce method calls

Let's say I have a method that permits to update some date in DB:
def updateLastConsultationDate(userId: String): Unit = ???
How can I throttle/debounce that method easily so that it won't be run more than once an hour per user.
I'd like the simplest possible solution, not based on any event-bus, actor lib or persistence layer. I'd like an in-memory solution (and I am aware of the risks).
I've seen solutions for throttling in Scala, based on Akka Throttler, but this really looks to me overkill to start using actors just for throttling method calls. Isn't there a very simple way to do that?
Edit: as it seems not clear enough, here's a visual representation of what I want, implemented in JS. As you can see, throttling may not only be about filtering subsequent calls, but also postponing calls (also called trailing events in js/lodash/underscore). The solution I'm looking for can't be based on pure-synchronous code only.
This sounds like a great job for a ReactiveX-based solution. On Scala, Monix is my favorite one. Here's the Ammonite REPL session illustrating it:
import $ivy.`io.monix::monix:2.1.0` // I'm using Ammonite's magic imports, it's equivalent to adding "io.monix" %% "monix" % "2.1.0" into your libraryImports in SBT
import scala.concurrent.duration.DurationInt
import monix.reactive.subjects.ConcurrentSubject
import monix.reactive.Consumer
import monix.execution.Scheduler.Implicits.global
import monix.eval.Task
class DbUpdater {
val publish = ConcurrentSubject.publish[String]
val throttled = publish.throttleFirst(1 hour)
val cancelHandle = throttled.consumeWith(
Consumer.foreach(userId =>
println(s"update your database with $userId here")))
.runAsync
def updateLastConsultationDate(userId: String): Unit = {
publish.onNext(userId)
}
def stop(): Unit = cancelHandle.cancel()
}
Yes, and with Scala.js this code will work in the browser, too, if it's important for you.
Since you ask for the simplest possible solution, you can store a val lastUpdateByUser: Map[String, Long], which you would consult before allowing an update
if (lastUpdateByUser.getOrElse(userName, 0)+60*60*1000 < System.currentTimeMillis) updateLastConsultationDate(...)
and update when a user actually performs an update
lastUpdateByUser(userName) = System.currentTimeMillis
One way to throttle, would be to maintain a count in a redis instance. Doing so would ensure that the DB wouldn't be updated, no matter how many scala processes you were running, because the state is stored outside of the process.

Integrate Spark ML Model in Scala App without embedded Spark Cluster

I have trained a Spark Multilayer Perceptron Classifier to detect spam messages and would like to use it in a webservice in combination with the Play Framework.
My solution (see below) spawns an embedded local spark cluster, loads the model and classifies messages. Is there a way to use the model without an embedded Spark cluster?
Spark has some dependencies that clash with the Play Framework dependencies. I thought there might be a way to run the model in classification mode without starting an embedded spark cluster.
My second question is if I can classify a single message without putting it in a DataFrame first.
Application Loader:
lazy val sparkSession: SparkSession = {
val conf: SparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("Classifier")
.set("spark.ui.enabled", "false")
val session = SparkSession.builder()
.config(conf)
.getOrCreate()
applicationLifecycle.addStopHook { () ⇒
Future { session.stop() }
}
session
}
lazy val model: PipelineModel = {
sparkSession
CrossValidatorModel.load("mpc-model").bestModel.asInstanceOf[PipelineModel]
}
Classification service (model and spark session are injected):
val messageDto = Seq(MessageSparkDto(
sender = message.sender.email.value,
text = featureTransformer.cleanText(text).value,
messagelength = text.value.length,
isMultimail = featureTransformer.isMultimail(message.sender.email),
))
val messageDf = messageDto.toDS()
model.transform(messageDf).head().getAs[Double]("prediction") match {
case 1.0 ⇒ MessageEvaluationResult(MessageClass.Spam)
case _ ⇒ MessageEvaluationResult(MessageClass.NonSpam)
}
Edit: As pointed out in the comments, one solution could be to transform the model to PMML and then use another engine to load the model and use it for classification. This sounds too me like a lot of overhead as well. Has someone experience with running spark in local mode with minimal overhead and dependencies to use the ML classifiers?
Although I like the solution proposed in the linked post, the following might also be possible. You could of course copy that model to the Server onto which you will deploy the Webservice, install a spark "cluster" with one machine on it and put spark jobserver on top of it, which will handle the requests and access spark. That would be the no-brainer-solution and should work if your model does not need lots of computational power.

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.
The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.
For those working with a larger dataset:
rdd.collect() should not be used in this case as it will collect all data as an Array in the driver, which is the easiest way to get out of memory.
rdd.coalesce(1).saveAsTextFile() should also not be used as the parallelism of upstream stages will be lost to be performed on a single node, where data will be stored from.
rdd.coalesce(1, shuffle = true).saveAsTextFile() is the best simple option as it will keep the processing of upstream tasks parallel and then only perform the shuffle to one node (rdd.repartition(1).saveAsTextFile() is an exact synonym).
rdd.saveAsSingleTextFile() as provided bellow additionally allows one to store the rdd in a single file with a specific name while keeping the parallelism properties of rdd.coalesce(1, shuffle = true).saveAsTextFile().
Something that can be inconvenient with rdd.coalesce(1, shuffle = true).saveAsTextFile("path/to/file.txt") is that it actually produces a file whose path is path/to/file.txt/part-00000 and not path/to/file.txt.
The following solution rdd.saveAsSingleTextFile("path/to/file.txt") will actually produce a file whose path is path/to/file.txt:
package com.whatever.package
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.CompressionCodec
object SparkHelper {
// This is an implicit class so that saveAsSingleTextFile can be attached to
// SparkContext and be called like this: sc.saveAsSingleTextFile
implicit class RDDExtensions(val rdd: RDD[String]) extends AnyVal {
def saveAsSingleTextFile(path: String): Unit =
saveAsSingleTextFileInternal(path, None)
def saveAsSingleTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit =
saveAsSingleTextFileInternal(path, Some(codec))
private def saveAsSingleTextFileInternal(
path: String, codec: Option[Class[_ <: CompressionCodec]]
): Unit = {
// The interface with hdfs:
val hdfs = FileSystem.get(rdd.sparkContext.hadoopConfiguration)
// Classic saveAsTextFile in a temporary folder:
hdfs.delete(new Path(s"$path.tmp"), true) // to make sure it's not there already
codec match {
case Some(codec) => rdd.saveAsTextFile(s"$path.tmp", codec)
case None => rdd.saveAsTextFile(s"$path.tmp")
}
// Merge the folder of resulting part-xxxxx into one file:
hdfs.delete(new Path(path), true) // to make sure it's not there already
FileUtil.copyMerge(
hdfs, new Path(s"$path.tmp"),
hdfs, new Path(path),
true, rdd.sparkContext.hadoopConfiguration, null
)
// Working with Hadoop 3?: https://stackoverflow.com/a/50545815/9297144
hdfs.delete(new Path(s"$path.tmp"), true)
}
}
}
which can be used this way:
import com.whatever.package.SparkHelper.RDDExtensions
rdd.saveAsSingleTextFile("path/to/file.txt")
// Or if the produced file is to be compressed:
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsSingleTextFile("path/to/file.txt.gz", classOf[GzipCodec])
This snippet:
First stores the rdd with rdd.saveAsTextFile("path/to/file.txt") in a temporary folder path/to/file.txt.tmp as if we didn't want to store data in one file (which keeps the processing of upstream tasks parallel)
And then only, using the hadoop file system api, we proceed with the merge (FileUtil.copyMerge()) of the different output files to create our final output single file path/to/file.txt.
You could call coalesce(1) and then saveAsTextFile() - but it might be a bad idea if you have a lot of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Having a single output file is only a good idea if you have very little data, in which case you could do collect() as well, as #aaronman said.
As others have mentioned, you can collect or coalesce your data set to force Spark to produce a single file. But this also limits the number of Spark tasks that can work on your dataset in parallel. I prefer to let it create a hundred files in the output HDFS directory, then use hadoop fs -getmerge /hdfs/dir /local/file.txt to extract the results into a single file in the local filesystem. This makes the most sense when your output is a relatively small report, of course.
In Spark 1.6.1 the format is as shown below. It creates a single output file.It is best practice to use it if the output is small enough to handle.Basically what it does is that it returns a new RDD that is reduced into numPartitions partitions.If you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
pair_result.coalesce(1).saveAsTextFile("/app/data/")
You can call repartition() and follow this way:
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
var repartitioned = year.repartition(1)
repartitioned.saveAsTextFile("C:/Users/TheBhaskarDas/Desktop/wc_spark00")
You will be able to do it in the next version of Spark, in the current version 1.0.0 it's not possible unless you do it manually somehow, for example, like you mentioned, with a bash script call.
I also want to mention that the documentation clearly states that users should be careful when calling coalesce with a real small number of partitions . this can cause upstream partitions to inherit this number of partitions.
I would not recommend using coalesce(1) unless really required.
Here's my answer to output a single file. I just added coalesce(1)
val year = sc.textFile("apat63_99.txt")
.map(_.split(",")(1))
.flatMap(_.split(","))
.map((_,1))
.reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Code:
year.coalesce(1).saveAsTextFile("year")