In advance, I know this question is a little long; but I did my best to simplify it and make it approachable. Please provide feedback and I will try to act on it.
I am new to Flink, and I am trying to use it for a pipeline that will update each item of a large (TBs) dataset continuously. I wish to update some higher priority items more often but want to update all items as fast as possible.
Would it be possible to have a different source for each priority Tier (e.g. high, medium, low) to start the update process and read from the higher-priority sources more often? The other approach I've thought of is a custom SourceFunction that would have a reader for each file and emit according to the rates I set. The first approach didn't seem feasable, so I am trying the second but am stuck.
This is what I've tried so far:
import scala.collection.mutable
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.hadoop.fs.{FileSystem, Path}
import java.util.concurrent.locks.ReadWriteLock
import java.util.concurrent.locks.ReentrantReadWriteLock
import java.util.concurrent.atomic.AtomicBoolean
/** Implements logic to ensure higher-priority values are pushed more often.
*
* The relative priority is the relative run frequency. For example, if tier C
* is half the priority of tier B, then C is pushed half as often.
*
* Type Parameters: \- T: Tier Type (e.g. enum) \- V: Emitted
* Value Type (e.g. ItemToUpdate)
*
* How to read objects from CSV rows:
* - https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/datastream/formats/csv/#:~:text=Flink%20supports%20reading%20CSV%20files%20using%20CsvReaderFormat.%20The,configuration%20for%20the%20CSV%20schema%20and%20parsing%20options.
*
* How to read objects from Parquet:
* - https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/formats/parquet/#:~:text=Flink%20supports%20reading%20Parquet%20files%2C%20producing%20Flink%20RowData,to%20your%20project%3A%20%3Cdependency%3E%20%3CgroupId%3Eorg.apache.flink%3C%2FgroupId%3E%20%3CartifactId%3Eflink-parquet%3C%2FartifactId%3E%20%3Cversion%3E1.16.0%3C%2Fversion%3E%20%3C%2Fdependency%3E
*
*
*
*/
abstract class TierPriorityEmitter[T, V](
val emitRates: mutable.Map[T, Float],
val updateCounts: mutable.Map[T, Long],
val iterators: mutable.Map[T, Iterator[V]]
) extends SourceFunction[V] {
#transient
private val running = new AtomicBoolean(false)
def this(
emitRates: Map[T, Float]
) {
this(
mutable.Map() ++= emitRates,
mutable.Map() ++= (for ((k, v) <- emitRates) yield (k, 0L)).toMap,
mutable.Map.empty[T, Iterator[V]]
)
for ((k, v) <- emitRates) {
val it = getTierIterator(k)
if (it.hasNext) {
iterators.put(k, it)
} else {
updateCounts.remove(k)
this.emitRates.remove(k)
}
}
// rescale rates in case any dropped or priorities given instead
val s = this.emitRates.foldLeft(0.0)(_ + _._2).asInstanceOf[Float] // sum
for ((k, v) <- this.emitRates) {
this.emitRates.put(k, v / s)
}
}
/** Return iterator for reading a specific priority tier from source file.
* Called again every pass over each priority tier.
*
* For example, may return an iterator to a specific file or an iterator that
* filters to the specified tier. Must avoid storing the entire file in memory
* as it is prohibitively large. TODO FIXME how to read?
*/
def getTierIterator(tier: T): Iterator[V] = ???
/** Determine which class of road should be emitted next.
*/
def nextToEmit(): T = {
/*
```python
total_updates = sum(updateCounts.values())
if total_updates == 0:
return max(emitRates.keys(), key=lambda k: emitRates[k])
actualEmitRates = {k: v/total_updates for k, v in updateCounts}
return min(emitRates.keys(), key=lambda k: actualEmitRates[k] - emitRates[k])
```
*/
val totalUpdates = updateCounts.foldLeft(0.0)(_ + _._2) // sum
if (totalUpdates == 0) {
return emitRates.maxBy(_._2)._1
}
val actualEmitRates =
(for ((k, v) <- updateCounts) yield (k, v / totalUpdates)).toMap
return emitRates.minBy(t => actualEmitRates(t._1) - t._2)._1
}
/** Emit to trigger updates.
*
* #param ctx
*/
override def run(ctx: SourceFunction.SourceContext[V]): Unit = {
running.set(true)
while (running.get()) {
val tier = nextToEmit()
val it = iterators(tier)
// zero length iterators are removed in the constructor
ctx.collect(it.next())
updateCounts.put(tier, updateCounts(tier) + 1)
if (!it.hasNext) {
iterators.put(tier, getTierIterator(tier))
}
}
}
override def cancel(): Unit = {
running.set(false)
}
}
I have this unit test.
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.scalatest.BeforeAndAfter
import org.apache.flink.api.common.time.Time
import org.apache.flink.streaming.util.SourceOperatorTestHarness
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.api.common.typeinfo.TypeHint
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.operators.SourceOperatorFactory
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.runtime.operators.testutils.MockEnvironmentBuilder
import org.apache.flink.api.connector.source.Source
import org.apache.flink.api.connector.source.SourceReader
import scala.collection.mutable.Queue
import org.apache.flink.test.util.MiniClusterWithClientResource
import org.apache.flink.test.util.MiniClusterResourceConfiguration
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
/** Class for testing abstract TierPriorityEmitter
*/
class StringPriorityEmitter(priorities: Map[Char, Float]) extends TierPriorityEmitter[Char, String](priorities) with Serializable {
override def getTierIterator(tier: Char): Iterator[String] = {
return (for (i <- (1 to 9)) yield tier + i.toString()).iterator
}
}
class TestTierPriorityEmitter
extends AnyFlatSpec
with Matchers with BeforeAndAfter {
val flinkCluster = new MiniClusterWithClientResource(
new MiniClusterResourceConfiguration.Builder()
.setNumberSlotsPerTaskManager(1)
.setNumberTaskManagers(1)
.build
)
before {
flinkCluster.before()
}
after {
flinkCluster.after()
}
"TierPriorityEmitter" should "emit according to priority" in {
// be sure emit rates are floats and sum to 1
val emitter: TierPriorityEmitter[Char, String] = new StringPriorityEmitter(Map('A' -> 3f, 'B' -> 2f, 'C' -> 1f))
// should set emit rates based on our priorities
(emitter.emitRates('A') > emitter.emitRates('B') && emitter.emitRates('B') > emitter.emitRates('C') && emitter.emitRates('C') > 0) shouldBe(true)
(emitter.emitRates('A') + emitter.emitRates('B') + emitter.emitRates('C')) shouldEqual(1.0)
(emitter.emitRates('A') / emitter.emitRates('C')) shouldEqual(3.0)
// should output according to the assigned rates
// TODO use a SourceOperatorTestHarness instead. I couldn't get one to work.
val res = new Queue[String]()
for (_ <- 1 to 15) {
val tier = emitter.nextToEmit()
val it = emitter.iterators(tier)
// zero length iterators are removed in the constructor
res.enqueue(it.next())
emitter.updateCounts.put(tier, emitter.updateCounts(tier) + 1)
if (!it.hasNext) {
emitter.iterators.put(tier, emitter.getTierIterator(tier))
}
}
res shouldEqual(Queue("A1", "B1", "C1", "A2", "B2", "A3", "B3", "A4", "C2", "A5", "B4", "A6", "B5", "A7", "C3"))
}
"TierPriorityEmitter" should "be a source" in {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) // FIXME not enough resources for 2
val emitter: TierPriorityEmitter[Char, String] = new StringPriorityEmitter(Map('A' -> 3, 'B' -> 2, 'C' -> 1))
// FIXME not serializable
val source = env.addSource(
emitter
)
source.returns(TypeInformation.of(new TypeHint[String]() {}))
val sink = new TestSink[String]("Strings")
source.addSink(sink)
// Execute program, beginning computation.
env.execute("Test TierPriorityEmitter")
sink.getResults should contain allOf ("A1", "B1", "C1", "A2", "B2", "A3", "B3", "A4", "C2", "A5", "B4", "A6", "B5", "A7", "C3")
}
}
There are a couple obvious problems that I can't figure out how to resolve:
How to read the input files to implement the TierPriorityEmitter as a working (serializable) SourceFunction?
I know that Iterators probably aren't the right class to use, but I am not sure what to try next. I am not very experienced with Scala.
Optimization advice?
Although the mutex leads to synchronization rate balancing should still result in the best downstream performance as later steps will be slower. Any advice on how to improve this would be appreciated.
Related
I need to convert a Float32 into a Chisel FixedPoint, perform some computation and convert back FixedPoint to Float32.
For example, I need the following:
val a = 3.1F
val b = 2.2F
val res = a * b // REPL returns res: Float 6.82
Now, I do this:
import chisel3.experimental.FixedPoint
val fp_tpe = FixedPoint(6.W, 2.BP)
val a_fix = a.Something (fp_tpe) // convert a to FixPoint
val b_fix = b.Something (fp_tpe) // convert b to FixPoint
val res_fix = a_fix * b_fix
val res0 = res_fix.Something (fp_tpe) // convert back to Float
As a result, I'd expect the delta to be in a range of , e.g
val eps = 1e-4
assert ( abs(res - res0) < eps, "The error is too big")
Who can provide a working example for Chisel3 FixedPoint class for the pseudocode above?
Take a look at the following code:
import chisel3._
import chisel3.core.FixedPoint
import dsptools._
class FPMultiplier extends Module {
val io = IO(new Bundle {
val a = Input(FixedPoint(6.W, binaryPoint = 2.BP))
val b = Input(FixedPoint(6.W, binaryPoint = 2.BP))
val c = Output(FixedPoint(12.W, binaryPoint = 4.BP))
})
io.c := io.a * io.b
}
class FPMultiplierTester(c: FPMultiplier) extends DspTester(c) {
//
// This will PASS, there is sufficient precision to model the inputs
//
poke(c.io.a, 3.25)
poke(c.io.b, 2.5)
step(1)
expect(c.io.c, 8.125)
//
// This will FAIL, there is not sufficient precision to model the inputs
// But this is only caught on output, this is likely the right approach
// because you can't really pass in wrong precision data in hardware.
//
poke(c.io.a, 3.1)
poke(c.io.b, 2.2)
step(1)
expect(c.io.c, 6.82)
}
object FPMultiplierMain {
def main(args: Array[String]): Unit = {
iotesters.Driver.execute(Array("-fiv"), () => new FPMultiplier) { c =>
new FPMultiplierTester(c)
}
}
}
I'd also suggest looking at ParameterizedAdder in dsptools, that gives you a feel of how to write hardware modules that you pass different types. Generally you start with DspReals, confirm the model then start experimenting/calculating with FixedPoint sizes that return results with the desired precision.
For others benefit, I provide an improved solution from #Chick, rewritten in a more abstract Scala with variable DSP tolerances.
package my_pkg
import chisel3._
import chisel3.core.{FixedPoint => FP}
import dsptools.{DspTester, DspTesterOptions, DspTesterOptionsManager}
class FPGenericIO (inType:FP, outType:FP) extends Bundle {
val a = Input(inType)
val b = Input(inType)
val c = Output(outType)
}
class FPMul (inType:FP, outType:FP) extends Module {
val io = IO(new FPGenericIO(inType, outType))
io.c := io.a * io.b
}
class FPMulTester(c: FPMul) extends DspTester(c) {
val uut = c.io
// This will PASS, there is sufficient precision to model the inputs
poke(uut.a, 3.25)
poke(uut.b, 2.5)
step(1)
expect(uut.c, 3.25*2.5)
// This will FAIL, if you won't increase tolerance, which is eps = 0.0 by default
poke(uut.a, 3.1)
poke(uut.b, 2.2)
step(1)
expect(uut.c, 3.1*2.2)
}
object FPUMain extends App {
val fpInType = FP(8.W, 4.BP)
val fpOutType = FP(12.W, 6.BP)
// Update default DspTester options and increase tolerance
val opts = new DspTesterOptionsManager {
dspTesterOptions = DspTesterOptions(
fixTolLSBs = 2,
genVerilogTb = false,
isVerbose = true
)
}
dsptools.Driver.execute (() => new FPMul(fpInType, fpOutType), opts) {
c => new FPMulTester(c)
}
}
Here's my ultimate DSP multiplier implementation, which should support both FixedPoint and DspComplex numbers. #ChickMarkley, how do I update this class to implement a Complex multiplication?
package my_pkg
import chisel3._
import dsptools.numbers.{Ring,DspComplex}
import dsptools.numbers.implicits._
import dsptools.{DspContext}
import chisel3.core.{FixedPoint => FP}
import dsptools.{DspTester, DspTesterOptions, DspTesterOptionsManager}
class FPGenericIO[A <: Data:Ring, B <: Data:Ring] (inType:A, outType:B) extends Bundle {
val a = Input(inType.cloneType)
val b = Input(inType.cloneType)
val c = Output(outType.cloneType)
override def cloneType = (new FPGenericIO(inType, outType)).asInstanceOf[this.type]
}
class FPMul[A <: Data:Ring, B <: Data:Ring] (inType:A, outType:B) extends Module {
val io = IO(new FPGenericIO(inType, outType))
DspContext.withNumMulPipes(3) {
io.c := io.a * io.b
}
}
class FPMulTester[A <: Data:Ring, B <: Data:Ring](c: FPMul[A,B]) extends DspTester(c) {
val uut = c.io
//
// This will PASS, there is sufficient precision to model the inputs
//
poke(uut.a, 3.25)
poke(uut.b, 2.5)
step(1)
expect(uut.c, 3.25*2.5)
//
// This will FAIL, there is not sufficient precision to model the inputs
// But this is only caught on output, this is likely the right approach
// because you can't really pass in wrong precision data in hardware.
//
poke(uut.a, 3.1)
poke(uut.b, 2.2)
step(1)
expect(uut.c, 3.1*2.2)
}
object FPUMain extends App {
val fpInType = FP(8.W, 4.BP)
val fpOutType = FP(12.W, 6.BP)
//val comp = DspComplex[Double] // How to declare a complex DSP type ?
val opts = new DspTesterOptionsManager {
dspTesterOptions = DspTesterOptions(
fixTolLSBs = 0,
genVerilogTb = false,
isVerbose = true
)
}
dsptools.Driver.execute (() => new FPMul(fpInType, fpOutType), opts) {
//dsptools.Driver.execute (() => new FPMul(comp, comp), opts) { // <-- this won't compile
c => new FPMulTester(c)
}
}
I am using Spark and I would like to train a machine learning model.
Because of bad results, I would like to display the error made by the model at each epoch of the training (on train and test dataset).
I will then use this information to determined if my model is underfitting or overfitting the data.
Question: How can I draw the learning curve of a model with spark ?
In the following example, I have implement my own evaluator and override the evaluate method to print the metrics I was needed, but only two values have been display (maxIter = 1000).
MinimalRunnableCode.scala:
import org.apache.spark.SparkConf
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
import org.apache.spark.sql.SparkSession
object Min extends App {
// Open spark session.
val conf = new SparkConf()
.setMaster("local")
.set("spark.network.timeout", "800")
val ss = SparkSession.builder
.config(conf)
.getOrCreate
// Load data.
val data = ss.createDataFrame(ss.sparkContext.parallelize(
List(
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 4), 3)
)
))
.withColumnRenamed("_1", "features")
.withColumnRenamed("_2", "label")
val Array(training, test) = data.randomSplit(Array(0.8, 0.2), seed = 42)
// Create model of linear regression.
val lr = new LinearRegression().setMaxIter(1000)
// Create parameters grid that will be used to train different version of the linear model.
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.001))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.5))
.build()
// Create trainer using validation split to evaluate which set of parameters performs the best.
val trainValidationSplit = new TrainValidationSplit()
.setEstimator(lr)
.setEvaluator(new CustomRegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
.setTrainRatio(0.8) // 80% of the data will be used for training and the remaining 20% for validation.
// Run train validation split, and choose the best set of parameters.
var model = trainValidationSplit.fit(training)
// Close spark session.
ss.stop()
}
CustomRegressionEvaluator.scala:
import org.apache.spark.ml.evaluation.{Evaluator, RegressionEvaluator}
import org.apache.spark.ml.param.{Param, ParamMap, Params}
import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable}
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.sql.{Dataset, Row}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
final class CustomRegressionEvaluator (override val uid: String) extends Evaluator with HasPredictionCol with HasLabelCol with DefaultParamsWritable {
def this() = this(Identifiable.randomUID("regEval"))
def checkNumericType(
schema: StructType,
colName: String,
msg: String = ""): Unit = {
val actualDataType = schema(colName).dataType
val message = if (msg != null && msg.trim.length > 0) " " + msg else ""
require(actualDataType.isInstanceOf[NumericType], s"Column $colName must be of type " +
s"NumericType but was actually of type $actualDataType.$message")
}
def checkColumnTypes(
schema: StructType,
colName: String,
dataTypes: Seq[DataType],
msg: String = ""): Unit = {
val actualDataType = schema(colName).dataType
val message = if (msg != null && msg.trim.length > 0) " " + msg else ""
require(dataTypes.exists(actualDataType.equals),
s"Column $colName must be of type equal to one of the following types: " +
s"${dataTypes.mkString("[", ", ", "]")} but was actually of type $actualDataType.$message")
}
var i = 0 // count the number of time the evaluate method is called
override def evaluate(dataset: Dataset[_]): Double = {
val schema = dataset.schema
checkColumnTypes(schema, $(predictionCol), Seq(DoubleType, FloatType))
checkNumericType(schema, $(labelCol))
val predictionAndLabels = dataset
.select(col($(predictionCol)).cast(DoubleType), col($(labelCol)).cast(DoubleType))
.rdd
.map { case Row(prediction: Double, label: Double) => (prediction, label) }
val metrics = new RegressionMetrics(predictionAndLabels)
val metric = "mae" match {
case "rmse" => metrics.rootMeanSquaredError
case "mse" => metrics.meanSquaredError
case "r2" => metrics.r2
case "mae" => metrics.meanAbsoluteError
}
println(s"$i $metric") // Print the metrics
i = i + 1 // Update counter
metric
}
override def copy(extra: ParamMap): RegressionEvaluator = defaultCopy(extra)
}
object RegressionEvaluator extends DefaultParamsReadable[RegressionEvaluator] {
override def load(path: String): RegressionEvaluator = super.load(path)
}
private[ml] trait HasPredictionCol extends Params {
/**
* Param for prediction column name.
* #group param
*/
final val predictionCol: Param[String] = new Param[String](this, "predictionCol", "prediction column name")
setDefault(predictionCol, "prediction")
/** #group getParam */
final def getPredictionCol: String = $(predictionCol)
}
private[ml] trait HasLabelCol extends Params {
/**
* Param for label column name.
* #group param
*/
final val labelCol: Param[String] = new Param[String](this, "labelCol", "label column name")
setDefault(labelCol, "label")
/** #group getParam */
final def getLabelCol: String = $(labelCol)
}
Here is a possible solution for the specific case of LinearRegression and any other algorithm that support objective history (in this case, And LinearRegressionTrainingSummary does the job).
Let's first create a minimal verifiable and complete example :
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
import org.apache.spark.mllib.util.{LinearDataGenerator, MLUtils}
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder().getOrCreate()
import org.apache.spark.ml.evaluation.RegressionEvaluator
import spark.implicits._
val data = {
val tmp = LinearDataGenerator.generateLinearRDD(
spark.sparkContext,
nexamples = 10000,
nfeatures = 4,
eps = 0.05
).toDF
MLUtils.convertVectorColumnsToML(tmp, "features")
}
As you've noticed, when you want to generate data for testing purposes for spark-mllib or spark-ml, it's advised to use data generators.
Now, let's train a linear regressor :
// Create model of linear regression.
val lr = new LinearRegression().setMaxIter(1000)
// The following line will create two sets of parameters
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.001)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.5)).build()
// Create trainer using validation split to evaluate which set of parameters performs the best.
// I'm using the regular RegressionEvaluator here
val trainValidationSplit = new TrainValidationSplit()
.setEstimator(lr)
.setEvaluator(new RegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
.setTrainRatio(0.8) // 80% of the data will be used for training and the remaining 20% for validation.
// To retrieve subModels, make sure to set collectSubModels to true before fitting.
trainValidationSplit.setCollectSubModels(true)
// Run train validation split, and choose the best set of parameters.
var model = trainValidationSplit.fit(data)
Now since our model is trained, all we need is to get the objective history.
The following part needs a bit of gymnastics between the model and sub-models object parameters.
In case you have a Pipeline or so, this code needs to be modified, so use it carefully. It's just an example :
val objectiveHist = spark.sparkContext.parallelize(
model.subModels.zip(model.getEstimatorParamMaps).map {
case (m: LinearRegressionModel, pm: ParamMap) =>
val history: Array[Double] = m.summary.objectiveHistory
val idx: Seq[Int] = 1 until history.length
// regParam, elasticNetParam, fitIntercept
val parameters = pm.toSeq.map(pair => (pair.param.name, pair.value.toString)) match {
case Seq(x, y, z) => (x._2, y._2, z._2)
}
(parameters._1, parameters._2, parameters._3, idx.zip(history).toMap)
}).toDF("regParam", "elasticNetParam", "fitIntercept", "objectiveHistory")
We can now examine those metrics :
objectiveHist.show(false)
// +--------+---------------+------------+-------------------------------------------------------------------------------------------------------+
// |regParam|elasticNetParam|fitIntercept|objectiveHistory |
// +--------+---------------+------------+-------------------------------------------------------------------------------------------------------+
// |0.001 |0.5 |true |[1 -> 0.4999999999999999, 2 -> 0.4038796441909531, 3 -> 0.02659222058006269, 4 -> 0.026592220340980147]|
// |0.001 |0.5 |false |[1 -> 0.5000637621421942, 2 -> 0.4039303922115196, 3 -> 0.026592220673025396, 4 -> 0.02659222039347222]|
// +--------+---------------+------------+-------------------------------------------------------------------------------------------------------+
You can notice that the training process actually stops after 4 iterations.
If you want just the number of iterations, you can do the following instead :
val objectiveHist2 = spark.sparkContext.parallelize(
model.subModels.zip(model.getEstimatorParamMaps).map {
case (m: LinearRegressionModel, pm: ParamMap) =>
val history: Array[Double] = m.summary.objectiveHistory
// regParam, elasticNetParam, fitIntercept
val parameters = pm.toSeq.map(pair => (pair.param.name, pair.value.toString)) match {
case Seq(x, y, z) => (x._2, y._2, z._2)
}
(parameters._1, parameters._2, parameters._3, history.size)
}).toDF("regParam", "elasticNetParam", "fitIntercept", "iterations")
I've changed the number of features in the generator (nfeatures = 100) for the sake of demonstrations :
objectiveHist2.show
// +--------+---------------+------------+----------+
// |regParam|elasticNetParam|fitIntercept|iterations|
// +--------+---------------+------------+----------+
// | 0.001| 0.5| true| 11|
// | 0.001| 0.5| false| 11|
// +--------+---------------+------------+----------+
I have a small scenario where i read text file and calculate average based on date and store the summary into Mysql database.
Following is code
val repo_sum = joined_data.map(SensorReport.generateReport)
repo_sum.show() --- STEP 1
repo_sum.write.mode(SaveMode.Overwrite).jdbc(url, "sensor_report", prop)
repo_sum.show() --- STEP 2
After calculating average in repo_sum dataframe following is the result of STEP 1
+----------+------------------+-----+-----+
| date| flo| hz|count|
+----------+------------------+-----+-----+
|2017-10-05|52.887049194476745|10.27| 5.0|
|2017-10-04| 55.4188048943416|10.27| 5.0|
|2017-10-03| 54.1529270444092|10.27| 10.0|
+----------+------------------+-----+-----+
Then the save command is executed and the dataset values at step 2 is
+----------+-----------------+------------------+-----+
| date| flo| hz|count|
+----------+-----------------+------------------+-----+
|2017-10-05|52.88704919447673|31.578524597238367| 10.0|
|2017-10-04| 55.4188048943416| 32.84440244717079| 10.0|
+----------+-----------------+------------------+-----+
Following is complete code
class StreamRead extends Serializable {
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this);
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(2))
val sqlContext = new SQLContext(ssc.sparkContext)
import sqlContext.implicits._
val sensorDStream = ssc.textFileStream("file:///C:/Users/M1026352/Desktop/Spark/StreamData").map(Sensor.parseSensor)
val url = "jdbc:mysql://localhost:3306/streamdata"
val prop = new java.util.Properties
prop.setProperty("user", "root")
prop.setProperty("password", "root")
val tweets = sensorDStream.foreachRDD {
rdd =>
if (rdd.count() != 0) {
val databaseVal = sqlContext.read.jdbc("jdbc:mysql://localhost:3306/streamdata", "sensor_report", prop)
val rdd_group = rdd.groupBy { x => x.date }
val repo_data = rdd_group.map { x =>
val sum_flo = x._2.map { x => x.flo }.reduce(_ + _)
val sum_hz = x._2.map { x => x.hz }.reduce(_ + _)
val sum_flo_count = x._2.size
print(sum_flo_count)
SensorReport(x._1, sum_flo, sum_hz, sum_flo_count)
}
val df = repo_data.toDF()
val joined_data = df.join(databaseVal, Seq("date"), "fullouter")
joined_data.show()
val repo_sum = joined_data.map(SensorReport.generateReport)
repo_sum.show()
repo_sum.write.mode(SaveMode.Overwrite).jdbc(url, "sensor_report", prop)
repo_sum.show()
}
}
ssc.start()
WorkerAndTaskExample.main(args)
ssc.awaitTermination()
}
case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double)
object Sensor extends Serializable {
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble)
}
}
case class SensorReport(date: String, flo: Double, hz: Double, count: Double)
object SensorReport extends Serializable {
def generateReport(row: Row): SensorReport = {
print(row)
if (row.get(4) == null) {
SensorReport(row.getString(0), row.getDouble(1) / row.getDouble(3), row.getDouble(2) / row.getDouble(3), row.getDouble(3))
} else if (row.get(2) == null) {
SensorReport(row.getString(0), row.getDouble(4), row.getDouble(5), row.getDouble(6))
} else {
val count = row.getDouble(3) + row.getDouble(6)
val flow_avg_update = (row.getDouble(6) * row.getDouble(4) + row.getDouble(1)) / count
val flow_flo_update = (row.getDouble(6) * row.getDouble(5) + row.getDouble(1)) / count
print(count + " : " + flow_avg_update + " : " + flow_flo_update)
SensorReport(row.getString(0), flow_avg_update, flow_flo_update, count)
}
}
}
As far as i understand when save command is executed in spark the whole process runs again, is my understanding is correct please let me know.
In Spark all transformations are lazy, nothing will happen until an action is called. At the same time, this means that if multiple actions are called on the same RDD or dataframe, all computations will be performed multiple times. This includes loading the data and all transformations.
To avoid this, use cache() or persist() (same thing except that cache() can specify different types of storage, the default is RAM memory only). cache() will keep the RDD/dataframe in memory after the first time an action was used on it. Hence, avoiding running the same transformations multiple times.
In this case, since two actions are performed on the dataframe is causing this unexpected behavior, caching the dataframe would solve the problem:
val repo_sum = joined_data.map(SensorReport.generateReport).cache()
Apologies in advance for the basic question. I am starting to learn Scala with http4s and in a router handler, I am trying to enter an entry to MongoDB. As far as I can tell insertOne returns a Observable[Completed].
Any idea how I can wait for the observalbe to complete, before returning the response?
My code is:
class Routes {
val service: HttpService = HttpService {
case r # GET -> Root / "hello" => {
val mongoClient: MongoClient = MongoClient()
val database: MongoDatabase = mongoClient.getDatabase("scala")
val collection: MongoCollection[Document] = database.getCollection("tests")
val doc: Document = Document("_id" -> 0, "name" -> "MongoDB", "type" -> "database",
"count" -> 1, "info" -> Document("x" -> 203, "y" -> 102))
collection.insertOne(doc)
mongoClient.close()
Ok("Hello.")
}
}
}
class GomadApp(host: String, port: Int) {
private val pool = Executors.newCachedThreadPool()
println(s"Starting server on '$host:$port'")
val routes = new Routes().service
// Add some logging to the service
val service: HttpService = routes.local { req =>
val path = req.uri
val start = System.nanoTime()
val result = req
val time = ((System.nanoTime() - start) / 1000) / 1000.0
println(s"${req.remoteAddr.getOrElse("null")} -> ${req.method}: $path in $time ms")
result
}
// Construct the blaze pipeline.
def build(): ServerBuilder =
BlazeBuilder
.bindHttp(port, host)
.mountService(service)
.withServiceExecutor(pool)
}
object GomadApp extends ServerApp {
val ip = "127.0.0.1"
val port = envOrNone("HTTP_PORT") map (_.toInt) getOrElse (8787)
override def server(args: List[String]): Task[Server] =
new GomadApp(ip, port)
.build()
.start
}
I'd recommend https://github.com/haghard/mongo-query-streams - although you'll have to fork it and up the dependencies a bit, scalaz 7.1 and 7.2 aren't binary-compatible.
The less-streamy (and less referentially correct) way: https://github.com/Verizon/delorean
collection.insertOne(doc).toFuture().toTask.flatMap({res => Ok("Hello")})
The latter solution looks easier, but it has some hidden pitfalls. See https://www.reddit.com/r/scala/comments/3zofjl/why_is_future_totally_unusable/
This tweet made me wonder: https://twitter.com/timperrett/status/684584581048233984
Do you consider Futures "totally unusable" or is this just hyperbole? I've never had a major problem, but I'm willing to be enlightened. Doesn't the following code make Futures effectively "lazy"? def myFuture = Future { 42 }
And, finally, I've also heard rumblings that scalaz's Tasks have some failings as well, but I haven't found much on it. Anybody have more details?
Answer:
The fundamental problem is that constructing a Future with a side-effecting expression is itself a side-effect. You can only reason about Future for pure computations, which unfortunately is not how they are commonly used. Here is a demonstration of this operation breaking referential transparency:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.util.Random
val f1 = {
val r = new Random(0L)
val x = Future(r.nextInt)
for {
a <- x
b <- x
} yield (a, b)
}
// Same as f1, but I inlined `x`
val f2 = {
val r = new Random(0L)
for {
a <- Future(r.nextInt)
b <- Future(r.nextInt)
} yield (a, b)
}
f1.onComplete(println) // Success((-1155484576,-1155484576))
f2.onComplete(println) // Success((-1155484576,-723955400)) <-- not the same
However this works fine with Task. Note that the interesting one is the non-inlined version, which manages to produce two distinct Int values. This is the important bit: Task has a constructor that captures side-effects as values, and Future does not.
import scalaz.concurrent.Task
val task1 = {
val r = new Random(0L)
val x = Task.delay(r.nextInt)
for {
a <- x
b <- x
} yield (a, b)
}
// Same as task1, but I inlined `x`
val task2 = {
val r = new Random(0L)
for {
a <- Task.delay(r.nextInt)
b <- Task.delay(r.nextInt)
} yield (a, b)
}
println(task1.run) // (-1155484576,-723955400)
println(task2.run) // (-1155484576,-723955400)
Most of the commonly-cited differences like "a Task doesn't run until you ask it to" and "you can compose the same Task over and over" trace back to this fundamental distinction.
So the reason it's "totally unusable" is that once you're used to programming with pure values and relying on equational reasoning to understand and manipulate programs it's hard to go back to side-effecty world where things are much harder to understand.
I want to read multiple big files using Akka Streams to process each line. Imagine that each key consists of an (identifier -> value). If a new identifier is found, I want to save it and its value in the database; otherwise, if the identifier has already been found while processing the stream of lines, I want to save only the value. For that, I think that I need some kind of recursive stateful flow in order to keep the identifiers that have already been found in a Map. I think I'd receive in this flow a pair of (newLine, contextWithIdentifiers).
I've just started to look into Akka Streams. I guess I can manage myself to do the stateless processing stuff but I have no clue about how to keep the contextWithIdentifiers. I'd appreciate any pointers to the right direction.
Maybe something like statefulMapConcat can help you:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
import scala.util.Random._
import scala.math.abs
import scala.concurrent.ExecutionContext.Implicits.global
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
//encapsulating your input
case class IdentValue(id: Int, value: String)
//some random generated input
val identValues = List.fill(20)(IdentValue(abs(nextInt()) % 5, "valueHere"))
val stateFlow = Flow[IdentValue].statefulMapConcat{ () =>
//state with already processed ids
var ids = Set.empty[Int]
identValue => if (ids.contains(identValue.id)) {
//save value to DB
println(identValue.value)
List(identValue)
} else {
//save both to database
println(identValue)
ids = ids + identValue.id
List(identValue)
}
}
Source(identValues)
.via(stateFlow)
.runWith(Sink.seq)
.onSuccess { case identValue => println(identValue) }
A few years later, here is an implementation I wrote if you only need a 1-to-1 mapping (not 1-to-N):
import akka.stream.stage.{GraphStage, GraphStageLogic}
import akka.stream.{Attributes, FlowShape, Inlet, Outlet}
object StatefulMap {
def apply[T, O](converter: => T => O) = new StatefulMap[T, O](converter)
}
class StatefulMap[T, O](converter: => T => O) extends GraphStage[FlowShape[T, O]] {
val in = Inlet[T]("StatefulMap.in")
val out = Outlet[O]("StatefulMap.out")
val shape = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
val f = converter
setHandler(in, () => push(out, f(grab(in))))
setHandler(out, () => pull(in))
}
}
Test (and demo):
behavior of "StatefulMap"
class Counter extends (Any => Int) {
var count = 0
override def apply(x: Any): Int = {
count += 1
count
}
}
it should "not share state among substreams" in {
val result = await {
Source(0 until 10)
.groupBy(2, _ % 2)
.via(StatefulMap(new Counter()))
.fold(Seq.empty[Int])(_ :+ _)
.mergeSubstreams
.runWith(Sink.seq)
}
result.foreach(_ should be(1 to 5))
}