How to use feature extraction with DStream in Apache Spark - scala

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?

I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

Related

How to use scala SparkML model to do multiple predictions in loop

I just started learning Spark 3.3 for scala to do some regressions.
I was able to create, fit and test a model, but I got stuck trying to predict different subsets of a dataframe looping through one of it's columns to filter the data.
This is my principal function: (I'm testing everything here)
The last line is what I'm trying to achieve, but I'm getting a "Task not serializable" error
def test() = {
val data_key = "./data/EAD_HILIC_PFP_Com.csv"
val df = spark.read
.option("multiLine", true)
.option("header", "true")
.option("inferSchema", "true")
.csv(data_key)
val df2 = df.withColumn("PRECURSORMZ", $"PRECURSORMZ".cast("double").as("PRECURSORMZ"))
// to get list of samples from data
val labelDF = df2.select("mix_label").distinct
val (lrModel, test) = createModel(df2, labelDF)
println(s"Coefficients: ${lrModel.coefficients}")
println(s"Intercept: ${lrModel.intercept}")
println(s"Root Mean Squared Error (RMSE) = ${lrModel.summary.rootMeanSquaredError}")
println(s"R^2 = ${lrModel.summary.r2}")
val predictions = lrModel.transform(test)
val rmse = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("rmse")
val r2 = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("r2")
println(s"Root Mean Squared Error (RMSE) on test data (${labelDF.head.get(0)}) = " + rmse.evaluate(predictions))
println(s"R^2 on test data (${labelDF.head.get(0)}) = " + r2.evaluate(predictions))
labelDF.foreach(label => process(df2, label, lrModel, rmse, r2))
}
The createModel funtion, does what it says, creates and fits a linear regression model:
def createModel(df2: DataFrame, labelDF: DataFrame): (LinearRegressionModel, DataFrame) = {
val first = labelDF.head.get(0)
//istds for single sample (mix_label)
val istdsDF = df2
.filter('mix_label === first)
.select($"PRECURSORMZ", $"Average_mz")
df2.show()
istdsDF.show()
val assembler = new VectorAssembler()
.setInputCols(istdsDF.drop("msms", "mix_label").columns)
.setOutputCol("features")
val (train, test) = train_test_split(istdsDF, assembler)
val lr = new LinearRegression()
.setLabelCol("PRECURSORMZ")
.setFeaturesCol("features")
val lrModel = lr.fit(train)
(lrModel, test)
}
def train_test_split(data: DataFrame, assembler: VectorAssembler): (DataFrame, DataFrame) = {
val Array(train, test) = data.randomSplit(Array(0.8, 0.2), seed = 30)
(assembler.transform(train), assembler.transform(test))
}
Thanks for any help
EDIT 1: adding process function:
def process(otherDF: DataFrame, otherLabel: String, lrModel: LinearRegressionModel): Unit = {
val assembler = new VectorAssembler()
.setInputCols(otherDF.drop("msms", "mix_label").columns)
.setOutputCol("features")
val others = assembler.transform(otherDF)
others.show()
val rmse = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("rmse")
val r2 = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("r2")
val otherPreds = lrModel.transform(others)
println(s"Root Mean Squared Error (RMSE) on other data (with label '${otherLabel}') = " + rmse.evaluate(otherPreds))
println(s"R^2 on other data (with label '${otherLabel}') = " + r2.evaluate(otherPreds))
}
StackTrace here: https://pastebin.com/xhUpqvnx

Output is not showing, spark scala

Output is showing the schema, but output of sql query is not visible. I dont understand where I am doing wrong.
object ex_1 {
def parseLine(line:String): (String, String, Int, Int) = {
val fields = line.split(" ")
val project_code = fields(0)
val project_title = fields(1)
val page_hits = fields(2).toInt
val page_size = fields(3).toInt
(project_code, project_title, page_hits, page_size)
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "Weblogs")
val lines = sc.textFile("F:/Downloads_F/pagecounts.out")
val parsedLines = lines.map(parseLine)
println("hello")
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val RDD1 = parsedLines.toDF("project","page","pagehits","pagesize")
RDD1.printSchema()
RDD1.createOrReplaceTempView("logs")
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
val results = min1.collect()
results.foreach(println)
println("bye")
spark.stop()
}
}
As confirmed in the comments, using the show method displays the result of spark.sql(..).
Since spark.sql returns a DataFrame, calling show is the ideal way to display the data. Where you where calling collect, previously, is not advised:
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
..
..
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
// where `false` prevents the output from being truncated.
min1.show(false)
println("bye")
spark.stop()
Even if your DataFrame is empty you will still see a table output including the column names (i.e: the schema); whereas .collect() and println would print nothing in this scenario.

Use dataframes for Decision tree classifier in spark with string fields

I have managed to get my Decision Tree classifier work for the RDD-based API, but now I am trying to switch to the Dataframes-based API in Spark.
I have a dataset like this (but with many more fields) :
country, destination, duration, label
Belgium, France, 10, 0
Bosnia, USA, 120, 1
Germany, Spain, 30, 0
First I load my csv file in a dataframe :
val data = session.read
.format("org.apache.spark.csv")
.option("header", "true")
.csv("/home/Datasets/data/dataset.csv")
Then I transform string columns into numerical columns
val stringColumns = Array("country", "destination")
val index_transformers = stringColumns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
Then I assemble all my features into one single vector, using VectorAssembler, like this :
val assembler = new VectorAssembler()
.setInputCols(Array("country_index", "destination_index", "duration_index"))
.setOutputCol("features")
I split my data into training and test :
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
Then I create my DecisionTree Classifier
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
Then I use a pipeline to make all the transformations
val pipeline = new Pipeline()
.setStages(Array(index_transformers, assembler, dt))
I train my model and use it for predictions :
val model = pipeline.fit(trainingData)
val predictions = model.transform(testData)
But I get some mistakes I don't understand :
When I run my code like that, I have this error :
[error] found : Array[org.apache.spark.ml.feature.StringIndexer]
[error] required: org.apache.spark.ml.PipelineStage
[error] .setStages(Array(index_transformers, assembler,dt))
So what I did is that I added a pipeline right after the index_transformers val, and right before val assembler :
val index_pipeline = new Pipeline().setStages(index_transformers)
val index_model = index_pipeline.fit(data)
val df_indexed = index_model.transform(data)
and I use as training set and testing set, my new df_indexed dataframe, and I removed index_transformers from my pipeline with assembler and dt
val Array(trainingData, testData) = df_indexed.randomSplit(Array(0.7, 0.3))
val pipeline = new Pipeline()
.setStages(Array(assembler,dt))
And I get this error :
Exception in thread "main" java.lang.IllegalArgumentException: Data type StringType is not supported.
It basically says I use VectorAssembler on String, whereas I told it to use it on df_indexed which has now a numerical column_index, but it doesn't seem to use it in vectorAssembler, and i just don't understand..
Thank you
EDIT
Now I have almost managed to get it work :
val data = session.read
.format("org.apache.spark.csv")
.option("header", "true")
.csv("/home/hvfd8529/Datasets/dataOINIS/dataset.csv")
val stringColumns = Array("country_index", "destination_index", "duration_index")
val stringColumns_index = stringColumns.map(c => s"${c}_index")
val index_transformers = stringColumns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
val assembler = new VectorAssembler()
.setInputCols(stringColumns_index)
.setOutputCol("features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
.setImpurity("entropy")
.setMaxBins(1000)
.setMaxDepth(15)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels())
val stages = index_transformers :+ assembler :+ labelIndexer :+ dt :+ labelConverter
val pipeline = new Pipeline()
.setStages(stages)
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "indexedFeatures").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("accuracy = " + accuracy)
val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println("Learned classification tree model:\n" + treeModel.toDebugString)
except that now I have an error saying this :
value labels is not a member of org.apache.spark.ml.feature.StringIndexer
and I don't understand, as I am following examples on spark doc :/
Should be:
val pipeline = new Pipeline()
.setStages(index_transformers ++ Array(assembler, dt): Array[PipelineStage])
What I did for my first problem :
val stages = index_transformers :+ assembler :+ labelIndexer :+ rf :+ labelConverter
val pipeline = new Pipeline()
.setStages(stages)
For my second issue with label, I needed to use .fit(data) like this
val labelIndexer = new StringIndexer()
.setInputCol("label_fraude")
.setOutputCol("indexedLabel")
.fit(data)

Can I convert an incoming stream of data into an array?

I'm trying to learn streaming data and manipulating it with the telecom churn dataset provided here. I've written a method to calculate this in batch:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
def main(args: Array[String]): Unit = {
//setting spark context
val conf = new SparkConf().setAppName("churn")
val sc = new SparkContext(conf)
//loading and mapping data into RDD
val csv = sc.textFile("file://filename.csv")
val data = csv.map {line =>
val parts = line.split(",").map(_.trim)
val stringvec = Array(parts(1)) ++ parts.slice(4,20)
val label = parts(20).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint(label, Vectors.dense(vec))
}
val splits = data.randomSplit(Array(0.7,0.3))
val (training, testing) = (splits(0),splits(1))
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 6
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 7
val maxBins = 32
val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val labelAndPreds = testing.map {point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
}
}
I've had no problems with this. Now, I looked at the NetworkWordCount example provided on the spark website, and changed the code slightly to see how it would behave.
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val data = lines.flatMap(_.split(","))
My question is: is it possible to convert this DStream to an array which I can input into my analysis code? Currently when I try to convert to Array using val data = lines.flatMap(_.split(",")), it clearly says that:error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]
Your DStream contains many RDDs you can get access to the RDDs using foreachRDD function.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)
then each RDD can be converted to array using collect function.
this has already been shown here
For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
DStream.foreachRDD gives you an RDD[String] for each interval of
course, you could collect in an array
val arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Also keep in mind you could end up having way more data than you want in your driver since a DStream can be huge.
To limit the data for your analysis , I would do this way
data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet
You cannot put all the elements of a DStream in an array because those elements will keep being read over the wire, and your array would have to be indefinitely extensible.
The adaptation of this decision tree model to a streaming mode, where training and testing data arrives continuously, is not trivial for algorithmical reasons — while the answers mentioning collect are technically correct, they're not the appropriate solution to what you're trying to do.
If you want to run decision trees on a Stream in Spark, you may want to look at Hoeffding trees.

Apache Spark - Reduce Step Output (K, (V1,V2, V3, ...)

I have a chronological sequence of events (T1,K1,V1),(T2,K2,V3),(T3,K1,V2),(T4,K2,V4),(T5,K1,V5).
Both keys and values are strings.
I am trying to achieve the following using Spark
K1,(V1,V2,V5)
K2,(V3,V4)
This is what I tried
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile(inputFile, 2).cache()
val rdd2= rdd1.map {
line =>
val fields = line.split(" ")
val key = fields(1)
val v = fields(2)
(key, v)
}
// TODO : rdd2.reduce to get the output I want
rdd2.saveAsTextFile(outputFile)
Could someone please point me towards how to get the reducer to produce the output I want? Many thanks in advance.
You just need to group your rdd by key to achieve the desired output: rdd2.groupByKey
This small spark-shell session illustrates the usage:
val events = List(("t1","k1","v1"), ("t2","k2","v3"), ("t3","k1","v2"), ("t4","k2","v4"), ("t5","k1","v5"))
val rdd = sc.parallelize(events)
val kv = rdd.map{case (t,k,v) => (k,v)}
val grouped = kv.groupByKey
// show the collection ('collect' used here only to show the contents)
grouped.collect
res0: Array[(String, Iterable[String])] = Array((k1,ArrayBuffer(v1, v2, v5)), (k2,ArrayBuffer(v3, v4)))