Apache Spark - Reduce Step Output (K, (V1,V2, V3, ...) - scala

I have a chronological sequence of events (T1,K1,V1),(T2,K2,V3),(T3,K1,V2),(T4,K2,V4),(T5,K1,V5).
Both keys and values are strings.
I am trying to achieve the following using Spark
K1,(V1,V2,V5)
K2,(V3,V4)
This is what I tried
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile(inputFile, 2).cache()
val rdd2= rdd1.map {
line =>
val fields = line.split(" ")
val key = fields(1)
val v = fields(2)
(key, v)
}
// TODO : rdd2.reduce to get the output I want
rdd2.saveAsTextFile(outputFile)
Could someone please point me towards how to get the reducer to produce the output I want? Many thanks in advance.

You just need to group your rdd by key to achieve the desired output: rdd2.groupByKey
This small spark-shell session illustrates the usage:
val events = List(("t1","k1","v1"), ("t2","k2","v3"), ("t3","k1","v2"), ("t4","k2","v4"), ("t5","k1","v5"))
val rdd = sc.parallelize(events)
val kv = rdd.map{case (t,k,v) => (k,v)}
val grouped = kv.groupByKey
// show the collection ('collect' used here only to show the contents)
grouped.collect
res0: Array[(String, Iterable[String])] = Array((k1,ArrayBuffer(v1, v2, v5)), (k2,ArrayBuffer(v3, v4)))

Related

Output is not showing, spark scala

Output is showing the schema, but output of sql query is not visible. I dont understand where I am doing wrong.
object ex_1 {
def parseLine(line:String): (String, String, Int, Int) = {
val fields = line.split(" ")
val project_code = fields(0)
val project_title = fields(1)
val page_hits = fields(2).toInt
val page_size = fields(3).toInt
(project_code, project_title, page_hits, page_size)
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "Weblogs")
val lines = sc.textFile("F:/Downloads_F/pagecounts.out")
val parsedLines = lines.map(parseLine)
println("hello")
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val RDD1 = parsedLines.toDF("project","page","pagehits","pagesize")
RDD1.printSchema()
RDD1.createOrReplaceTempView("logs")
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
val results = min1.collect()
results.foreach(println)
println("bye")
spark.stop()
}
}
As confirmed in the comments, using the show method displays the result of spark.sql(..).
Since spark.sql returns a DataFrame, calling show is the ideal way to display the data. Where you where calling collect, previously, is not advised:
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
..
..
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
// where `false` prevents the output from being truncated.
min1.show(false)
println("bye")
spark.stop()
Even if your DataFrame is empty you will still see a table output including the column names (i.e: the schema); whereas .collect() and println would print nothing in this scenario.

Scala and Spark - Logistic Regression - NullPointerException

Hellow everyone!
I have a problem with my scala and spark code. I am trying to implement a logistic regreesion model. For this I had to implement two UDF functions to collect my features. The problem is that every time when I try to call dataframe.show() function, I get an error:
I thought tha maybe I had null values on my dataframe and I tried to call dataframe.na.drop() in order to eliminate likely null values.
The poblem stil exists an it says that faild to execute user defined function(anonfun$3: (array, array) => int).
Here is my hole code:
val sc = spark.sparkContext
val data = sc.textFile("resources/data/training_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1), fields(2).toInt)
})
import spark.implicits._
val trainingDF = data.toDF("srcId","dstId", "label")
val infoRDD = spark.read.option("header","false").option("inferSchema","true").format("csv").load("resources/data/node_information.csv")
val infoDF = infoRDD.toDF("srcId","year","title","authors","jurnal","abstract")
println("Showing linksDF sample...")
trainingDF.show(5)
println("Rows of linksDF: ",trainingDF.count())
println("Showing infoDF sample...")
infoDF.show(2)
println("Rows of infoDF: ",infoDF.count())
println("Joining linksDF and infoDF...")
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" === $"b.srcId")
println(joinedDF.count())
joinedDF = joinedDF.select($"a.srcId",$"a.dstId",$"a.label",$"b.year",$"b.title",$"b.authors",$"b.jurnal",$"b.abstract")
println("Renameming joinedDF...")
joinedDF = joinedDF
.withColumnRenamed("srcId","id_from")
.withColumnRenamed("dstId","id_to")
.withColumnRenamed("year","year_from")
.withColumnRenamed("title","title_from")
.withColumnRenamed("authors","authors_from")
.withColumnRenamed("jurnal","jurnal_from")
.withColumnRenamed("abstract","abstract_from")
var infoDfRenamed = joinedDF
.withColumnRenamed("id_from","id_from")
.withColumnRenamed("id_to","id_to")
.withColumnRenamed("year_from","year_to")
.withColumnRenamed("title_from","title_to")
.withColumnRenamed("authors_from","authors_to")
.withColumnRenamed("jurnal_from","jurnal_to")
.withColumnRenamed("abstract_from","abstract_to").select("id_to","year_to","title_to","authors_to","jurnal_to","jurnal_to")
var finalDF = joinedDF.as(("a")).join(infoDF.as("b"),$"a.id_to" === $"b.srcId")
finalDF = finalDF
.withColumnRenamed("year","year_to")
.withColumnRenamed("title","title_to")
.withColumnRenamed("authors","authors_to")
.withColumnRenamed("jurnal","jurnal_to")
.withColumnRenamed("abstract","abstract_to")
println("Dropping unused columns from joinedDF...")
finalDF = finalDF.drop("srcId")
println("Spliting title_from column into words...")
finalDF = finalDF.withColumn("title_from_words", functions.split(col("title_from"), "\\s+"))
println("Spliting title_to column into words...")
finalDF = finalDF.withColumn("title_to_words", functions.split(col("title_to"), "\\s+"))
println("Spliting authors_from column into words...")
finalDF = finalDF.withColumn("authors_from_words", functions.split(col("authors_from"), "\\s+"))
println("Spliting authors_to column into words...")
finalDF = finalDF.withColumn("authors_to_words", functions.split(col("authors_to"), "\\s+"))
println("Removing stopwords from title_from column...")
val remover = new StopWordsRemover().setInputCol("title_from_words").setOutputCol("title_from_words_f")
finalDF = remover.transform(finalDF)
println("Removing stopwords from title_to column...")
val remover2 = new StopWordsRemover().setInputCol("title_to_words").setOutputCol("title_to_words_f")
finalDF = remover2.transform(finalDF)
println("Removing stopwords from authors_from column...")
val remover3 = new StopWordsRemover().setInputCol("authors_from_words").setOutputCol("authors_from_words_f")
finalDF = remover3.transform(finalDF)
println("Removing stopwords from authors_to column...")
val remover4 = new StopWordsRemover().setInputCol("authors_to_words").setOutputCol("authors_to_words_f")
finalDF = remover4.transform(finalDF)
finalDF.count()
val udf_title_overlap=udf(findNumberCommonWordsTitle(_:Seq[String],_:Seq[String]))
val udf_authors_overlap = udf(findNumberCommonAuthors(_:Seq[String], _:Seq[String]))
println("Getting the number of common words between title_from and title_to columns using UDF function...")
finalDF = finalDF.withColumn("titles_intersection",udf_title_overlap(finalDF("title_from_words"),finalDF("title_to_words")))
println("Getting the number of common words between authors_from and authors_to columns using UDF function...")
finalDF = finalDF.withColumn("authors_intersection",udf_authors_overlap(finalDF("aut
hors_from_words"),finalDF("authors_to_words")))
finalDF.count()
finalDF = finalDF.withColumn("time_dist",abs($"year_from" -
$"year_to"))
println("Show schema of finalDF:\n")
finalDF.printSchema()
println("Dropping unused columns from finalDF...\n")
val finalCollsDF = finalDF.select("label","titles_intersection",
"authors_intersection", "time_dist")
println("Printing schema for finalDF...\n")
finalCollsDF.printSchema()
println("Creating features coll from finalDF using
VectorAssembler...\n")
val assembler = new VectorAssembler()
.setInputCols(Array("titles_intersection", "authors_intersection",
"time_dist"))
.setOutputCol("features")
val output = assembler.transform(finalCollsDF)
println("Printing final schema before trainning...\n")
output.printSchema()
output.na.drop()
println("Splitting dataset into trainingData and testData...\n")
val Array(trainingData, testData) = output.randomSplit(Array(0.6,
0.4))
val lr = new LogisticRegression()
.setFeaturesCol("features")
.setLabelCol("label")
.setPredictionCol("prediction")
.setRawPredictionCol("prediction_raw")
.setMaxIter(10)
val lr_model = lr.fit(trainingData)
val lr_results = lr_model.transform(testData)
val evaluator = new BinaryClassificationEvaluator()
.setRawPredictionCol("prediction_raw")
.setLabelCol("label")
.setMetricName("areaUnderPR")
println("RESULTS FOR LOGISTIC REGRESSION:")
println(evaluator.evaluate(lr_results))
}
The two UDF functions which I am using are the following:
def findNumberCommonWordsTitle(title_from:Seq[String],
title_to:Seq[String]) ={
val intersection = title_from.intersect(title_to)
intersection.length
}
def findNumberCommonAuthors(author_from:Seq[String],
author_to:Seq[String])={
val intersection = author_from.intersect(author_to)
intersection.length
}
I had searched for null values manualy using a foreach statement but it did not worked too. How can I fix this problem. Maybe I is another problem which I can not find.
Thanks
Spark allow values to be null in dataframes, so calling the UDF on columns of that dataframe may result in NullPointerExceptions if for example one or both arrays are null.
You should check in your UDF for null values and handle these accordingly, or make sure the column in the dataframe may not be null and set null values to empty arrays.
def findNumberCommonAuthors(author_from:Seq[String], author_to:Seq[String])={
if(author_from == null || author_to == null) 0
else author_from.intersect(author_to).length
}

converting textFile to dataFrame dynamically

I am trying to convert input from a text file to dataframe using a schema file which is read at run time.
My input text file looks like this:
John,23
Charles,34
The schema file looks like this:
name:string
age:integer
This is what I tried:
object DynamicSchema {
def main(args: Array[String]) {
val inputFile = args(0)
val schemaFile = args(1)
val schemaLines = Source.fromFile(schemaFile, "UTF-8").getLines().map(_.split(":")).map(l => l(0) -> l(1)).toMap
val spark = SparkSession.builder()
.master("local[*]")
.appName("Dynamic Schema")
.getOrCreate()
import spark.implicits._
val input = spark.sparkContext.textFile(args(0))
val schema = spark.sparkContext.broadcast(schemaLines)
val nameToType = {
Seq(IntegerType,StringType)
.map(t => t.typeName -> t).toMap
}
println(nameToType)
val fields = schema.value
.map(field => StructField(field._1, nameToType(field._2), nullable = true)).toSeq
val schemaStruct = StructType(fields)
val rowRDD = input
.map(_.split(","))
.map(attributes => Row.fromSeq(attributes))
val peopleDF = spark.createDataFrame(rowRDD, schemaStruct)
peopleDF.printSchema()
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
results.show()
}
}
Though the printSchema gives the desired result, result.show errors out. I think the age field actually needs to be converted using toInt. Is there a way to achieve the same when the schema is only available at runtime?
Replace
val input = spark.sparkContext.textFile(args(0))
with
val input = spark.read.schema(schemaStruct).csv(args(0))
and move it after schema definition.

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?
I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

Transform input stream to key-values pairs stream

I am new to Spark and Scala so my question probably is rather easy but I still struggle to find an answer. I need to join two Spark streams but I have problems with converting those streams to appropriate format. Please see my code below:
val lines7 = ssc.socketTextStream("localhost", 9997)
val pairs7 = lines7.map(line => (line.split(" ")[0], line))
val lines8 = ssc.socketTextStream("localhost", 9998)
val pairs8 = lines8.map(line => (line.split(" ")[0], line))
val newStream = pairs7.join(pairs8)
This doesn't work because "join" function expects streams in format DStream[String, String] and result of map function is DStream[(String, String)] .
And now my question is how to code this map function to get appropriate output (little explanation would be also great)?
Thanks in advance.
This works as expected:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(30))
val lines7 = ssc.socketTextStream("localhost", 9997)
val pairs7 = lines7.map(line => (line.split(" ")(0), line))
val lines8 = ssc.socketTextStream("localhost", 9998)
val pairs8 = lines8.map(line => (line.split(" ")(0), line))
val newStream = pairs7.join(pairs8)
newStream.foreachRDD(rdd => println(rdd.collect.map(_.toString).mkString(",")))
ssc.start
The only issue I see is a syntax error on: line.split(" ")[0] vs line.split(" ")(0) but I guess that would be noticed by the compiler.