Map value empty outside a for each in scala - scala

I have just started programming in scala. I'm also using Apache spark for reading a file - moviesFile. In the following code, I'm updating a mutable map inside a foreach function. The map is updated within the foreach function. But the values are not present once the foreach exits.
How to make the values remain permanent in the map variable movieMap.
val movieMap = scala.collection.mutable.Map[String,String]()
val movie = moviesFile.map(_.split("::")).foreach {
x => x.mkString(" ")
val movieid = x(0)
val title = x(1)
val genre = x(2)
val value = title+","+genre
movieMap(movieid.toString()) = value.toString()
println(movieMap.keySet)
}
println(movieMap.keySet)
println(movieMap.get("29"))

I believe that you are using Spark in a very wrong way. If you want to utilize Spark, you will have to use Spark's distributed data structures.
I will suggest to stay with Spark's distributed and parallelized data structure ( RDD's ). RDD's which contain ( key, value ) pairs are implicitly provided with some Map-like functionality.
Import org.apache.spark.SparkContext._
// Assume sc is the SparkContext instance
val moviesFileRdd = sc.textFile("movies.txt")
// moviesRdd is RDD[ ( String, String ) ] which acts as a Map-like thing of ( key, value ) pairs
val moviesRdd = moviesFileRdd.map( line =>
val splitLine = line.split( "::" )
val movieId = splitLine(0)
val title = splitLine(1)
val genre = splitLine(2)
val value = title + ", " + genre
( movieId.toString, value.toString )
)
// You see... RDD[ ( String, String ) ] offers some map-like things.
// get a list of all values with key 29
val listOfValuesWithKey29 = moviesRdd.lookup( "29" )
// I don't know why ? but if you really need a map here then
val moviesMap = moviesRdd.collectAsMap
// moviesMap will be a immutable Map, in case you need a mutable Map,
val moviesMutableMap = mutable.Map( moviesMap.toList: _* )

Related

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

how to merge RDD tuples

I want to use reduceByKey merge many tuples with same key,
here is the code:
val data = Array(DenseMatrix((2.0,1.0,5.0),(4.0,3.0,6.0)),
DenseMatrix((7.0,8.0,9.0),(10.0,12.0,11.0)))
val init = sc.parallelize(data,2)
//getColumn
def getColumn(v:DenseMatrix[Double]) : Map[Int, IndexedSeq[(Int, Double)]]={
val r = Random
val index = 0 to v.size - 1
def func(x:Int, y:DenseMatrix[Double]):(Int,(Int, Double)) =
{
( x,( r.nextInt(10), y.valueAt(x)))
}
val rest = index.map{x=> func(x,v)}.groupBy(x=>x._1).mapValues(x=>x.map(_._2))
rest
}
val out= init.flatMap{ v=> getColumn(v) }
val reduceOutput = tmp.reduceByKey(_++_)
val out2 = out.map{case(k,v)=>k}.collect() // keys here are not I want
here is two pic, the first one is the [key,value] pairs I thought it would be, the second one shows the real keys ,they are not I want,so the ouput is not right.
What should I do?

Filtering One RDD based on another RDD using regex

I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
}
})
})
val ratings_train_list = ratings_train.toList
return ratings_train_list
}
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
}
training_set.collect().toList
}
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions

Deduping evnts using hiveContext in spark with scala

I am trying to dedupe event records, using the hiveContext in spark with Scala.
df to rdd is compilation error saying "object Tuple23 is not a member of package scala". There is known issue, that Scala Tuple can't have 23 or more
Is there any other way to dedupe
val events = hiveContext.table("default.my_table")
val valid_events = events.select(
events("key1"),events("key2"),events("col3"),events("col4"),events("col5"),
events("col6"),events("col7"),events("col8"),events("col9"),events("col10"),
events("col11"),events("col12"),events("col13"),events("col14"),events("col15"),
events("col16"),events("col17"),events("col18"),events("col19"),events("col20"),
events("col21"),events("col22"),events("col23"),events("col24"),events("col25"),
events("col26"),events("col27"),events("col28"),events("col29"),events("epoch")
)
//events are deduped based on latest epoch time
val valid_events_rdd = valid_events.rdd.map(t => {
((t(0),t(1)),(t(2),t(3),t(4),t(5),t(6),t(7),t(8),t(9),t(10),t(11),t(12),t(13),t(14),t(15),t(16),t(17),t(18),t(19),t(20),t(21),t(22),t(23),t(24),t(25),t(26),t(28),t(29)))
})
// reduce by key so we will only get one record for every primary key
val reducedRDD = valid_events_rdd.reduceByKey((a,b) => if ((a._29).compareTo(b._29) > 0) a else b)
//Get all the fields
reducedRDD.map(r => r._1 + "," + r._2._1 + "," + r._2._2).collect().foreach(println)
Off the top of my head:
use cases classes which no longer have size limit. Just keep in mind that cases classes won't work correctly in Spark REPL,
use Row objects directly and extract only keys,
operate directly on a DataFrame,
import org.apache.spark.sql.functions.{col, max}
val maxs = df
.groupBy(col("key1"), col("key2"))
.agg(max(col("epoch")).alias("epoch"))
.as("maxs")
df.as("df")
.join(maxs,
col("df.key1") === col("maxs.key1") &&
col("df.key2") === col("maxs.key2") &&
col("df.epoch") === col("maxs.epoch"))
.drop(maxs("epoch"))
.drop(maxs("key1"))
.drop(maxs("key2"))
or with window function:
val w = Window.partitionBy($"key1", $"key2").orderBy($"epoch")
df.withColumn("rn_", rowNumber.over(w)).where($"rn" === 1).drop("rn")