Convert RDD[Array[(String,String)]] type to RDD[(String,String)] in scala - scala

I'm new to Scala and tried multiple things to convert RDD[Array[(String,String)]] type to RDD[(String,String)].
What I want to achive is to select from a Json two elements (text and category). For every word in the text, I just want to create a key/value pair in the form (word1, category), (word2, category), ....
My example looks like this:
import org.json4s._
import org.json4s.jackson.JsonMethods._
// Example Json-line: {"reviewText": "This was a gift!", "category": "Apps"}"
val rdd = sc.textFile(PathToJSONFile)
rdd.map{
row =>
val json_row = parse(row)
val myCategory = compact(json_row \ "category").toString
val myText = compact(json_row \ "reviewText").toString.toLowerCase.split("[#&$!]").map(_.trim).filter(_.length > 1)
myText.map{word => (word, myCategory)}
}
The output is org.apache.spark.rdd.RDD[Array[(String, String)]] and looks like this:
Array(Array((this,"Apps"), (was,"Apps"), (a,"Apps"), (gift,"Apps"))
But what I want to achieve is a key value pair in the form of RDD[(String,String)] (where key is a word and the value is the same category for every word in this line)
How can I achieve this? Many thanks!

The suggestions from Psidom solved the problem.
Changing rdd.map to rdd.flatMap was the solution.

Related

Scala:How to convert my input to list of list

i have the below input,
input
[level:1,firstFile:one,secondFile:secone,Flag:NA][level:1,firstFile:two,secondFile:sectwo,Flag:NA][level:2,firstFile:three,secondFile:secthree,Flag:NA]
getting below output and working fine,
List(List(one, two), List(three))
List(List(secone, sectwo), List(secthree))
However when i pass the below input i am getting the output as,
[level:1,firstFile:one,four,secondFile:secone,Flag:NA][level:1,firstFile:two,secondFile:sectwo,Flag:NA][level:2,firstFile:three,secondFile:secthree,Flag:NA]
getting output as,
List(List(), List(two), List(three))
List(List(), List(sectwo), List(secthree))
But the expected output is,
List(List(one, four, two), List(three))
List(List(secone, sectwo), List(secthree))
Code.
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd).orderBy("level").groupBy("level")
.agg(collect_list("firstFile").as("firstFile"), collect_list("secondFile").as("secondFile"))
.select(collect_list("firstFile").as("firstFile"), collect_list("secondFile").as("secondFile"))
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
val second = rdd(0)._2.map(x => x.toList).toList
val firstInputcolumns = first.map(_.filterNot(_ == null))
val secondInputcolumns= second.map(_.filterNot(_ == null))
println(firstInputcolumns)
println(secondInputcolumns)
Kindly help me to correct the code.
It doesn't look like your replaces are quite producing valid JSON. If you run them on the second input, for the first entry you get:
{"level":"1","firstFile":"one","four","secondFile":"secone","Flag":"NA"}
But JSON is a list of key-value pairs. You can't just have "four" sitting out on its own like that. If you want firstFile to be mapped to a list, one and four should be wrapped in square brackets, and the JSON should look like so:
{"level":"1","firstFile":["one","four"],"secondFile":"secone","Flag":"NA"}

how to search multipule string in Scala spark

I am using
val my_data = sc.textFile("abc.txt")
val my_search = my_data.filter(x => x.contains("is","work"))
where I am trying to filter the lines contains "is" and "work" in my RDD "My_Data"
If you know all the strings you want to filter beforehand (as in the example you gave), you can do the following:
my_data.filter(x => Seq("is", "work").forall(x.contains))
Full words
If you want to filter the full words, you will need to tokenize each line first. The easiest way to do it is by using a string.split(" "). Be careful, as this doesn't work for languages like Japanese or Chinese.
my_data.filter { line =>
val tokens = line.split(" ").toSet
Seq("is", "work").forall(tokens.contains)
}

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

Unable to use collectAsMap() in scala code

val titleMap = movies.map(line => line.split("\\|")).take(2)
//converting movie-id and movie name as map(key-pair)
val title1 = titleMap.map(array=>(array(0).toInt,array(1)))
val titles = movies.map(line => line.split("\\|").take(2)).map(array
=> (array(0).toInt,
array(1))).collectAsMap()
Whats wrong here with "title1",I am unable to apply collectAsMap function here,same thing I can apply in case of "titles"
The type of title1 is not an RDD, so it doesn't have the method collectAsMap().
The type of titles is an RDD so it does have the method collectAsMap().
Advise reading up on types https://en.wikipedia.org/wiki/Type_safety, https://en.wikipedia.org/wiki/Type_system

Create new column with function in Spark Dataframe

I'm trying to figure out the new dataframe API in Spark. Seems like a good step forward but having trouble doing something that should be pretty simple. I have a dataframe with 2 columns, "ID" and "Amount". As a generic example, say I want to return a new column called "code" that returns a code based on the value of "Amt". I can write a function something like this:
def coder(myAmt:Integer):String {
if (myAmt > 100) "Little"
else "Big"
}
When I try to use it like this:
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", coder(myDF("Amt")))
I get type mismatch errors
found : org.apache.spark.sql.Column
required: Integer
I've tried changing the input type on my function to org.apache.spark.sql.Column but I then I start getting errors with the function compiling because it wants a boolean in the if statement.
Am I doing this wrong? Is there a better/another way to do this than using withColumn?
Thanks for your help.
Let's say you have "Amt" column in your Schema:
import org.apache.spark.sql.functions._
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}
val sqlfunc = udf(coder)
myDF.withColumn("Code", sqlfunc(col("Amt")))
I think withColumn is the right way to add a column
We should avoid defining udf functions as much as possible due to its overhead of serialization and deserialization of columns.
You can achieve the solution with simple when spark function as below
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", when(myDF("Amt") < 100, "Little").otherwise("Big"))
Another way of doing this:
You can create any function but according to the above error, you should define function as a variable
Example:
val coder = udf((myAmt:Integer) => {
if (myAmt > 100) "Little"
else "Big"
})
Now this statement works perfectly:
myDF.withColumn("Code", coder(myDF("Amt")))