Gatling: transform findAll to sorted list - scala

I'm new to scala and Gatling. I'm trying to transform the result of findAll into a sorted list and then return a String representation of the sorted list. I can't seem to do this with the following code:
http(requestTitle)
.post(serverUrl)
.body(ElFileBody(sendMessageFile))
.header("correlation-id", correlationId)
.check(status.is(200),
jsonPath("$.data.sendMessage.targetedRecipients").findAll.transform(recipients => {
println("recipients class: " + recipients.getClass)
var mutable = scala.collection.mutable.ListBuffer(recipients: _*)
var sortedRecipients = mutable.sortWith(_ < _)
println("users sorted "+ usersSorted)
usersSorted.mkString(",")
}).is(expectedMessageRecipients))
Recipients is of type scala.collection.immutable.Vector.
I thought I would be able to convert the immutable collection into a mutable collection using scala.collection.mutable.ListBuffer. Any help would be appreciated, thanks.

I don't think your problem is immutability, it's JSON parsing vs Gatling's .find and .findAll methods.
I'm going to make a guess that your response looks something like...
{"data":{"sendMessage":{"targetedRecipients":[1,4,2,3]}}}
in which case Gatling's .findAll method will return a vector (it always does if it finds something), but it will only have one element which will be "[1,4,2,3]" - ie: a string representing json data, so sorting the collection of a single element naturally achieves nothing. To get .findAll to behave like you seem to be expecting, you would need a response something like...
{"data":
{"sendMessage":
{"targetedRecipients":
[{"recipientId":1},
{"recipientId":4},
{"recipientId":2},
{"recipientId":3}]
}}}
which you could use .jsonPath("$..recipientId").findAllto turn into a Vector[String] of the Ids.
So assuming you are indeed just getting a single string representation of an array of values, you could use a straight transform to generate an array and sort (as you tried in your example)
Here's a working version
val data = """{"data":{"sendMessage":{"targetedRecipients":[1,4,2,3]}}}"""
def sortedArray : ScenarioBuilder = scenario("sorting an array")
.exec(http("test call")
.post("http://httpbin.org/anything")
.body(StringBody(data)).asJson
.check(
status.is(200),
jsonPath("$.json.data.sendMessage.targetedRecipients")
.find
.transform(_
.drop(1)
.dropRight(1)
.split(",")
.toVector
.sortWith(_<_)
)
.saveAs("received")
))
.exec(session => {
println(s"received: ${session("received").as[Vector[String]]}")
session
})

There is no reason to use mutable collection if all you want is to sort the result:
Vector(5,4,3,2,1).sortWith(_ < _).mkString(", ") // "1, 2, 3, 4, 5"
To use ListBuffer you have to copy all elements into newly allocated object, so it isn't even more optimal in any way. Same with vars - you can use vals as you don't update the reference
println(s"recipients class: ${recipients.getClass}")
val result = recipients.sortWith(_ < _).mkString(", ")
println(s"users sorted $result")
result

Related

Expected Map[] but Found List[Map[]]

The scala program is iterating through a list of words and appending the word with value if already found or adding key->word otherwise. It is expected to produce Map[] but producing List[Map[]] instead.
val hashmap:Map[List[(Char, Int)], List[String]]=Map()
for (word <- dictionary) yield {
val word_occ = wordOccurrences(word)
hashmap + (if (hashmap.contains(word_occ)) (word_occ -> (hashmap(word_occ) ++ List(word))) else (word_occ -> List(word)))
}
Note that in this case you probably want to build the Map in a single pass rather than modifying a mutable Map:
val hashmap:Map[List[(Char, Int)], List[String]]=
dictionary
.map(x => (wordOccurrences(x), x))
.groupBy(_._1)
.map { case (k, v) => k -> v.map(_._2) }
In Scala 2.13 you can replace the last two lines with
.groupMap(_._1)(_._2)
You can also use a view on the dictionary to avoid creating the intermediate list if performance is a significant issue.
A for comprehension with a single <- generator de-sugars to a map() call on the original collection. And, as you'll recall, map() can change the elements of a collection, but it won't change the collection type itself.
So if dictionary is a List then what you end up with will be a List. The yield specifies what is to be the next element in the resulting List.
In your case the code is creating a new single-element Map for each element in the dictionary. Probably not what you want. I'd suggest you try using foldLeft().

Reduce list of tuples to a single tuple in scala

Item is a custom type.
I have a Iterable of pairs (Item, Item). The first element in every pair is the same, so I want to reduce the list to a single pair of type (Item, Array[Item])
// list: Iterable[(Item, Item)]
// First attempt
val res = list.foldLeft((null, Array[Item]()))((p1,p2) => {
(p2._1, p1._2 :+ p2._2)
}
// Second attempt
val r = list.unzip
val res = (r._1.head, r._2.toArray))
1. I don't know how to correctly setup the zero value in the first ("foldLeft") solution. Is there any way to do something like this?
2. Other than the second solution, is there a better way to reduce a list of custom object tuples to single tuple ?
If you are sure the first element in every pair is the same, why don't you use that information to simplify?
(list.head._1, list.map(_._2))
should do the work
if there are other cases where the first element is different, you may want to try:
list.groupBy(_._1).map { case (common, lst) => (common, lst.map(_._2)) }

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

How do I iterate RDD's in apache spark (scala)

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}
You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())
The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.
I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();
what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.
for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}

Why does :+ appending to a Seq have no effect?

i wanna have a result-Sequence with a Triple of (String, Int, Int) like this:
var all_info: Seq[(String, Int, Int)] = null
now that i try adding elements to my Seq as followed:
if (all_info == null) {
all_info = Seq((name, id, count))
} else {
all_info :+ (name, id, count)
}
and print them
Console.println(all_info.mkString)
Unfortunately, the printed result is only the first triple that is added by the if-clause and basically intializes a new Seq, since it's been just "null" before.
All following triples which are supposed to be added to the Seq in the else-clause are not.
I also tried different methods like "++" which won't work either ("too many arguments")
Can't really figure out what I'm doing wrong here.
Thanks for any help in advance!
Greetings.
First of all instead of using nulls you would be better using an empty collection. Next use :+= so the result of :+ would not be thrown away — :+ produces a new collection as the result instead of modifying the existing one. The final code would look like
var all_info: Seq[(String, Int, Int)] = Seq.empty
all_info :+= (name, id, count)
As you can see, now you don't need ifs and code should work fine.
Method :+ creates new collection and leaves your original collection untouched.
You should use method +=. If there is no method += in all_info compiler will treat all_info += (name, id, count) as all_info = all_info + (name, id, count).
On the contrary, if you'll change the type of all_info to some mutable collection, you'll get method += in it, so your code will work as expected: method += on mutable collections changes target collection.
Note that there is no method :+= in mutable collections, so you'll get all_info = all_info :+ (name, id, count) even for mutable collection.