Create a list of Map from csv feeder in gatling - scala

I am using Gatling pebble template in my Gatling simulation, this works with a Map like this
val mapValuesFeeder = Iterator.continually(Map("mapValues" -> List(
Map("id" -> "1", "weight" -> "10"),
Map("id" -> "2", "weight" -> "20"),
)))
but I don't want to hardcode these values in a Map, how can I create a map similar to above Map from CSV feeder data?

Related

Flatten a map into pairs (key, value) in Scala [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed last month.
Improve this question
Say I construct the following map in scala:
val map = Map.empty[String, Seq[String]] + ("A" -> ("1", "2", "3", "4"), "B" -> ("2", "3"), "C" -> ("3", "4"))
My output should be a Sequence of single key, value pairs. Namely, it should look like this:
[("A", "1"), ("A", "2"), ("A", "3"), ("A", "4"), ("B", "2"), ("B", "3"), ("B", "2"), ("C", "3"),
("C", "4")]
How can I obtain this using flatMap?
I would guess that your original goal was to create next map (note added Seqs in map):
val map = Map.empty[String, Seq[String]] + ("A" -> Seq("1", "2", "3", "4"), "B" -> Seq("2", "3"), "C" -> Seq("3", "4"))
Then you will be able to easily transform it easily with:
val result = map.toSeq.flatMap { case (k, v) => v.map((k, _)) }
Also, you can create the map directly, no need appending to an empty map.
val map = Map("A" -> Seq("1", "2", "3", "4"), "B" -> Seq("2", "3"), "C" -> Seq("3", "4"))

Spark 2.3: Reading dataframe inside rdd.map()

I want to iterate through each row of an RDD using .map() and I want to use a dataframe inside the map function as follows:
val rdd = ... // rdd holding seq of ids in each row
val df = ... // df with columns `id: String` and `value: Double`
rdd
.map { case Row(listOfStrings: Seq[String]) =>
listOfStrings.foldLeft(Seq[Double]())(op = (temp, curr) => {
// calling df here
val extractValue: Double = df.filter(s"id == $curr").first()(1)
temp :+ extractValue
}
}
Above is pseudocode which I made up, and this results in an exception because I cannot call a dataframe inside .map().
The only way I can think of overcoming this is to collect df before .map() so that it is no longer a dataframe. Is there a method in which I can do this without collecting? Note that joining the rdd and df is not suitable.
Basically you have a RDD of lists of IDs RDD[Seq[String]] and a dataframe of tuples (id, value). You are trying to replace the IDs of the RDD by the corresponding values in the dataframe.
The way you try to do it is impossible in spark. You cannot reference a dataframe nor a RDD inside a map. Indeed, they are objects that you manipulate in the driver to parallelize jobs, executed by the workers. However, the code inside map is executed by a worker and a worker cannot delegate work to other workers. Only the driver can. This is why (intuitively) what you are trying to do is not possible.
You say that a join is not suitable. I am not sure why but this is exactly what I propose, in combination with a flatMap. I use the RDD API but we could write similar code using the dataframe API.
// generating data
val data = Seq(Seq("a", "b", "c"), Seq("d", "e"), Seq("f"))
val rdd = sc.parallelize(data)
val df = Seq("a" -> 1d, "b" -> 2d, "c" -> 3d,
"d" -> 4d, "e" -> 5d, "f" -> 6d)
.toDF("id", "value")
// Transforming the dataframe into a RDD[String, Double]
val rdd_df = df.rdd
.map(row => row.getAs[String]("id") -> row.getAs[Double]("value"))
val result = rdd
// We start with zipWithUniqueId to remember how the lists were arranged
.zipWithUniqueId
// we flatten the lists, remembering for each row the list id
.flatMap{ case (ids, unique_id) => ids.map(id => id -> unique_id) }
.join(rdd_df)
.map{ case(_, (unique_id, value)) => unique_id -> value }
// we reform the lists by grouping by list id
.groupByKey
.map(_._2.toArray)
scala> result.collect
res: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0), Array(4.0, 5.0), Array(6.0))

Read Data from MongoDB through Apache Spark with a query

I am able to read the data stored in MongoDB via Apache Spark via the conventional methods described in its documentation. I have a mongoDB query that I would like to be used while loading the collection. The query is simple, but I can't seem to find the correct way to specify the query the config() function in SparkSession object.
Following is my SparkSession builder
val confMap: Map[String, String] = Map(
"spark.mongodb.input.uri" -> "mongodb://xxx:xxx#mongodb1:27017,mongodb2:27017,mongodb3:27017/?ssl=true&replicaSet=MongoShard-0&authSource=xxx&retryWrites=true&authMechanism=SCRAM-SHA-1",
"spark.mongodb.input.database" -> "A",
"spark.mongodb.input.collection" -> "people",
"spark.mongodb.output.database" -> "B",
"spark.mongodb.output.collection" -> "result",
"spark.mongodb.input.readPreference.name" -> "primaryPreferred"
)
conf.setAll(confMap)
val spark: SparkSession =
SparkSession.builder().master("local[1]").config(conf).getOrCreate()
Is there a way to specify the MongoDB query in the SparkConf object so that the SparkSession reads only the specific fields present in the collection.
Use .withPipeline API
Example Code:
val readConfig = ReadConfig(Map("uri" -> MONGO_DEV_URI, "collection" -> MONGO_COLLECTION_NAME, "readPreference.name" -> "secondaryPreferred"))
MongoSpark
.load(spark.sparkContext, readConfig)
.withPipeline(Seq(Document.parse(query)))
As per comments:
sparkSession.read.format("com.mongodb.spark.sql.DefaultSource")
.option("pipeline", "[{ $match: { name: { $exists: true } } }]")
.option("uri","mongodb://127.0.0.1/mydb.mycoll")
.load()

In Spark Json to Csv converting?

I have a Json object like bellow
{"Event":"xyz","Name":"test","Prog":0,"AId":"367","CId":"11522"}
using bellow spark script,I have converted into csv
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.load("org.apache.spark.sql.json", Map("path" -> "test1.json"))
df.save("com.databricks.spark.csv", SaveMode.ErrorIfExists, Map("path" -> "datascv", "header" -> "true"))
I am able to convert into CSV file,My Output is
AId,CId,Event,Name,Prog
367,11522,xyz,test,0
but here header of csv is in ascending order,but I want to maintain my csv file header in customized format like bellow i.e same as my json order.
Event,Name,Prog,AId,CIdEvent,Name,Prog,AId,CId
Please help me with this.
Thanks in advance.
You can try the following.
val selectedData = df.select("Event", "Name", "Prog", "AId", "CId")
selectedData.save("com.databricks.spark.csv", SaveMode.ErrorIfExists,
Map("path" -> "datascv", "header" -> "true"))

Creating Map values in Spark using Scala

I am new to spark-scala development. I am trying to create a map values in spark using scala but getting type mismatch error.
scala> val nums = sc.parallelize(Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F"))
<console>:21: error: type mismatch;
found : scala.collection.immutable.Map[String,String]
required: Seq[?]
Error occurred in an application involving default arguments.
val nums = sc.parallelize(Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F"))
How should I do this?
SparkContext.parallelize transforms from Seq[T] to RDD[T]. If you want to create RDD[(String, String)] where each element is an individual key-value pair from the original Map use:
import org.apache.spark.rdd.RDD
val m = Map("red" -> "#FF0000","azure" -> "#F0FFFF","peru" -> "#CD853F")
val rdd: RDD[(String, String)] = sc.parallelize(m.toSeq)
If you want RDD[Map[String,String]] (not that it makes any sense with a single element) use:
val rdd: RDD[Map[String,String]] = sc.parallelize(Seq(m))