Scala two list match data - scala

First List data as below
List(("A",66729122803169854198650092,"SD"),("B",14941578978240528153321786,"HD"),("C",14941578978240528153321786,"PD"))
and second list contains data as below
List(("X",14941578978240528153321786),("Y",68277588597782900503675727),("Z",14941578978240528153321786),("L"66729122803169854198650092))
using above two list I want to form following list which matched first list second number to second list second number so my output should as below
List(("X",14941578978240528153321786,"B","HD"),("X",14941578978240528153321786,"C","PD"), ("Y",68277588597782900503675727,"",""),("Z",14941578978240528153321786,"B","HD"),("Z",14941578978240528153321786,"C","PD"),
("L",66729122803169854198650092,"A","SD"))

val tuples3 = List(
("A", "66729122803169854198650092", "SD"),
("B", "14941578978240528153321786", "HD"),
("C", "14941578978240528153321786", "PD"))
val tuples2 = List(
("X", "14941578978240528153321786"),
("Y", "68277588597782900503675727"),
("Z", "14941578978240528153321786"),
("L", "66729122803169854198650092"))
Group first list by target field:
val tuples3Grouped =
tuples3
.groupBy(_._2)
.mapValues(_.map(t => (t._1, t._3)))
.withDefaultValue(List(("", "")))
Zip all data:
val result = for{ (first, second) <- tuples2
t <- tuples3Grouped(second)
} yield (first, second, t._1, t._2)

Related

SparkMap[RDD] - map toMany-One

I have a rdd in the following form:
[ ("a") -> (pos3, pos5), ("b") -> (pos1, pos7), .... ]
and
(pos1 ,pos2, ............, posn)
Q: How can I map each position to its key ?(to be something like the following)
("b", "e", "a", "d", "a" .....)
// "b" correspond to pos 1, "e" correspond to pose 2 and ...
Example(edit) :
// chunk of my data
val data = Vector(("a",(124)), ("b",(125)), ("c",(121, 123)), ("d",(122)),..)
val rdd = sc.parallelize(data)
// from rdd I can create my position rdd which is something like:
val positions = Vector(1,2,3,4,.......125) // my positions
// I want to map each position to my tokens("a", "b", "c", ....) to achive:
Vector("a", "b", "a", ...)
// a correspond to pos1, b correspond to pos2 ...
Not sure you have to use Spark to address this specific use case (starting with a Vector, ending with a Vector containing all your data characters).
Nevertheless, here's some suggestion if it suits your needs :
val data = Vector(("a",Set(124)), ("b", Set(125)), ("c", Set(121, 123)), ("d", Set(122)))
val rdd = spark.sparkContext.parallelize(data)
val result = rdd.flatMap{case (k,positions) => positions.map(p => Map(p -> k))}
.reduce(_ ++ _) //here, we aggregate the Map objects together, reducing partitions first and then merging executors results
.toVector
.sortBy(_._1) //We sort data based on position
.map(_._2) // We only keep characters
.mkString

Convert list into multiple tuples in scala

I have the following tuples: (1,"3idiots",List("Action","Adventure","Horror") Which I need to convert into a list in the following format:
List(
(1,"3idiots","Action"),
(1,"3idiots","Adventure")
)
To add to previous answers, you can also use for-comprehension in this case; it might make things clearer IMHO:
for(
(a,b,l) <- ts;
s <- l
) yield (a,b,s)
So if you have:
val ts = List(
("a","1", List("foo","bar","baz")),
("b","2", List("foo1","bar1","baz1"))
)
You will get:
List(
(a,1,foo),
(a,1,bar),
(a,1,baz),
(b,2,foo1),
(b,2,bar1),
(b,2,baz1)
)
Assuming that you have more than one tuple like this:
val tuples = List(
(1, "3idiots", List("Action", "Adventure", "Horror")),
(2, "foobar", List("Foo", "Bar"))
)
and you want result like this:
List(
(1, "3idiots", "Action"),
(1, "3idiots" , "Adventure"),
(1, "3idiots", "Horror"),
(2, "foobar", "Foo"),
(2, "foobar", "Bar")
)
the solution for you would be to use a flatMap, which can convert a list of lists to a single list:
tuples.flatMap(t =>
t._3.map(s =>
(t._1, t._2, s)
)
)
or shorter: tuples.flatMap(t => t._3.map((t._1, t._2, _)))
This should do what you want:
val input = (1,"3idiots",List("Action","Adventure","Horror"))
val result = input._3.map(x => (input._1,input._2,x))
// gives List((1,3idiots,Action), (1,3idiots,Adventure), (1,3idiots,Horror))
You can use this.
val question = (1,"3idiots",List("Action","Adventure","Horror"))
val result = question._3.map(x=> (question._1 , question._2 ,x))

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:
val rdd = sc.parallelize(Seq(
(1, List("a")),
(2, List("a", "b")),
(3, List("b", "c", "d")),
(4, List("f"))))
For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:
val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
(1, List("a", "c")),
(2, List("a", "b", "b", "c")),
(3, List("b", "c", "d", "a", "a", "f")),
(4, List("f", "d"))))
I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):
// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => x._2.map(s => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = count.map { x =>
(x._1, rddFlat.sample(withReplacement = true, x._2.toDouble / count.map(_._2).sum, scala.util.Random.nextLong()))
}.map(x => x._2.map(t => (x._1, t._2))).reduce(_.union(_))
// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
.union(rddRandom)
.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.
By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.
In a nutshell, we need a distributed method to a weighted sample of all the values.
What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.
Let's write a spark-based algo for that:
// sample data to guide us.
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys
val data = sc.parallelize(Seq(
(1, List("A", "B")),
(2, List("x", "y", "z")),
(3, List("1", "2", "3", "4")),
(4, List("foo", "bar")),
(5, List("+")),
(6, List())))
val flattenedData = data.flatMap{case (k,vlist) => vlist.map(v=> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize = data.map{case (k, list) => (k,list.size)}
val totalElements = keysBySize.map{case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples = probMatrix.map{case ((key, prob),value) =>
((key,value),(prob, Random.nextDouble))}
ScoredSamples looks like this:
((1,A),(0.16666666666666666,0.911900315814998))
((1,B),(0.16666666666666666,0.13615047422122906))
((1,x),(0.16666666666666666,0.6292430257377151))
((1,y),(0.16666666666666666,0.23839887096373114))
((1,z),(0.16666666666666666,0.9174808344986465))
...
val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}
samples looks like this:
(1,foo)
(1,bar)
(2,1)
(2,3)
(3,y)
...
Now, we union our sampled data with the original and have our final result.
val result = (flattenedData union samples).groupByKey.mapValues(_.toList)
result.collect()
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))
Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey, which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names

I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement.
So i have created a Scala List of 100 column names.
And then i want to iterate through a for loop to actually drop the column in each for loop iteration.
Below is the code.
final val dropList: List[String] = List("Col1","Col2",...."Col100”)
def drpColsfunc(inputDF: DataFrame): DataFrame = {
for (i <- 0 to dropList.length - 1) {
val returnDF = inputDF.drop(dropList(i))
}
return returnDF
}
val test_df = drpColsfunc(input_dataframe)
test_df.show(5)
If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:
df.drop("colA", "colB", "colC")
Answer:
val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*)
This should work fine :
val dropList : List[String] |
val df : DataFrame |
val test_df = df.drop(dropList : _*)
You can just do,
def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame =
dropList.foldLeft(inputDF)((df, col) => df.drop(col))
It will return you the DataFrame without the columns passed in dropList.
As an example (of what's happening behind the scene), let me put it this way.
scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)
scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)
scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)
The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.
You can use the drop operation to drop multiple columns. If you are having column names in the list that you need to drop than you can pass that using :_* after the column list variable and it would drop all the columns in the list that you pass.
Scala:
val df = Seq(("One","Two","Three"),("One","Two","Three"),("One","Two","Three")).toDF("Name","Name1","Name2")
val columnstoDrop = List("Name","Name1")
val df1 = df.drop(columnstoDrop:_*)
Python:
In python you can use the * operator to do the same stuff.
data = [("One", "Two","Three"), ("One", "Two","Three"), ("One", "Two","Three")]
columns = ["Name","Name1","Name2"]
df = spark.sparkContext.parallelize(data).toDF(columns)
columnstoDrop = ["Name","Name1"]
df1 = df.drop(*columnstoDrop)
Now in df1 you would get the dataframe with only one column i.e Name2.

How to insert elements in list from another list using scala?

I have two lists -
A = (("192.168.1.1","private","Linux_server","str1"),
("192.168.1.2","private","Linux_server","str2"))
B = ("A","B")
I want following output
outputList = (("192.168.1.1","private","Linux_server","str1", "A"),
("192.168.1.2","private","Linux_server","str2","B"))
I want to insert second list element into first list as list sequence.
Two lists size will be always same.
How do I get above output using scala??
The short answer:
A = (A zip B).map({ case (x, y) => x :+ y })
Some compiling code to be more explicit:
val a = List(
List("192.168.1.1", "private", "Linux_server", "str1"),
List("192.168.1.2", "private", "Linux_server", "str2")
)
val b = List("A", "B")
val c = List(
List("192.168.1.1", "private", "Linux_server", "str1", "A"),
List("192.168.1.2", "private", "Linux_server", "str2", "B")
)
assert((a zip b).map({ case (x, y) => x :+ y }) == c)