Group by of list - scala

I have a list with 5 elements
data = List((List(c1),Y), (List(c1),N), (List(c1),N), (List(c1),Y), (List(c1),Y))
And I want to create a list following:
List((List(c1),Y,0.666), (List(c1),N),0.333)
Any tips on the best way to do this?
I am using scala if that's any help

object Grouping {
def main(args: Array[String]): Unit = {
val data = List((List("c1"),"Y"), (List("c1"),"N"), (List("c1"),"N"), (List("c1"),"Y"), (List("c1"),"Y"))
val result = data.groupBy(grp => (grp._1,grp._2))
.mapValues(count => BigDecimal(count.size.toDouble).setScale(3)./(BigDecimal(data.size.toDouble).setScale(3))
.setScale(3, BigDecimal.RoundingMode.HALF_UP))
.map( k => (k._1._1,k._1._2,k._2)).toList
println("result=="+result)
}
}

def calculatePercentages(data : List[(List[String], String)]): List[((List[String], String),BigDecimal)] = {
val (yesRows, noRows) = data.partition(_._2 == "Y")
List((yesRows(0), (BigDecimal(yesRows.length) / BigDecimal(data.length)).setScale(3, BigDecimal.RoundingMode.HALF_UP)),
(noRows(0), (BigDecimal(noRows.length) / BigDecimal(data.length)).setScale(3, BigDecimal.RoundingMode.HALF_UP)))
}
scala> calculatePercentages(data)
res30: List[((List[String], String), BigDecimal)] = List(((List(c1),Y),0.600), ((List(c1),N),0.400))

Thank you very much for your support. Your code ran properly on my first request. However, with more complex data as with a list below it is not as I expected.
List(
(List(c1, a1),Y),
(List(a1),Y),
(List(c1, a1),N),
(List(a1),N),
(List(a1),Y))
and i want the result is
List(
(List(c1, a1),Y, 0.5),
(List(c1, a1),N, 0.5),
(List(a1),Y, 0.66),
(List(a1),N, 0.33),
)
I look forward to your support

Related

Transform a list of object to lists of its field

I have a List[MyObject], with MyObject containing the fields field1, field2 and field3.
I'm looking for an efficient way of doing :
Tuple3(_.map(_.field1), _.map(_.field2), _.map(_.field3))
In java I would do something like :
Field1Type f1 = new ArrayList<Field1Type>();
Field2Type f2 = new ArrayList<Field2Type>();
Field3Type f3 = new ArrayList<Field3Type>();
for(MyObject mo : myObjects) {
f1.add(mo.getField1());
f2.add(mo.getField2());
f3.add(mo.getField3());
}
I would like something more functional since I'm in scala but I can't put my finger on it.
Get 2\3 sub-groups with unzip\unzip3
Assuming the starting point:
val objects: Seq[MyObject] = ???
You can unzip to get all 3 sub-groups:
val (firsts, seconds, thirds) =
objects
.unzip3((o: MyObject) => (o.f1, o.f2, o.f3))
What if I have more than 3 relevant sub-groups ?
If you really need more sub-groups you need to implement your own unzipN however instead of working with Tuple22 I would personally use an adapter:
case class MyObjectsProjection(private val objs: Seq[MyObject]) {
lazy val f1s: Seq[String] =
objs.map(_.f1)
lazy val f2s: Seq[String] =
objs.map(_.f2)
...
lazy val f22s: Seq[String] =
objs.map(_.f3)
}
val objects: Seq[MyClass] = ???
val objsProjection = MyObjectsProjection(objects)
objs.f1s
objs.f2s
...
objs.f22s
Notes:
Change MyObjectsProjection according to your needs.
This is from a Scala 2.12\2.11 vanilla perspective.
The following will decompose your objects into three lists:
case class MyObject[T,S,R](f1: T, f2: S, f3: R)
val myObjects: Seq[MyObject[Int, Double, String]] = ???
val (l1, l2, l3) = myObjects.foldLeft((List.empty[Int], List.empty[Double], List.empty[String]))((acc, nxt) => {
(nxt.f1 :: acc._1, nxt.f2 :: acc._2, nxt.f3 :: acc._3)
})

Efficient way to collect HashSet during map operation on some Dataset

I have big dataset to transform one structure to another. During that phase I want also collect some info about computed field (quadkeys for given lat/longs). I dont want attach this info to every result row, since it would give a lot of duplication information and memory overhead. All I need is to know which particular quadkeys are touched by given coordinates. If there are any way to do it within one job to not iterate dataset twice?
def load(paths: Seq[String]): (Dataset[ResultStruct], Dataset[String]) = {
val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.schema(schema)
.option("delimiter", "\t")
.load(paths:_*)
.as[InitialStruct]
val qkSet = mutable.HashSet.empty[String]
val result = df.map(c => {
val id = c.id
val points = toPoints(c.geom)
points.foreach(p => qkSet.add(Quadkey.get(p.lat, p.lon, 6).getId))
createResultStruct(id, points)
})
return result, //some dataset created from qkSet's from all executors
}
You could use accumulators
class SetAccumulator[T] extends AccumulatorV2[T, Set[T]] {
import scala.collection.JavaConverters._
private val items = new ConcurrentHashMap[T, Boolean]
override def isZero: Boolean = items.isEmpty
override def copy(): AccumulatorV2[T, Set[T]] = {
val other = new SetAccumulator[T]
other.items.putAll(items)
other
}
override def reset(): Unit = items.clear()
override def add(v: T): Unit = items.put(v, true)
override def merge(
other: AccumulatorV2[T, Set[T]]): Unit = other match {
case setAccumulator: SetAccumulator[T] => items.putAll(setAccumulator.items)
}
override def value: Set[T] = items.keys().asScala.toSet
}
val df = Seq("foo", "bar", "foo", "foo").toDF("test")
val acc = new SetAccumulator[String]
spark.sparkContext.register(acc)
df.map {
case Row(str: String) =>
acc.add(str)
str
}.count()
println(acc.value)
Prints
Set(bar, foo)
Note that map itself is lazy so something like count etc. is needed to actually force the calculation. Depending on the real use-case, another option would be to cache the data frame and just using plain SQL functions df.select("test").distinct()

How to convert a Seq of tuples into set's of individual elements Scala

We have a sequence of tuples Seq(department, title) depTitleSeq we would like to extract Set(department) and Set(title) looking for the best way to do so far we could come up with is
val depTitleSeq = getDepTitleTupleSeq()
var departmentSeq = ArrayBuffer[String]()
var titleSeq = ArrayBuffer[String]()
for (depTitle <- depTitleSeq) yield {
departmentSeq += depTitle._1
titleSeq += depTitle._2
}
val depSet = departmentSeq.toSet
val titleSet = titleSeq.toSet
Fairly new to scala, i'm sure there are better and more efficient ways to achieve this if you could please point us in the right direction it would of great help
If you have two Seqs of data that you want combined into a Seq of tuples, you can zip them together.
If you have a Seq of tuples and you want the elements separated, then you can unzip them.
val (departmentSeq, titleSeq) = getDepTitleTupleSeq().unzip
val depSet :Set[String] = departmentSeq.toSet
val titleSet :Set[String] = titleSeq.toSet
val depTitleSeq = Seq(("x","a"),("y","b"))
val depSet = depTitleSeq.map(_._1).toSet
val titleSet = depTitleSeq.map(_._2).toSet
In Scala REPL:
scala> val depTitleSeq = Seq(("x","a"),("y","b"))
depTitleSeq: Seq[(String, String)] = List((x,a), (y,b))
scala> val depSet = depTitleSeq.map(_._1).toSet
depSet: scala.collection.immutable.Set[String] = Set(x, y)
scala> val titleSet = depTitleSeq.map(_._2).toSet
titleSet: scala.collection.immutable.Set[String] = Set(a, b)
val result:(Set[String], Set[String]) = depTitleSeq.foldLeft((Set[String](), Set[String]())){(a, b) => (a._1 + b._1, a._2 + b._2) }
you can use foldLeft to achieve this.

Combining files

I am new to scala. I have two RDD's and I need to separate out my training and testing data. In one file I have all the data and in another just the testing data. I need to remove the testing data from my complete data set.
The complete data file is of the format(userID,MovID,Rating,Timestamp):
res8: Array[String] = Array(1, 31, 2.5, 1260759144)
The test data file is of the format(userID,MovID):
res10: Array[String] = Array(1, 1172)
How do I generate ratings_train that will not have the caes matched with the testing dataset
I am using the following function but the returned list is showing empty:
def create_training(data: RDD[String], ratings_test: RDD[String]): ListBuffer[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(",")).collect()
var ratings_train = new ListBuffer[Array[String]]()
ratings_split.foreach(x => {
ratings_testing.foreach(y => {
if (x(0) != y(0) || x(1) != y(1)) {
ratings_train += x
}
})
})
return ratings_train
}
EDIT: changed code but running into memory issues.
This may work.
def create_training(data: RDD[String], ratings_test: RDD[String]): Array[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(","))
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == false
})
}
The code snippets you posted are not logically correct. A row will only be part of the final data if it has no presence in the test data. But in the code you picked the row if it does not match with any of the test data. But we should check whether it does not match with all of the test data and then only we can decide whether it is a valid row or not.
You are using RDD, but now exploring the full power of them. I guess you are reading the input from a csv file. Then you can structure your data in the RDD, no need to spit the string based on comma character and manually processing them as ROW. You can take a look at the DataFrame API of spark. These links may help: https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm , http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
Using Regex:
def main(args: Array[String]): Unit = {
// creating test data set
val data = spark.sparkContext.parallelize(Seq(
// "userID, MovID, Rating, Timestamp",
"1, 31, 2.5, 1260759144",
"2, 31, 2.5, 1260759144"))
val ratings_test = spark.sparkContext.parallelize(Seq(
// "userID, MovID",
"1, 31",
"2, 30",
"30, 2"
))
val result = getData(data, ratings_test).collect()
// the result will only contain "2, 31, 2.5, 1260759144"
}
def getData(data: RDD[String], ratings_test: RDD[String]): RDD[String] = {
val ratings = dropheader(data)
val ratings_testing = dropheader(ratings_test)
// Broadcasting the test rating data to all spark nodes, since we are collecting this before hand.
// The reason we are collecting the test data is to avoid call collect in the filter logic
val ratings_testing_bc = spark.sparkContext.broadcast(ratings_testing.collect.toSet)
ratings.filter(rating => {
ratings_testing_bc.value.exists(testRating => regexMatch(rating, testRating)) == false
})
}
def regexMatch(data: String, testData: String): Boolean = {
// Regular expression to find first two columns
val regex = """^([^,]*), ([^,\r\n]*),?""".r
val (dataCol1, dataCol2) = regex findFirstIn data match {
case Some(regex(col1, col2)) => (col1, col2)
}
val (testDataCol1, testDataCol2) = regex findFirstIn testData match {
case Some(regex(col1, col2)) => (col1, col2)
}
(dataCol1 == testDataCol1) && (dataCol2 == testDataCol2)
}

how to convert my pyspark code into scala?

I am a scala beginner. Now I have to convert some codes I wrote in Pyspark to scala. The codes are just to extract fields for modeling.
Could someone point out to me how to write the following code into scala? At least where and how I could get the quick answer. Thanks so much!!!
Here are my previous codes
{val records = rawdata.map(x=> x.split(","))
val data = records.map(r=> LabeledPoint(extract_label(r), extract_features(r)))
...
def extract_features(record):
return np.array(map(float, record[2:16]))
def extract_label(record):
return float(record[16])
}
It goes like this:
scala> def extract_label(record: Array[String]): Float = { record(16).toFloat }
extract_label: (record: Array[String])Float
scala> def extract_features(record: Array[String]): Array[Float] = { val newArray = new Array[Float](14); for(i <- 2 until 16) newArray(i-2)=record(i).toFloat; newArray;}
extract_features: (record: Array[String])Array[Float]
There may be a direct method for above logic.
Test:
scala> records.map(x => extract_label(x)).take(5).foreach(println)
4.9
scala> records.map(x => extract_features(x).mkString(",")).take(5).foreach(println)
6.4,2.5,4.5,2.8,4.7,2.5,6.4,8.5,3.5,6.4,2.9,10.5,6.4,2.2