Grouping duplicates in a csv file - scala

I have a CSV file which I've applied a case class onto and made into a list e.g
CSV file was like this -
"user_id","age","liked_ad","location"
2145,34,true,USA
6786,25,true,UK
9025,21,false,USA
1145,40,false,UK
It goes on. Ultimately I am trying to find the top user_id's who have the most liked_ad's (true values). I know that there are duplicates within the csv file as I did -
val origFile = processCSV("src/main/resources/advert-data.csv")
val origFileLength = origFile.length
val uniqueList = origFile.distinct
val uniqueListLength = uniqueList.length
The two lengths were different. I am thinking I need to group all the user_id's so that all the entries of the same user_id are in a group where I can then count how many 'trues' are in that user's entries. I am completely stuck on the right way to go about this.
This is my processCSV function at the moment -
final case class AdvertInfo(userId: Int, age: Int, likedAd: Boolean, location: String)
def processCSV(file: String): List[AdvertInfo] = {
val data = io.Source.fromFile(file)
data
.getLines()
.map(_.split(',').iterator.map(_.trim).toList)
.flatMap {
case userIdRaw :: ageRaw :: likedAdRaw :: locationRaw :: Nil =>
for {
userId <- userIdRaw.toIntOption
age <- ageRaw.toIntOption
likedAd <- likedAdRaw.toBooleanOption
location <- Some(locationRaw)
} yield AdvertInfo(userId, age, likedAd, location)
case _ =>
None
}.toList
}

Your description is a bit confusing but I think what you want is:
origFile.filter(_.likedAd)
.groupMapReduce(_.userId)(_ => 1)(_+_) //Scala 2.13.x
The result is a Map with the user_ids as the keys and the count of all the liked_ad=="true" as the values.
From there you can .toList.sortBy(-_._2) in order to get the ranking-by-liked-count.

Related

Consolidate a list of Futures into a Map in Scala

I have two case classes P(id: String, ...) and Q(id: String, ...), and two functions returning futures:
One that retrieves a list of objects given a list of id-s:
def retrieve(ids: Seq[String]): Future[Seq[P]] = Future { ... }
The length of the result might be shorter than the input, if not all id-s were found.
One that further transforms P to some other type Q:
def transform(p: P): Future[Q] = Future { ... }
What I would like in the end is, the following. Given ids: Seq[String], calculate a Future[Map[String, Option[Q]]].
Every id from ids should be a key in the map, with id -> Some(q) when it was retrieved successfully (ie. present in the result of retrieve) and also transformed successfully. Otherwise, the map should contain id -> None or Empty.
How can I achieve this?
Is there an .id property on P or Q? You would need one to create the map. Something like this?
for {
ps <- retrieve(ids)
qs <- Future.sequence(ps.map(p => transform(p))
} yield ids.map(id => id -> qs.find(_.id == id)).toMap
Keep in mind that Map[String,Option[X]] is usually not necessary, since if you have Map[String,X] the .get method on the map will give you an Option[X].
Edit: Now assumes that P has a member id that equals the original id-String, otherwise the connection between ids and ps gets lost after retrieve.
def consolidatedMap(ids: Seq[String]): Future[Map[String, Option[Q]]] = {
for {
ps <- retrieve(ids)
qOpts <- Future.traverse(ps){
p => transform(p).map(Option(_)).recover {
// TODO: don't sweep `Throwable` under the
// rug in your real code
case t: Throwable => None
}
}
} yield {
val qMap = (ps.map(_.id) zip qOpts).toMap
ids.map{ id => (id, qMap.getOrElse(id, None)) }.toMap
}
}
Builds an intermediate Map from retrieved Ps and transformed Qs, so that building of ids-to-q-Options map works in linear time.

Working scala code using a var in a pure function. Is this possible without a var?

Is it possible (or even worthwhile) to try to write the below code block without a var? It works with a var. This is not for an interview, it's my first attempt at scala (came from java).
The problem: Fit people as close to the front of a theatre as possible, while keeping each request (eg. Jones, 4 tickets) in a single theatre section. The theatre sections, starting at the front, are sized 6, 6, 3, 5, 5... and so on. I'm trying to accomplish this by putting together all of the potential groups of ticket requests, and then choosing the best fitting group per section.
Here are the classes. A SeatingCombination is one possible combination of SeatingRequest (just the IDs) and the sum of their ticketCount(s):
class SeatingCombination(val idList: List[Int], val seatCount: Int){}
class SeatingRequest(val id: Int, val partyName: String, val ticketCount: Int){}
class TheatreSection(val sectionSize: Int, rowNumber: Int, sectionNumber: Int) {
def id: String = rowNumber.toString + "_"+ sectionNumber.toString;
}
By the time we get to the below function...
1.) all of the possible combinations of SeatingRequest are in a list of SeatingCombination and ordered by descending size.
2.) all of the TheatreSection are listed in order.
def getSeatingMap(groups: List[SeatingCombination], sections: List[TheatreSection]): HashMap[Int, TheatreSection] = {
var seatedMap = new HashMap[Int, TheatreSection]
for (sect <- sections) {
val bestFitOpt = groups.find(g => { g.seatCount <= sect.sectionSize && !isAnyListIdInMap(seatedMap, g.idList) })
bestFitOpt.filter(_.idList.size > 0).foreach(_.idList.foreach(seatedMap.update(_, sect)))
}
seatedMap
}
def isAnyListIdInMap(map: HashMap[Int, TheatreSection], list: List[Int]): Boolean = {
(for (id <- list) yield !map.get(id).isEmpty).reduce(_ || _)
}
I wrote the rest of the program without a var, but in this iterative section it seems impossible. Maybe with my implementation strategy it's impossible. From what else I've read, a var in a pure function is still functional. But it's been bothering me I can't think of how to remove the var, because my textbook told me to try to avoid them, and I don't know what I don't know.
You can use foldLeft to iterate on sections with a running state (and again, inside, on your state to add iteratively all the ids in a section):
sections.foldLeft(Map.empty[Int, TheatreSection]){
case (seatedMap, sect) =>
val bestFitOpt = groups.find(g => g.seatCount <= sect.sectionSize && !isAnyListIdInMap(seatedMap, g.idList))
bestFitOpt.
filter(_.idList.size > 0).toList. //convert option to list
flatMap(_.idList). // flatten list from option and idList
foldLeft(seatedMap)(_ + (_ -> sect))) // add all ids to the map with sect as value
}
By the way, you can simplify the second method using exists and map.contains:
def isAnyListIdInMap(map: HashMap[Int, TheatreSection], list: List[Int]): Boolean = {
list.exists(id => map.contains(id))
}
list.exists(predicate: Int => Boolean) is a Boolean which is true if the predicate is true for any element in list.
map.contains(key) checks if map is defined at key.
If you want to be even more concise, you don't need to give a name to the argument of the predicate:
list.exists(map.contains)
Simply changing var to val should do it :)
I think, you may be asking about getting rid of the mutable map, not of the var (it doesn't need to be var in your code).
Things like this are usually written recursively in scala or using foldLeft, like other answers suggest. Here is a recursive version:
#tailrec
def getSeatingMap(
groups: List[SeatingCombination],
sections: List[TheatreSection],
result: Map[Int, TheatreSection] = Map.empty): Map[Int, TheatreSection] = sections match {
case Nil => result
case head :: tail =>
val seated = groups
.iterator
.filter(_.idList.nonEmpty)
.filterNot(_.idList.find(result.contains).isDefined)
.find(_.seatCount <= head.sectionSize)
.fold(Nil)(_.idList.map(id => id -> sect))
getSeatingMap(groups, tail, result ++ seated)
}
btw, I don't think you need to test every id in list for presence in the map - should suffice to just look at the first one. You could also make it a bit more efficient, probably, if instead of checking the map every time to see if the group is already seated, you'd just drop it from the input list as soon as the section is assigned.
#tailrec
def selectGroup(
sect: TheatreSection,
groups: List[SeatingCombination],
result: List[SeatingCombination] = Nil
): (List[(Int, TheatreSection)], List[SeatingCombination]) = groups match {
case Nil => (Nil, result)
case head :: tail
if(head.idList.nonEmpty && head.seatCount <= sect.sectionSize) => (head.idList.map(_ -> sect), result.reverse ++ tail)
case head :: tail => selectGroup(sect, tail, head :: result)
}
and then in getSeatingMap:
...
case head :: tail =>
val(seated, remaining) => selectGroup(sect, groups)
getSeatingMap(remaining, tail, result ++ seated)
Here is how I was able to achieve without using the mutable.HashMap, the suggestion by the comment to use foldLeft was used to do it:
class SeatingCombination(val idList: List[Int], val seatCount: Int){}
class SeatingRequest(val id: Int, val partyName: String, val ticketCount: Int){}
class TheatreSection(val sectionSize: Int, rowNumber: Int, sectionNumber: Int) {
def id: String = rowNumber.toString + "_"+ sectionNumber.toString;
}
def getSeatingMap(groups: List[SeatingCombination], sections: List[TheatreSection]): Map[Int, TheatreSection] = {
sections.foldLeft(Map.empty[Int, TheatreSection]) { (m, sect) =>
val bestFitOpt = groups.find(g => {
g.seatCount <= sect.sectionSize && !isAnyListIdInMap(m, g.idList)
}).filter(_.idList.nonEmpty)
val newEntries = bestFitOpt.map(_.idList.map(_ -> sect)).getOrElse(List.empty)
m ++ newEntries
}
}
def isAnyListIdInMap(map: Map[Int, TheatreSection], list: List[Int]): Boolean = {
(for (id <- list) yield map.get(id).isDefined).reduce(_ || _)
}

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

Does this specific exercise lend itself well to a 'functional style' design pattern?

Say we have an array of one dimensional javascript objects contained in a file Array.json for which the key schema isn't known, that is the keys aren't known until the file is read.
Then we wish to output a CSV file with a header or first entry which is a comma delimited set of keys from all of the objects.
Each next line of the file should contain the comma separated values which correspond to each key from the file.
Array.json
[
abc:123,
xy:"yz",
s12:13,
],
...
[
abc:1
s:133,
]
A valid output:
abc,xy,s12,s
123,yz,13,
1,,,133
I'm teaching myself 'functional style' programming but I'm thinking that this problem doesn't lend itself well to a functional solution.
I believe that this problem requires some state to be kept for the output header and that subsequently each line depends on that header.
I'm looking to solve the problem in a single pass. My goals are efficiency for a large data set, minimal traversals, and if possible, parallelizability. If this isn't possible then can you give a proof or reasoning to explain why?
EDIT: Is there a way to solve the problem like this functionally?:
Say you pass through the array once, in some particular order. Then
from the start the header set looks like abc,xy,s12 for the first
object. With CSV entry 123,yz,13 . Then on the next object we add an
additional key to the header set so abc,xy,s12,s would be the header
and the CSV entry would be 1,,,133 . In the end we wouldn't need to
pass through the data set a second time. We could just append extra
commas to the result set. This is one way we could approach a single
pass....
Are there functional tools ( functions ) designed to solve problems like this, and what should I be considering? [ By functional tools I mean Monads,FlatMap, Filters, etc. ] . Alternatively, should I be considering things like Futures ?
Currently I've been trying to approach this using Java8, but am open to solutions from Scala, etc. Ideally I would be able to determine if Java8s' functional approach can solve the problem since that's the language I'm currently working in.
Since the csv output will change with every new line of input, you must hold that in memory before writing it out. If you consider creating an output text format from an internal representation of a csv file another "pass" over the data (the internal representation of the csv is practically a Map[String,List[String]] which you must traverse to convert it to text) then it's not possible to do this in a single pass.
If, however, this is acceptable, then you can use a Stream to read a single item from your json file, merge that into the csv file, and do this until the stream is empty.
Assuming, that the internal representation of the csv file is
trait CsvFile {
def merge(line: Map[String, String]): CsvFile
}
And you can represent a single item as
trait Item {
def asMap: Map[String, String]
}
You can implement it using foldLeft:
def toCsv(items: Stream[Item]): CsvFile =
items.foldLeft(CsvFile(Map()))((csv, item) => csv.merge(item.asMap))
or use recursion to get the same result
#tailrec def toCsv(items: Stream[Item], prevCsv: CsvFile): CsvFile =
items match {
case Stream.Empty => prevCsv
case item #:: rest =>
val newCsv = prevCsv.merge(item.asMap)
toCsv(rest, newCsv)
}
Note: Of course you don't have to create types for CsvFile or Item, you can use Map[String,List[String]] and Map[String,String] respectively
UPDATE:
As more detail was requested for the CsvFile trait/class, here's an example implementation:
case class CsvFile(lines: Map[String, List[String]], rowCount: Int = 0) {
def merge(line: Map[String, String]): CsvFile = {
val orig = lines.withDefaultValue(List.fill(rowCount)(""))
val current = line.withDefaultValue("")
val newLines = (lines.keySet ++ line.keySet) map {
k => (k, orig(k) :+ current(k))
}
CsvFile(newLines.toMap, rowCount + 1)
}
}
This could be one approach:
val arr = Array(Map("abc" -> 123, "xy" -> "yz", "s12" -> 13), Map("abc" -> 1, "s" -> 133))
val keys = arr.flatMap(_.keys).distinct // get the distinct keys for header
arr.map(x => keys.map(y => x.getOrElse(y,""))) // get an array of rows
Its completely OK to have state in functional programming. But having mutable state or mutating state is not allowed in functional programming.
Functional programming advocates creating new changed state instead of mutating the state in place.
So, its Ok to read and access state created in the program until and unless you are mutating or side effecting.
Coming to the point.
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.map { inner => inner.map { case (k, v) => k}}.flatten
list.map { inner => inner.map { case (k, v) => v}}.flatten
REPL
scala> val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list: List[List[(String, String)]] = List(List((abc,123), (xy,yz)), List((abc,1)))
scala> list.map { inner => inner.map { case (k, v) => k}}.flatten
res1: List[String] = List(abc, xy, abc)
scala> list.map { inner => inner.map { case (k, v) => v}}.flatten
res2: List[String] = List(123, yz, 1)
or use flatMap instead of map and flatten
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.flatMap { inner => inner.map { case (k, v) => k}}
list.flatMap { inner => inner.map { case (k, v) => v}}
In functional programming, mutable state is not allowed. But immutable states/values are fine.
Assuming that you have read your json file in to a value input:List[Map[String,String]], the codes below will solve your problem:
val input = List(Map("abc"->"123", "xy"->"yz" , "s12"->"13"), Map("abc"->"1", "s"->"33"))
val keys = input.map(_.keys).flatten.toSet
val keyvalues = input.map(kvs => keys.map(k => (k->kvs.getOrElse(k,""))).toMap)
val values = keyvalues.map(_.values)
val result = keys.mkString(",") + "\n" + values.map(_.mkString(",")).mkString("\n")

Creating a Map by reading elements of List in Scala

I have some records in a List .
Now I want to create a new Map(Mutable Map) from that List with unique key for each record. I want to achieve this my reading a List and calling the higher order method called map in scala.
records.txt is my input file
100,Surender,2015-01-27
100,Surender,2015-01-30
101,Raja,2015-02-19
Expected Output :
Map(0-> 100,Surender,2015-01-27, 1 -> 100,Surender,2015-01-30,2 ->101,Raja,2015-02-19)
Scala Code :
object SampleObject{
def main(args:Array[String]) ={
val mutableMap = scala.collection.mutable.Map[Int,String]()
var i:Int =0
val myList=Source.fromFile("D:\\Scala_inputfiles\\records.txt").getLines().toList;
println(myList)
val resultList= myList.map { x =>
{
mutableMap(i) =x.toString()
i=i+1
}
}
println(mutableMap)
}
}
But I am getting output like below
Map(1 -> 101,Raja,2015-02-19)
I want to understand why it is keeping the last record alone .
Could some one help me?
val mm: Map[Int, String] = Source.fromFile(filename).getLines
.zipWithIndex
.map({ case (line, i) => i -> line })(collection.breakOut)
Here the (collection.breakOut) is to avoid the extra parse caused by toMap.
Consider
(for {
(line, i) <- Source.fromFile(filename).getLines.zipWithIndex
} yield i -> line).toMap
where we read each line, associate an index value starting from zero and create a map out of each association.