How can I process this big data in scala IntelliJ? - scala

It has been a few days and i started learning Scala on IntelliJ and I am learning all by myself. Please bear my rookie mistakes. I have a csv file with more than 10,000 rows and 13 columns.
The heading of of the columns are:
Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last updated | Current Version | Android Version
I did manage to read and display the the csv file with the following code:
import scala.io.Source
object task {
def main(args: Array[String]): Unit = {
for(line <- Source.fromFile("D:/data.csv"))
{
println(line)
}
}
}
The problem with this is that this code displays one alphabet or digit, moves onto the next line and displays the next alphabet or digit. It does not display a row in one line.
I want to find out the best app for each category (ART_AND_DESIGN, AUTO_AND_VEHICLES, BEAUTY…,) based on its assigned priorities of reviews and ratings. The priorities are defined as 60 % for “reviews” and 40% for “rating” columns respectively. Calculate a value for each category (ART_AND_DESIGN, AUTO_AND_VEHICLES, BEAUTY…,) by using these assigned values of priorities. This value will help us out to find the best app in each category. You can use Priority formula equation as follows.
Priority = ( (((rating/max_rating) * 100) * 0.4) + (((reviews/max_reviews) * 100) * 0.6) )
Here max_rating is maximum rating of given data in same category like category(“ART_AND_DESIGN”) maximum rating is “4.7”, max_reviews is maximum reviews of app in same category like category(“ART_AND_DESIGN”) maximum reviews is “295221”. So priority value will be for first data record of category(“ART_AND_DESIGN”) is:
Rating= 4.1, reviews= 159,
max_rating= 4.7, max_reviews= 295221
My question is, how can i store every column in an array? That is how i plan on computing the data. If there is any other way to solve the above problem, i am open to suggestions.
I can upload a small chunk of the data if anyone wants to.

Source gives you a byte Iterator by default. To iterate through lines, use .getLines:
Source.fromFile(fileName)
.getLines
.foreach(println)
To split lines into arrays, use split (assuming the column values do not include separator):
val arrays = Source.fromFile(fileName).getLines.map(_.split("|"))
It is better to avoid using raw arrays though. Creating a case class makes for much better, readable code:
case class AppData(
category: String,
rating: Int,
reviews: Int,
size: Int,
installs: Int,
`type`: String,
price: Double,
contentRating: Int,
generes: Seq[String],
lastUpdated: Long,
version: String,
androidVersion: String
) {
def priority(maxRating: Int, maxReview: Int) =
if(maxRatings == 0 || maxReviews == 0) 0 else
(rating * 0.4 / maxRating + reviews * 0.6 /maxReview) * 100
}
object AppData {
def apply(str: String) = {
val fields = str.split("|")
assert(fields.length == 12)
AppData(
fields(0),
fields(1).toInt,
fields(2).toInt,
fields(3).toInt,
fields(4).toInt,
fields(5),
fields(6).toDouble,
fields(7).toInt,
fields(8).split(",").toSeq,
fields(9).toLong,
fields(10),
fields(11)
)
}
}
Now you can do what you want pretty neatly:
// Read the data, parse it and group by category
// This gives you a map of categories to a seq of apps
val byCategory = Source.fromFile(fileName)
.map(AppData)
.groupBy(_.category)
// Now, find out max ratings and reviews for each category
// This could be done even nicer with another case class and
// a monoid, but tuple/fold will do too
// It is tempting to use `.mapValues` here, but that's not a good idea
// because .mapValues is LAZY, it will recompute the max every time
// the value is accessed!
val maxes = byVategory.map { case (cat, data) =>
cat ->
data.foldLeft(0 -> 0) { case ((maxRatings, maxReviews), in) =>
(maxRatings max in.rating, maxReviews max in.reviews)
}
}.withDefault( _ => (0,0))
// And finally go through your categories, and find best for each,
// that's it!
val bestByCategory = byCategory.map { case(cat, apps) =>
cat -> apps.maxBy { _.priority.tupled(maxes(cat)) }
}

Related

How to aggregate data in Spark using Scala?

I have a data set test1.txt. It contain data like below
2::1::3
1::1::2
1::2::2
2::1::5
2::1::4
3::1::2
3::1::1
3::2::2
I have created data-frame using the below code.
case class Test(userId: Int, movieId: Int, rating: Float)
def pRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 3)
Test(fields(0).toInt, fields(1).toInt, fields(2).toFloat)
}
val ratings = spark.read.textFile("C:/Users/test/Desktop/test1.txt").map(pRating).toDF()
2,1,3
1,1,2
1,2,2
2,1,5
2,1,4
3,1,2
3,1,1
3,2,2
But I want to print output like below I.e. removing duplicate combinations and instead of field(2) value sum of values1,1, 2.0.
1,1,2.0
1,2,2.0
2,1,12.0
3,1,3.0
3,2,2.0
Please help me on this, how can achieve this.
To drop duplicates, use df.distinct. To aggregate you first groupBy and then agg. Putting this all together:
case class Rating(userId: Int, movieId: Int, rating: Float)
def pRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 3)
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat)
}
val ratings = spark.read.textFile("C:/Users/test/Desktop/test1.txt").map(pRating)
val totals = ratings.distinct
.groupBy('userId, 'movieId)
.agg(sum('rating).as("rating"))
.as[Rating]
I am not sure you'd want the final result as Dataset[Rating] and whether the distinct and sum logic is exactly as you'd want it as the example in the question is not very clear but, hopefully, this will give you what you need.
ratings.groupBy("userId","movieId").sum(rating)

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

Scala - Having trouble with filter & vals

Working on an exercism.io exercise in Scala. Here is my current code:
case class Allergies() {
def isAllergicTo(allergen:Allergen.Allergen, score:Int) = {
if (score < allergen.value || score == 0) false
else true
}
def allergies(score:Int) = {
val list = List(Allergen.Cats, Allergen.Pollen, Allergen.Chocolate, Allergen.Tomatoes,
Allergen.Strawberries, Allergen.Shellfish, Allergen.Peanuts, Allergen.Eggs)
list.filter(isAllergicTo(_, score)).sortBy(v => v.value)
}
}
object Allergen {
sealed trait Allergen { def value: Int }
case object Eggs extends Allergen { val value = 1 }
case object Peanuts extends Allergen { val value = 2 }
case object Shellfish extends Allergen { val value = 4 }
case object Strawberries extends Allergen { val value = 8 }
case object Tomatoes extends Allergen { val value = 16 }
case object Chocolate extends Allergen { val value = 32 }
case object Pollen extends Allergen { val value = 64 }
case object Cats extends Allergen { val value = 128 }
}
Due to the way the tests are formatted, a few of these strange constructs are just simple ways of passing syntactical issues in the tests. With that aside, a quick overview of what you are seeing...
Allergies takes a score and returns all of the allergens that could add to this score. So, if a person has a score of 34, they must be allergic to Chocolate and Peanuts. isAllergicTo takes an allergen and a score and determines if it is possible that there is an allergen present.
The problem I am running into is my filter logic is sort of correct, but right now for the example of 34 as input, it will return not only Chocolate and Peanuts, but everything with a score less than Chocolate. I am not really sure how to go about solving this issue with a score that is changing to reflect a match found, partly because score is a val and can't be reassigned without using an intermediate variable.
I know my problem is vague, but I'm not sure where to continue on this one and would appreciate any suggestions.
The issue is simply that isAllergicTo is implemented incorrectly. If you fix it, you won't need to change score at all. As a hint: think about binary representation of score.
"Bitmasking" is often used to represent a set of items as an integer value.
If you have a collection of items (e.g. Allergens), you assign a value to it equal to some power of 2. Eggs is 2^0, Peanuts is 2^1, and so on. To create a set of these items, you take the bitwise "OR" of the items in that set. By using different powers of 2, when the value is represented in binary, the 1 in each item's value goes in a different place.
For example:
Peanuts: 00000010 (2)
OR Chocolate: 00100000 (32)
----------------------------
= (combined): 00100010 (34)
To check if an item is in a set (value), you use bitwise "AND" to compare the item's value with the set's value, e.g.
Set: 00100010 (34)
AND Peanuts: 00000010 (2)
---------------------------
result: 00000010 (2)
Set: 00100010 (34)
AND Shellfish: 00000100 (4)
-----------------------------
result: 00000000 (0)
If Peanuts had not been in the set, the result value would be 0.
looks like I'm a bit late, but I'm posting this anyway
Using a tail recursive function :
def allergies(score:Int) = {
val list = List(Allergen.Cats, Allergen.Pollen, Allergen.Chocolate, Allergen.Tomatoes,
Allergen.Strawberries, Allergen.Shellfish, Allergen.Peanuts, Allergen.Eggs)
def inner(maybeSmaller: List[Allergen.Allergen], score: Int, allergies: List[Allergen.Allergen]) : List[Allergen.Allergen] =
if (score == 0 || maybeSmaller.length == 0) allergies.reverse
else {
val smaller = maybeSmaller.filter(isAllergicTo(_, score))//(_.value < score)
inner(smaller.tail, score - smaller.head.value, smaller.head :: allergies)
}
inner(list.sortBy(- _.value), score, Nil)
}

Scala, finding max value in arrays

First time I've had to ask a question here, there is not enough info on Scala out there for a newbie like me.
Basically what I have is a file filled with hundreds of thousands of lists formatted like this:
(type, date, count, object)
Rows look something like this:
(food, 30052014, 400, banana)
(food, 30052014, 2, pizza)
All I need to is find the one row with the highest count.
I know I did this a couple of months ago but can't seem to wrap my head around it now. I'm sure I can do this without a function too. All I want to do is set a value and put that row in it but I can't figure it out.
I think basically what I want to do is a Math.max on the 3rd element in the lists, but I just can't get it.
Any help will be kindly appreciated. Sorry if my wording or formatting of this question isn't the best.
EDIT: There's some extra info I've left out that I should probably add:
All the records are stored in a tsv file. I've done this to split them:
val split_food = food.map(_.split("/t"))
so basically I think I need to use split_food... somehow
Modified version of #Szymon answer with your edit addressed:
val split_food = food.map(_.split("/t"))
val max_food = split_food.maxBy(tokens => tokens(2).toInt)
or, analogously:
val max_food = split_food.maxBy { case Array(_, _, count, _) => count.toInt }
In case you're using apache spark's RDD, which has limited number of usual scala collections methods, you have to go with reduce
val max_food = split_food.reduce { (max: Array[String], current: Array[String]) =>
val curCount = current(2).toInt
val maxCount = max(2).toInt // you probably would want to preprocess all items,
// so .toInt will not be called again and again
if (curCount > maxCount) current else max
}
You should use maxBy function:
case class Purchase(category: String, date: Long, count: Int, name: String)
object Purchase {
def apply(s: String) = s.split("\t") match {
case Seq(cat, date, count, name) => Purchase(cat, date.toLong, count.toInt, name)
}
}
foodRows.map(row => Purchase(row)).maxBy(_.count)
Simply:
case class Record(food:String, date:String, count:Int)
val l = List(Record("ciccio", "x", 1), Record("buffo", "y", 4), Record("banana", "z", 3))
l.maxBy(_.count)
>>> res8: Record = Record(buffo,y,4)
Not sure if you got the answer yet but I had the same issues with maxBy. I found once I ran the package... import scala.io.Source I was able to use maxBy and it worked.

scalding compare consecutive records

Does anyone know how to compare consecutive records in scalding when creating a schema. I am looking at tutorial 6 and suppose that I want to print the age of the person if data in record #2 is greater than record #1 (for all records)
for example:
R1: John 30
R2: Kim 55
R3: Mark 20
if Rn.age > R(n-1).age the output ... which will result to R2: Kim 55
EDIT:
Looking at the code I just realized it is a Scala enumeration, so my question is how to compare records in scala enumeration ?
class Tutorial6(args : Args) extends Job(args) {
/** When a data set has a large number of fields, and we want to specify those fields conveniently
in code, we can use, for example, a Tuple of Symbols (as most of the other tutorials show), or a List of Symbols.
Note that Tuples can only be used if the number of fields is at most 22, since Scala Tuples cannot have more
than 22 elements. Another alternative is to use Enumerations, which we show here **/
object Schema extends Enumeration {
val first, last, phone, age, country = Value // arbitrary number of fields
}
import Schema._
Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)
.read
.project(first,age)
.write(Tsv("tutorial/data/output6.tsv"))
}
It seems the implicit conversion from Enumeration#Value is missing, so you could define it yourself:
import cascading.tuple.Fields
implicit def valueToFields(v: Enumeration#Value): Fields = v.toString
object Schema extends Enumeration {
val first, last, phone, age, country = Value // arbitrary number of fields
}
import Schema._
var current = Int.MaxValue
Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)
.read
.map(age -> ('current, 'previous)) { a: String =>
val previous = current
current = a.toInt
current -> previous
}
.filter('current, 'previous) { age: (Int, Int) => age._1 > age._2 }
.project(first, age)
.write(Tsv("tutorial/data/output6.tsv"))
In the end, we expect the result to be the same as that of:
Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)
.read
.map((new Fields("age"), (new Fields("current", "previous"))) { a: String =>
val previous = current
current = a.toInt
current -> previous
}
.filter(new Fields("current", "previous")) { age: (Int, Int) =>
age._1 > age._2
}
.project(new Fields("first", "age"))
.write(Tsv("tutorial/data/output6.tsv"))
The implicit conversions provided by scalding allow you to write shorter versions of these new Fields(...).
An inplicit conversion is just a view, which will get used by the compiler when you pass arguments which are not of the expected type, but can be converted to the appropriate type by this view. For example, because map() expects a pair of Fields while you're passing it a pair of Symbols, Scala will search for an implicit conversion from Symbol -> Symbol to Fields -> Fields. A short explanation on views can be found here.
Scalding 0.8.5 introduced conversions from a product of Eumeration#Value to a Fields, but was missing conversions from a pair of values. The develop branch now also provides the latter.