Data analysis on a subset with scala - scala

I'm new to learning Scala and exploring the ways it can do things, and am now trying to learn to implement some slightly more sophisticated data analysis.
I have weather data for various cities in different countries in a text file loaded into the program. I have so far figured out how to calculate simple things like the average temperature across a country per day, or the average temperature of each city grouped by country across the whole file, using Maps/Mapvalues to bind keys to the values I'm looking for.
Now would like to be able to specify a time window (say, a week) and, from there, grouped by country, figure out things like the average temperature of each city in that time window. For simplicity, I've made the dates simple INTs rather than go with MM/DD/YY format.
In another language I would likely go for loops to do this, but I'm not quite sure the best "Scala" way to do this. At first I thought maybe "sliding" or "grouped," but have found this would split the list entirely and thefore I could not specify an arbitrary day to calculate the week from. I've included example code for my method which calculates the average temperature per city over the whole time period
def citytempaverages(): Map[String, Map[String, Double]] = {
weatherpatterns.groupBy(_.country)
.mapValues(_.groupBy(_.city)
.mapValues(cityavg => cityavg.map(_.temperature).sum /cityavg.length))
Does it even still make sense to use maps for this new problem, or perhaps another method in the collections API is more suited?
UPDATE #1: so I've built a collection like so:
def dailycities(): Map[Int, Map[String,Map[String, List[Double]]]] = {
weatherpatterns.groupBy(_.day)
.mapValues(_.groupBy(_.country).mapValues(_.groupBy(_.city)
.mapValues(_.map(_.temperature))))
}
And then created a new map using filterKeys and the Set function to give me back just a list of the days I'm looking for. So I suppose now it's just a matter of formatting to get the averages out correctly.

I would't call it a best way in scala to do this. Rather any way to minimize iteration is the best imo in that case:
def averageOfDay(country: String, city: String, day: Int) = {
val temps = weatherPatterns.collect {
case WeatherPattern(`day`, `country`, `city`, temp) => temp
}
temps.sum / temps.length
}
Edit
I just noticed you mainly need an operation that calculates avgs for all cities and countries. In that case I'd say instead of forming the hierarchical relationship of country -> city -> temp in every operation, you'd rather opt for building the hierarchy once beforhand then operate on that:
case class DailyTemperature(day: Int, temperature: Double)
object DailyTemperature {
def sequence(patterns: List[WeatherPattern]): List[DailyTemperature] =
patterns.map(p => DailyTemperature(p.day,p.temperature))
}
case class CityTempInfo(city: String, dailyTemperatures: List[DailyTemperature])
object CityTempInfo {
def sequence(patterns: List[WeatherPattern]): List[CityTempInfo] =
patterns.groupBy(_.city).map {
case (city, ps) => CityTempInfo(city,DailyTemperature.sequence(ps))
}.toList
}
case class CountryTempInfo(country: String, citiesInfo: List[CityTempInfo])
object CountryTempInfo {
def sequence(patterns: List[WeatherPattern]) =
patterns.groupBy(_.country).map {
case (country, ps) => CountryTempInfo(country, CityTempInfo.sequence(ps))
}.toList
}
now to have your tree of country -> city -> temp you call the CountryTempInfo.sequence and feed it your list of WeatherPatterns. any other method you want to have operate on DailyTemperature,CityTempInfo, of CountryTempInfo can be defined on their respective classes.

I am not sure what exactly you mean when you say that you use "simple ints" for the date, but if it is something sensible, like, for instance "days since epoch", you could fairly easily come up with a grouping function, that maps to weeks:
def weakOf(n: Int, start: Int) = (start + n) / 7
patterns
.groupBy { p => (weakOf(p.day, startDay), p.country, p.city) }
.mapValues(_.map(_.temperature))
.mapValues { s => s.sum / s.size }

Related

Improve the efficiency of the algorithm

I'm trying to improve the algorithm.
Now it works for O(n) and iterates through all the elements of the set. Always. My attempts to achieve incomplete O(n) lead to the introduction of var variables. It would be great to do without var.
Task:
We need a class that implements a list of company names by substring - from the list of all available names, output a certain number of companies
that start with the entered line.
It is assumed that the class will be called when filling out a form on a website/mobile application with a high RPS (Requests per second).
My solution:
class SuggestService(companyNames : Seq[String]) {
def suggest(input: String, numberOfSuggest : Int) : Seq[String] = {
val resultCompanyNames =
for {
name <- companyNames if input.equals(name.take(input.length))
} yield name
resultCompanyNames.take(numberOfSuggest)
} //TODO: My code
}
Scastie: https://scastie.scala-lang.org/mIC5ZTGwRyKnuJAbhM1VlA
The solution proposed by #jwvh in the comments:
companyNames.view.filter(_.startsWith(input)).take(numberOfSuggest).toSeq
is good when you have several dozen company names, but in the worst case you'll have to check every single company name. If you have thousands of company names and many requests per second, this has a chance to become a serious bottleneck.
A better approach might be to sort the company names and use binary search to find the first potential result in O(L log N), where L is the average length of a company name:
import scala.collection.imuutable.ArraySeq // in Scala 2.13
class SuggestService(companyNames: Seq[String]) {
// in Scala 2.12 use companyNames.toIndexedSeq.sorted
private val sortedNames = companyNames.to(ArraySeq).sorted
#annotation.tailrec
private def binarySearch(input: String, from: Int = 0, to: Int = sortedNames.size): Int = {
if (from == to) from
else {
val cur = (from + to) / 2
if (sortedNames(cur) < input) binarySearch(input, cur + 1, to)
else binarySearch(input, from, cur)
}
}
def suggest(input: String, numberOfSuggest: Int): Seq[String] = {
val start = binarySearch(input)
sortedNames.view
.slice(start, start + numberOfSuggest)
.takeWhile(_.startsWith(input))
.toSeq
}
}
Note that the built-in binary search in scala (sortedNames.search) returns any result, not necessarily the first one. And the built-in binary search from Java works either on Arrays (Arrays.binarySearch) or on Java collections (Collections.binarySearch). So in the code above I've provided an explicit implementation of the lower bound binary search.
If you need still better performance, you can use a trie data structure. After traversing the trie and finding the node corresponding to input in O(L) (may depend on trie implementation), you can then continue down from this node to retrieve the first numberOfSuggest results. The query time doesn't depend on N at all, so you can use this method with millions company names.

Implementing the Apache Spark tutorial with FP-growth, No results on freqItemsets

This is my first question here and I hope I am doing this correctly.
So, I was trying to get into Apache Spark and its FP-growth algorithm. Therefore i tried to apply the FP-growth tutorial to the bank tutorial that comes with Spark.
I am really new to all this data-mapping stuff und scala, so this question might seem very basic for you guys, but i appreciate your help!
case class Bank(age:Integer, job: String, marital: String, education:
String, balance: Integer)
val bank = bankTest.map(s=>s.split(";")).filter(s=>s(0)!= "\"age\"").map(
s=>Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
)
val transactions: RDD[Array[Object]] = bank.map(x => Array(x))
val fpg = new FPGrowth()
.setMinSupport(0.1)
.setNumPartitions(10)
val model = fpg.run(transactions)
model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
This is what I coded and I think the problem is the mapping of my bank element into the transactions variable. The code runs properly, but there are no results. I guess this happens because the FP-growth algorithm compares the different objects of the type bank with each other, which are contained in the transaction variable. Of course there is no whole object with a support of 20%.
So the question is: How can I make the FP-growth check for the COLUMNS in my data and not for the whole object?
For example: The support for "job = manager" should be around 20%, so it should appear as frequent, which it does not in my results.
Thank you in advance!
An easy solution would be to create a toList method that simply returns a list with all the member of your bank:
case class Bank(age:Integer, job: String, marital: String, education: String, balance: Integer)
{
def toList():List[String]=
{
List(""+age, job, marital, education, ""+balance);
}
}
Note that I used a List of String as FP-growth works with "classified items". It means that if you input integers or floats as salary or age, it will treat every single salary as unique if they differ by a cent (the same for age):
val bank1 = Bank(35, "engineer", "engaged", "college", 100000)
val bank2 = Bank(35, "engineer", "engaged", "college", 100001)
Although the salary of bank1 and bank2 is very close, FP-growth will consider this two items as different. Thus you will have trouble classifying salaries as they have a high divergence.
I would recommend to define an enum for each age class and salary class such as AGE_BETWEEN_0_18, AGE_BETWEEN_18_25 ...
That way you will shrink the histogram and let FP-growth work perfectly.
P.S.: I am not sure the object should be called Bank, I would rather name it BankCustomer

Building an immutable List based on conditions

I have to build a list whose members should be included or not based on a different condition for each.
Let's say I have to validate a purchase order and, depending on the price, I have to notify a number of people: if the price is more than 10, the supervisor has to be notified. If the price is more than 100 then both the supervisor and the manager. If the price is more than 1000 then the supervisor, the manager, and the director.
My function should take a price as input and output a list of people to notify. I came up with the following:
def whoToNotify(price:Int) = {
addIf(price>1000, "director",
addIf(price>100, "manager",
addIf(price>10, "supervisor", Nil)
)
)
}
def addIf[A](condition:Boolean, elem:A, list:List[A]) = {
if(condition) elem :: list else list
}
Are there better ways to do this in plain Scala? Am I reinventing some wheel here with my addIf function?
Please note that the check on price is just a simplification. In real life, the checks would be more complex, on a number of database fields, and including someone in the organizational hierarchy will not imply including all the people below, so truncate-a-list solutions won't work.
EDIT
Basically, I want to achieve the following, using immutable lists:
def whoToNotify(po:PurchaseOrder) = {
val people = scala.collection.mutable.Buffer[String]()
if(directorCondition(po)) people += "director"
if(managerCondition(po)) people += "manager"
if(supervisorCondition(po)) people += "supervisor"
people.toList
}
You can use List#flatten() to build a list from subelements. It would even let you add two people at the same time (I'll do that for the managers in the example below):
def whoToNotify(price:Int) =
List(if (price > 1000) List("director") else Nil,
if (price > 100) List("manager1", "manager2") else Nil,
if (price > 10) List("supervisor") else Nil).flatten
Well, its a matter of style, but I would do it this way to make all the conditions more amenable -
case class Condition(price: Int, designation: String)
val list = List(
Condition(10, "supervisor"),
Condition(100, "manager") ,
Condition(1000, "director")
)
def whoToNotify(price: Int) = {
list.filter(_.price <= price).map(_.designation)
}
You can accommodate all your conditions in Condition class and filter function as per your requirements.
Well, it is a matter of style, but I would prefer keeping a list of people to notify with rules rather than function nesting. I don't see much value in having something like addIf in above example.
My solution.
val notifyAbovePrice = List(
(10, "supervisor"),
(100, "manager"),
(1000, "director"))
def whoToNotify(price: Int): Seq[String] = {
notifyAbovePrice.takeWhile(price > _._1).map(_._2)
}
In real world, you may have objects to notifyAbovePrice instead of tuples and use filter instead of takeWhile if there is no order or the order doesn't imply notifications on lower level.
If you have an original list of members, you would probably want to consider using the filter method. If you also want to transform the member object so as to have a different type of list at the end, take a look at the collect method, which takes a partial function.

What are good examples of: "operation of a program should map input values to output values rather than change data in place"

I came across this sentence in Scala in explaining its functional behavior.
operation of a program should map input of values to output values rather than change data in place
Could somebody explain it with a good example?
Edit: Please explain or give example for the above sentence in its context, please do not make it complicate to get more confusion
The most obvious pattern that this is referring to is the difference between how you would write code which uses collections in Java when compared with Scala. If you were writing scala but in the idiom of Java, then you would be working with collections by mutating data in place. The idiomatic scala code to do the same would favour the mapping of input values to output values.
Let's have a look at a few things you might want to do to a collection:
Filtering
In Java, if I have a List<Trade> and I am only interested in those trades executed with Deutsche Bank, I might do something like:
for (Iterator<Trade> it = trades.iterator(); it.hasNext();) {
Trade t = it.next();
if (t.getCounterparty() != DEUTSCHE_BANK) it.remove(); // MUTATION
}
Following this loop, my trades collection only contains the relevant trades. But, I have achieved this using mutation - a careless programmer could easily have missed that trades was an input parameter, an instance variable, or is used elsewhere in the method. As such, it is quite possible their code is now broken. Furthermore, such code is extremely brittle for refactoring for this same reason; a programmer wishing to refactor a piece of code must be very careful to not let mutated collections escape the scope in which they are intended to be used and, vice-versa, that they don't accidentally use an un-mutated collection where they should have used a mutated one.
Compare with Scala:
val db = trades filter (_.counterparty == DeutscheBank) //MAPPING INPUT TO OUTPUT
This creates a new collection! It doesn't affect anyone who is looking at trades and is inherently safer.
Mapping
Suppose I have a List<Trade> and I want to get a Set<Stock> for the unique stocks which I have been trading. Again, the idiom in Java is to create a collection and mutate it.
Set<Stock> stocks = new HashSet<Stock>();
for (Trade t : trades) stocks.add(t.getStock()); //MUTATION
Using scala the correct thing to do is to map the input collection and then convert to a set:
val stocks = (trades map (_.stock)).toSet //MAPPING INPUT TO OUTPUT
Or, if we are concerned about performance:
(trades.view map (_.stock)).toSet
(trades.iterator map (_.stock)).toSet
What are the advantages here? Well:
My code can never observe a partially-constructed result
The application of a function A => B to a Coll[A] to get a Coll[B] is clearer.
Accumulating
Again, in Java the idiom has to be mutation. Suppose we are trying to sum the decimal quantities of the trades we have done:
BigDecimal sum = BigDecimal.ZERO
for (Trade t : trades) {
sum.add(t.getQuantity()); //MUTATION
}
Again, we must be very careful not to accidentally observe a partially-constructed result! In scala, we can do this in a single expression:
val sum = (0 /: trades)(_ + _.quantity) //MAPPING INTO TO OUTPUT
Or the various other forms:
(trades.foldLeft(0)(_ + _.quantity)
(trades.iterator map (_.quantity)).sum
(trades.view map (_.quantity)).sum
Oh, by the way, there is a bug in the Java implementation! Did you spot it?
I'd say it's the difference between:
var counter = 0
def updateCounter(toAdd: Int): Unit = {
counter += toAdd
}
updateCounter(8)
println(counter)
and:
val originalValue = 0
def addToValue(value: Int, toAdd: Int): Int = value + toAdd
val firstNewResult = addToValue(originalValue, 8)
println(firstNewResult)
This is a gross over simplification but fuller examples are things like using a foldLeft to build up a result rather than doing the hard work yourself: foldLeft example
What it means is that if you write pure functions like this you always get the same output from the same input, and there are no side effects, which makes it easier to reason about your programs and ensure that they are correct.
so for example the function:
def times2(x:Int) = x*2
is pure, while
def add5ToList(xs: MutableList[Int]) {
xs += 5
}
is impure because it edits data in place as a side effect. This is a problem because that same list could be in use elsewhere in the the program and now we can't guarantee the behaviour because it has changed.
A pure version would use immutable lists and return a new list
def add5ToList(xs: List[Int]) = {
5::xs
}
There are plenty examples with collections, which are easy to come by but might give the wrong impression. This concept works at all levels of the language (it doesn't at the VM level, however). One example is the case classes. Consider these two alternatives:
// Java-style
class Person(initialName: String, initialAge: Int) {
def this(initialName: String) = this(initialName, 0)
private var name = initialName
private var age = initialAge
def getName = name
def getAge = age
def setName(newName: String) { name = newName }
def setAge(newAge: Int) { age = newAge }
}
val employee = new Person("John")
employee.setAge(40) // we changed the object
// Scala-style
case class Person(name: String, age: Int) {
def this(name: String) = this(name, 0)
}
val employee = new Person("John")
val employeeWithAge = employee.copy(age = 40) // employee still exists!
This concept is applied on the construction of the immutable collection themselves: a List never changes. Instead, new List objects are created when necessary. Use of persistent data structures reduce the copying that would happen on a mutable data structure.

scala: map-like structure that doesn't require casting when fetching a value?

I'm writing a data structure that converts the results of a database query. The raw structure is a java ResultSet and it would be converted to a map or class which permits accessing different fields on that data structure by either a named method call or passing a string into apply(). Clearly different values may have different types. In order to reduce burden on the clients of this data structure, my preference is that one not need to cast the values of the data structure but the value fetched still has the correct type.
For example, suppose I'm doing a query that fetches two column values, one an Int, the other a String. The result then names of the columns are "a" and "b" respectively. Some ideal syntax might be the following:
val javaResultSet = dbQuery("select a, b from table limit 1")
// with ResultSet, particular values can be accessed like this:
val a = javaResultSet.getInt("a")
val b = javaResultSet.getString("b")
// but this syntax is undesirable.
// since I want to convert this to a single data structure,
// the preferred syntax might look something like this:
val newStructure = toDataStructure[Int, String](javaResultSet)("a", "b")
// that is, I'm willing to state the types during the instantiation
// of such a data structure.
// then,
val a: Int = newStructure("a") // OR
val a: Int = newStructure.a
// in both cases, "val a" does not require asInstanceOf[Int].
I've been trying to determine what sort of data structure might allow this and I could not figure out a way around the casting.
The other requirement is obviously that I would like to define a single data structure used for all db queries. I realize I could easily define a case class or similar per call and that solves the typing issue, but such a solution does not scale well when many db queries are being written. I suspect some people are going to propose using some sort of ORM, but let us assume for my case that it is preferred to maintain the query in the form of a string.
Anyone have any suggestions? Thanks!
To do this without casting, one needs more information about the query and one needs that information at compiole time.
I suspect some people are going to propose using some sort of ORM, but let us assume for my case that it is preferred to maintain the query in the form of a string.
Your suspicion is right and you will not get around this. If current ORMs or DSLs like squeryl don't suit your fancy, you can create your own one. But I doubt you will be able to use query strings.
The basic problem is that you don't know how many columns there will be in any given query, and so you don't know how many type parameters the data structure should have and it's not possible to abstract over the number of type parameters.
There is however, a data structure that exists in different variants for different numbers of type parameters: the tuple. (E.g. Tuple2, Tuple3 etc.) You could define parameterized mapping functions for different numbers of parameters that returns tuples like this:
def toDataStructure2[T1, T2](rs: ResultSet)(c1: String, c2: String) =
(rs.getObject(c1).asInstanceOf[T1],
rs.getObject(c2).asInstanceOf[T2])
def toDataStructure3[T1, T2, T3](rs: ResultSet)(c1: String, c2: String, c3: String) =
(rs.getObject(c1).asInstanceOf[T1],
rs.getObject(c2).asInstanceOf[T2],
rs.getObject(c3).asInstanceOf[T3])
You would have to define these for as many columns you expect to have in your tables (max 22).
This depends of course on that using getObject and casting it to a given type is safe.
In your example you could use the resulting tuple as follows:
val (a, b) = toDataStructure2[Int, String](javaResultSet)("a", "b")
if you decide to go the route of heterogeneous collections, there are some very interesting posts on heterogeneous typed lists:
one for instance is
http://jnordenberg.blogspot.com/2008/08/hlist-in-scala.html
http://jnordenberg.blogspot.com/2008/09/hlist-in-scala-revisited-or-scala.html
with an implementation at
http://www.assembla.com/wiki/show/metascala
a second great series of posts starts with
http://apocalisp.wordpress.com/2010/07/06/type-level-programming-in-scala-part-6a-heterogeneous-list%C2%A0basics/
the series continues with parts "b,c,d" linked from part a
finally, there is a talk by Daniel Spiewak which touches on HOMaps
http://vimeo.com/13518456
so all this to say that perhaps you can build you solution from these ideas. sorry that i don't have a specific example, but i admit i haven't tried these out yet myself!
Joschua Bloch has introduced a heterogeneous collection, which can be written in Java. I once adopted it a little. It now works as a value register. It is basically a wrapper around two maps. Here is the code and this is how you can use it. But this is just FYI, since you are interested in a Scala solution.
In Scala I would start by playing with Tuples. Tuples are kinda heterogeneous collections. The results can be, but not have to be accessed through fields like _1, _2, _3 and so on. But you don't want that, you want names. This is how you can assign names to those:
scala> val tuple = (1, "word")
tuple: ([Int], [String]) = (1, word)
scala> val (a, b) = tuple
a: Int = 1
b: String = word
So as mentioned before I would try to build a ResultSetWrapper around tuples.
If you want "extract the column value by name" on a plain bean instance, you can probably:
use reflects and CASTs, which you(and me) don't like.
use a ResultSetToJavaBeanMapper provided by most ORM libraries, which is a little heavy and coupled.
write a scala compiler plugin, which is too complex to control.
so, I guess a lightweight ORM with following features may satisfy you:
support raw SQL
support a lightweight,declarative and adaptive ResultSetToJavaBeanMapper
nothing else.
I made an experimental project on that idea, but note it's still an ORM, and I just think it may be useful to you, or can bring you some hint.
Usage:
declare the model:
//declare DB schema
trait UserDef extends TableDef {
var name = property[String]("name", title = Some("姓名"))
var age1 = property[Int]("age", primary = true)
}
//declare model, and it mixes in properties as {var name = ""}
#BeanInfo class User extends Model with UserDef
//declare a object.
//it mixes in properties as {var name = Property[String]("name") }
//and, object User is a Mapper[User], thus, it can translate ResultSet to a User instance.
object `package`{
#BeanInfo implicit object User extends Table[User]("users") with UserDef
}
then call raw sql, the implicit Mapper[User] works for you:
val users = SQL("select name, age from users").all[User]
users.foreach{user => println(user.name)}
or even build a type safe query:
val users = User.q.where(User.age > 20).where(User.name like "%liu%").all[User]
for more, see unit test:
https://github.com/liusong1111/soupy-orm/blob/master/src/test/scala/mapper/SoupyMapperSpec.scala
project home:
https://github.com/liusong1111/soupy-orm
It uses "abstract Type" and "implicit" heavily to make the magic happen, and you can check source code of TableDef, Table, Model for detail.
Several million years ago I wrote an example showing how to use Scala's type system to push and pull values from a ResultSet. Check it out; it matches up with what you want to do fairly closely.
implicit val conn = connect("jdbc:h2:f2", "sa", "");
implicit val s: Statement = conn << setup;
val insertPerson = conn prepareStatement "insert into person(type, name) values(?, ?)";
for (val name <- names)
insertPerson<<rnd.nextInt(10)<<name<<!;
for (val person <- query("select * from person", rs => Person(rs,rs,rs)))
println(person.toXML);
for (val person <- "select * from person" <<! (rs => Person(rs,rs,rs)))
println(person.toXML);
Primitives types are used to guide the Scala compiler into selecting the right functions on the ResultSet.