scalding compare consecutive records - scala

Does anyone know how to compare consecutive records in scalding when creating a schema. I am looking at tutorial 6 and suppose that I want to print the age of the person if data in record #2 is greater than record #1 (for all records)
for example:
R1: John 30
R2: Kim 55
R3: Mark 20
if Rn.age > R(n-1).age the output ... which will result to R2: Kim 55
EDIT:
Looking at the code I just realized it is a Scala enumeration, so my question is how to compare records in scala enumeration ?
class Tutorial6(args : Args) extends Job(args) {
/** When a data set has a large number of fields, and we want to specify those fields conveniently
in code, we can use, for example, a Tuple of Symbols (as most of the other tutorials show), or a List of Symbols.
Note that Tuples can only be used if the number of fields is at most 22, since Scala Tuples cannot have more
than 22 elements. Another alternative is to use Enumerations, which we show here **/
object Schema extends Enumeration {
val first, last, phone, age, country = Value // arbitrary number of fields
}
import Schema._
Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)
.read
.project(first,age)
.write(Tsv("tutorial/data/output6.tsv"))
}

It seems the implicit conversion from Enumeration#Value is missing, so you could define it yourself:
import cascading.tuple.Fields
implicit def valueToFields(v: Enumeration#Value): Fields = v.toString
object Schema extends Enumeration {
val first, last, phone, age, country = Value // arbitrary number of fields
}
import Schema._
var current = Int.MaxValue
Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)
.read
.map(age -> ('current, 'previous)) { a: String =>
val previous = current
current = a.toInt
current -> previous
}
.filter('current, 'previous) { age: (Int, Int) => age._1 > age._2 }
.project(first, age)
.write(Tsv("tutorial/data/output6.tsv"))
In the end, we expect the result to be the same as that of:
Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)
.read
.map((new Fields("age"), (new Fields("current", "previous"))) { a: String =>
val previous = current
current = a.toInt
current -> previous
}
.filter(new Fields("current", "previous")) { age: (Int, Int) =>
age._1 > age._2
}
.project(new Fields("first", "age"))
.write(Tsv("tutorial/data/output6.tsv"))
The implicit conversions provided by scalding allow you to write shorter versions of these new Fields(...).
An inplicit conversion is just a view, which will get used by the compiler when you pass arguments which are not of the expected type, but can be converted to the appropriate type by this view. For example, because map() expects a pair of Fields while you're passing it a pair of Symbols, Scala will search for an implicit conversion from Symbol -> Symbol to Fields -> Fields. A short explanation on views can be found here.
Scalding 0.8.5 introduced conversions from a product of Eumeration#Value to a Fields, but was missing conversions from a pair of values. The develop branch now also provides the latter.

Related

Scala : How to pass a class field into a method

I'm new to Scala and attempting to do some data analysis.
I have a CSV files with a few headers - lets say item no., item type, month, items sold.
I have made an Item class with the fields of the headers.
I split the CSV into a list with each iteration of the list being a row of the CSV file being represented by the Item class.
I am attempting to make a method that will create maps based off of the parameter I send in. For example if I want to group the items sold by month, or by item type. However I am struggling to send the Item.field into a method.
F.e what I am attempting is something like:
makemaps(Item.month);
makemaps(Item.itemtype);
def makemaps(Item.field):
if (item.field==Item.month){}
else (if item.field==Item.itemType){}
However my logic for this appears to be wrong. Any ideas?
def makeMap[T](items: Iterable[Item])(extractKey: Item => T): Map[T, Iterable[Item]] =
items.groupBy(extractKey)
So given this example Item class:
case class Item(month: String, itemType: String, quantity: Int, description: String)
You could have (I believe the type ascriptions are mandatory):
val byMonth = makeMap[String](items)(_.month)
val byType = makeMap[String](items)(_.itemType)
val byQuantity = makeMap[Int](items)(_.quantity)
val byDescription = makeMap[String](items)(_.description)
Note that _.month, for instance, creates a function taking an Item which results in the String contained in the month field (simplifying a little).
You could, if so inclined, save the functions used for extracting keys in the companion object:
object Item {
val month: Item => String = _.month
val itemType: Item => String = _.itemType
val quantity: Item => Int = _.quantity
val description: Item => String = _.description
// Allows us to determine if using a predefined extractor or using an ad hoc one
val extractors: Set[Item => Any] = Set(month, itemType, quantity, description)
}
Then you can pass those around like so:
val byMonth = makeMap[String](items)(Item.month)
The only real change semantically is that you explicitly avoid possible extra construction of lambdas at runtime, at the cost of having the lambdas stick around in memory the whole time. A fringe benefit is that you might be able to cache the maps by extractor if you're sure that the source Items never change: for lambdas, equality is reference equality. This might be particularly useful if you have some class representing the collection of Items as opposed to just using a standard collection, like so:
object Items {
def makeMap[T](items: Iterable[Item])(extractKey: Item => T): Map[T,
Iterable[Item]] =
items.groupBy(extractKey)
}
class Items(val underlying: immutable.Seq[Item]) {
def makeMap[T](extractKey: Item => T): Map[T, Iterable[Item]] =
if (Item.extractors.contains(extractKey)) {
if (extractKey == Item.month) groupedByMonth.asInstanceOf[Map[T, Iterable[Item]]]
else if (extractKey == Item.itemType) groupedByItemType.asInstanceOf[Map[T, Iterable[Item]]]
else if (extractKey == Item.quantity) groupedByQuantity.asInstanceOf[Map[T, Iterable[Item]]]
else if (extractKey == Item.description) groupedByDescription.asInstanceOf[Map[T, Iterable[Item]]]
else throw new AssertionError("Shouldn't happen!")
} else {
Items.makeMap(underlying)(extractKey)
}
lazy val groupedByMonth = Items.makeMap[String](underlying)(Item.month)
lazy val groupedByItemType = Items.makeMap[String](underlying)(Item.itemType)
lazy val groupedByQuantity = Items.makeMap[Int](underlying)(Item.quantity)
lazy val groupedByDescription = Items.makeMap[String](underlying)(Item.description)
}
(that is almost certainly a personal record for asInstanceOfs in a small block of code... I'm not sure if I should be proud or ashamed of this snippet)

How to print a Monocle Lens as a property accessor style string

Using Monocle I can define a Lens to read a case class member without issue,
val md5Lens = GenLens[Message](_.md5)
This can used to compare the value of md5 between two objects and fail with an error message that includes the field name when the values differ.
Is there a way to produce a user-friendly string from the Lens alone that identifies the field being read by the lens? I want to avoid providing the field name explicitly
val md5LensAndName = (GenLens[Message](_.md5), "md5")
If there is a solution that also works with lenses with more than one component then even better. For me it would be good even if the solution only worked to a depth of one.
This is fundamentally impossible. Conceptually, lens is nothing more than a pair of functions: one to get a value from object and one to obtain new object using a given value. That functions can be implemented by the means of accessing the source object's fields or not. In fact, even GenLens macro can use a chain field accessors like _.field1.field2 to generate composite lenses to the fields of nested objects. That can be confusing at first, but this feature have its uses. For example, you can decouple the format of data storage and representation:
import monocle._
case class Person private(value: String) {
import Person._
private def replace(
array: Array[String], index: Int, item: String
): Array[String] = {
val copy = Array.ofDim[String](array.length)
array.copyToArray(copy)
copy(index) = item
copy
}
def replaceItem(index: Int, item: String): Person = {
val array = value.split(delimiter)
val newArray = replace(array, index, item)
val newValue = newArray.mkString(delimiter)
Person(newValue)
}
def getItem(index: Int): String = {
val array = value.split(delimiter)
array(index)
}
}
object Person {
private val delimiter: String = ";"
val nameIndex: Int = 0
val cityIndex: Int = 1
def apply(name: String, address: String): Person =
Person(Array(name, address).mkString(delimiter))
}
val name: Lens[Person, String] =
Lens[Person, String](
_.getItem(Person.nameIndex)
)(
name => person => person.replaceItem(Person.nameIndex, name)
)
val city: Lens[Person, String] =
Lens[Person, String](
_.getItem(Person.cityIndex)
)(
city => person => person.replaceItem(Person.cityIndex, city)
)
val person = Person("John", "London")
val personAfterMove = city.set("New York")(person)
println(name.get(personAfterMove)) // John
println(city.get(personAfterMove)) // New York
While not very performant, that example illustrates the idea: Person class don't have city or address fields, but by wrapping data extractor and a string rebuild function into Lens, we can pretend it have them. For more complex objects, lens composition works as usual: inner lens just operates on extracted object, relying on outer one to pack it back.

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

spark scala - replace text if exists in list

I have 2 datasets.
One is a dataframe with a bunch of data, one column has comments (a string).
The other is a list of words.
If a comment contains a word in the list, I want to replace the word in the comment with ##### and return the comment in full with the replaced words.
Here's some sample data:
CommentSample.txt
1 A badword small town
2 "Love the truck, though rattle is annoying."
3 Love the paint!
4
5 "Like that you added the ""oh badword2"" handle to passenger side."
6 "badword you. specific enough for you, badword3?"
7 This car is a piece if badword2
ProfanitySample.txt
badword
badword2
badword3
Here's my code so far:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Response(UniqueID: Int, Comment: String)
val response = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim.toString, r(10).trim.toInt)).toDF()
var profanity = sc.textFile("file:/data/ProfanitySample.txt").map(x => (x.toLowerCase())).toArray();
def replaceProfanity(s: String): String = {
val l = s.toLowerCase()
val r = "#####"
if(profanity.contains(s))
r
else
s
}
def processComment(s: String): String = {
val commentWords = sc.parallelize(s.split(' '))
commentWords.foreach(replaceProfanity)
commentWords.collect().mkString(" ")
}
response.select(processComment("Comment")).show(100)
It compiles, it runs, but the words are not replaced.
I don't know how to debug in scala.
I'm totally new! This is my first project ever!
Many thanks for any pointers.
-M
First, I think the usecase you describe here won't benefit much from the use of DataFrames - it's simpler to implement using RDDs only (DataFrames are mostly convenient when your transformations can easily be described using SQL, which isn't the case here).
So - here's a possible implementation using RDDs. This assumes the list of profanities isn't too large (i.e. up to ~thousands), so we can collect it into non-distributed memory. If that's not the case, a different approach (involving a join) might be needed.
case class Response(UniqueID: Int, Comment: String)
val mask = "#####"
val responses: RDD[Response] = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim))
val profanities: Array[String] = sc.textFile("file:/data/ProfanitySample.txt").collect()
val result = responses.map(r => {
// using foldLeft here means we'll replace profanities one by one,
// with the result of each replace as the input of the next,
// starting with the original comment
profanities.foldLeft(r.Comment)({
case (updatedComment, profanity) => updatedComment.replaceAll(s"(?i)\\b$profanity\\b", mask)
})
})
result.take(10).foreach(println) // just printing some examples...
Note that the case-insensitivity and the "words only" limitations are implemented in the regex itself: "(?i)\\bSomeWord\\b".

How to read input from a file and convert data lines of the file to List[Map[Int,String]] using scala?

My Query is, read input from a file and convert data lines of the file to List[Map[Int,String]] using scala. Here I give a dataset as the input. My code is,
def id3(attrs: Attributes,
examples: List[Example],
label: Symbol
) : Node = {
level = level+1
// if all the examples have the same label, return a new node with that label
if(examples.forall( x => x(label) == examples(0)(label))){
new Leaf(examples(0)(label))
} else {
for(a <- attrs.keySet-label){ //except label, take all attrs
("Information gain for %s is %f".format(a,
informationGain(a,attrs,examples,label)))
}
// find the best splitting attribute - this is an argmax on a function over the list
var bestAttr:Symbol = argmax(attrs.keySet-label, (x:Symbol) =>
informationGain(x,attrs,examples,label))
// now we produce a new branch, which splits on that node, and recurse down the nodes.
var branch = new Branch(bestAttr)
for(v <- attrs(bestAttr)){
val subset = examples.filter(x=> x(bestAttr)==v)
if(subset.size == 0){
// println(levstr+"Tiny subset!")
// zero subset, we replace with a leaf labelled with the most common label in
// the examples
val m = examples.map(_(label))
val mostCommonLabel = m.toSet.map((x:Symbol) => (x,m.count(_==x))).maxBy(_._2)._1
branch.add(v,new Leaf(mostCommonLabel))
}
else {
// println(levstr+"Branch on %s=%s!".format(bestAttr,v))
branch.add(v,id3(attrs,subset,label))
}
}
level = level-1
branch
}
}
}
object samplet {
def main(args: Array[String]){
var attrs: sample.Attributes = Map()
attrs += ('0 -> Set('abc,'nbv,'zxc))
attrs += ('1 -> Set('def,'ftr,'tyh))
attrs += ('2 -> Set('ghi,'azxc))
attrs += ('3 -> Set('jkl,'fds))
attrs += ('4 -> Set('mno,'nbh))
val examples: List[sample.Example] = List(
Map(
'0 -> 'abc,
'1 -> 'def,
'2 -> 'ghi,
'3 'jkl,
'4 -> 'mno
),
........................
)
// obviously we can't use the label as an attribute, that would be silly!
val label = 'play
println(sample.try(attrs,examples,label).getStr(0))
}
}
But How I change this code to - accepting input from a .csv file?
I suggest you use Java's io / nio standard library to read your CSV file. I think there is no relevant drawback in doing so.
But the first question we need to answer is where to read the file in the code? The parsed input seems to replace the value of examples. This fact also hints us what type the parsed CSV input must have, namely List[Map[Symbol, Symbol]]. So let us declare a new class
class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
def getInput(file: Path): List[Map[Symbol, Symbol]] = ???
}
Note that the Charset is only needed if we must distinguish between differently encoded CSV-files.
Okay, so how do we implement the method? It should do the following:
Create an appropriate input reader
Read all lines
Split each line at the comma-separator
Transform each substring into the symbol it represents
Build a map from from the list of symbols, using the attributes as key
Create and return the list of maps
Or expressed in code:
class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
val Attributes = List('outlook, 'temperature, 'humidity, 'wind, 'play)
val Separator = ","
/** Get the desired input from the CSV file. Does not perform any checks, i.e., there are no guarantees on what happens if the input is malformed. */
def getInput(file: Path): List[Map[Symbol, Symbol]] = {
val reader = Files.newBufferedReader(file, charset)
/* Read the whole file and discard the first line */
inputWithHeader(reader).tail
}
/** Reads all lines in the CSV file using [[java.io.BufferedReader]] There are many ways to do this and this is probably not the prettiest. */
private def inputWithHeader(reader: BufferedReader): List[Map[Symbol, Symbol]] = {
(JavaConversions.asScalaIterator(reader.lines().iterator()) foldLeft Nil.asInstanceOf[List[Map[Symbol, Symbol]]]){
(accumulator, nextLine) =>
parseLine(nextLine) :: accumulator
}.reverse
}
/** Parse an entry. Does not verify the input: If there are less attributes than columns or vice versa, zip creates a list of the size of the shorter list */
private def parseLine(line: String): Map[Symbol, Symbol] = (Attributes zip (line split Separator map parseSymbol)).toMap
/** Create a symbol from a String... we could also check whether the string represents a valid symbol */
private def parseSymbol(symbolAsString: String): Symbol = Symbol(symbolAsString)
}
Caveat: Expecting only valid input, we are certain that the individual symbol representations do not contain the comma-separation character. If this cannot be assumed, then the code as is would fail to split certain valid input strings.
To use this new code, we could change the main-method as follows:
def main(args: Array[String]){
val csvInputFile: Option[Path] = args.headOption map (p => Paths get p)
val examples = (csvInputFile map new InputFromCsvLoader().getInput).getOrElse(exampleInput)
// ... your code
Here, examples uses the value exampleInput, which is the current, hardcoded value of examples if no input argument is specified.
Important: In the code, all error handling has been omitted for convenience. In most cases, errors can occur when reading from files and user input must be considered invalid, so sadly, error handling at the boundaries of your program is usally not optional.
Side-notes:
Try not to use null in your code. Returning Option[T] is a better option than returning null, because it makes "nullness" explicit and provides static safety thanks to the type-system.
The return-keyword is not required in Scala, as the last value of a method is always returned. You can still use the keyword if you find the code more readable or if you want to break in the middle of your method (which is usually a bad idea).
Prefer val over var, because immutable values are much easier to understand than mutable values.
The code will fail with the provided CSV string, because it contains the symbols TRUE and FALSE which are not legal according to your programs logic (they should be true and false instead).
Add all information to your error-messages. Your error message only tells me what that a value for the attribute 'wind is bad, but it does not tell me what the actual value is.
Read a csv file ,
val datalines = Source.fromFile(filepath).getLines()
So this datalines contains all the lines from the csv file.
Next, convert each line into a Map[Int,String]
val datamap = datalines.map{ line =>
line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}
Here, we split each line with ",". Then construct a map with key as column number and value as each word after the split.
Next, If we want List[Map[Int,String]],
val datamap = datalines.map{ line =>
line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}.toList