Scala - Expanding an argument list in a pattern matching expression - scala

I'm very new to Scala and trying to use it as an interface to Spark. I'm running into a problem making a generic CSV to DataFrame function. For example, I've got a CSV with about 50 fields, the first of which are task, name, and id. I can get the following to work:
val reader = new CSVReader(new StringReader(txt))
reader.readAll().map(_ match {
case Array(task, name, id, _*) => Row(task, name, id)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: "+ unexpectedArrayForm.mkString(","))
})
However, I'd rather not have to hard code the number of fields needed to create a spark Row. I tried this:
val reader = new CSVReader(new StringReader(txt))
reader.readAll().map(_ match {
case Array(args # _*) => Row(args)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: "+ unexpectedArrayForm.mkString(","))
})
But it just creates a Row object with a single element. How can I make it expand the args in Row(args) so that if I have an array of N elements I'll get a Row with N elements?

Change your input to be variable length by adding _*:
Row(args:_*)
This is what Row accepts per its apply signature.
In fact, you don't even need to do anything other than pass this in to the Row as it is already of the right sequence type.
reader.readAll().map(Row(_:_*))

This should do the trick:
val reader = new CSVReader(new StringReader(txt))
reader.readAll().map(_ match {
case a: Array[String] => Row(a:_*)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: "+ unexpectedArrayForm.mkString(","))
})
Edited to correct omission of Array type

Related

Accessing parts of splitted string in scala

I am working on a scala application. I have a string as follows:
val str = abc,def,xyz
I want to split this string and access splitted parts separately like abc , def and xyx. My code is as follows
val splittedString = str.split(',')
to access each part of this splitted string I am trying something like this splittedString._1, splittedString._2, splittedString._3. But intellij is giving me error stating "cannot resolve symbol _1" and same error for part 2 and 3 as well. How can I access each element of splitted string?
The method split is defined over Strings to return an Array[String].
What you can do is access by (zero-based) index, splittedString(0) being the first item.
Alternatively, if you know the length of resulting array you want to obtain, you can convert it to a tuple and access with the accessor methods you were referring to:
val tuple =
str.split(",") match {
case Array(a, b, c) => (a, b, c)
case _ => throw new IllegalArgumentException
}
tuple._1 will now contain abc in your example.

How to drop multiple columns in spark with scala with some regex?

Say I have some column names which ends with '_undo', Now I need to remove this columns. Now instead of dropping one by one or storing them in a list before dropping, can i drop at one go?
df.drop(//drop at one go for those columns ending with _undo)
val customDF = df.select(df.columns.filter(colName => checkname(colName)) .map(colName => new Column(colName)): _*)
where checkname is simple user function which returns colname if matches and null if not. Use your pattern to check inside that function.
def checkname () : [return type] = {
return [expr]
}
What about this?
df.drop(df.columns.filter(c => "regexp".r.pattern.matcher(c).matches): _*)
or
df.drop(df.columns.filter(_.endsWith("_undo")): _*)

Scala - how to filter a StructType with a list of StructField names?

I'm writing a method to parse schema and want to filter the resulting StructType with a list of column names. Which is a subset of StructField names of the original schema.
As a result, if a flag isFilteringReq = true, I want to return a StructType containing only StructFields with the names from the specialColumnNames, in the same order. If the flag is false, then return an original StructType.
val specialColumnNames = Seq("metric_1", "metric_2", "metric_3")
First I'm getting an original schema with pattern-matching.
val customSchema: StructType = schemaType match {
case "type_1" => getType1chema()
case "type_2" => getType2chema()
}
There are two problems:
1 - I wasn't able to apply .filter() directly to the customSchema right after the curly brace. And geting a Cannot resolve symbol filter. So I wrote a separate method makeCustomSchema. But I don't need a separate object. Is there a more elegant way to apply filtering in this case?
2 - I could filter the originalStruct but only with a single hardcoded column name. How should I pass the specialColumnNames to contains()?
def makeCustomSchema(originalStruct: StructType, isFilteringReq: Boolean, updColumns: Seq[String]) = if (isFilteringReq) {
originalStruct.filter(s => s.name.contains("metric_1"))
} else {
originalStruct
}
val newSchema = makeCustomSchema(customSchema, isFilteringReq, specialColumnNames)
Instead of passing a Seq, pass a Set and you can filter if the field is in the set or not.
Also, I wouldn't use a flag, instead, you could pass an empty Set when there's no filtering, or use Option[Set[String]].
Anyway, you could also use the copy method that comes for free with case classes.
Something like this should work.
def makeCustomSchema(originalStruct: StructType, updColumns:Set[String]): StructType = {
updColumns match {
case s if s.isEmpty => originalStruct
case _ => originalStruct.copy(
fields = originalStruct.fields.filter(
f => updColumns.contains(f.name)))
}
}
Usually you don't need to build structs like this, have you tried using the drop() method in DataFrame/DataSet ?

Filter a list based on a parameter

I want to filter the employees based on name and return the id of each employee
case class Company(emp:List[Employee])
case class Employee(id:String,name:String)
val emp1=Employee("1","abc")
val emp2=Employee("2","def")
val cmpy= Company(List(emp1,emp2))
val a = cmpy.emp.find(_.name == "abc")
val b = a.map(_.id)
val c = cmpy.emp.find(_.name == "def")
val d = c.map(_.id)
println(b)
println(d)
I want to create a generic function that contains the filter logic and I can have different kind of list and filter parameter for those list
Ex employeeIdByName which takes the parameters
Updated
criteria for filter eg :_.name and id
list to filter eg:cmpy.emp value
for comparison eg :abc/def
Any better way to achieve the result
I have used map and find
If you really want a "generic" filter function, that can filter any list of elements, by any property of these elements, based on a closed set of "allowed" values, while mapping results to some other property - it would look something like this:
def filter[T, P, R](
list: List[T], // input list of elements with type T (in our case: Employee)
propertyGetter: T => P, // function extracting value for comparison, in our case a function from Employee to String
values: List[P], // "allowed" values for the result of propertyGetter
resultMapper: T => R // function extracting result from each item, in our case from Employee to String
): List[R] = {
list
// first we filter only items for which the result of
// applying "propertyGetter" is one of the "allowed" values:
.filter(item => values.contains(propertyGetter(item)))
// then we map remaining values to the result using the "resultMapper"
.map(resultMapper)
}
// for example, we can use it to filter by name and return id:
filter(
List(emp1, emp2),
(emp: Employee) => emp.name, // function that takes an Employee and returns its name
List("abc"),
(emp: Employee) => emp.id // function that takes an Employee and returns its id
)
// List(1)
However, this is a ton of noise around a very simple Scala operation: filtering and mapping a list; This specific usecase can be written as:
val goodNames = List("abc")
val input = List(emp1, emp2)
val result = input.filter(emp => goodNames.contains(emp.name)).map(_.id)
Or even:
val result = input.collect {
case Employee(id, name) if goodNames.contains(name) => id
}
Scala's built-in map, filter, collect functions are already "generic" in the sense that they can filter/map by any function that applies to the elements in the collection.
You can use Shapeless. If you have a employees: List[Employee], you can use
import shapeless._
import shapeless.record._
employees.map(LabelledGeneric[Employee].to(_).toMap)
To convert each Employee to a map from field key to field value. Then you can apply the filters on the map.

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}