Scala: dynamically joining data frames - scala

I have data split into multiple files. I want to load and join the files.
I'd like to build a dynamic function that
1. will join n data files into a single data frame
2. given the input of file location and join column (e.g., pk)
I think this can be done with foldLeft, but I am not quite sure how:
Here is my code so far:
#throws
def dataJoin(path:String, fileNames:String*): DataFrame=
{
try
{
val dfList:ArrayBuffer[DataFrame]=new ArrayBuffer
for(fileName <- fileNames)
{
val df:DataFrame=DataFrameUtils.openFile(spark, s"$path${File.separator}$fileName")
dfList += df
}
dfList.foldLeft
{
(df,df1) => joinDataFrames(df,df1, "UID")
}
}
catch
{
case e:Exception => throw new Exception(e)
}
}
def joinDataFrames(df:DataFrame,df1:DataFrame, joinColum:String): Unit =
{
df.join(df1, Seq(joinColum))
}

foldLeft might indeed be suitable here, but it requires a "zero" element to start the folding from (in addition to the folding function). In this case, that "zero" can be the first DataFrame:
dfList.tail.foldLeft(dfList.head) { (df1, df2) => df1.join(df2, "UID") }
To avoid errors, you probably want to make sure the list isn't empty before trying to access that first item - one way of doing that would be using pattern matching.
dfList match {
case head :: tail => tail.foldLeft(head) { (df1, df2) => df1.join(df2, "UID") }
case Nil => spark.emptyDataFrame
}
Lastly, it's simpler, safer and more idiomatic to map over a collection instead of iterating over it and populated another (empty, mutable) collection:
val dfList = fileNames.map(fileName => DataFrameUtils.openFile(spark, s"$path${File.separator}$fileName"))
Altogether:
def dataJoin(path:String, fileNames: String*): DataFrame = {
val dfList = fileNames
.map(fileName => DataFrameUtils.openFile(spark, s"$path${File.separator}$fileName"))
.toList
dfList match {
case head :: tail => tail.foldLeft(head) { (df1, df2) => df1.join(df2, "UID") }
case Nil => spark.emptyDataFrame
}
}

Related

How to covert multiple strings in a list to be keys in a map

I am trying to write a function that would return a map in which every word is a key and the values are pages at which the word shows up. Currently, I am stuck at the point where I have data of the following type: List(List(words),page).
Is there any sensible way to reformat this data if so, please explain as I have no idea how to even begin?
object G {
def main(args: Array[String]): Unit = {
stwórzIndeks()
}
def stwórzIndeks(): Unit= {
val linie = io.Source
.fromResource("tekst.txt")
.getLines
.toList
val zippedLinie: List[(String,Int)]=linie.zipWithIndex
val splitt=zippedLinie.foldLeft(List.empty[(List[String],Int)])((acc,curr)=>{
curr match {
case (arr,int) => {
val toAdd=(arr.split("\\s+").toList,zippedLinie.length-int)
toAdd+:acc
}
}
})
}
}
You can replace that foldLet with a flatMap with an inner map to get a big List of (word, page).
val wordsAndPage = zippedLinie.flatMap {
case (line, idx) =>
lome.split("\\s+").toList.map(word => word -> idx + 1)
}
After that you can check for one of the grouping methods in the scaladoc.

Exception handling in Spark Scala UDF

def parse_values(value: String) = {
val values = value.split(",").map(_.trim)
values.foldLeft(Array[(Int, Double)]()) {
case (acc, present) =>
val Array(k, v) = present.split(",")(0).split(":")
acc :+ (k.trim.toInt, v.trim.toDouble)
}
I am currently using the above UDF to parse a column of string into an array of keys and values.
"50:63.25,100:58.38" to [[50,63.2], [100,58.38]].
In some cases, the string is "\N" and I am unable to parse the column value.
If the string is "\N" then I should return an empty array. Can anyone help me to handle this exception or help me adding a new case? I am new to spark-scala.
Error: scala.MatchError: [Ljava.lang.String;#497cb6a9 (of class [Ljava.lang.String;)
You need to check that the resulting Array has two elements. You need a pattern matching like this to avoid that parse error:
def parse_values(value: String) = {
val values = value.split(",").map(_.trim)
values.foldLeft(Array[(Int, Double)]()) {
case (acc, present) =>
val Array(k, v) = {
present.split(",")(0).split(":") match {
case Array(_) => Array("0", "0.0")
case arr => arr
}
}
acc :+ (k.trim.toInt, v.trim.toDouble)
}
}

Scala alternative to series of if statements that append to a list?

I have a Seq[String] in Scala, and if the Seq contains certain Strings, I append a relevant message to another list.
Is there a more 'scalaesque' way to do this, rather than a series of if statements appending to a list like I have below?
val result = new ListBuffer[Err]()
val malformedParamNames = // A Seq[String]
if (malformedParamNames.contains("$top")) result += IntegerMustBePositive("$top")
if (malformedParamNames.contains("$skip")) result += IntegerMustBePositive("$skip")
if (malformedParamNames.contains("modifiedDate")) result += FormatInvalid("modifiedDate", "yyyy-MM-dd")
...
result.toList
If you want to use some scala iterables sugar I would use
sealed trait Err
case class IntegerMustBePositive(msg: String) extends Err
case class FormatInvalid(msg: String, format: String) extends Err
val malformedParamNames = Seq[String]("$top", "aa", "$skip", "ccc", "ddd", "modifiedDate")
val result = malformedParamNames.map { v =>
v match {
case "$top" => Some(IntegerMustBePositive("$top"))
case "$skip" => Some(IntegerMustBePositive("$skip"))
case "modifiedDate" => Some(FormatInvalid("modifiedDate", "yyyy-MM-dd"))
case _ => None
}
}.flatten
result.toList
Be warn if you ask for scala-esque way of doing things there are many possibilities.
The map function combined with flatten can be simplified by using flatmap
sealed trait Err
case class IntegerMustBePositive(msg: String) extends Err
case class FormatInvalid(msg: String, format: String) extends Err
val malformedParamNames = Seq[String]("$top", "aa", "$skip", "ccc", "ddd", "modifiedDate")
val result = malformedParamNames.flatMap {
case "$top" => Some(IntegerMustBePositive("$top"))
case "$skip" => Some(IntegerMustBePositive("$skip"))
case "modifiedDate" => Some(FormatInvalid("modifiedDate", "yyyy-MM-dd"))
case _ => None
}
result
Most 'scalesque' version I can think of while keeping it readable would be:
val map = scala.collection.immutable.ListMap(
"$top" -> IntegerMustBePositive("$top"),
"$skip" -> IntegerMustBePositive("$skip"),
"modifiedDate" -> FormatInvalid("modifiedDate", "yyyy-MM-dd"))
val result = for {
(k,v) <- map
if malformedParamNames contains k
} yield v
//or
val result2 = map.filterKeys(malformedParamNames.contains).values.toList
Benoit's is probably the most scala-esque way of doing it, but depending on who's going to be reading the code later, you might want a different approach.
// Some type definitions omitted
val malformations = Seq[(String, Err)](
("$top", IntegerMustBePositive("$top")),
("$skip", IntegerMustBePositive("$skip")),
("modifiedDate", FormatInvalid("modifiedDate", "yyyy-MM-dd")
)
If you need a list and the order is siginificant:
val result = (malformations.foldLeft(List.empty[Err]) { (acc, pair) =>
if (malformedParamNames.contains(pair._1)) {
pair._2 ++: acc // prepend to list for faster performance
} else acc
}).reverse // and reverse since we were prepending
If the order isn't significant (although if the order's not significant, you might consider wanting a Set instead of a List):
val result = (malformations.foldLeft(Set.empty[Err]) { (acc, pair) =>
if (malformedParamNames.contains(pair._1)) {
acc ++ pair._2
} else acc
}).toList // omit the .toList if you're OK with just a Set
If the predicates in the repeated ifs are more complex/less uniform, then the type for malformations might need to change, as they would if the responses changed, but the basic pattern is very flexible.
In this solution we define a list of mappings that take your IF condition and THEN statement in pairs and we iterate over the inputted list and apply the changes where they match.
// IF THEN
case class Operation(matcher :String, action :String)
def processInput(input :List[String]) :List[String] = {
val operations = List(
Operation("$top", "integer must be positive"),
Operation("$skip", "skip value"),
Operation("$modify", "modify the date")
)
input.flatMap { in =>
operations.find(_.matcher == in).map { _.action }
}
}
println(processInput(List("$skip","$modify", "$skip")));
A breakdown
operations.find(_.matcher == in) // find an operation in our
// list matching the input we are
// checking. Returns Some or None
.map { _.action } // if some, replace input with action
// if none, do nothing
input.flatMap { in => // inputs are processed, converted
// to some(action) or none and the
// flatten removes the some/none
// returning just the strings.

Scala removing IF Else statement and write in a functional way

Let's say I have the following tuple
(colType, colDocV)
Where colType is a boolean and colDocV is a String
Depending on those two values, I will apply some chunk of code that applies transformations to a Dataframe.
Now, this code works. However, I am not convinced this is the proper way to write functional programming code.
I don't know which of these 3 approaches will improve the quality of the code and remove all if-if else-else :
Should I apply some kind of design pattern and which one?
Should I use some kind of pattern matching?
Should I use some anonymous function?
if (colDocV) {
val newCol = udf(UDFHashCode.udfHashCode).apply(col(columnName))
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("string") || colType.contains("text")) {
val newCol = udf(Entropy.stringEntropyFunc).apply(col(columnName)).cast(DoubleType)
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("date")) {
val newCol = udf(DateUtils.getTimeAsDoubleFunc).apply(col(columnName)).cast(DoubleType)
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("long")) {
dataframe.withColumn(columnName, dataframe(columnName).cast(DoubleType) )
} else {
dataframe.drop(columnName) //Dropping column that cannot be processed
}
You can do this with a match statement and a bunch of regexps.
val str = ".*(?:string|text).*".r
val date = ".*date.*".r
val long = ".*long.*".r
def col(tuple: (Boolean, String)) = tuple match {
case (true, _) => Some(udf(...))
case (_, str()) => Some(udf(...))
case (_, date()) => Some(udf(...))
case (, long()) => Some(udf(...))
case _ => None
}
col(colType -> colDocv)
.fold(dataframe.drop(columnName)) { newCol =>
dataframe.withColumn(columnName, newCol)
}
According to what I understand from your question following can be a solution using match case
def callUdf(colDocV: String, colType: Boolean, dataframe: DataFrame) = (colDocV, colType) match {
case x if (x._1.contains("string") || x._1.contains("text")) => dataframe.withColumn(columnName, udf(Entropy.stringEntropyFunc).apply(col(columnName)).cast(DoubleType))
case x if (x._1.contains("date")) => dataframe.withColumn(columnName, udf(DateUtils.getTimeAsDoubleFunc).apply(col(columnName)).cast(DoubleType))
case x if (x._1.contains("long")) => dataframe.withColumn(columnName, dataframe(columnName).cast(DoubleType) )
case _ => dataframe.drop(columnName)
}

Pattern matching and RDDs

I have a very simple (n00b) question but I'm somehow stuck. I'm trying to read a set of files in Spark with wholeTextFiles and want to return an RDD[LogEntry], where LogEntry is just a case class. I want to end up with an RDD of valid entries and I need to use a regular expression to extract the parameters for my case class. When an entry is not valid I do not want the extractor logic to fail but simply write an entry in a log. For that I use LazyLogging.
object LogProcessors extends LazyLogging {
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[Option[CleaningLogEntry]] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.map(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
That gives me an RDD[Array[Option[LogEntry]]]. Is there a neat way to end up with an RDD of the LogEntrys? I'm somehow missing it.
I was thinking about using Try instead, but I'm not sure if that's any better.
Thoughts greatly appreciated.
To get rid of the Array - simply replace the map command with flatMap - flatMap will treat a result of type Traversable[T] for each record as separate records of type T.
To get rid of the Option - collect only the successful ones: entries.collect { case Some(entry) => entry }.
Note that this collect(p: PartialFunction) overload (which performs something equivelant to a map and a filter combined) is very different from collect() (which sends all data to the driver).
Altogether, this would be something like:
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[CleaningLogEntry] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.flatMap(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
entries.collect { case Some(entry) => entry }
}