How to skip entries that don't match a pattern when constructing an RDD - scala

I'm doing some pattern matching on an RDD and would like to select only thoese rows/records matching a pattern. Here is what I currently have,
val idPattern = """Id="([^"]*)""".r.unanchored
val typePattern = """PostTypeId="([^"]*)""".r.unanchored
val datePattern = """CreationDate="([^"]*)""".r.unanchored
val tagPattern = """Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.map {line => {
val id = line match {case idPattern(x) => x}
val typeId = line match {case typePattern(x) => x}
val date = line match {case datePattern(x) => x}
val tags = line match {case tagPattern(x) => x}
Post(Some(id),Some(typeId),Some(date),Some(tags))
}
}
case class Post(Id: Option[String], Type: Option[String], CreationDate: Option[String], Tags: Option[String])
I'm only interested in row/record which match to all patterns (that is, records that have all those four fields). How can I skip those rows/records not satisfy my requirement?

You can use RDD.collect(scala.PartialFunction f) to do the filtering and mapping in one step.
For example, if you know the order of these fields in your input, you can merge the regular expressions into one and use a single case:
val pattern = """Id="([^"]*).*PostTypeId="([^"]*).*CreationDate="([^"]*).*Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.collect {
case pattern(id, typeId, date, tags) => Post(Some(id), Some(typeId), Some(date), Some(tags))
}
The returned RDD[Post] will only contain records for which this case matched. Notice that this collect(PartialFunction) has nothing to do with collect() - it does not collect entire data set into driver memory.

Use the filter API.
val projectedPostAnswers = postsAnswers.filter(line => f(line)).map{....
Create a function 'f' that does your data cleansing for you. Make sure the function returns true or false, as thats what filter uses to decide whether or not to pass the record.

Related

Hashing multiple columns of spark dataframe

I need to hash specific columns of spark dataframe. Some columns have specific datatype which are basically the extensions of standard spark's DataType class. The problem is because for some reason in when case some conditions don't work as expected.
As a hash table I have a map. Let's call it tableConfig:
val tableConfig = Map("a" -> "KEEP", "b" -> "HASH", "c" -> "KEEP", "d" -> "HASH", "e" -> "KEEP")
The salt variable is used to concatenate with column:
val salt = "abc"
The function for hashing looks like this:
def hashColumns(tableConfig: Map[String, String], salt: String, df: DataFrame): DataFrame = {
val removedColumns = tableConfig.filter(_._2 == "REMOVE").keys.toList
val hashedColumns = tableConfig.filter(_._2 == "HASH").keys.toList
val cleanedDF = df.drop(removedColumns: _ *)
val colTypes = cleanedDF.dtypes.toMap
def typeFromString(s: String): DataType = s match {
case "StringType" => StringType
case "BooleanType" => BooleanType
case "IntegerType" => IntegerType
case "DateType" => DateType
case "ShortType" => ShortType
case "DecimalType(15,7)" => DecimalType(15,7)
case "DecimalType(18,2)" => DecimalType(18,2)
case "DecimalType(11,7)" => DecimalType(11,7)
case "DecimalType(17,2)" => DecimalType(17,2)
case "DecimalType(38,2)" => DecimalType(38,2)
case _ => throw new TypeNotPresentException(
"Please check types in the dataframe. The following column type is missing: ".concat(s), null
)
}
val getType = colTypes.map{case (k, _) => (k, typeFromString(colTypes(k)))}
val hashedDF = cleanedDF.columns.foldLeft(cleanedDF) {
(memoDF, colName) =>
memoDF.withColumn(
colName,
when(col(colName).isin(hashedColumns: _*) && col(colName).isNull, null).
when(col(colName).isin(hashedColumns: _*) && col(colName).isNotNull,
sha2(concat(col(colName), lit(salt)), 256)).otherwise(col(colName)
)
)
}
hashedDF
}
I am getting error regarding to specific column. Namely the error is the following:
org.apache.spark.sql.AnalysisException: cannot resolve '(c IN ('a', 'b', 'd', 'e'))' due to data type mismatch: Arguments must be same type but were: boolean != string;;
Column names are changed.
My search didn't give any clear explanation why isin or isNull functions don't work as expected. Also I follow specific style of implementation and want to avoid the following approaches:
1) No UDFs. They are painful for me.
2) No for loops over spark dataframe columns. The data could contain more than billion samples and it's going to be a headache in terms of performance.
As mentioned in the comments the first FIX should be to remove the condition col(colName).isin(hashedColumns: _*) && col(colName).isNull since this check will be always false.
As for the error, it is because of the mismatch between value type of col(colName) and hashedColumns. The value of hashedColumns is always a string therefore col(colName) should be a string as well but in your case it seems to be a Boolean.
The last issue that I see here it has to do with the logic of the foldLeft. If I understood correctly what you want to achieve is to go through the columns and apply sha2 for those that exist in hashedColumns. To achieve that you must change your code to:
// 1st change: Convert each element of hashedColumns from String to Spark col
val hashArray = hashedColumns.map(lit(_))
val hashedDF = cleanedDF.columns.foldLeft(cleanedDF) {
(memoDF, colName) =>
memoDF.withColumn(
colName,
// 2nd.change: check if colName is in "a", "b", "c", "d" etc, if so apply sha2 otherwise leave the value as it is
when(col(colName).isNotNull && array_contains(array(hashArray:_*), lit(colName)) ,
sha2(concat(col(colName), lit(salt)), 256)
)
)
}
UPDATE:
Iterating through all the columns via foldLeft wouldn't be efficient and adds extra overhead, even more when you have large number of columns (see discussion with #baitmbarek below) I added one more approach instead of foldLeft using single select. In the next code when is applied only for the hashedColumns. We separate the columns into nonHashedCols and transformedCols then we concatenate the list and pass it to the select:
val transformedCols = hashedColumns.map{ c =>
when(col(c).isNotNull , sha2(concat(col(c), lit(salt)), 256)).as(c)
}
val nonHashedCols = (cleanedDF.columns.toSet -- hashedColumns.toSet).map(col(_)).toList
cleanedDF.select((nonHashedCols ++ transformedCols):_*)
(Posted a solution on behalf of the question author, to move it from the question to the answer section).
#AlexandrosBiratsis gave very good solution in terms of performance and elegant implementation. So the hashColumns function will look like this:
def hashColumns(tableConfig: Map[String, String], salt: String, df: DataFrame): DataFrame = {
val removedCols = tableConfig.filter(_._2 == "REMOVE").keys.toList
val hashedCols = tableConfig.filter(_._2 == "HASH").keys.toList
val cleanedDF = df.drop(removedCols: _ *)
val transformedCols = hashedCols.map{ c =>
when(col(c).isNotNull , sha2(concat(col(c), lit(salt)), 256)).as(c)
}
val nonHashedCols = (cleanedDF.columns.toSet -- hashedCols.toSet).map(col(_)).toList
val hashedDF = cleanedDF.select((nonHashedCols ++ transformedCols):_*)
hashedDF
}

How can I take the highest ranking filter condition that ends up matching in a dataframe?

Wording of my question might be confusing so let me explain. Say I have an array of strings. They are ranked in order of the best case scenario match. So at index 0 we want this to always exist in the dataframe column, but if it doesn't then index 1 is the next best option. I have written this logic like this, but I don't feel like this is the most efficient way to have done this. Is there another way of doing it that is better?
The datasets are quite small, but I fear this can't really scale very well.
val df = spark.createDataFrame(data)
val nameArray = Array[String]("Name", "Name%", "%Name%", "Person Name", "Person Name%", "%Person Name%")
nameArray.foreach(x => {
val nameDf = df.where("text like '" + x + "'")
if(nameDf.count() > 0){
nameDf.show(1)
break()
}
})
If values are order according to preference from left (highest precedence) to right (lowest precedence) and lower precedence patterns already cover higher precedence ones (it doesn't look like it is the case in your example) you generate expression like this
import org.apache.spark.sql._
def matched(df: DataFrame, nameArray: Seq[String], c: String = "text") = {
val matchIdx = nameArray.zipWithIndex.foldRight(lit(-1)){
case ((s, i), acc) => when(col(c) like s, lit(i)).otherwise(acc)
}
df.select(max(matchIdx)).first match {
case Row(-1) => None // No pattern matches all records
case Row(i: Int) => Some(nameArray(i))
}
}
Usage examples:
matched(Seq("Some Name", "Name", "Name Surname").toDF("text"), Seq("Name", "Name%", "%Name%"))
// Option[String] = Some(%Name%)
There are two advantages of this method:
Only one action is required.
Pattern matching can be short circuited.
If pre-conditions are not satisfied you can
import org.apache.spark.sql.functions._
val unmatchedCount: Map[String, Long] = df.select(
nameArray.map(s => count(when(not($"text" like s), 1)).alias(s)): _*
).first.getValuesMap[Long](nameArray)
Unlike the first approach it will check all patterns, but it requires only one action.

How to unwrap optional tuple of options to tuple of options in Scala?

I have a list of Person and want to retrieve a person by its id
val person = personL.find(_.id.equals(tempId))
After that, I want to get as a tuple the first and last element of a list which is an attribute of Person.
val marks: Option[(Option[String], Option[String])] = person.map { p =>
val marks = p.school.marks
(marks.headOption.map(_.midtermMark), marks.lastOption.map(_.finalMark))
}
This work's fine but now I want to transform the Option[(Option[String], Option[String])] to a simple (Option[String], Option[String]). Is it somehow possible to do this on-the-fly by using the previous map?
I suppose:
person.map{...}.getOrElse((None, None))
(None, None) is a default value here in case if your option of tuple is empty
You are, probably, looking for fold:
personL
.collectFirst {
case Person(`tempId`, _, .., school) => school.marks
}.fold[Option[String], Option[String]](None -> None) { marks =>
marks.headOption.map(_.midtermMark) -> marks.lastOption.map(_.finalMark)
}

Pattern matching and RDDs

I have a very simple (n00b) question but I'm somehow stuck. I'm trying to read a set of files in Spark with wholeTextFiles and want to return an RDD[LogEntry], where LogEntry is just a case class. I want to end up with an RDD of valid entries and I need to use a regular expression to extract the parameters for my case class. When an entry is not valid I do not want the extractor logic to fail but simply write an entry in a log. For that I use LazyLogging.
object LogProcessors extends LazyLogging {
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[Option[CleaningLogEntry]] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.map(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
That gives me an RDD[Array[Option[LogEntry]]]. Is there a neat way to end up with an RDD of the LogEntrys? I'm somehow missing it.
I was thinking about using Try instead, but I'm not sure if that's any better.
Thoughts greatly appreciated.
To get rid of the Array - simply replace the map command with flatMap - flatMap will treat a result of type Traversable[T] for each record as separate records of type T.
To get rid of the Option - collect only the successful ones: entries.collect { case Some(entry) => entry }.
Note that this collect(p: PartialFunction) overload (which performs something equivelant to a map and a filter combined) is very different from collect() (which sends all data to the driver).
Altogether, this would be something like:
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[CleaningLogEntry] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.flatMap(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
entries.collect { case Some(entry) => entry }
}

How to create a List of Wildcard elements Scala

I'm trying to write a function that returns a list (for querying purposes) that has some wildcard elements:
def createPattern(query: List[(String,String)]) = {
val l = List[(_,_,_,_,_,_,_)]
var iter = query
while(iter != null) {
val x = iter.head._1 match {
case "userId" => 0
case "userName" => 1
case "email" => 2
case "userPassword" => 3
case "creationDate" => 4
case "lastLoginDate" => 5
case "removed" => 6
}
l(x) = iter.head._2
iter = iter.tail
}
l
}
So, the user enters some query terms as a list. The function parses through these terms and inserts them into val l. The fields that the user doesn't specify are entered as wildcards.
Val l is causing me troubles. Am I going the right route or are there better ways to do this?
Thanks!
Gosh, where to start. I'd begin by getting an IDE (IntelliJ / Eclipse) which will tell you when you're writing nonsense and why.
Read up on how List works. It's an immutable linked list so your attempts to update by index are very misguided.
Don't use tuples - use case classes.
You shouldn't ever need to use null and I guess here you mean Nil.
Don't use var and while - use for-expression, or the relevant higher-order functions foreach, map etc.
Your code doesn't make much sense as it is, but it seems you're trying to return a 7-element list with the second element of each tuple in the input list mapped via a lookup to position in the output list.
To improve it... don't do that. What you're doing (as programmers have done since arrays were invented) is to use the index as a crude proxy for a Map from Int to whatever. What you want is an actual Map. I don't know what you want to do with it, but wouldn't it be nicer if it were from these key strings themselves, rather than by a number? If so, you can simplify your whole method to
def createPattern(query: List[(String,String)]) = query.toMap
at which point you should realise you probably don't need the method at all, since you can just use toMap at the call site.
If you insist on using an Int index, you could write
def createPattern(query: List[(String,String)]) = {
def intVal(x: String) = x match {
case "userId" => 0
case "userName" => 1
case "email" => 2
case "userPassword" => 3
case "creationDate" => 4
case "lastLoginDate" => 5
case "removed" => 6
}
val tuples = for ((key, value) <- query) yield (intVal(key), value)
tuples.toMap
}
Not sure what you want to do with the resulting list, but you can't create a List of wildcards like that.
What do you want to do with the resulting list, and what type should it be?
Here's how you might build something if you wanted the result to be a List[String], and if you wanted wildcards to be "*":
def createPattern(query:List[(String,String)]) = {
val wildcard = "*"
def orElseWildcard(key:String) = query.find(_._1 == key).getOrElse("",wildcard)._2
orElseWildcard("userID") ::
orElseWildcard("userName") ::
orElseWildcard("email") ::
orElseWildcard("userPassword") ::
orElseWildcard("creationDate") ::
orElseWildcard("lastLoginDate") ::
orElseWildcard("removed") ::
Nil
}
You're not using List, Tuple, iterator, or wild-cards correctly.
I'd take a different approach - maybe something like this:
case class Pattern ( valueMap:Map[String,String] ) {
def this( valueList:List[(String,String)] ) = this( valueList.toMap )
val Seq(
userId,userName,email,userPassword,creationDate,
lastLoginDate,removed
):Seq[Option[String]] = Seq( "userId", "userName",
"email", "userPassword", "creationDate", "lastLoginDate",
"removed" ).map( valueMap.get(_) )
}
Then you can do something like this:
scala> val pattern = new Pattern( List( "userId" -> "Fred" ) )
pattern: Pattern = Pattern(Map(userId -> Fred))
scala> pattern.email
res2: Option[String] = None
scala> pattern.userId
res3: Option[String] = Some(Fred)
, or just use the map directly.