How to use array_contains with 2 columns in spark scala? - scala

I have an issue , I want to check if an array of string contains string present in another column . I am currently using below code which is giving an error.
.withColumn("is_designer_present", when(array_contains(col("list_of_designers"),$"dept_resp"),1).otherwise(0))
error:
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.ColumnName dept_resp
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)

you can write a udf function to get your job done
import org.apache.spark.sql.functions._
def stringContains = udf((array: collection.mutable.WrappedArray[String], str: String) => array.contains(str))
df.withColumn("is_designer_present", when(stringContains(col("list_of_designers"), $"dept_resp"),1).otherwise(0))
You can return appropriate value from udf function itself so that you don't have to use when function
import org.apache.spark.sql.functions._
def stringContains = udf((array: collection.mutable.WrappedArray[String], str: String) => if (array.contains(str)) 1 else 0)
df.withColumn("is_designer_present", stringContains(col("list_of_designers"), $"dept_resp"))

With Spark 1.6 you can wrap your array_contains() as a string into the expr() function:
import org.apache.spark.sql.functions.expr
.withColumn("is_designer_present",
when(expr("array_contains(list_of_designers, dept_resp)"), 1).otherwise(0))
This form of array_contains inside the expr can accept a column as the second argument.

I know that this is a somewhat old question, but I found myself in a similar predicament, and found the following solution. It uses Spark native functions (so it doesn't suffer from UDF related performance regressions, and it does not rely in string expressions (which are hard to maintain).
def array_contains_column(arrayColumn: Column, valueColumn: Column): Column = {
new Column(ArrayContains(arrayColumn.expr, valueColumn.expr))
}
// ...
df.withColumn(
"is_designer_present",
when(
array_contains_column(col("list_of_designers"),col("dept_resp")),
1
).otherwise(0)
)

You can do it without UDFs using explode.
.withColumn("exploCol", explode($"dept_resp"))
.withColumn("aux", when($"exploCol" === col("list_of_designers"), 1).otherwise(0))
.drop("exploCol")
.groupBy($"dep_rest") //all cols except aux
.agg(sum($"aux") as "result")
And there you go, if result > 0 then "dept_rest" contains the value.

Related

Is there a Spark built-in that flattens nested arrays?

I have a DataFrame field that is a Seq[Seq[String]] I built a UDF to transform said column into a column of Seq[String]; basically, a UDF for the flatten function from Scala.
def combineSentences(inCol: String, outCol: String): DataFrame => DataFrame = {
def flatfunc(seqOfSeq: Seq[Seq[String]]): Seq[String] = seqOfSeq match {
case null => Seq.empty[String]
case _ => seqOfSeq.flatten
}
df: DataFrame => df.withColumn(outCol, udf(flatfunc _).apply(col(inCol)))
}
My use case is strings, but obviously, this could be generic. You can use this function in a chain of DataFrame transforms like:
df.transform(combineSentences(inCol, outCol))
Is there a Spark built-in function that does the same thing? I have not been able to find one.
There is a similar function (since Spark 2.4) and it is called flatten:
import org.apache.spark.sql.functions.flatten
From the official documentation:
def flatten(e: Column): Column
Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.
Since
2.4.0
To get the exact equivalent you'll have to coalesce to replace NULL.

Avoid specifying schema twice (Spark/scala)

I need to iterate over data frame in specific order and apply some complex logic to calculate new column.
Also my strong preference is to do it in generic way so I do not have to list all columns of a row and do df.as[my_record] or case Row(...) => as shown here. Instead, I want to access row columns by their names and just add result column(s) to source row.
Below approach works just fine but I'd like to avoid specifying schema twice: first time so that I can access columns by name while iterating and second time to process output.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val q = """
select 2 part, 1 id
union all select 2 part, 4 id
union all select 2 part, 3 id
union all select 2 part, 2 id
"""
val df = spark.sql(q)
def f_row(iter: Iterator[Row]) : Iterator[Row] = {
if (iter.hasNext) {
def complex_logic(p: Int): Integer = if (p == 3) null else p * 10;
val head = iter.next
val schema = StructType(head.schema.fields :+ StructField("result", IntegerType))
val r =
new GenericRowWithSchema((head.toSeq :+ complex_logic(head.getAs("id"))).toArray, schema)
iter.scanLeft(r)((r1, r2) =>
new GenericRowWithSchema((r2.toSeq :+ complex_logic(r2.getAs("id"))).toArray, schema)
)
} else iter
}
val schema = StructType(df.schema.fields :+ StructField("result", IntegerType))
val encoder = RowEncoder(schema)
df.repartition($"part").sortWithinPartitions($"id").mapPartitions(f_row)(encoder).show
What information is lost after applying mapPartitions so output cannot be processed without explicit encoder? How to avoid specifying it?
What information is lost after applying mapPartitions so output cannot be processed without
The information is hardly lost - it wasn't there from the begining - subclasses of Row or InternalRow are basically untyped, variable shape containers, which don't provide any useful type information, that could be used to derive an Encoder.
schema in GenericRowWithSchema is inconsequential as it describes content in terms of metadata not types.
How to avoid specifying it?
Sorry, you're out of luck. If you want to use dynamically typed constructs (a bag of Any) in a statically typed language you have to pay the price, which here is providing an Encoder.
OK - I have checked some of my spark code and using .mapPartitions with the Dataset API does not require me to explicitly build/pass an encoder.
You need something like:
case class Before(part: Int, id: Int)
case class After(part: Int, id: Int, newCol: String)
import spark.implicits._
// Note column names/types must match case class constructor parameters.
val beforeDS = <however you obtain your input DF>.as[Before]
def f_row(it: Iterator[Before]): Iterator[After] = ???
beforeDS.reparition($"part").sortWithinPartitions($"id").mapPartitions(f_row).show
I found below explanation sufficient, maybe it will be useful for others.
mapPartitions requires Encoder because otherwise it cannot construct Dataset from iterator or Rows. Even though each row has a schema, that shema cannot be derived (used) by constructor of Dataset[U].
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] = {
new Dataset[U](
sparkSession,
MapPartitions[T, U](func, logicalPlan),
implicitly[Encoder[U]])
}
On the other hand, without calling mapPartitions Spark can use the schema derived from initial query because structure (metadata) of the original columns is not changed.
I described alternatives in this answer: https://stackoverflow.com/a/53177628/7869491.

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.)
How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'.
What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
This is an old question, I just wanted to update it according to the new version of Spark.
Since Spark 3.0.0, the method that #Raphael Roth has mentioned is deprecated. Hence, you might get an AnalysisException. The reason is that the input closure using this method doesn't have type checking and the behavior might be different from what we expect in SQL when it comes to null values.
If you really know what you're doing, you need to set spark.sql.legacy.allowUntypedScalaUDF configuration to true.
Another solution is to use case class instead of schema. For example,
case class Foo(field1: String, field2: String)
val convertFunction: Seq[Row] => Seq[Foo] = input => {
input.map {
x => // do something with x and convert to Foo
}
}
val myUdf = udf(convertFunction)

Scala UDF with multiple parameters used in Pyspark

I have a UDF written in Scala that I'd like to be able to call through a Pyspark session. The UDF takes two parameters, string column value and a second string parameter. I've been able to successfully call the UDF if it takes only a single parameter (column value). I'm struggling to call the UDF if there's multiple parameters required. Here's what I've been able to do so far in Scala and then through Pyspark:
Scala UDF:
class SparkUDFTest() extends Serializable {
def stringLength(columnValue: String, columnName: String): Int =
LOG.info("Column name is: " + columnName)
return columnValue.length
}
When using this in Scala, I've been able to register and use this UDF:
Scala main class:
val udfInstance = new SparkUDFTest()
val stringLength = spark.sqlContext.udf.register("stringlength", udfInstance.stringLength _)
val newDF = df.withColumn("name", stringLength(col("email"), lit("email")))
The above works successfully. Here's the attempt through Pyspark:
def testStringLength(colValue, colName):
package = "com.test.example.udf.SparkUDFTest"
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().stringLength().apply
return Column(udfInstance(_to_seq(sc, [colValue], _to_java_column), colName))
Call the UDF in Pyspark:
df.withColumn("email", testStringLength("email", lit("email")))
Doing the above and making some adjustments in Pyspark gives me following errors:
py4j.Py4JException: Method getStringLength([]) does not exist
or
java.lang.ClassCastException: com.test.example.udf.SparkUDFTest$$anonfun$stringLength$1 cannot be cast to scala.Function1
or
TypeError: 'Column' object is not callable
I was able to modify the UDF to take just a single parameter (the column value) and was able to successfully call it and get back a new Dataframe.
Scala UDF Class
class SparkUDFTest() extends Serializable {
def testStringLength(): UserDefinedFunction = udf(stringLength _)
def stringLength(columnValue: String): Int =
LOG.info("Column name is: " + columnName)
return columnValue.length
}
Updating Python code:
def testStringLength(colValue, colName):
package = "com.test.example.udf.SparkUDFTest"
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().testStringLength().apply
return Column(udfInstance(_to_seq(sc, [colValue], _to_java_column)))
The above works successfully. I'm still struggling to call the UDF if the UDF takes an extra parameter. How can the second parameter be passed to the UDF through in Pyspark?
I was able to resolve this by using currying. First registered the UDF as
def testStringLength(columnName): UserDefinedFunction = udf((colValue: String) => stringLength(colValue, colName)
Called the UDF
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().testStringLength("email").apply
df.withColumn("email", Column(udfInstance(_to_seq(sc, [col("email")], _to_java_column))))
This can be cleaned up a bit more but it's how I got it to work.
Edit: The reason I went with currying is because even when I was using 'lit' on the second argument that I wanted to pass in as a String to the UDF, I kept exerperiencing the "TypeError: 'Column' object is not callable" error. In Scala I did not experience this issue. I am not sure as to why this was happening in Pyspark. It's possible it could be due to some complication that may occur between the Python interpreter and the Scala code. Still unclear but currying works for me.

Convert RDD[String] to RDD[Row] to Dataframe Spark Scala

I am reading in a file that has many spaces and need to filter out the space. Afterwards we need to convert it to a dataframe. Example input below.
2017123 ¦ ¦10¦running¦00000¦111¦-EXAMPLE
My solution to this was the following function which parses out all spaces and trims the file.
def truncateRDD(fileName : String): RDD[String] = {
val example = sc.textFile(fileName)
example.map(lines => lines.replaceAll("""[\t\p{Zs}]+""", ""))
}
However, I am not sure how to get it into a dataframe. sc.textFile returns a RDD[String]. I tried the case class way but the issue is we have 800 field schema, case class cannot go beyond 22.
I was thinking of somehow converting RDD[String] to RDD[Row] so I can use the createDataFrame function.
val DF = spark.createDataFrame(rowRDD, schema)
Any suggestions on how to do this?
First split/parse your strings into the fields.
rdd.map( line => parse(line)) where parse is some parsing function. It could be as simple as split but you may want something more robust. This will get you an RDD[Array[String]] or similar.
You can then convert to an RDD[Row] with rdd.map(a => Row.fromSeq(a))
From there you can convert to DataFrame wising sqlContext.createDataFrame(rdd, schema) where rdd is your RDD[Row] and schema is your schema StructType.
In your case simple way :
val RowOfRDD = truncateRDD("yourfilename").map(r => Row.fromSeq(r))
How to solve productarity issue if you are using scala 2.10 ?
However, I am not sure how to get it into a dataframe. sc.textFile
returns a RDD[String]. I tried the case class way but the issue is we
have 800 field schema, case class cannot go beyond 22.
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here