I have a dataframe with a column of arraytype that can contain integer values. If no values it will contain only one and it will be the null value
Important: note the column will not be null but an array with a single value; null
> val df: DataFrame = Seq(("foo", Seq(Some(2), Some(3))), ("bar", Seq(None))).toDF("k", "v")
df: org.apache.spark.sql.DataFrame = [k: string, v: array<int>]
> df.show()
+---+------+
| k| v|
+---+------+
|foo|[2, 3]|
|bar|[null]|
Question: I'd like to get the rows that have the null value.
What I have tried thus far:
> df.filter(array_contains(df("v"), 2)).show()
+---+------+
| k| v|
+---+------+
|foo|[2, 3]|
+---+------+
for null, it does not seem to work
> df.filter(array_contains(df("v"), null)).show()
org.apache.spark.sql.AnalysisException: cannot resolve
'array_contains(v, NULL)' due to data type mismatch: Null typed
values cannot be used as arguments;
or
> df.filter(array_contains(df("v"), None)).show()
java.lang.RuntimeException: Unsupported literal type class scala.None$
None
It is not possible to use array_contains in this case because SQL NULL cannot be compared for equality.
You can use udf like this:
val contains_null = udf((xs: Seq[Integer]) => xs.contains(null))
df.where(contains_null($"v")).show
// +---+------+
// | k| v|
// +---+------+
// |bar|[null]|
For Spark 2.4+, you can use the higher-order function exists instead of UDF:
df.where("exists(v, x -> x is null)").show
//+---+---+
//| k| v|
//+---+---+
//|bar| []|
//+---+---+
The PySpark implementation, if needed:
contains_null = f.udf(lambda x: None in x, BooleanType())
df.filter(contains_null(f.col("v"))).show()
Related
I'm interested in taking a column in my dataframe called mapColumn
+-------------------+
| mapColumn |
+-------------------+
| Map(KEY -> VALUE) |
+-------------------+
and create a stringColumn that's just the key and value of the Map column where the value is "KEY,VALUE":
+-------------------+
| stringColumn |
+-------------------+
| KEY,VALUE |
+-------------------
I have tried creating a UDF to pass this value like follows:
var getStringColumn = udf(mapToString _)
df.withColumn("stringColumn,
when(col(mapColumn).isNotNull,
getStringColumn(col(mapColumn)))
.otherwise(lit(null: String)))
def mapToString(row: Row): String = {
if (null == row || row.isNullAt(FirstItemIndex)) {
return null
}
return row.getValuesMap[Any](row.schema.fieldNames).mkString(",")
}
I keep getting the following error:
Failed to execute user defined function($anonfun$1: (map) => string)
Cause: java.lang.ClassCastException:scala.collection.immutable.Map$Map1 cannot be cast to org.apache.spark.sql.Row
There is no need for a UDF. One approach is to explode the Map column into flattened key & value columns and concat the key-value elements as Strings accordingly:
val df = Seq(
(10, Map((1, "a"), (2, "b"))),
(20, Map((3, "c")))
).toDF("id", "map")
df.
select($"id", explode($"map")).
withColumn("kv_string", concat($"key".cast("string"), lit(","), $"value")).
show
// +---+---+-----+---------+
// | id|key|value|kv_string|
// +---+---+-----+---------+
// | 10| 1| a| 1,a|
// | 10| 2| b| 2,b|
// | 20| 3| c| 3,c|
// +---+---+-----+---------+
I have a DataFrame with Arrays.
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
|id |complete1|complete2|
+-------------+---------+---------+
| 123| [, 1, 2]|[3, 3, 4]|
| 124| [, 3, 2]| [, 3, 4]|
+-------------+---------+---------+
How do I extract the minimum of each arrays?
|id |complete1|complete2|
+-------------+---------+---------+
| 123| 1 | 3 |
| 124| 2 | 3 |
+-------------+---------+---------+
I have tried defining a UDF to do this but I am getting an error.
def minArray(a:Array[String]) :String = a.filter(_.nonEmpty).min.mkString
val minArrayUDF = udf(minArray _)
def getMinArray(df: DataFrame, i: Int): DataFrame = df.withColumn("complete" + i, minArrayUDF(df("complete" + i)))
val minDf = (1 to 2).foldLeft(DF){ case (df, i) => getMinArray(df, i)}
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
Since Spark 2.4, you can use array_min to find the minimum value in an array. To use this function you will first have to cast your arrays of strings to arrays of integers. Casting will also take care of the empty strings by converting them into null values.
DF.select($"id",
array_min(expr("cast(complete1 as array<int>)")).as("complete1"),
array_min(expr("cast(complete2 as array<int>)")).as("complete2"))
You can define your udf function as below
def minUdf = udf((arr: Seq[String])=> arr.filterNot(_ == "").map(_.toInt).min)
and call it as
DF.select(col("id"), minUdf(col("complete1")).as("complete1"), minUdf(col("complete2")).as("complete2")).show(false)
which should give you
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|123|1 |3 |
|124|2 |3 |
+---+---------+---------+
Updated
In case if the array passed to udf functions are empty or array of empty strings then you will encounter
java.lang.UnsupportedOperationException: empty.min
You should handle that with if else condition in udf function as
def minUdf = udf((arr: Seq[String])=> {
val filtered = arr.filterNot(_ == "")
if(filtered.isEmpty) 0
else filtered.map(_.toInt).min
})
I hope the answer is helpful
Here is how you can do it without using udf
First explode the array you got with split() and then group by the same id and find min
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
.withColumn("complete1", explode($"complete1"))
.withColumn("complete2", explode($"complete2"))
.groupBy($"id").agg(min($"complete1".cast(IntegerType)).as("complete1"), min($"complete2".cast(IntegerType)).as("complete2"))
Output:
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|124|2 |3 |
|123|1 |3 |
+---+---------+---------+
You don't need an UDF for this, you can use sort_array:
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select(
$"id",
split(regexp_replace($"complete1","^\\|",""), "\\|").as("complete1"),
split(regexp_replace($"complete2","^\\|",""), "\\|").as("complete2")
)
// now select minimum
DF.
.select(
$"id",
sort_array($"complete1")(0).as("complete1"),
sort_array($"complete2")(0).as("complete2")
).show()
+---+---------+---------+
| id|complete1|complete2|
+---+---------+---------+
|123| 1| 3|
|124| 2| 3|
+---+---------+---------+
Note that I removed the leading | before splitting to avoid empty strings in the array
I am trying to insert dataframe to cassandra:
result.rdd.saveToCassandra(keyspaceName, tableName)
However some of the column values are empty and thus I get exceptions:
java.lang.NumberFormatException: empty String
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
at java.lang.Float.parseFloat(Float.java:451)
at scala.collection.immutable.StringLike$class.toFloat(StringLike.scala:231)
at scala.collection.immutable.StringOps.toFloat(StringOps.scala:31)
at com.datastax.spark.connector.types.TypeConverter$FloatConverter$$anonfun$convertPF$4.applyOrElse(TypeConverter.scala:216)
Is there a way to replace all EMPTY values with null in the dataframe and would that solve this issue?
For this question, lets assume this is the dataframe df:
col1 | col2 | col3
"A" | "B" | 1
"E" | "F" |
"S" | "K" | 5
How can I replace that empty value in col3 with null?
If You cast the DataFrame column to your numeric type then any values that cannot be pared to the appropriate type will be turned into nulls.
import org.apache.spark.sql.types.IntegerType
df.select(
$"col1",
$"col2",
$"col3" cast IntegerType
)
or if you don't have a select statement
df.withColumn("col3", df("col3") cast IntegerType)
If you have many columns that you want to apply this to and feel it would do too inconvenient to do this in a select statement or if casting wont work for your case, you can convert to rdd to apply the transformation then go back to a dataframe. You may want to define a method for this.
def emptyToNull(df: DataFrame): DataFrame = {
val sqlCtx = df.sqlContext
val schema = df.schema
val rdd = df.rdd.map(
row =>
row.toSeq.map {
case "" => null
case otherwise => otherwise
})
.map(Row.fromSeq)
sqlCtx.createDataFrame(rdd, schema)
}
You can write a udf for this:
val df = Seq(("A", "B", "1"), ("E", "F", ""), ("S", "K", "1")).toDF("col1", "col2", "col3")
// make a udf that converts String to option[String]
val nullif = udf((s: String) => if(s == "") None else Some(s))
df.withColumn("col3", nullif($"col3")).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| E| F|null|
| S| K| 1|
+----+----+----+
You can also use when.otherwise, if you want to avoid the usage of udf:
df.withColumn("col3", when($"col3" === "", null).otherwise($"col3")).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| E| F|null|
| S| K| 1|
+----+----+----+
Or you can use SQL nullif function to convert empty string to null:
df.selectExpr("col1", "col2", "nullif(col3, \"\") as col3").show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| E| F|null|
| S| K| 1|
+----+----+----+
before use:
//将RDD映射到rowRDD
val rowRDD = personRDD.map(p => Row(p(0).trim.toLong, p(1).trim, p(2).trim, p(3).trim.toLong, p(4).trim.toLong))
use cast:
//通过StructType直接指定每个字段的schema
val schema = StructType(
StructField("id", LongType, false) ::
StructField("name", StringType, true) ::
StructField("gender", StringType, true) ::
StructField("salary", LongType, true) ::
StructField("expense", LongType, true) :: Nil
)
//允许字段为空
val rdd = personRDD.map(row =>
row.toSeq.map(r => {
if (r.trim.length > 0) {
val castValue = Util.castTo(r.trim, schema.fields(row.toSeq.indexOf(r)).dataType)
castValue
}
else null
})).map(Row.fromSeq)
Util method:
def castTo(value: String, dataType: DataType) = {
dataType match {
case _: IntegerType => value.toInt
case _: LongType => value.toLong
case _: StringType => value
}
I have a spark dataframe for which I need to filter nulls and spaces for a particular column.
Lets say dataframe has two columns. col2 has both nulls and also blanks.
col1 col2
1 abc
2 null
3 null
4
5 def
I want to apply a filter out the records which have col2 as nulls or blanks.
Can any one please help on this.
Version:
Spark1.6.2
Scala 2.10
The standard logical operators are defined on Spark Columns:
scala> val myDF = Seq((1, "abc"),(2,null),(3,null),(4, ""),(5,"def")).toDF("col1", "col2")
myDF: org.apache.spark.sql.DataFrame = [col1: int, col2: string]
scala> myDF.show
+----+----+
|col1|col2|
+----+----+
| 1| abc|
| 2|null|
| 3|null|
| 4| |
| 5| def|
+----+----+
scala> myDF.filter(($"col2" =!= "") && ($"col2".isNotNull)).show
+----+----+
|col1|col2|
+----+----+
| 1| abc|
| 5| def|
+----+----+
Note: depending on your Spark version you will need !== or =!= (the latter is the more current option).
If you had n conditions to be met I would probably use a list to reduce the boolean columns together:
val conds = List(myDF("a").contains("x"), myDF("b") =!= "y", myDF("c") > 2)
val filtered = myDF.filter(conds.reduce(_&&_))
I have a CSV file and I am processing its data.
I am working with data frames, and I calculate average, min, max, mean, sum of each column based on some conditions. The data of each column could be empty or null.
I have noticed that in some cases I got as max, or sum a null value instead of a number. Or I got in max() a number which is less that the output that the min() returns.
I do not want to replace the null/empty values with other.
The only thing I have done is to use these 2 options in CSV:
.option("nullValue", "null")
.option("treatEmptyValuesAsNulls", "true")
Is there any way to handle this issue? Have everyone faced this problem before? Is it a problem of data types?
I run something like this:
data.agg(mean("col_name"), stddev("col_name"),count("col_name"),
min("col_name"), max("col_name"))
Otherwise I can consider that it is a problem in my code.
I have done some research on this question, and the result shows that mean, max, min functions ignore null values. Below is the experiment code and results.
Environment: Scala, Spark 1.6.1 Hadoop 2.6.0
import org.apache.spark.sql.{Row}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
val row1 =Row("1", 2.4, "2016-12-21")
val row2 = Row("1", None, "2016-12-22")
val row3 = Row("2", None, "2016-12-23")
val row4 = Row("2", None, "2016-12-23")
val row5 = Row("3", 3.0, "2016-12-22")
val row6 = Row("3", 2.0, "2016-12-22")
val theRdd = sc.makeRDD(Array(row1, row2, row3, row4, row5, row6))
val schema = StructType(StructField("key", StringType, false) ::
StructField("value", DoubleType, true) ::
StructField("d", StringType, false) :: Nil)
val df = sqlContext.createDataFrame(theRdd, schema)
df.show()
df.agg(mean($"value"), max($"value"), min($"value")).show()
df.groupBy("key").agg(mean($"value"), max($"value"), min($"value")).show()
Output:
+---+-----+----------+
|key|value| d|
+---+-----+----------+
| 1| 2.4|2016-12-21|
| 1| null|2016-12-22|
| 2| null|2016-12-23|
| 2| null|2016-12-23|
| 3| 3.0|2016-12-22|
| 3| 2.0|2016-12-22|
+---+-----+----------+
+-----------------+----------+----------+
| avg(value)|max(value)|min(value)|
+-----------------+----------+----------+
|2.466666666666667| 3.0| 2.0|
+-----------------+----------+----------+
+---+----------+----------+----------+
|key|avg(value)|max(value)|min(value)|
+---+----------+----------+----------+
| 1| 2.4| 2.4| 2.4|
| 2| null| null| null|
| 3| 2.5| 3.0| 2.0|
+---+----------+----------+----------+
From the output you can see that the mean, max, min functions on column 'value' of group key='1' returns '2.4' instead of null which shows that the null values were ignored in these functions. However, if the column contains only null values then these functions will return null values.
Contrary to one of the comments it is not true that nulls are ignored. Here is an approach:
max(coalesce(col_name,Integer.MinValue))
min(coalesce(col_name,Integer.MaxValue))
This will still have an issue if there were only null values: you will need to convert Min/MaxValue to null or whatever you want to use to represent "no valid/non-null entries".
To add to other answers:
Remember the null and NaN are different things to spark:
NaN is not a number and numeric aggregations on a column with NaN in it result in NaN
null is a missing value and numeric aggregations on a column with null ignore it as if the row wasn't even there
df_=spark.createDataFrame([(1, np.nan), (None, 2.0),(3,4.0)], ("a", "b"))
df_.show()
| a| b|
+----+---+
| 1|NaN|
|null|2.0|
| 3|4.0|
+----+---+
df_.agg(F.mean("a"),F.mean("b")).collect()
[Row(avg(a)=2.0, avg(b)=nan)]