Empty values in spark dataframe - scala

I am trying to insert dataframe to cassandra:
result.rdd.saveToCassandra(keyspaceName, tableName)
However some of the column values are empty and thus I get exceptions:
java.lang.NumberFormatException: empty String
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
at java.lang.Float.parseFloat(Float.java:451)
at scala.collection.immutable.StringLike$class.toFloat(StringLike.scala:231)
at scala.collection.immutable.StringOps.toFloat(StringOps.scala:31)
at com.datastax.spark.connector.types.TypeConverter$FloatConverter$$anonfun$convertPF$4.applyOrElse(TypeConverter.scala:216)
Is there a way to replace all EMPTY values with null in the dataframe and would that solve this issue?
For this question, lets assume this is the dataframe df:
col1 | col2 | col3
"A" | "B" | 1
"E" | "F" |
"S" | "K" | 5
How can I replace that empty value in col3 with null?

If You cast the DataFrame column to your numeric type then any values that cannot be pared to the appropriate type will be turned into nulls.
import org.apache.spark.sql.types.IntegerType
df.select(
$"col1",
$"col2",
$"col3" cast IntegerType
)
or if you don't have a select statement
df.withColumn("col3", df("col3") cast IntegerType)
If you have many columns that you want to apply this to and feel it would do too inconvenient to do this in a select statement or if casting wont work for your case, you can convert to rdd to apply the transformation then go back to a dataframe. You may want to define a method for this.
def emptyToNull(df: DataFrame): DataFrame = {
val sqlCtx = df.sqlContext
val schema = df.schema
val rdd = df.rdd.map(
row =>
row.toSeq.map {
case "" => null
case otherwise => otherwise
})
.map(Row.fromSeq)
sqlCtx.createDataFrame(rdd, schema)
}

You can write a udf for this:
val df = Seq(("A", "B", "1"), ("E", "F", ""), ("S", "K", "1")).toDF("col1", "col2", "col3")
// make a udf that converts String to option[String]
val nullif = udf((s: String) => if(s == "") None else Some(s))
df.withColumn("col3", nullif($"col3")).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| E| F|null|
| S| K| 1|
+----+----+----+
You can also use when.otherwise, if you want to avoid the usage of udf:
df.withColumn("col3", when($"col3" === "", null).otherwise($"col3")).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| E| F|null|
| S| K| 1|
+----+----+----+
Or you can use SQL nullif function to convert empty string to null:
df.selectExpr("col1", "col2", "nullif(col3, \"\") as col3").show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| E| F|null|
| S| K| 1|
+----+----+----+

before use:
//将RDD映射到rowRDD
val rowRDD = personRDD.map(p => Row(p(0).trim.toLong, p(1).trim, p(2).trim, p(3).trim.toLong, p(4).trim.toLong))
use cast:
//通过StructType直接指定每个字段的schema
val schema = StructType(
StructField("id", LongType, false) ::
StructField("name", StringType, true) ::
StructField("gender", StringType, true) ::
StructField("salary", LongType, true) ::
StructField("expense", LongType, true) :: Nil
)
//允许字段为空
val rdd = personRDD.map(row =>
row.toSeq.map(r => {
if (r.trim.length > 0) {
val castValue = Util.castTo(r.trim, schema.fields(row.toSeq.indexOf(r)).dataType)
castValue
}
else null
})).map(Row.fromSeq)
Util method:
def castTo(value: String, dataType: DataType) = {
dataType match {
case _: IntegerType => value.toInt
case _: LongType => value.toLong
case _: StringType => value
}

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.
Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

Scala spark, input dataframe, return columns where all values equal to 1

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.
A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it
one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+
I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

Scala: Find the maximum value across each row of a dataframe

For each row of a DataFrame, I would like to extract the maximum value and put it in a new column.
The example code below gives me a DataFrame ('dfmax') of each maximum value:
val donuts = Seq((2.0, 1.50, 3.5), (4.2, 22.3, 10.8), (33.6, 2.50, 7.3))
val df = sparkSession
.createDataFrame(donuts)
.toDF("col1", "col2", "col3")
df.show()
import sparkSession.implicits._
val dfmax = df.map(r => r.getValuesMap[Double](df.schema.fieldNames).map(r => r._2).max)
dfmax.show
This gives me df:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2.0| 1.5| 3.5|
| 4.2|22.3|10.8|
|33.6| 2.5| 7.3|
+----+----+----+
and dfmax:
+-----+
|value|
+-----+
| 3.5|
| 22.3|
| 33.6|
+-----+
I would like to have these two frames combined in one table preferably using .withColumn or similar in a style like this (which I cannot get to work):
def maxValue(data: DataFrame): DataFrame = {
val dfmax = df.map(r => r.getValuesMap[Double](df.schema.fieldNames).map(r => r._2).max)
dfmax
}
val udfMaxValue = udf(maxValue _)
df.withColumn("max", udfMaxValue(df))

Check if arraytype column contains null

I have a dataframe with a column of arraytype that can contain integer values. If no values it will contain only one and it will be the null value
Important: note the column will not be null but an array with a single value; null
> val df: DataFrame = Seq(("foo", Seq(Some(2), Some(3))), ("bar", Seq(None))).toDF("k", "v")
df: org.apache.spark.sql.DataFrame = [k: string, v: array<int>]
> df.show()
+---+------+
| k| v|
+---+------+
|foo|[2, 3]|
|bar|[null]|
Question: I'd like to get the rows that have the null value.
What I have tried thus far:
> df.filter(array_contains(df("v"), 2)).show()
+---+------+
| k| v|
+---+------+
|foo|[2, 3]|
+---+------+
for null, it does not seem to work
> df.filter(array_contains(df("v"), null)).show()
org.apache.spark.sql.AnalysisException: cannot resolve
'array_contains(v, NULL)' due to data type mismatch: Null typed
values cannot be used as arguments;
or
> df.filter(array_contains(df("v"), None)).show()
java.lang.RuntimeException: Unsupported literal type class scala.None$
None
It is not possible to use array_contains in this case because SQL NULL cannot be compared for equality.
You can use udf like this:
val contains_null = udf((xs: Seq[Integer]) => xs.contains(null))
df.where(contains_null($"v")).show
// +---+------+
// | k| v|
// +---+------+
// |bar|[null]|
For Spark 2.4+, you can use the higher-order function exists instead of UDF:
df.where("exists(v, x -> x is null)").show
//+---+---+
//| k| v|
//+---+---+
//|bar| []|
//+---+---+
The PySpark implementation, if needed:
contains_null = f.udf(lambda x: None in x, BooleanType())
df.filter(contains_null(f.col("v"))).show()

Spark - How to convert map function output (Row,Row) tuple to one Dataframe

I need to write one scenario in Spark using Scala API.
I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -
**Calling map function-**
val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
var result1,result2:Row = Row()
..........
return (result1,result2)
Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe
RDD(Row). Appreciate your help.
You can use flatMap to flatten your Row tuples, say if we start from this example rdd:
rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))
val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35
To convert it to data frame.
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(StructField("x", IntegerType, true)::
StructField("y", IntegerType, true)::Nil)
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
| x| y|
+---+---+
| 1| 2|
| 3| 4|
| 2| 1|
| 4| 2|
+---+---+