Adding a count column to my sequence in Scala - scala

Given the code below, how would I go about adding a count column? (e.g. .count("*").as("count"))
Final output to look like something like this:
+---+------+------+-----------------------------+------
| id|sum(d)|max(b)|concat_ws(,, collect_list(s))|count|
+---+------+------+-----------------------------+------
| 1| 1.0| true| a. | 1 |
| 2| 4.0| true| b,b| 2 |
| 3| 3.0| true| c. | 1 |
Current code is below:
val df =Seq(
(1, 1.0, true, "a"),
(2, 2.0, false, "b")
(3, 3.0, false, "b")
(2, 2.0, false, "c")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf => (sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c))) // "append"
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id")
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(aggExprs.head,aggExprs.tail:_*)
.show()

You can simply append count("*").as("count") to aggExprs.tail in your agg, as shown below:
df.
groupBy("id").agg(aggExprs.head, aggExprs.tail :+ count("*").as("count"): _*).
show
// +---+------+------+-----------------------------+-----+
// | id|sum(d)|max(b)|concat_ws(,, collect_list(s))|count|
// +---+------+------+-----------------------------+-----+
// | 1| 1.0| true| a| 1|
// | 3| 3.0| false| b| 1|
// | 2| 4.0| false| b,c| 2|
// +---+------+------+-----------------------------+-----+

Related

How to retain the column structure of a Spark Dataframe following a map operation on rows

I am trying to apply a function to each row of a Spark DataFrame, as in the example.
val df = sc.parallelize(
Seq((1, 2, 0), (0, 0, 1), (0, 0, 0))).toDF("x", "y", "z")
df.show()
which yields
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Suppose I want to do something to the values in each row, for example changing 0 to 5.
val b = df.map(row => row.toSeq.map(x => x match{
case 0 => 5
case x: Int => x
}))
b.show()
+---------+
| value|
+---------+
|[1, 2, 5]|
|[5, 5, 1]|
|[5, 5, 5]|
+---------+
The function worked, but I now have one column whose entries are Lists, instead of 3 columns of Ints. I would like my named columns back.
You can define an UDF to apply this substitution. For example:
def subsDef(k: Int): Int = if(k==0) 5 else k
val subs = udf[Int, Int](subsDef)
Then you can apply the UDF to a specific column or, if you desire, to every columns of the DF:
// to a single column, for example "x"
df = df.withColumn("x", subs(col("x")))
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 5| 0| 1|
| 5| 0| 0|
+---+---+---+
// to every columns of DF
df.columns.foreach(c => {
df = df.withColumn(c, subs(col(c)))
})
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 5|
| 5| 5| 1|
| 5| 5| 5|
+---+---+---+
Rather than transforming the DataFrame row-wise, consider using built-in Spark API function when/otherwise, as follows:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((1, 2, 0), (0, 0, 1), (0, 0, 0)).toDF("x", "y", "z")
val vFrom = 0
val vTo = 5
val cols = df.columns // Filter for specific columns if necessary
df.select( cols.map( c =>
when(col(c) === vFrom, vTo).otherwise(col(c)).as(c)
): _*
).show
// +---+---+---+
// | x| y| z|
// +---+---+---+
// | 1| 2| 5|
// | 5| 5| 1|
// | 5| 5| 5|
// +---+---+---+
There are various ways to do it here are some:
df.map(row => {
val size = row.size
var seq: Seq[Int] = Seq.empty[Int]
for (a <- 0 to size - 1) {
val value: Int = row(a).asInstanceOf[Int]
val newVal: Int = value match {
case 0 =>
5
case _ =>
value
}
seq = seq :+ newVal
}
Row.fromSeq(seq)
})(RowEncoder.apply(df.schema))
val columns = df.columns
df.select(
columns.map(c => when(col(c) === 0, 5).otherwise(col(c)).as(c)): _*)
.show()
def fun: (Int => Int) = { x =>
if (x == 0) 5 else x
}
val function = udf(fun)
df.select(function(col("x")).as("x"),
function(col("y")).as("y"),
function(col("z")).as("z"))
.show()
def checkZero(a: Int): Int = if (a == 0) 5 else a
df.map {
case Row(a: Int, b: Int, c: Int) =>
Row(checkZero(a), checkZero(b), checkZero(c))
} { RowEncoder.apply(df.schema) }
.show()

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Spark ML StringIndexer and OneHotEncoder - empty strings error [duplicate]

I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the application crashes with error requirement failed: Cannot have an empty string for name.. Is there a way I can get around this?
I could reproduce the error in the example provided on Spark ml page:
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.show()
It is annoying since missing/empty values is a highly generic case.
Thanks in advance,
Nikhil
Since the OneHotEncoder/OneHotEncoderEstimator does not accept empty string for name, or you'll get the following error :
java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33)
at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32)
[...]
This is how I will do it : (There is other way to do it, rf. #Anthony 's answer)
I'll create an UDF to process the empty category :
import org.apache.spark.sql.functions._
def processMissingCategory = udf[String, String] { s => if (s == "") "NA" else s }
Then, I'll apply the UDF on the column :
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
.withColumn("category",processMissingCategory('category))
df.show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
Now, you can go back to your transformations
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| NA| 3.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)
encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex| categoryVec|
// +---+--------+-------------+-------------+
// | 0| a| 0.0|(3,[0],[1.0])|
// | 1| b| 2.0|(3,[2],[1.0])|
// | 2| c| 1.0|(3,[1],[1.0])|
// | 3| NA| 3.0| (3,[],[])|
// | 4| a| 0.0|(3,[0],[1.0])|
// | 5| c| 1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+
EDIT:
#Anthony 's solution in Scala :
df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
I hope this helps!
Yep, it's a little thorny but maybe you can just replace the empty string with something sure to be different than other values. NOTE that I am using pyspark DataFrameNaFunctions API but Scala's should be similar.
df = sqlContext.createDataFrame([(0,"a"), (1,'b'), (2, 'c'), (3,''), (4,'a'), (5, 'c')], ['id', 'category'])
df = df.na.replace('', 'EMPTY', 'category')
df.show()
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| b|
| 2| c|
| 3| EMPTY|
| 4| a|
| 5| c|
+---+--------+
if the column contains null the OneHotEncoder fails with a NullPointerException.
therefore i extended the udf to tanslate null values as well
object OneHotEncoderExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// $example on$
val df1 = sqlContext.createDataFrame(Seq(
(0.0, "a"),
(1.0, "b"),
(2.0, "c"),
(3.0, ""),
(4.0, null),
(5.0, "c")
)).toDF("id", "category")
import org.apache.spark.sql.functions.udf
def emptyValueSubstitution = udf[String, String] {
case "" => "NA"
case null => "null"
case value => value
}
val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show()
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
.setDropLast(false)
val encoded = encoder.transform(indexed)
encoded.show()
// $example off$
sc.stop()
}
}

Updating a column of a dataframe based on another column value

I'm trying to update the value of a column using another column's value in Scala.
This is the data in my dataframe :
+-------------------+------+------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+------+------+-----+------+----+--------------------+-----------+
| 1| 0| 0| Name| 0|Desc| | 0|
| 2| 2.11| 10000|Juice| 0| XYZ|2016/12/31 : Inco...| 0|
| 3|-0.500|-24.12|Fruit| -255| ABC| 1994-11-21 00:00:00| 0|
| 4| 0.087| 1222|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5| 0.087| 1222|Bread|-22.06| | | 0|
+-------------------+------+------+-----+------+----+--------------------+-----------+
Here column _c5 contains a value which is incorrect(value in Row2 has the string Incorrect) based on which I'd like to update its isBadRecord field to 1.
Is there a way to update this field?
You can use withColumn api and use one of the functions which meet your needs to fill 1 for bad record.
For your case you can write a udf function
def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)
And call it as
val newDF = dataframe.withColumn("isBadRecord", fillbad(dataframe("_c5")))
Instead of reasoning about updating it, I'll suggest you think about it as you would in SQL; you could do the following:
import org.spark.sql.functions.when
val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe
import spark.implicits._
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
Here is a self-contained script that you can copy and paste on your Spark shell to see the result locally:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
sc.setLogLevel("ERROR")
val schema =
StructType(Seq(
StructField("UniqueRowIdentifier", IntegerType),
StructField("_c0", DoubleType),
StructField("_c1", DoubleType),
StructField("_c2", StringType),
StructField("_c3", DoubleType),
StructField("_c4", StringType),
StructField("_c5", StringType),
StructField("isBadRecord", IntegerType)))
val contents =
Seq(
Row(1, 0.0 , 0.0 , "Name", 0.0, "Desc", "", 0),
Row(2, 2.11 , 10000.0 , "Juice", 0.0, "XYZ", "2016/12/31 : Incorrect", 0),
Row(3, -0.5 , -24.12, "Fruit", -255.0, "ABC", "1994-11-21 00:00:00", 0),
Row(4, 0.087, 1222.0 , "Bread", -22.06, "", "2017-02-14 00:00:00", 0),
Row(5, 0.087, 1222.0 , "Bread", -22.06, "", "", 0)
)
val df = spark.createDataFrame(sc.parallelize(contents), schema)
df.show()
val withBadRecords =
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
withBadRecords.show()
Whose relevant output is the following:
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 0|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 1|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
The best option is to create a UDF and try to convert it do Date format.
If it can be converted then return 0 else return 1
This work even if you have an bad date format
val spark = SparkSession.builder().master("local")
.appName("test").getOrCreate()
import spark.implicits._
//create test dataframe
val data = spark.sparkContext.parallelize(Seq(
(1,"1994-11-21 Xyz"),
(2,"1994-11-21 00:00:00"),
(3,"1994-11-21 00:00:00")
)).toDF("id", "date")
// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure
val check = udf((value: String) => {
Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
case Success(d) => 1
case Failure(e) => 0
}
})
// Add column
data.withColumn("badData", check($"date")).show
Hope this helps!

How to combine (join) information across an Array[DataFrame]

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:
(0,1.0,0.4,0.1)
(0,3.0,0.2,0.1)
(0,5.0,0.4,0.1)
The first column is the ID, and my ideal output for this ID would be:
(0, 1, 1, 0)
meaning that the second and third columns changed while the third did not.
I attach here a bit of data to replicate my setting
val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.9,0.2),
(3,0.9,0.2,0.2),
(4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.5,0.2),
(3,0.8,0.1,0.1),
(4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
(1,0.5,0.3,0.3),
(2,0.3,0.3,0.5),
(3,0.3,0.3,0.1),
(4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)
How can I map over the array and get my output?
You can use countDistinct with groupBy:
import org.apache.spark.sql.functions.{countDistinct}
val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))
val combined = result.reduce(_ unionAll _)
val aggregatedViaGroupBy = combined
.groupBy($"id")
.agg(exprs.head, exprs.tail: _*)
aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+
First we need to join all the DataFrames together.
val combined = result.reduceLeft((a,b) => a.join(b,"id"))
To compare all the columns of the same label (e.g., "prod1"), I found it easier (at least for me) to operate on the RDD level. We fist transform the data into (id, Seq[Double]).
val finalResults = combined.rdd.map{
x =>
(x.getInt(0), x.toSeq.tail.map(_.asInstanceOf[Double]))
}.map{
case(i,d) =>
def checkAllEqual(l: Seq[Double]) = if(l.toSet.size == 1) 0 else 1
val g = d.grouped(3).toList
val g1 = checkAllEqual(g.map(x => x(0)))
val g2 = checkAllEqual(g.map(x => x(1)))
val g3 = checkAllEqual(g.map(x => x(2)))
(i, g1,g2,g3)
}.toDF("id", "prod1", "prod2", "prod3")
finalResults.show()
This will print:
+---+-----+-----+-----+
| id|prod1|prod2|prod3|
+---+-----+-----+-----+
| 0| 1| 1| 0|
| 1| 1| 0| 0|
| 2| 1| 1| 1|
| 3| 1| 1| 1|
| 4| 0| 0| 0|
+---+-----+-----+-----+