Scala filter to filter id's from list of string not working - scala

I have a df that has id (bigint) column and i need to filter these id's from list(string)
+-----------+
|id |
+-----------+
| 1231|
| 1331|
| 1431|
| 1531|
| 9431|
+-----------+
val a= List(1231,5031,1331,1441,1531)
Expected o/p
+-----------+
|id |
+-----------+
| 1431|
| 9431|
+-----------+
I tried as below
df.filter(!col(("id")).isin(a : _*))
But it is not filtering those ids.Any idea what's wrong here?

You need to use udf. Check this out
scala> val df = Seq(1231,1331,1431,1531,9431).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> val a= List(1231,5031,1331,1441,1531)
a: List[Int] = List(1231, 5031, 1331, 1441, 1531)
scala> def udf_contain(x:Int)={
| ! a.contains(x)
| }
udf_contain: (x: Int)Boolean
scala> val myudf_contain = udf ( udf_contain(_:Int):Boolean )
myudf_contain: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(IntegerType)))
scala> df.filter(myudf_contain('id)).show
+----+
| id|
+----+
|1431|
|9431|
+----+
scala>
or the RDD way.
scala> val rdd = Seq(1231,1331,1431,1531,9431).toDF("id").rdd
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:32
scala> val a= List(1231,5031,1331,1441,1531)
a: List[Int] = List(1231, 5031, 1331, 1441, 1531)
scala> def udf_contain(x:Int)={
| ! a.contains(x)
| }
udf_contain: (x: Int)Boolean
scala>
scala> rdd.filter(x=>udf_contain(Row(x(0)).mkString.toInt)).collect
res29: Array[org.apache.spark.sql.Row] = Array([1431], [9431])

Related

Add new column containing an Array of column names sorted by the row-wise values

Given a dataFrame with a few columns, I'm trying to create a new column containing an array of these columns' names sorted by decreasing order, based on the row-wise values of these columns.
| a | b | c | newcol|
|---|---|---|-------|
| 1 | 4 | 3 |[b,c,a]|
| 4 | 1 | 3 |[a,c,b]|
---------------------
The names of the columns are stored in a var names:Array[String]
What approach should I go for?
Using UDF is most simple way to achieve custom tasks here.
val df = spark.createDataFrame(Seq((1,4,3), (4,1,3))).toDF("a", "b", "c")
val names=df.schema.fieldNames
val sortNames = udf((v: Seq[Int]) => {v.zip(names).sortBy(_._1).map(_._2)})
df.withColumn("newcol", sortNames(array(names.map(col): _*))).show
Something like this can be an approach using Dataset:
case class Element(name: String, value: Int)
case class Columns(a: Int, b: Int, c: Int, elements: Array[String])
def function1()(implicit spark: SparkSession) = {
import spark.implicits._
val df0: DataFrame =
spark.createDataFrame(spark.sparkContext
.parallelize(Seq(Row(1, 2, 3), Row(4, 1, 3))),
StructType(Seq(StructField("a", IntegerType, false),
StructField("b", IntegerType, false),
StructField("c", IntegerType, false))))
val df1 = df0
.flatMap(row => Seq(Columns(row.getAs[Int]("a"),
row.getAs[Int]("b"),
row.getAs[Int]("c"),
Array(Element("a", row.getAs[Int]("a")),
Element("b", row.getAs[Int]("b")),
Element("c", row.getAs[Int]("c"))).sortBy(-_.value).map(_.name))))
df1
}
def main(args: Array[String]) : Unit = {
implicit val spark = SparkSession.builder().master("local[1]").getOrCreate()
function1().show()
}
gives:
+---+---+---+---------+
| a| b| c| elements|
+---+---+---+---------+
| 1| 2| 3|[a, b, c]|
| 4| 1| 3|[b, c, a]|
+---+---+---+---------+
Try something like this:
val sorted_column_names = udf((column_map: Map[String, Int]) =>
column_map.toSeq.sortBy(- _._2).map(_._1)
)
df.withColumn("column_map", map(lit("a"), $"a", lit("b"), $"b", lit("c"), $"c")
.withColumn("newcol", sorted_column_names($"column_map"))

In spark iterate through each column and find the max length

I am new to spark scala and I have following situation as below
I have a table "TEST_TABLE" on cluster(can be hive table)
I am converting that to dataframe
as:
scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")
Now the DF can be viewed as
scala> testDF.show()
COL1|COL2|COL3
----------------
abc|abcd|abcdef
a|BCBDFG|qddfde
MN|1234B678|sd
I want an output like below
COLUMN_NAME|MAX_LENGTH
COL1|3
COL2|8
COL3|6
Is this feasible to do so in spark scala?
Plain and simple:
import org.apache.spark.sql.functions._
val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)
You can try in the following way:
import org.apache.spark.sql.functions.{length, max}
import spark.implicits._
val df = Seq(("abc","abcd","abcdef"),
("a","BCBDFG","qddfde"),
("MN","1234B678","sd"),
(null,"","sd")).toDF("COL1","COL2","COL3")
df.cache()
val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH")
+-----------+----------+
|COLUMN_NAME|MAX_LENGTH|
+-----------+----------+
| COL1| 3|
| COL2| 8|
| COL3| 6|
+-----------+----------+
I think it's good idea to cache input dataframe df to make the computation faster.
Here is one more way to get the report of column names in vertical
scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.show(false)
+----+--------+------+
|COL1|COL2 |COL3 |
+----+--------+------+
|abc |abcd |abcdef|
|a |BCBDFG |qddfde|
|MN |1234B678|sd |
+----+--------+------+
scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)
scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]
scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3 |8 |6 |
+---------+---------+---------+
scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1 |_2 |
+----+---+
|COL1|3 |
|COL2|8 |
|COL3|6 |
+----+---+
scala>
To get the results into Scala collections, say Map()
scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala>

conditional operator with groupby in spark rdd level - scala

I am using Spark 1.60 and Scala 2.10.5
I have a dataframe like this,
+------------------+
|id | needed |
+------------------+
|1 | 2 |
|1 | 0 |
|1 | 3 |
|2 | 0 |
|2 | 0 |
|3 | 1 |
|3 | 2 |
+------------------+
From this df I created an rdd like this,
val dfRDD = df.rdd
from my rdd, I want to group by id and count of needed is > 0.
((1, 2), (2,0), (3,2))
So, I tried like this,
val groupedDF = dfRDD.map(x =>(x(0), x(1) > 0)).count.redueByKey(_+_)
In this case, I am getting an error:
error: value > is not a member of any
I need that in rdd level. Any help to get my desired output would be great.
The problem is that in your map you're calling the apply method of Row, and as you can see in its scaladoc, that method returns Any - and as you can see for the error and from the scaladoc there is not such method < in Any
You can fix it using the getAs[T] method.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df =
List(
(1, 2),
(1, 0),
(1, 3),
(2, 0),
(2, 0),
(3, 1),
(3, 2)
).toDF("id", "needed")
val rdd: RDD[(Int, Int)] = df.rdd.map(row => (row.getAs[Int](fieldName = "id"), row.getAs[Int](fieldName = "needed")))
From there you can continue with the aggregation, you have a few mistakes in your logic.
First, you don't need the count call.
And second, if you need to count the amount of times "needed" was greater than one you can't do _ + _, because that is the sum of needed values.
val grouped: RDD[(Int, Int)] = rdd.reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
PS: You should tell your professor to upgrade to Spark 2 and Scala 2.11 ;)
Edit
Using case classes in the above example.
final case class Data(id: Int, needed: Int)
val rdd: RDD[Data] = df.as[Data].rdd
val grouped: RDD[(Int, Int)] = rdd.map(d => d.id -> d.needed).reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
There's no need to do the calculation at the rdd level. Aggregation with the data frame should work:
df.groupBy("id").agg(sum(($"needed" > 0).cast("int")).as("positiveCount")).show
+---+-------------+
| id|positiveCount|
+---+-------------+
| 1| 2|
| 3| 2|
| 2| 0|
+---+-------------+
If you have to work with RDD, use row.getInt or as #Luis' answer row.getAs[Int] to get the value with explicit type, and then do the comparison and reduceByKey:
df.rdd.map(r => (r.getInt(0), if (r.getInt(1) > 0) 1 else 0)).reduceByKey(_ + _).collect
// res18: Array[(Int, Int)] = Array((1,2), (2,0), (3,2))

How to use Except function with spark Dataframe

I would like to get differences between two dataframe but returning the row with the different fields only. For example, I have 2 dataframes as follow:
val DF1 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Hyderabad","ram",9847, 50000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
val DF2 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Sydney","ram",9847, 48000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
The only difference between these two dataframe is emp_city and emp_sal for the second row.
Now, I am using the except function which gives me the entire row as follow:
DF1.except(DF2)
+------+---------+--------+---------+-------+--------+
|emp_id| emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+---------+--------+---------+-------+--------+
| 1|Hyderabad| ram| 9847| 50000| SF|
+------+---------+--------+---------+-------+--------+
However, I need the output to be like this:
+---------+--------+-----+
|emp_id| emp_city|emp_sal|
+------+---------+-------+
| 1|Hyderabad| 50000|
+------+---------+-------+
Which shows the different cells as well as emp_id.
Edit :
if there is change in column then it should appear if there is no change then it should be hidden or Null
The following should give you the result you are looking for.
DF1.except(DF2).select("emp_id","emp_city","emp_sal")
You should consider the comment from #user238607 as we cannot predict which columns are going to differ,
Still you can try this workaround.
I'm assuming emp_id is unique,
scala> val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
scala> DF1.join(DF2, DF1("emp_id") === DF2("emp_id"))
res15: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 10 more fields]
scala> res15.withColumn("diffcolumn", split(concat_ws(",",DF1.columns.map(x => diff(lit(x), DF1(x), DF2(x))):_*),","))
res16: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 11 more fields]
scala> res16.show(false)
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|emp_id|emp_city |emp_name|emp_phone|emp_sal|emp_site|emp_id|emp_city|emp_name|emp_phone|emp_sal|emp_site|diffcolumn |
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|3 |Chennai |rahman |9846 |45000 |SanRamon|3 |Chennai |rahman |9846 |45000 |SanRamon|[, , , , , ] |
|1 |Hyderabad|ram |9847 |50000 |SF |1 |Sydney |ram |9847 |48000 |SF |[, emp_city, , , emp_sal, ]|
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
scala> val diff_cols = res16.select(explode($"diffcolumn")).filter("col != ''").distinct.collect.map(a=>col(a(0).toString))
scala> val exceptOpr = DF1.except(DF2)
scala> exceptOpr.select(diff_cols:_*).show
+-------+---------+
|emp_sal| emp_city|
+-------+---------+
| 50000|Hyderabad|
+-------+---------+
I found this solution which seems to be working fine :
val cols = DF1.columns.filter(_ != "emp_id").toList
val DF3 = DF1.except(DF2)
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*)
It generates the output as follow :
+------+-------------------+--------+---------+--------------+--------+
|emp_id| emp_city|emp_name|emp_phone| emp_sal|emp_site|
+------+-------------------+--------+---------+--------------+--------+
| 1|[Sydney, Hyderabad]| null| null|[48000, 50000]| null|
|
+------+-------------------+--------+---------+--------------+--------+

How to subtract one Scala Spark DataFrame from another (Normalise to the mean)

I have two Spark DataFrames:
df1 with 80 columns
CO01...CO80
+----+----+
|CO01|CO02|
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.70|0.87|
|1.90|0.64|
+----+----+
and df2 with 80 columns
avg(CO01)...avg(CO80)
which is mean of each column
+------------------+------------------+
| avg(CO01)| avg(CO02)|
+------------------+------------------+
|2.6185106382978716|1.0080985915492937|
+------------------+------------------+
How can i subtract df2 from df1 for corresponding values?
I'm looking for solution that does not require to list all the columns.
P.S
In pandas it could be simply done by:
df2=df1-df1.mean()
Here is what you can do
scala> val df = spark.sparkContext.parallelize(List(
| (2.06,0.56),
| (1.96,0.72),
| (1.70,0.87),
| (1.90,0.64))).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
subMean: (mean: Double)org.apache.spark.sql.expressions.UserDefinedFunction
scala>
scala> val result = df.columns.foldLeft(df)( (df, col) =>
| { val avg = df.select(mean(col)).first().getAs[Double](0);
| df.withColumn(col, subMean(avg)(df(col)))
| })
result: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> result.show(10, false)
+---------------------+---------------------+
|c1 |c2 |
+---------------------+---------------------+
|0.15500000000000025 |-0.13749999999999996 |
|0.05500000000000016 |0.022499999999999964 |
|-0.20499999999999985 |0.1725 |
|-0.004999999999999893|-0.057499999999999996|
+---------------------+---------------------+
Hope, this helps!
Please note that, this will work for n number of columns as long as all columns in dataframe are of numeric type