How to "negative select" columns in spark's dataframe - scala

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:
column_names = Array("A","B","C")
I'd like to do a df.select() in such a way, that I can specify which columns not to select.
Example: let's say I do not want to select columns "B". I tried
df.select(column_names.filter(_!="B"))
but this does not work, as
org.apache.spark.sql.DataFrame
cannot be applied to (Array[String])
So, here it says it should work with a Seq instead. However, trying
df.select(column_names.filter(_!="B").toSeq)
results in
org.apache.spark.sql.DataFrame
cannot be applied to (Seq[String]).
What am I doing wrong?

Since Spark 1.4 you can use drop method:
Scala:
case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")
Python:
df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]

I had the same problem and solved it this way (oaffdf is a dataframe):
val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)
https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html

OK, it's ugly, but this quick spark shell session shows something that works:
scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21
scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]
scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21
scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]
scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]
scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)
scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)
scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]
Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.

For Spark v1.4 and higher, using drop(*cols) -
Returns a new DataFrame without the specified column(s).
Example -
df.drop('age').collect()
For Spark v2.3 and higher you could also do it using colRegex(colName) -
Selects column based on the column name specified as a regex and returns it as Column.
Example-
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
Reference - colRegex, drop
For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.

val columns = Seq("A","B","C")
df.select(columns.diff(Seq("B")))

In pyspark you can do
df.select(list(set(df.columns) - set(["B"])))
Using more than one line you can also do
cols = df.columns
cols.remove("B")
df.select(cols)

It is possible to do as following
It uses Spark's ability to select columns using regular expressions.
And using negative look-ahead expression ?!
In this case dataframe has columns a,b,c and regex excluding column b from the list.
Notice: you need to enable regexp for column name lookups using spark.sql.parser.quotedRegexColumnNames=true session setting. And requires Spark 2.3+
select `^(?!b).*`
from (
select 1 as a, 2 as b, 3 as c
)

Related

Functional way of joining multiple dataframes

I'm learning Spark in Scala coming from heavy Python abuse and I'm getting a java.lang.NullPointerException because I'm doing things the python way.
I have say 3 dataframes of shape 4x2 each, first column is always an index 0,1,2,3 and the second column is some binary feature. The end goal is to have a 4x4 dataframe with a join of all of individual ones. In python I would first define some master df and then loop over the intermediate ones, assigning at each loop the resulting joined dataframe to the master dataframe variable name (ugly):
dataframes = [temp1, temp2, temp3]
df = pd.DataFrame(index=[0,1,2,3]) # Master df
for temp in dataframes:
df = df.join(temp)
In Spark this doesnt play well:
q = "select * from table"
val df = sql(q) Works obviously
scala> val df = df.join(sql(q))
<console>:33: error: recursive value df needs type
val df = df.join(sql(q))
Ok so:
scala> val df:org.apache.spark.sql.DataFrame = df.join(sql(q))
java.lang.NullPointerException
... 50 elided
I think its highly likely that I'm not doing it the functional way. So I tried (uglyest!):
scala> :paste
// Entering paste mode (ctrl-D to finish)
sql(q).
join(sql(q), "device_id").
join(sql(q), "device_id").
join(sql(q), "device_id")
// Exiting paste mode, now interpreting.
res128: org.apache.spark.sql.DataFrame = [device_id: string, devtype: int ... 3 more fields]
This just looks ugly and inelegant and beginner. What would be a proper functional Scala way to achieve this?
foldLeft:
val dataframes: Seq[String] = ???
val df: Dataset[Row] = ???
dataframes.foldLeft(df)((acc, q) => acc.join(sql(q)))
And if you're looking for imperative equivalent of your Python code:
var dataframes: Seq[String] = ??? // IMPORTANT: var
for (q <- dataframes ) { df = df.join(sql(q)) }
Even simpler,
val dataframes: Seq[String] = ???
dataframes.reduce(_ join _)

Spark: Scala - apply function to a list in a DataFrame [duplicate]

This question already has answers here:
How to slice and sum elements of array column?
(6 answers)
Closed 4 years ago.
I am trying to apply a sum-function to each cell of a column of a dataframe in spark. Each cell contains a list of integers which I would like to add up.
However, the error I am getting is:
console:357: error: value sum is not a member of
org.apache.spark.sql.ColumnName
for the example script below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().getOrCreate()
val df = spark.createDataFrame(Seq(
(0, List(1,2,3)),
(1, List(2,2,3)),
(2, List(3,2,3)))).toDF("Id", "col_1")
val test = df.withColumn( "col_2", $"col_1".sum )
test.show()
You can define a UDF.
scala> def sumFunc(a: Seq[Int]): Int = a.sum
sumFunc: (a: Seq[Int])Int
scala> val sumUdf = udf(sumFunc(_: Seq[Int]))
sumUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(ArrayType(IntegerType,false))))
scala> val test = df.withColumn( "col_2", sumUdf($"col_1") )
test: org.apache.spark.sql.DataFrame = [Id: int, col_1: array<int> ... 1 more field]
scala> test.collect
res0: Array[org.apache.spark.sql.Row] = Array([0,WrappedArray(1, 2, 3),6], [1,WrappedArray(2, 2, 3),7], [2,WrappedArray(3, 2, 3),8])

scala map sequence of tuples to vector

I have a spark dataframe like
val df = Seq((1, 2), (2, 1), (1,2)).toDF()
when collected as val local = df.collect the result is Array([1,2], [2,1], [1,2])
How can I map this rather to 2 separate vectors like
val col1: Seq(1,2,1)
val col2: Seq(2,1,2)
To split before collecting - you can use two separate select operations:
val col1: Array[Int] = df.select("_1").as[Int].collect()
val col2: Array[Int] = df.select("_2").as[Int].collect()
But - beware that the calculation to create df would be executed twice unless you persist it (e.g. by calling df.cache() beforehand).
To split after collecting:
val arr = df.as[(Int, Int)].collect()
val (col1, col2) = arr.unzip
p.s. all this assumes you're using Spark 1.6 or newer.

Sum up the values of the DataFrame based on conditions

I have a DataFrame that is created as follows:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF()
This is the content of data.txt:
123,2016-11-09,1
124,2016-11-09,2
123,2016-11-10,1
123,2016-11-11,1
123,2016-11-12,1
124,2016-11-13,1
124,2016-11-14,1
Is it possible to filter df in order to get the sum of 3rd column values for 123 for the last N days starting from now? I am interested in a flexible solution so that N could be defined as a parameter.
For example, if today would be 2016-11-16 and N would be equal to 5, then the sum of 3rd column values for 124 would be equal to 2.
This is my current solution:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF(["key","date","qty"])
val starting_date = LocalDate.now().minusDays(x_last_days)
df.filter(col("key") === "124")
.filter(to_date(df("date")).gt(starting_date))
.agg(sum(col("qty")))
but it does not seem to work properly. 1. The line where I define column names ["key","date","qty"] does not compile for Scala 2.10.6 and Spark 1.6.2. 2. Also it returns a dataframe, while I need Int. Should I just do toString.toInt?
Both of the following won't compile :
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
// <console>:1: error: illegal start of simple expression
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
^
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
// <console>:27: error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[String]]
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
^
The first one won't because it's a incorrect syntax and as for the second, it is because, like the error says, it's not a member, in other terms, the action is not supported.
The later one will compile with Spark 2.x but the following solution would also apply or you'll have a DataFrame with one column of type ArrayType.
Now let's solve the issue :
scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._ // you don't need to import this in the shell.
val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1"))
.map{ _.split(",") match { case Array(a,b,c) => (a,b,c) }}.toDF("key","date","qty")
// Exiting paste mode, now interpreting.
// df: org.apache.spark.sql.DataFrame = [key: string, date: string, qty: string]
You can apply any filter you want and compute the aggregation needed, e.g :
scala> val df2 = df.filter(col("key") === "124").agg(sum(col("qty")))
// df2: org.apache.spark.sql.DataFrame = [sum(qty): double]
scala> df2.show
// +--------+
// |sum(qty)|
// +--------+
// | 4.0|
// +--------+
PS: The above code has been tested in Spark 1.6.2 and 2.0.0

Spark Joins with None Values

I am trying to perform a join in Spark knowing that one of my keys on the left does not have a corresponding value in the other RDD.
The documentation says it should perform the join with None as an option if no key is found, but I keep getting a type mismatch error.
Any insight here?
Take these two RDDs:
val rdd1 = sc.parallelize(Array(("test","foo"),("test2", "foo2")))
val rdd2 = sc.parallelize(Array(("test","foo3"),("test3", "foo4")))
When you join them, you have a couple of options. What you do depends on what you want. Do you want an RDD only with the common keys?
val leftJoined = rdd1.join(rdd2)
leftJoined.collect
res1: Array[(String, (String, String))] = Array((test,(foo,foo3)))
If you want keys missing from rdd2 to be filled in with None, use leftOuterJoin:
val leftOuter = rdd.leftOuterJoin(rdd2)
leftOuter.collect
res2: Array[(String, (String, Option[String]))] = Array((test2,(foo2,None)), (test,(foo,Some(foo3))))
If you want keys missing from either side to be filled in with None, use fullOuterJoin:
val fullOuter = rdd1.fullOuterJoin(rdd2)
fullOuter.collect
res3: Array[(String, (Option[String], Option[String]))] = Array((test2,(Some(foo2),None)), (test3,(None,Some(foo4))), (test,(Some(foo),Some(fo3))))