Sum up the values of the DataFrame based on conditions - scala

I have a DataFrame that is created as follows:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF()
This is the content of data.txt:
123,2016-11-09,1
124,2016-11-09,2
123,2016-11-10,1
123,2016-11-11,1
123,2016-11-12,1
124,2016-11-13,1
124,2016-11-14,1
Is it possible to filter df in order to get the sum of 3rd column values for 123 for the last N days starting from now? I am interested in a flexible solution so that N could be defined as a parameter.
For example, if today would be 2016-11-16 and N would be equal to 5, then the sum of 3rd column values for 124 would be equal to 2.
This is my current solution:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF(["key","date","qty"])
val starting_date = LocalDate.now().minusDays(x_last_days)
df.filter(col("key") === "124")
.filter(to_date(df("date")).gt(starting_date))
.agg(sum(col("qty")))
but it does not seem to work properly. 1. The line where I define column names ["key","date","qty"] does not compile for Scala 2.10.6 and Spark 1.6.2. 2. Also it returns a dataframe, while I need Int. Should I just do toString.toInt?

Both of the following won't compile :
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
// <console>:1: error: illegal start of simple expression
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
^
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
// <console>:27: error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[String]]
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
^
The first one won't because it's a incorrect syntax and as for the second, it is because, like the error says, it's not a member, in other terms, the action is not supported.
The later one will compile with Spark 2.x but the following solution would also apply or you'll have a DataFrame with one column of type ArrayType.
Now let's solve the issue :
scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._ // you don't need to import this in the shell.
val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1"))
.map{ _.split(",") match { case Array(a,b,c) => (a,b,c) }}.toDF("key","date","qty")
// Exiting paste mode, now interpreting.
// df: org.apache.spark.sql.DataFrame = [key: string, date: string, qty: string]
You can apply any filter you want and compute the aggregation needed, e.g :
scala> val df2 = df.filter(col("key") === "124").agg(sum(col("qty")))
// df2: org.apache.spark.sql.DataFrame = [sum(qty): double]
scala> df2.show
// +--------+
// |sum(qty)|
// +--------+
// | 4.0|
// +--------+
PS: The above code has been tested in Spark 1.6.2 and 2.0.0

Related

Scala Apache Spark and dynamic column list inside of DataFrame select method

I have the following Scala Spark code in order to parse the fixed width txt file:
val schemaDf = df.select(
df("value").substr(0, 6).cast("integer").alias("id"),
df("value").substr(7, 6).alias("date"),
df("value").substr(13, 29).alias("string")
)
I'd like to extract the following code:
df("value").substr(0, 6).cast("integer").alias("id"),
df("value").substr(7, 6).alias("date"),
df("value").substr(13, 29).alias("string")
into the dynamic loop in order to be able to define the column parsing in some external configuration, something like this(where x will hold the config for each column parsing but for now this is simple numbers for demo purpose):
val x = List(1, 2, 3)
val df1 = df.select(
x.foreach {
df("value").substr(0, 6).cast("integer").alias("id")
}
)
but right now the following line df("value").substr(0, 6).cast("integer").alias("id") don't compile with the following error:
type mismatch; found : org.apache.spark.sql.Column required: Int ⇒ ?
What am I doing wrong and how to properly return the dynamic Column list inside of df.select method?
The select won't take a statement as input, but you can save off the Columns you want to create and then expand the expression as input for the select:
val x = List(1, 2, 3)
val cols: List[Column] = x.map { i =>
newRecordsDF("value").substr(0, 6).cast("integer").alias("id")
}
val df1 = df.select(cols: _*)

Functional way of joining multiple dataframes

I'm learning Spark in Scala coming from heavy Python abuse and I'm getting a java.lang.NullPointerException because I'm doing things the python way.
I have say 3 dataframes of shape 4x2 each, first column is always an index 0,1,2,3 and the second column is some binary feature. The end goal is to have a 4x4 dataframe with a join of all of individual ones. In python I would first define some master df and then loop over the intermediate ones, assigning at each loop the resulting joined dataframe to the master dataframe variable name (ugly):
dataframes = [temp1, temp2, temp3]
df = pd.DataFrame(index=[0,1,2,3]) # Master df
for temp in dataframes:
df = df.join(temp)
In Spark this doesnt play well:
q = "select * from table"
val df = sql(q) Works obviously
scala> val df = df.join(sql(q))
<console>:33: error: recursive value df needs type
val df = df.join(sql(q))
Ok so:
scala> val df:org.apache.spark.sql.DataFrame = df.join(sql(q))
java.lang.NullPointerException
... 50 elided
I think its highly likely that I'm not doing it the functional way. So I tried (uglyest!):
scala> :paste
// Entering paste mode (ctrl-D to finish)
sql(q).
join(sql(q), "device_id").
join(sql(q), "device_id").
join(sql(q), "device_id")
// Exiting paste mode, now interpreting.
res128: org.apache.spark.sql.DataFrame = [device_id: string, devtype: int ... 3 more fields]
This just looks ugly and inelegant and beginner. What would be a proper functional Scala way to achieve this?
foldLeft:
val dataframes: Seq[String] = ???
val df: Dataset[Row] = ???
dataframes.foldLeft(df)((acc, q) => acc.join(sql(q)))
And if you're looking for imperative equivalent of your Python code:
var dataframes: Seq[String] = ??? // IMPORTANT: var
for (q <- dataframes ) { df = df.join(sql(q)) }
Even simpler,
val dataframes: Seq[String] = ???
dataframes.reduce(_ join _)

Apply Function to DataFrame Rows and Convert Back to DataFrame Spark / Scala [duplicate]

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?
This code works perfectly from Spark 2.x with Scala 2.11
Import necessary classes
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
Create SparkSession Object, and Here it's spark
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val sc = spark.sparkContext // Just used to create test RDDs
Let's an RDD to make it DataFrame
val rdd = sc.parallelize(
Seq(
("first", Array(2.0, 1.0, 2.1, 5.4)),
("test", Array(1.5, 0.5, 0.9, 3.7)),
("choose", Array(8.0, 2.9, 9.1, 2.5))
)
)
##Method 1
Using SparkSession.createDataFrame(RDD obj).
val dfWithoutSchema = spark.createDataFrame(rdd)
dfWithoutSchema.show()
+------+--------------------+
| _1| _2|
+------+--------------------+
| first|[2.0, 1.0, 2.1, 5.4]|
| test|[1.5, 0.5, 0.9, 3.7]|
|choose|[8.0, 2.9, 9.1, 2.5]|
+------+--------------------+
##Method 2
Using SparkSession.createDataFrame(RDD obj) and specifying column names.
val dfWithSchema = spark.createDataFrame(rdd).toDF("id", "vals")
dfWithSchema.show()
+------+--------------------+
| id| vals|
+------+--------------------+
| first|[2.0, 1.0, 2.1, 5.4]|
| test|[1.5, 0.5, 0.9, 3.7]|
|choose|[8.0, 2.9, 9.1, 2.5]|
+------+--------------------+
##Method 3 (Actual answer to the question)
This way requires the input rdd should be of type RDD[Row].
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row("first", 2.0, 7.0),
Row("second", 3.5, 2.5),
Row("third", 7.0, 5.9)
)
)
create the schema
val schema = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("val1", DoubleType, true))
.add(StructField("val2", DoubleType, true))
Now apply both rowsRdd and schema to createDataFrame()
val df = spark.createDataFrame(rowsRdd, schema)
df.show()
+------+----+----+
| id|val1|val2|
+------+----+----+
| first| 2.0| 7.0|
|second| 3.5| 2.5|
| third| 7.0| 5.9|
+------+----+----+
SparkSession has a number of createDataFrame methods that create a DataFrame given an RDD. I imagine one of these will work for your context.
For example:
def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
Creates a DataFrame from an RDD containing Rows using the given
schema.
Assuming your RDD[row] is called rdd, you can use:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
rdd.toDF()
Note: This answer was originally posted here
I am posting this answer because I would like to share additional details about the available options that I did not find in the other answers
To create a DataFrame from an RDD of Rows, there are two main options:
1) As already pointed out, you could use toDF() which can be imported by import sqlContext.implicits._. However, this approach only works for the following types of RDDs:
RDD[Int]
RDD[Long]
RDD[String]
RDD[T <: scala.Product]
(source: Scaladoc of the SQLContext.implicits object)
The last signature actually means that it can work for an RDD of tuples or an RDD of case classes (because tuples and case classes are subclasses of scala.Product).
So, to use this approach for an RDD[Row], you have to map it to an RDD[T <: scala.Product]. This can be done by mapping each row to a custom case class or to a tuple, as in the following code snippets:
val df = rdd.map({
case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")
or
case class MyClass(val1: String, ..., valN: Long = 0L)
val df = rdd.map({
case Row(val1: String, ..., valN: Long) => MyClass(val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")
The main drawback of this approach (in my opinion) is that you have to explicitly set the schema of the resulting DataFrame in the map function, column by column. Maybe this can be done programatically if you don't know the schema in advance, but things can get a little messy there. So, alternatively, there is another option:
2) You can use createDataFrame(rowRDD: RDD[Row], schema: StructType) as in the accepted answer, which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.
Suppose you have a DataFrame and you want to do some modification on the fields data by converting it to RDD[Row].
val aRdd = aDF.map(x=>Row(x.getAs[Long]("id"),x.getAs[List[String]]("role").head))
To convert back to DataFrame from RDD we need to define the structure type of the RDD.
If the datatype was Long then it will become as LongType in structure.
If String then StringType in structure.
val aStruct = new StructType(Array(StructField("id",LongType,nullable = true),StructField("role",StringType,nullable = true)))
Now you can convert the RDD to DataFrame using the createDataFrame method.
val aNamedDF = sqlContext.createDataFrame(aRdd,aStruct)
Method 1: (Scala)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df_2 = sc.parallelize(Seq((1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c"))).toDF("x", "y", "z")
Method 2: (Scala)
case class temp(val1: String,val3 : Double)
val rdd = sc.parallelize(Seq(
Row("foo", 0.5), Row("bar", 0.0)
))
val rows = rdd.map({case Row(val1:String,val3:Double) => temp(val1,val3)}).toDF()
rows.show()
Method 1: (Python)
from pyspark.sql import Row
l = [('Alice',2)]
Person = Row('name','age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r:Person(*r))
df2 = sqlContext.createDataFrame(person)
df2.show()
Method 2: (Python)
from pyspark.sql.types import *
l = [('Alice',2)]
rdd = sc.parallelize(l)
schema = StructType([StructField ("name" , StringType(), True) ,
StructField("age" , IntegerType(), True)])
df3 = sqlContext.createDataFrame(rdd, schema)
df3.show()
Extracted the value from the row object and then applied the case class to convert rdd to DF
val temp1 = attrib1.map{case Row ( key: Int ) => s"$key" }
val temp2 = attrib2.map{case Row ( key: Int) => s"$key" }
case class RLT (id: String, attrib_1 : String, attrib_2 : String)
import hiveContext.implicits._
val df = result.map{ s => RLT(s(0),s(1),s(2)) }.toDF
Here is a simple example of converting your List into Spark RDD and then converting that Spark RDD into Dataframe.
Please note that I have used Spark-shell's scala REPL to execute following code, Here sc is an instance of SparkContext which is implicitly available in Spark-shell. Hope it answer your question.
scala> val numList = List(1,2,3,4,5)
numList: List[Int] = List(1, 2, 3, 4, 5)
scala> val numRDD = sc.parallelize(numList)
numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[80] at parallelize at <console>:28
scala> val numDF = numRDD.toDF
numDF: org.apache.spark.sql.DataFrame = [_1: int]
scala> numDF.show
+---+
| _1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
On newer versions of spark (2.0+)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
val dfSchema = Seq("col1", "col2", "col3")
rdd.toDF(dfSchema: _*)
One needs to create a schema, and attach it to the Rdd.
Assuming val spark is a product of a SparkSession.builder...
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
/* Lets gin up some sample data:
* As RDD's and dataframes can have columns of differing types, lets make our
* sample data a three wide, two tall, rectangle of mixed types.
* A column of Strings, a column of Longs, and a column of Doubules
*/
val arrayOfArrayOfAnys = Array.ofDim[Any](2,3)
arrayOfArrayOfAnys(0)(0)="aString"
arrayOfArrayOfAnys(0)(1)=0L
arrayOfArrayOfAnys(0)(2)=3.14159
arrayOfArrayOfAnys(1)(0)="bString"
arrayOfArrayOfAnys(1)(1)=9876543210L
arrayOfArrayOfAnys(1)(2)=2.71828
/* The way to convert an anything which looks rectangular,
* (Array[Array[String]] or Array[Array[Any]] or Array[Row], ... ) into an RDD is to
* throw it into sparkContext.parallelize.
* http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext shows
* the parallelize definition as
* def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
* so in our case our ArrayOfArrayOfAnys is treated as a sequence of ArraysOfAnys.
* Will leave the numSlices as the defaultParallelism, as I have no particular cause to change it.
*/
val rddOfArrayOfArrayOfAnys=spark.sparkContext.parallelize(arrayOfArrayOfAnys)
/* We'll be using the sqlContext.createDataFrame to add a schema our RDD.
* The RDD which goes into createDataFrame is an RDD[Row] which is not what we happen to have.
* To convert anything one tall and several wide into a Row, one can use Row.fromSeq(thatThing.toSeq)
* As we have an RDD[somethingWeDontWant], we can map each of the RDD rows into the desired Row type.
*/
val rddOfRows=rddOfArrayOfArrayOfAnys.map(f=>
Row.fromSeq(f.toSeq)
)
/* Now to construct our schema. This needs to be a StructType of 1 StructField per column in our dataframe.
* https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructField shows the definition as
* case class StructField(name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)
* Will leave the two default values in place for each of the columns:
* nullability as true,
* metadata as an empty Map[String,Any]
*
*/
val schema = StructType(
StructField("colOfStrings", StringType) ::
StructField("colOfLongs" , LongType ) ::
StructField("colOfDoubles", DoubleType) ::
Nil
)
val df=spark.sqlContext.createDataFrame(rddOfRows,schema)
/*
* +------------+----------+------------+
* |colOfStrings|colOfLongs|colOfDoubles|
* +------------+----------+------------+
* | aString| 0| 3.14159|
* | bString|9876543210| 2.71828|
* +------------+----------+------------+
*/
df.show
Same steps, but with fewer val declarations:
val arrayOfArrayOfAnys=Array(
Array("aString",0L ,3.14159),
Array("bString",9876543210L,2.71828)
)
val rddOfRows=spark.sparkContext.parallelize(arrayOfArrayOfAnys).map(f=>Row.fromSeq(f.toSeq))
/* If one knows the datatypes, for instance from JDBC queries as to RDBC column metadata:
* Consider constructing the schema from an Array[StructField]. This would allow looping over
* the columns, with a match statement applying the appropriate sql datatypes as the second
* StructField arguments.
*/
val sf=new Array[StructField](3)
sf(0)=StructField("colOfStrings",StringType)
sf(1)=StructField("colOfLongs" ,LongType )
sf(2)=StructField("colOfDoubles",DoubleType)
val df=spark.sqlContext.createDataFrame(rddOfRows,StructType(sf.toList))
df.show
I tried to explain the solution using the word count problem.
1. Read the file using sc
Produce word count
Methods to create DF
rdd.toDF method
rdd.toDF("word","count")
spark.createDataFrame(rdd,schema)
Read file using spark
val rdd=sc.textFile("D://cca175/data/")
Rdd to Dataframe
val df=sc.textFile("D://cca175/data/").toDF("t1")
df.show
Method 1
Create word count RDD to Dataframe
val df=rdd.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey((x,y)=>(x+y)).toDF("word","count")
Method2
Create Dataframe from Rdd
val df=spark.createDataFrame(wordRdd)
# with header
val df=spark.createDataFrame(wordRdd).toDF("word","count") df.show
Method3
Define Schema
import org.apache.spark.sql.types._
val schema=new StructType().
add(StructField("word",StringType,true)).
add(StructField("count",StringType,true))
Create RowRDD
import org.apache.spark.sql.Row
val rowRdd=wordRdd.map(x=>(Row(x._1,x._2)))
Create DataFrame from RDD with schema
val df=spark.createDataFrame(rowRdd,schema)
df.show
I meet the same problem, and finally solve it. It's quite simple and easy.
You have to add this code import sc.implicits._, sc means SQLContext. add this code you will get rdd.toDF() method.
Transform your rdd[RawData] to rdd[YourCaseClass]. For example, you have a rdd type like this rdd[(String, Integer, Long)], you can create a Case Class YourCaseClass(name: String, age: Integer, timestamp: Long) and convert raw rdd to rdd with YourCaseClass type, then you get rdd[YourCaseClass]
save rdd[YourCaseClass] to hive table. yourRdd.toDF().write.format("parquet").mode(SaveMode.Overwrite).insertInto(yourHiveTableName) Use case class to represent rdd type, we can avoid naming each column field or StructType related schema.
To convert an Array[Row] to DataFrame or Dataset, the following works elegantly:
Say, schema is the StructType for the row,then
val rows: Array[Row]=...
implicit val encoder = RowEncoder.apply(schema)
import spark.implicits._
rows.toDS

How to "negative select" columns in spark's dataframe

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:
column_names = Array("A","B","C")
I'd like to do a df.select() in such a way, that I can specify which columns not to select.
Example: let's say I do not want to select columns "B". I tried
df.select(column_names.filter(_!="B"))
but this does not work, as
org.apache.spark.sql.DataFrame
cannot be applied to (Array[String])
So, here it says it should work with a Seq instead. However, trying
df.select(column_names.filter(_!="B").toSeq)
results in
org.apache.spark.sql.DataFrame
cannot be applied to (Seq[String]).
What am I doing wrong?
Since Spark 1.4 you can use drop method:
Scala:
case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")
Python:
df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]
I had the same problem and solved it this way (oaffdf is a dataframe):
val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)
https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html
OK, it's ugly, but this quick spark shell session shows something that works:
scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21
scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]
scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21
scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]
scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]
scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)
scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)
scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]
Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.
For Spark v1.4 and higher, using drop(*cols) -
Returns a new DataFrame without the specified column(s).
Example -
df.drop('age').collect()
For Spark v2.3 and higher you could also do it using colRegex(colName) -
Selects column based on the column name specified as a regex and returns it as Column.
Example-
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
Reference - colRegex, drop
For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.
val columns = Seq("A","B","C")
df.select(columns.diff(Seq("B")))
In pyspark you can do
df.select(list(set(df.columns) - set(["B"])))
Using more than one line you can also do
cols = df.columns
cols.remove("B")
df.select(cols)
It is possible to do as following
It uses Spark's ability to select columns using regular expressions.
And using negative look-ahead expression ?!
In this case dataframe has columns a,b,c and regex excluding column b from the list.
Notice: you need to enable regexp for column name lookups using spark.sql.parser.quotedRegexColumnNames=true session setting. And requires Spark 2.3+
select `^(?!b).*`
from (
select 1 as a, 2 as b, 3 as c
)

How to convert rdd object to dataframe in spark

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?
This code works perfectly from Spark 2.x with Scala 2.11
Import necessary classes
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
Create SparkSession Object, and Here it's spark
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val sc = spark.sparkContext // Just used to create test RDDs
Let's an RDD to make it DataFrame
val rdd = sc.parallelize(
Seq(
("first", Array(2.0, 1.0, 2.1, 5.4)),
("test", Array(1.5, 0.5, 0.9, 3.7)),
("choose", Array(8.0, 2.9, 9.1, 2.5))
)
)
##Method 1
Using SparkSession.createDataFrame(RDD obj).
val dfWithoutSchema = spark.createDataFrame(rdd)
dfWithoutSchema.show()
+------+--------------------+
| _1| _2|
+------+--------------------+
| first|[2.0, 1.0, 2.1, 5.4]|
| test|[1.5, 0.5, 0.9, 3.7]|
|choose|[8.0, 2.9, 9.1, 2.5]|
+------+--------------------+
##Method 2
Using SparkSession.createDataFrame(RDD obj) and specifying column names.
val dfWithSchema = spark.createDataFrame(rdd).toDF("id", "vals")
dfWithSchema.show()
+------+--------------------+
| id| vals|
+------+--------------------+
| first|[2.0, 1.0, 2.1, 5.4]|
| test|[1.5, 0.5, 0.9, 3.7]|
|choose|[8.0, 2.9, 9.1, 2.5]|
+------+--------------------+
##Method 3 (Actual answer to the question)
This way requires the input rdd should be of type RDD[Row].
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row("first", 2.0, 7.0),
Row("second", 3.5, 2.5),
Row("third", 7.0, 5.9)
)
)
create the schema
val schema = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("val1", DoubleType, true))
.add(StructField("val2", DoubleType, true))
Now apply both rowsRdd and schema to createDataFrame()
val df = spark.createDataFrame(rowsRdd, schema)
df.show()
+------+----+----+
| id|val1|val2|
+------+----+----+
| first| 2.0| 7.0|
|second| 3.5| 2.5|
| third| 7.0| 5.9|
+------+----+----+
SparkSession has a number of createDataFrame methods that create a DataFrame given an RDD. I imagine one of these will work for your context.
For example:
def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
Creates a DataFrame from an RDD containing Rows using the given
schema.
Assuming your RDD[row] is called rdd, you can use:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
rdd.toDF()
Note: This answer was originally posted here
I am posting this answer because I would like to share additional details about the available options that I did not find in the other answers
To create a DataFrame from an RDD of Rows, there are two main options:
1) As already pointed out, you could use toDF() which can be imported by import sqlContext.implicits._. However, this approach only works for the following types of RDDs:
RDD[Int]
RDD[Long]
RDD[String]
RDD[T <: scala.Product]
(source: Scaladoc of the SQLContext.implicits object)
The last signature actually means that it can work for an RDD of tuples or an RDD of case classes (because tuples and case classes are subclasses of scala.Product).
So, to use this approach for an RDD[Row], you have to map it to an RDD[T <: scala.Product]. This can be done by mapping each row to a custom case class or to a tuple, as in the following code snippets:
val df = rdd.map({
case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")
or
case class MyClass(val1: String, ..., valN: Long = 0L)
val df = rdd.map({
case Row(val1: String, ..., valN: Long) => MyClass(val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")
The main drawback of this approach (in my opinion) is that you have to explicitly set the schema of the resulting DataFrame in the map function, column by column. Maybe this can be done programatically if you don't know the schema in advance, but things can get a little messy there. So, alternatively, there is another option:
2) You can use createDataFrame(rowRDD: RDD[Row], schema: StructType) as in the accepted answer, which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.
Suppose you have a DataFrame and you want to do some modification on the fields data by converting it to RDD[Row].
val aRdd = aDF.map(x=>Row(x.getAs[Long]("id"),x.getAs[List[String]]("role").head))
To convert back to DataFrame from RDD we need to define the structure type of the RDD.
If the datatype was Long then it will become as LongType in structure.
If String then StringType in structure.
val aStruct = new StructType(Array(StructField("id",LongType,nullable = true),StructField("role",StringType,nullable = true)))
Now you can convert the RDD to DataFrame using the createDataFrame method.
val aNamedDF = sqlContext.createDataFrame(aRdd,aStruct)
Method 1: (Scala)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df_2 = sc.parallelize(Seq((1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c"))).toDF("x", "y", "z")
Method 2: (Scala)
case class temp(val1: String,val3 : Double)
val rdd = sc.parallelize(Seq(
Row("foo", 0.5), Row("bar", 0.0)
))
val rows = rdd.map({case Row(val1:String,val3:Double) => temp(val1,val3)}).toDF()
rows.show()
Method 1: (Python)
from pyspark.sql import Row
l = [('Alice',2)]
Person = Row('name','age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r:Person(*r))
df2 = sqlContext.createDataFrame(person)
df2.show()
Method 2: (Python)
from pyspark.sql.types import *
l = [('Alice',2)]
rdd = sc.parallelize(l)
schema = StructType([StructField ("name" , StringType(), True) ,
StructField("age" , IntegerType(), True)])
df3 = sqlContext.createDataFrame(rdd, schema)
df3.show()
Extracted the value from the row object and then applied the case class to convert rdd to DF
val temp1 = attrib1.map{case Row ( key: Int ) => s"$key" }
val temp2 = attrib2.map{case Row ( key: Int) => s"$key" }
case class RLT (id: String, attrib_1 : String, attrib_2 : String)
import hiveContext.implicits._
val df = result.map{ s => RLT(s(0),s(1),s(2)) }.toDF
Here is a simple example of converting your List into Spark RDD and then converting that Spark RDD into Dataframe.
Please note that I have used Spark-shell's scala REPL to execute following code, Here sc is an instance of SparkContext which is implicitly available in Spark-shell. Hope it answer your question.
scala> val numList = List(1,2,3,4,5)
numList: List[Int] = List(1, 2, 3, 4, 5)
scala> val numRDD = sc.parallelize(numList)
numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[80] at parallelize at <console>:28
scala> val numDF = numRDD.toDF
numDF: org.apache.spark.sql.DataFrame = [_1: int]
scala> numDF.show
+---+
| _1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
On newer versions of spark (2.0+)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
val dfSchema = Seq("col1", "col2", "col3")
rdd.toDF(dfSchema: _*)
One needs to create a schema, and attach it to the Rdd.
Assuming val spark is a product of a SparkSession.builder...
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
/* Lets gin up some sample data:
* As RDD's and dataframes can have columns of differing types, lets make our
* sample data a three wide, two tall, rectangle of mixed types.
* A column of Strings, a column of Longs, and a column of Doubules
*/
val arrayOfArrayOfAnys = Array.ofDim[Any](2,3)
arrayOfArrayOfAnys(0)(0)="aString"
arrayOfArrayOfAnys(0)(1)=0L
arrayOfArrayOfAnys(0)(2)=3.14159
arrayOfArrayOfAnys(1)(0)="bString"
arrayOfArrayOfAnys(1)(1)=9876543210L
arrayOfArrayOfAnys(1)(2)=2.71828
/* The way to convert an anything which looks rectangular,
* (Array[Array[String]] or Array[Array[Any]] or Array[Row], ... ) into an RDD is to
* throw it into sparkContext.parallelize.
* http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext shows
* the parallelize definition as
* def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
* so in our case our ArrayOfArrayOfAnys is treated as a sequence of ArraysOfAnys.
* Will leave the numSlices as the defaultParallelism, as I have no particular cause to change it.
*/
val rddOfArrayOfArrayOfAnys=spark.sparkContext.parallelize(arrayOfArrayOfAnys)
/* We'll be using the sqlContext.createDataFrame to add a schema our RDD.
* The RDD which goes into createDataFrame is an RDD[Row] which is not what we happen to have.
* To convert anything one tall and several wide into a Row, one can use Row.fromSeq(thatThing.toSeq)
* As we have an RDD[somethingWeDontWant], we can map each of the RDD rows into the desired Row type.
*/
val rddOfRows=rddOfArrayOfArrayOfAnys.map(f=>
Row.fromSeq(f.toSeq)
)
/* Now to construct our schema. This needs to be a StructType of 1 StructField per column in our dataframe.
* https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructField shows the definition as
* case class StructField(name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)
* Will leave the two default values in place for each of the columns:
* nullability as true,
* metadata as an empty Map[String,Any]
*
*/
val schema = StructType(
StructField("colOfStrings", StringType) ::
StructField("colOfLongs" , LongType ) ::
StructField("colOfDoubles", DoubleType) ::
Nil
)
val df=spark.sqlContext.createDataFrame(rddOfRows,schema)
/*
* +------------+----------+------------+
* |colOfStrings|colOfLongs|colOfDoubles|
* +------------+----------+------------+
* | aString| 0| 3.14159|
* | bString|9876543210| 2.71828|
* +------------+----------+------------+
*/
df.show
Same steps, but with fewer val declarations:
val arrayOfArrayOfAnys=Array(
Array("aString",0L ,3.14159),
Array("bString",9876543210L,2.71828)
)
val rddOfRows=spark.sparkContext.parallelize(arrayOfArrayOfAnys).map(f=>Row.fromSeq(f.toSeq))
/* If one knows the datatypes, for instance from JDBC queries as to RDBC column metadata:
* Consider constructing the schema from an Array[StructField]. This would allow looping over
* the columns, with a match statement applying the appropriate sql datatypes as the second
* StructField arguments.
*/
val sf=new Array[StructField](3)
sf(0)=StructField("colOfStrings",StringType)
sf(1)=StructField("colOfLongs" ,LongType )
sf(2)=StructField("colOfDoubles",DoubleType)
val df=spark.sqlContext.createDataFrame(rddOfRows,StructType(sf.toList))
df.show
I tried to explain the solution using the word count problem.
1. Read the file using sc
Produce word count
Methods to create DF
rdd.toDF method
rdd.toDF("word","count")
spark.createDataFrame(rdd,schema)
Read file using spark
val rdd=sc.textFile("D://cca175/data/")
Rdd to Dataframe
val df=sc.textFile("D://cca175/data/").toDF("t1")
df.show
Method 1
Create word count RDD to Dataframe
val df=rdd.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey((x,y)=>(x+y)).toDF("word","count")
Method2
Create Dataframe from Rdd
val df=spark.createDataFrame(wordRdd)
# with header
val df=spark.createDataFrame(wordRdd).toDF("word","count") df.show
Method3
Define Schema
import org.apache.spark.sql.types._
val schema=new StructType().
add(StructField("word",StringType,true)).
add(StructField("count",StringType,true))
Create RowRDD
import org.apache.spark.sql.Row
val rowRdd=wordRdd.map(x=>(Row(x._1,x._2)))
Create DataFrame from RDD with schema
val df=spark.createDataFrame(rowRdd,schema)
df.show
I meet the same problem, and finally solve it. It's quite simple and easy.
You have to add this code import sc.implicits._, sc means SQLContext. add this code you will get rdd.toDF() method.
Transform your rdd[RawData] to rdd[YourCaseClass]. For example, you have a rdd type like this rdd[(String, Integer, Long)], you can create a Case Class YourCaseClass(name: String, age: Integer, timestamp: Long) and convert raw rdd to rdd with YourCaseClass type, then you get rdd[YourCaseClass]
save rdd[YourCaseClass] to hive table. yourRdd.toDF().write.format("parquet").mode(SaveMode.Overwrite).insertInto(yourHiveTableName) Use case class to represent rdd type, we can avoid naming each column field or StructType related schema.
To convert an Array[Row] to DataFrame or Dataset, the following works elegantly:
Say, schema is the StructType for the row,then
val rows: Array[Row]=...
implicit val encoder = RowEncoder.apply(schema)
import spark.implicits._
rows.toDS