scala - Spark : How to union all dataframe in loop - scala

Is there a way to get the dataframe that union dataframe in loop?
This is a sample code:
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits){
var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}
I would want to obtain some like this:
aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon
Thanks again

You could created a sequence of DataFrames and then use reduce:
val results = fruits.
map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
reduce(_.union(_))
results.show()

Steffen Schmitz's answer is the most concise one I believe.
Below is a more detailed answer if you are looking for more customization (of field types, etc):
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
//initialize DF
val schema = StructType(
StructField("aCol", StringType, true) ::
StructField("bCol", StringType, true) ::
StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
//list to iterate through
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits) {
//union returns a new dataset
initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}
//initialDF.show()
references:
How to create an empty DataFrame with a specified schema?
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/Dataset.html
https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html

If you have different/multiple dataframes you can use below code, which is efficient.
val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)

In a for loop:
val fruits = List("apple", "orange", "melon")
( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")

you can first create a sequence and then use toDF to create Dataframe.
scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
dseq: Seq[(String, String, String)] = List()
scala> for ( x <- fruits){
| dseq = dseq :+ ("aaa","bbb",x)
| }
scala> dseq
res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))
scala> val df = dseq.toDF("aCol","bCol","name")
df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]
scala> df.show
+----+----+------+
|aCol|bCol| name|
+----+----+------+
| aaa| bbb| apple|
| aaa| bbb|orange|
| aaa| bbb| melon|
+----+----+------+

Well... I think your question is a bit mis-guided.
As per my limited understanding of whatever you are trying to do, you should be doing following,
val fruits = List(
"apple",
"orange",
"melon"
)
val df = fruits
.map(x => ("aaa", "bbb", x))
.toDF("aCol", "bCol", "name")
And this should be sufficient.

Related

How to convert an RDD array string to a dataframe

Please help me convert the RDD array of IP address below, into a dataframe.
(Full Disclosure: I have little experience working with RDD)
RDD CREATION:
val SCND_RDD = FIRST_RDD.map(kv => kv._2).flatMap(r => r.get("ip")).map(o => o.asInstanceOf[scala.collection.mutable.Map[String, String]]).flatMap(ip => ip.get("address"))
SCND_RDD.take(3)
RESULTS:
SCND_RDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[33] at flatMap at <console>:38
res87: Array[String] = Array(5.42.212.99, 51.34.21.60, 63.99.831.7)`
My rdd<->dataframe conversion attempt:
case class X(callId: String)
val userDF = SCND_RDD.map{case Array(s0)=>X(s0)}.toDF()
This is the error I get
defined class X
<console>:40: error: scrutinee is incompatible with pattern type;
found : Array[T]
required: String
val userDF = NIPR_RDD22.map{case Array(s0)=>X(s0)}.toDF()
I leave a comment that is a duplicated question that might help you.
But here I also leave my trial.
val rdd = sc.parallelize(Array("test", "test2", "test3"))
rdd.take(3)
//res53: Array[String] = Array(test, test2, test3)
val df = rdd.toDF()
df.show
+-----+
|value|
+-----+
| test|
|test2|
|test3|
+-----+

How to get Schema as a Spark Dataframe from a Nested Structured Spark DataFrame

I have a sample Dataframe that I create using below code
val data = Seq(
Row(20.0, "dog"),
Row(3.5, "cat"),
Row(0.000006, "ant")
)
val schema = StructType(
List(
StructField("weight", DoubleType, true),
StructField("animal_type", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
val actualDF = df.withColumn(
"animal_interpretation",
struct(
(col("weight") > 5).as("is_large_animal"),
col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
)
)
actualDF.show(false)
+------+-----------+---------------------+
|weight|animal_type|animal_interpretation|
+------+-----------+---------------------+
|20.0 |dog |[true,true] |
|3.5 |cat |[false,true] |
|6.0E-6|ant |[false,false] |
+------+-----------+---------------------+
The schema of this Spark DF can be printed using -
scala> actualDF.printSchema
root
|-- weight: double (nullable = true)
|-- animal_type: string (nullable = true)
|-- animal_interpretation: struct (nullable = false)
| |-- is_large_animal: boolean (nullable = true)
| |-- is_mammal: boolean (nullable = true)
However, I would like to get this schema in the form of a dataframe that has 3 columns - field, type, nullable. The output dataframe from the schema would something like this -
+-------------------------------------+--------------+--------+
|field |type |nullable|
+-------------------------------------+--------------+--------+
|weight |double |true |
|animal_type |string |true |
|animal_interpretation |struct |false |
|animal_interpretation.is_large_animal|boolean |true |
|animal_interpretation.is_mammal |boolean |true |
+----------------------------------------------------+--------+
How can I achieve this in Spark. I am using Scala for coding.
Here is a complete example including your code. I used the somewhat common flattenSchema method for matching like Shankar did to traverse the Struct but rather than having this method return the flattened schema I used an ArrayBuffer to aggregate the datatypes of the StructType and returned the ArrayBuffer. I then turned the ArrayBuffer into a Sequence and finally, using Spark, converted the Sequence to a DataFrame.
import org.apache.spark.sql.types.{StructType, StructField, DoubleType, StringType}
import org.apache.spark.sql.functions.{struct, col}
import scala.collection.mutable.ArrayBuffer
val data = Seq(
Row(20.0, "dog"),
Row(3.5, "cat"),
Row(0.000006, "ant")
)
val schema = StructType(
List(
StructField("weight", DoubleType, true),
StructField("animal_type", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
val actualDF = df.withColumn(
"animal_interpretation",
struct(
(col("weight") > 5).as("is_large_animal"),
col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
)
)
var fieldStructs = new ArrayBuffer[(String, String, Boolean)]()
def flattenSchema(schema: StructType, fieldStructs: ArrayBuffer[(String, String, Boolean)], prefix: String = null): ArrayBuffer[(String, String, Boolean)] = {
schema.fields.foreach(field => {
val col = if (prefix == null) field.name else (prefix + "." + field.name)
field.dataType match {
case st: StructType => {
fieldStructs += ((col, field.dataType.typeName, field.nullable))
flattenSchema(st, fieldStructs, col)
}
case _ => {
fieldStructs += ((col, field.dataType.simpleString, field.nullable))
}
}}
)
fieldStructs
}
val foo = flattenSchema(actualDF.schema, fieldStructs).toSeq.toDF("field", "type", "nullable")
foo.show(false)
If you run the above you should get the following.
+-------------------------------------+-------+--------+
|field |type |nullable|
+-------------------------------------+-------+--------+
|weight |double |true |
|animal_type |string |true |
|animal_interpretation |struct |false |
|animal_interpretation.is_large_animal|boolean|true |
|animal_interpretation.is_mammal |boolean|true |
+-------------------------------------+-------+--------+
You could do something like this
def flattenSchema(schema: StructType, prefix: String = null) : Seq[(String, String, Boolean)] = {
schema.fields.flatMap(field => {
val col = if (prefix == null) field.name else (prefix + "." + field.name)
field.dataType match {
case st: StructType => flattenSchema(st, col)
case _ => Array((col, field.dataType.simpleString, field.nullable))
}
})
}
flattenSchema(actualDF.schema).toDF("field", "type", "nullable").show()
Hope this helps!

Conditional aggregation in scala based on data type

How do you aggregate dynamically in scala spark based on data types?
For example:
SELECT ID, SUM(when DOUBLE type)
, APPEND(when STRING), MAX(when BOOLEAN)
from tbl
GROUP BY ID
Sample data
You can do this by getting the runtime schema matching on the datatype, example :
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._
val df =Seq(
(1, 1.0, true, "a"),
(1, 2.0, false, "b")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf => (sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c))) // "append"
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id") // use all
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(
aggExprs.head,aggExprs.tail:_*
)
.show()
gives
+---+------+------+-----------------------------+
| id|sum(d)|max(b)|concat_ws(,, collect_list(s))|
+---+------+------+-----------------------------+
| 1| 3.0| true| a,b|
+---+------+------+-----------------------------+

Remove words from column if present in list

I have a dataframe with column 'text' which has many rows consisting of english sentences.
text
It is evening
Good morning
Hello everyone
What is your name
I'll see you tomorrow
I have a variable of type List which has some words such as
val removeList = List("Hello", "evening", "because", "is")
I want to remove all those words from column text which are present in removeList.
So my output should be
It
Good morning
everyone
What your name
I'll see you tomorrow
How can I do this using Spark Scala.
I wrote a code something like this:
val stopWordsList = List("Hello", "evening", "because", "is");
val df3 = sqlContext.sql("SELECT text FROM table");
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
def cleanText(x:String, stopWordsList:List[String]):Any = {
for(str <- stopWordsList) {
if(x.contains(str)) {
x.replaceAll(str, "")
}
}
}
But I am getting error
Error:(44, 12) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Error:(44, 12) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[String])org.apache.spark.sql.Dataset[String].
Unspecified value parameter evidence$6.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Check this df and rdd way.
val df = Seq(("It is evening"),("Good morning"),("Hello everyone"),("What is your name"),("I'll see you tomorrow")).toDF("data")
val removeList = List("Hello", "evening", "because", "is")
val rdd2 = df.rdd.map{ x=> {val p = x.getAs[String]("data") ; val k = removeList.foldLeft(p) ( (p,t) => p.replaceAll("\\b"+t+"\\b","") ) ; Row(x(0),k) } }
spark.createDataFrame(rdd2, df.schema.add(StructField("new1",StringType))).show(false)
Output:
+---------------------+---------------------+
|data |new1 |
+---------------------+---------------------+
|It is evening |It |
|Good morning |Good morning |
|Hello everyone | everyone |
|What is your name |What your name |
|I'll see you tomorrow|I'll see you tomorrow|
+---------------------+---------------------+
This code works for me.
Spark version 2.3.0, Scala version 2.11.8.
Using Datasets
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
)
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
}
val df1 = sc.parallelize(data).toDS // Dataset[String]
val df2 = df1.map(text => cleanText(text, removeList)) // Dataset[String]
Using DataFrames
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
)
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
}
// Creates a temp table.
sc.parallelize(data).toDF("text").createTempView("table")
val df1 = spark.sql("SELECT text FROM table") // DataFrame = [text: string]
val df2 = df1.map(row => cleanText(row.getAs[String](fieldName = "text"), removeList)).toDF("text") // DataFrame = [text: string]

Spark - Make dataframe with multi column csv

origin.csv
no,key1,key2,key3,key4,key5,...
1,A1,B1,C1,D1,E1,..
2,A2,B2,C2,D2,E2,..
3,A3,B3,C3,D3,E3,..
WhatIwant.csv
1,A1,key1
1,B1,key2
1,C1,key3
...
3,A3,key1
3,B3,key2
...
I loaded csv with read method(origin.csv dataframe), but unable to convert it.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
Any idea of this?
Try this.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val df = Seq((1,"A1","B1","C1","D1"), (2,"A2","B2","C2","D2"), (3,"A3","B3","C3","D2")).toDF("no", "key1", "key2","key3","key4")
df.show
def myUDF(df: DataFrame, by: Seq[String]): DataFrame = {
val (columns, types) = df.dtypes.filter{ case (clm, _) => !by.contains(clm)}.unzip
require(types.distinct.size == 1)
val keys = explode(array(
columns.map(clm => struct(lit(clm).alias("key"),col(clm).alias("val"))): _*
))
val byValue = by.map(col(_))
df.select(byValue :+ keys.alias("_key"): _*).select(byValue ++ Seq($"_key.val", $"_key.key"): _*)
}
val df1 = myUDF(df, Seq("no"))
df1.show