Add identical rows to a Spark Dataframe using an integer - scala

Assuming the following Dataframe df1 :
df1 :
+---------+--------+-------+
|A |B |C |
+---------+--------+-------+
|toto |tata |titi |
+---------+--------+-------+
I have the N = 3 integer which I want to use in order to create 3 duplicates in the df2 Dataframe using df1 :
df2 :
+---------+--------+-------+
|A |B |C |
+---------+--------+-------+
|toto |tata |titi |
|toto |tata |titi |
|toto |tata |titi |
+---------+--------+-------+
Any ideas ?

From Spark-2.4+ use arrays_zip + array_repeat + explode functions for this case.
val df=Seq(("toto","tata","titi")).toDF("A","B","C")
df.withColumn("arr",explode(array_repeat(arrays_zip(array("A"),array("B"),array("c")),3))).
drop("arr").
show(false)
//or dynamic way
val cols=df.columns.map(x => col(x))
df.withColumn("arr",explode(array_repeat(arrays_zip(array(cols:_*)),3))).
drop("arr").
show(false)
//+----+----+----+
//|A |B |C |
//+----+----+----+
//|toto|tata|titi|
//|toto|tata|titi|
//|toto|tata|titi|
//+----+----+----+

You can use foldLeft along with Dataframe's union
import org.apache.spark.sql.DataFrame
object JoinDataFrames {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(("toto","tata","titi")).toDF("A","B","C")
val N = 3;
val resultDf = (1 until N).foldLeft( df)((dfInner : DataFrame, count : Int) => {
df.union(dfInner)
})
resultDf.show()
}
}

Related

How to create a dataframe from Array[Strings]?

I used rdd.collect() to create an Array and now I want to use this Array[Strings] to create a DataFrame. My test file is in the following format(separated by a pipe |).
TimeStamp
IdC
Name
FileName
Start-0f-fields
column01
column02
column03
column04
column05
column06
column07
column08
column010
column11
End-of-fields
Start-of-data
G0002B|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002A|0|13|IS|LS|Xys|Xyz|12|23|45|
G0002x|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002C|0|13|IS|LS|Xys|Xyz|12|23|48|
End-of-data
document
the column name are in between Start-of-field and End-of-Field.
I want to store "| " pipe separated in different columns of Dataframe.
like below example:
column01 column02 column03 column04 column05 column06 column07 column08 column010 column11
G0002C 0 13 IS LS Xys Xyz 12 23 48
G0002x 0 13 LS MS Xys Xyz 14 300 400
my code :
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(5,16)
val rdd1 = sc.parallelize(rdd.collect())
val df = rdd1.toDf(columns)
but this is not giving me the above desired dataframe
Could you try this?
import spark.implicits._ // Add to use `toDS()` and `toDF()`
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16) // `.mkString(",")` is not needed
val dataDS = rdd.collect.slice(5,16)
.map(_.trim()) // to remove whitespaces
.map(s => s.substring(0, s.length - 1)) // to remove last pipe '|'
.toSeq
.toDS
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.csv(dataDS)
.toDF(columns: _*)
df.show(false)
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|column01|column02|column03|column04|column05|column06|column07|column08|column010|column11|
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|G0002B |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002A |0 |13 |IS |LS |Xys |Xyz |12 |23 |45 |
|G0002x |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002C |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
Calling spark.read...csv() method without schema, can take a long time with huge data, because of schema inferences(e,g. Additional reading).
On that case, you can specify schema like below.
/*
column01 STRING,
column02 STRING,
column03 STRING,
...
*/
val schema = columns
.map(c => s"$c STRING")
.mkString(",\n")
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.schema(schema) // schema inferences not occurred
.csv(dataDS)
// .toDF(columns: _*) => unnecessary when schema is specified
If the number of columns and the name of the column are fixed then you can do that as below :
val columns = rdd.collect.slice(5,15).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(17,21)
val d = data.mkString("\n").split('\n').toSeq.toDF()
import org.apache.spark.sql.functions._
val dd = d.withColumn("columnX",split($"value","\\|")).withColumn("column1",$"columnx".getItem(0)).withColumn("column2",$"columnx".getItem(1)).withColumn("column3",$"columnx".getItem(2)).withColumn("column4",$"columnx".getItem(3)).withColumn("column5",$"columnx".getItem(4)).withColumn("column6",$"columnx".getItem(5)).withColumn("column8",$"columnx".getItem(7)).withColumn("column10",$"columnx".getItem(8)).withColumn("column11",$"columnx".getItem(9)).drop("columnX","value")
display(dd)
you can see the output as below:

Dataframe becomes empty after appending an dataframe inside scala for loop

Actually, I am trying to append a dataframe to an empty dataframe in for loop in scala.
but the appended dataframe becomes empty every time.
below is the code
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import java.io._
import org.apache.spark.sql.DataFrame
object obj_Spark_url_Zipcode {
def main(args:Array[String]):Unit={
val spark = SparkSession.builder().appName("Spark_Url_Zip").master("local[*]").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
sc.setLogLevel("Error")
System.setProperty("http.agent","chrome")
val schema_str = "first,gender,state,zip,phone"
val struct_dymic = StructType(schema_str.split(",").map(x=>StructField(x, StringType, true)))
val df_empty = spark.createDataFrame(sc.emptyRDD[Row],struct_dymic)
for(i <- 1 to 10)
{
val url_json_data = scala.io.Source.fromURL("https://webapiusr.mue/apii/0.05/?reslts=4554").mkString
val url_json_rdd = sc.parallelize(url_json_data::Nil) //To convert a string to RDD
val url_json_df = spark.read.option("multiline",true).json(url_json_rdd)
val zipcode_df = url_json_df.withColumn("results",explode(col("results")))
.select("results.user.name.first","results.user.gender","results.user.location.state","results.user.location.zip","results.user.phone")
df_empty.union(zipcode_df)
println("Curr val : "+i)
}
df_empty.show()
}
}
Result:
#######
Curr val : 1
Curr val : 2
Curr val : 3
Curr val : 4
Curr val : 5
Curr val : 6
Curr val : 7
Curr val : 8
Curr val : 9
Curr val : 10
+-----+------+-----+---+-----+
|first|gender|state|zip|phone|
+-----+------+-----+---+-----+
+-----+------+-----+---+-----+
my intention is to append all the dataframes created inside the for loop into one dataframe and write the final dataframe into target.
I don't know why it becomes empty.
I tried this approach in pyspark. Appending the dataframes into an array and union array of dataframes into one dataframe.
But in scala, am unable to add dataframes into an array. (array of dataframes)
Regards
Dinesh Kumar
Example in Scala
import spark.implicits._
case class ReduceUnion (id: Int, v: String)
val l = Array.range(1,10)
val d = l.map(i => Seq(ReduceUnion(i, s"Test $i")).toDF())
val resultDF = d.reduce(_ union _)
resultDF.printSchema()
resultDF.show(false)
// root
// |-- id: integer (nullable = false)
// |-- v: string (nullable = true)
//
// +---+------+
// |id |v |
// +---+------+
// |1 |Test 1|
// |2 |Test 2|
// |3 |Test 3|
// |4 |Test 4|
// |5 |Test 5|
// |6 |Test 6|
// |7 |Test 7|
// |8 |Test 8|
// |9 |Test 9|
// +---+------+

How write code that creates a Dataset with columns that have the elements of an array column as values and their names being positions?

Input data:
val inputDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF
println("Input:")
inputDf.show(false)
Here is how look Input:
+---------+
|value |
+---------+
|[a, b, c]|
|[X, Y, Z]|
+---------+
Here is how look Expected:
+---+---+---+
|0 |1 |2 |
+---+---+---+
|a |b |c |
|X |Y |Z |
+---+---+---+
I tried use code like this:
val ncols = 3
val selectCols = (0 until ncols).map(i => $"arr"(i).as(s"col_$i"))
inputDf
.select(selectCols:_*)
.show()
But I have errors, because I need some :Unit
Another way to create a dataframe ---
df1 = spark.createDataFrame([(1,[4,2, 1]),(4,[3,2])], [ "col2","col4"])
OUTPUT---------
+----+---------+
|col2| col4|
+----+---------+
| 1|[4, 2, 1]|
| 4| [3, 2]|
+----+---------+
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object ArrayToCol extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val inptDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF("value")
val d = inptDf
.withColumn("0", col("value").getItem(0))
.withColumn("1", col("value").getItem(1))
.withColumn("2", col("value").getItem(2))
.drop("value")
d.show(false)
}
// Variant 2
val res = inptDf.select(
$"value".getItem(0).as("col0"),
$"value".getItem(1).as("col1"),
$"value".getItem(2).as("col2")
)
// Variant 3
val res1 = inptDf.select(
col("*") +: (0 until 3).map(i => col("value").getItem(i).as(s"$i")): _*
)
.drop("value")

Operate within a group by and populate additional columns

I have a dataframes as below:
+------+------+---+------+
|field1|field2|id |Amount|
+------+------+---+------+
|A |B |002|10.0 |
|A |B |003|12.0 |
|A |B |005|15.0 |
|C |B |002|20.0 |
|C |B |003|22.0 |
|C |B |005|25.0 |
+------+------+---+------+
I need to convert it to :
+------+------+---+-------+---+-------+---+-------+
|field1|field2|002|002_Amt|003|003_Amt|005|005_Amt|
+------+------+---+-------+---+-------+---+-------+
|A |B |002|10.0 |003|12.0 |005|15.0 |
|C |B |002|20.0 |003|22.0 |005|25.0 |
+------+------+---+-------+---+-------+---+-------+
Please advise!
Your final dataframe column depends on id column so you need to store the distinct ids in a separate array.
import org.apache.spark.sql.functions._
val distinctIds = df.select(collect_list("id")).rdd.first().get(0).asInstanceOf[mutable.WrappedArray[String]].distinct
Next step is to filter each of the distinctIds and join them
val first = distinctIds.head
var finalDF = df.filter($"id" === first).withColumnRenamed("id", first).withColumnRenamed("Amount", first+"_Amt")
for(str <- distinctIds.tail){
var tempDF = df.filter($"id" === str).withColumnRenamed("id", str).withColumnRenamed("Amount", str+"_Amt")
finalDF = finalDF.join(tempDF, Seq("field1", "field2"), "left")
}
finalDF.show(false)
You should have your desired output as
+------+------+---+-------+---+-------+---+-------+
|field1|field2|002|002_Amt|003|003_Amt|005|005_Amt|
+------+------+---+-------+---+-------+---+-------+
|A |B |002|10.0 |003|12.0 |005|15.0 |
|C |B |002|20.0 |003|22.0 |005|25.0 |
+------+------+---+-------+---+-------+---+-------+
Var is never recommended for scala. So you can create a recursive function to do the above logic as below
def getFinalDF(first: Boolean, array: List[String], df: DataFrame, tdf: DataFrame) : DataFrame = array match {
case head :: tail => {
if(first) {
getFinalDF(false, tail, df, df.filter($"id" === head).withColumnRenamed("id", head).withColumnRenamed("Amount", head + "_Amt"))
}
else{
val tempDF = df.filter($"id" === head).withColumnRenamed("id", head).withColumnRenamed("Amount", head+"_Amt")
getFinalDF(false, tail, df, tdf.join(tempDF, Seq("field1", "field2"), "left"))
}
}
case Nil => tdf
}
and call the recursive function as
getFinalDF(true, distinctIds.toList, df, df).show(false)
You should have the same output.

Comparing two array columns in Scala Spark

I have a dataframe of format given below.
movieId1 | genreList1 | genreList2
--------------------------------------------------
1 |[Adventure,Comedy] |[Adventure]
2 |[Animation,Drama,War] |[War,Drama]
3 |[Adventure,Drama] |[Drama,War]
and trying to create another flag column which shows whether genreList2 is a subset of genreList1.
movieId1 | genreList1 | genreList2 | Flag
---------------------------------------------------------------
1 |[Adventure,Comedy] | [Adventure] |1
2 |[Animation,Drama,War] | [War,Drama] |1
3 |[Adventure,Drama] | [Drama,War] |0
I have tried this:
def intersect_check(a: Array[String], b: Array[String]): Int = {
if (b.sameElements(a.intersect(b))) { return 1 }
else { return 2 }
}
def intersect_check_udf =
udf((colvalue1: Array[String], colvalue2: Array[String]) => intersect_check(colvalue1, colvalue2))
data = data.withColumn("Flag", intersect_check_udf(col("genreList1"), col("genreList2")))
But this throws error
org.apache.spark.SparkException: Failed to execute user defined function.
P.S.: The above function (intersect_check) works for Arrays.
We can define an udf that calculates the length of the intersection between the two Array columns and checks whether it is equal to the length of the second column. If so, the second array is a subset of the first one.
Also, the inputs of your udf need to be class WrappedArray[String], not Array[String] :
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val same_elements = udf { (a: WrappedArray[String],
b: WrappedArray[String]) =>
if (a.intersect(b).length == b.length){ 1 }else{ 0 }
}
df.withColumn("test",same_elements(col("genreList1"),col("genreList2")))
.show(truncate = false)
+--------+-----------------------+------------+----+
|movieId1|genreList1 |genreList2 |test|
+--------+-----------------------+------------+----+
|1 |[Adventure, Comedy] |[Adventure] |1 |
|2 |[Animation, Drama, War]|[War, Drama]|1 |
|3 |[Adventure, Drama] |[Drama, War]|0 |
+--------+-----------------------+------------+----+
Data
val df = List((1,Array("Adventure","Comedy"), Array("Adventure")),
(2,Array("Animation","Drama","War"), Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))).toDF("movieId1","genreList1","genreList2")
Here is the solution converting using subsetOf
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
)).toDF("movieId1", "genreList1", "genreList2")
val subsetOf = udf((col1: Seq[String], col2: Seq[String]) => {
if (col2.toSet.subsetOf(col1.toSet)) 1 else 0
})
data.withColumn("flag", subsetOf(data("genreList1"), data("genreList2"))).show()
Hope this helps!
One solution may be to exploit spark array builtin functions: genreList2 is subset of genreList1 if the intersection between the two is equal to genreList2. In the code below a sort_array operation has been added to avoid a mismatch between two arrays with different ordering but same elements.
val spark = {
SparkSession
.builder()
.master("local")
.appName("test")
.getOrCreate()
}
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = Seq(
(1, Array("Adventure","Comedy"), Array("Adventure")),
(2, Array("Animation","Drama","War"), Array("War","Drama")),
(3, Array("Adventure","Drama"), Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
df
.withColumn("flag",
sort_array(array_intersect($"genreList1",$"genreList2"))
.equalTo(
sort_array($"genreList2")
)
.cast("integer")
)
.show()
The output is
+--------+--------------------+------------+----+
|movieId1| genreList1| genreList2|flag|
+--------+--------------------+------------+----+
| 1| [Adventure, Comedy]| [Adventure]| 1|
| 2|[Animation, Drama...|[War, Drama]| 1|
| 3| [Adventure, Drama]|[Drama, War]| 0|
+--------+--------------------+------------+----+
This can also work here and it does not use udf
import spark.implicits._
val data = Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
data
.withColumn("size",size(array_except($"genreList2",$"genreList1")))
.withColumn("flag",when($"size" === lit(0), 1) otherwise(0))
.show(false)
Spark 3.0+ (forall)
forall($"genreList2", x => array_contains($"genreList1", x)).cast("int")
Full example:
val df = Seq(
(1, Seq("Adventure", "Comedy"), Seq("Adventure")),
(2, Seq("Animation", "Drama","War"), Seq("War", "Drama")),
(3, Seq("Adventure", "Drama"), Seq("Drama", "War"))
).toDF("movieId1", "genreList1", "genreList2")
val df2 = df.withColumn("Flag", forall($"genreList2", x => array_contains($"genreList1", x)).cast("int"))
df2.show()
// +--------+--------------------+------------+----+
// |movieId1| genreList1| genreList2|Flag|
// +--------+--------------------+------------+----+
// | 1| [Adventure, Comedy]| [Adventure]| 1|
// | 2|[Animation, Drama...|[War, Drama]| 1|
// | 3| [Adventure, Drama]|[Drama, War]| 0|
// +--------+--------------------+------------+----+