When I join two dataframes using Seq(col), I still get multiple columns when using df.*
E.g:
'''
val schema = StructType( Array(
StructField("language", StringType,true),
StructField("users", StringType,true)
))
val rowData= Seq(Row("Java", "20000"),
Row("Python", "100000"),
Row("Scala", "3000"))
var dfFromData3 = spark.createDataFrame(rowData,schema)
val schema1 = StructType( Array(
StructField("language", StringType,true),
StructField("price", StringType,true)
))
val rowData1= Seq(Row("Java", "20"),
Row("Python", "10"))
var dfFromData4 = spark.createDataFrame(rowData1,schema1)
val combined = dfFromData3.join(dfFromData4,Seq("language"),"left")
'''
'''display(combined)''' - Has only one "language" column
but
'''display(combined.as("df").select("df.*"))''' - Has two "language" columns
Can someone please explain what is happening here?
Related
Suppose I have two data frames and I would like to join them based on certain columns. The list of these join columns can differ based on the data frames that I'm joining, but we can always count on the fact that the two data frames we will join using joinDfs will always have the same column names as joinCols.
I am trying to figure out how to form the joinCondition given the assumptions/requirements above. Currently, it is returning (((a.colName1 = b.colName1) AND (a.colName2 = b.colName2)) AND (a.colName3 = b.colName3)), which is not quite returning what I'm expecting from the INNER JOIN in the example below.
Thank you in advance for helping me, a newbie to Scala and Spark figure out how to form a proper joinCondition.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.types._
def joinDfs(firstDf: DataFrame, secondDf: DataFrame, joinCols: Array[String]): DataFrame = {
val firstDfAlias: String = "a"
val secondDfAlias: String = "b"
// This is what I am trying to figure out and need help with
val joinCondition = joinCols
.map(c => col(f"${firstDfAlias}.${c}") === col(f"${secondDfAlias}.${c}"))
.reduce((x,y) => (x && y))
secondDf.as(secondDfAlias).join(
firstDfAlias.as(firstDfAlias),
joinCondition,
"inner"
).select(cols.map(col): _*)
}
// This is an example of data frames that I'm trying to join
// But these data frames can change in terms of number of columns in each
// and data types, etc. The only thing we know for sure is that these
// data frames will contain some or all columns with the same name and
// we will use them to join the two data frames.
val firstDfSchema = StructType(List(
StructField(name = "colName1", dataType = LongType, nullable=true),
StructField(name = "colName2", dataType = StringType, nullable=true),
StructField(name = "colName3", dataType = LongType, nullable=true),
StructField(name = "market_id", dataType = LongType, nullable=true)
))
val firstDfData = Seq(
Row(123L, "123", 123L, 123L),
Row(234L, "234", 234L, 234L),
Row(345L, "345", 345L, 345L),
Row(456L, "456", 456L, 456L)
)
val firstDf = spark.createDataFrame(spark.sparkContext.parallelize(firstDfData), firstDfSchema)
val secondDfSchema = StructType(List(
StructField(name = "colName1", dataType = LongType, nullable=true),
StructField(name = "colName2", dataType = StringType, nullable = true),
StructField(name = "colName3", dataType = LongType, nullable = true),
StructField(name = "num_orders", dataType = LongType, nullable=true)
))
val secondDfData = Seq(
Row(123L, "123", 123L, 1L),
Row(234L, "234", 234L, 2L),
Row(567L, "567", 567L, 3L)
)
val secondDf = spark.createDataFrame(spark.sparkContext.parallelize(secondDfData), secondDfSchema)
// Suppose we are going to join the two data frames above on the following columns
val joinCols: Array[String] = Array("colName1", "colName2", "colName3")
val finalDf = joinDfs(firstDf, secondDf, joinCols)
The following code:
val data1 = Seq(("Android", 1, "2021-07-24 12:01:19.000", "play"), ("Android", 1, "2021-07-24 12:02:19.000", "stop"),
("Apple", 1, "2021-07-24 12:03:19.000", "play"), ("Apple", 1, "2021-07-24 12:04:19.000", "stop"))
val schema1 = StructType(Array(StructField("device_id", StringType, true),
StructField("video_id", IntegerType, true),
StructField("event_timestamp", StringType, true),
StructField("event_type", StringType, true)
))
val spark = SparkSession.builder()
.enableHiveSupport()
.appName("PlayStop")
.getOrCreate()
var transaction=spark.createDataFrame(data1, schema1)
produces the error:
Cannot resolve overloaded method 'createDataFrame'
Why?
And how to fix it?
If your schema consists of default StructField settings, the easiest way to create a DataFrame would be to simply apply toDF():
val transaction = data1.toDF("device_id", "video_id", "event_timestamp", "event_type")
To specify custom schema definition, note that createDataFrame() takes a RDD[Row] and schema as its parameters. In your case, you could transform data1 into a RDD[Row] like below:
val transaction = spark.createDataFrame(sc.parallelize(data1.map(Row(_))), schema1)
An alternative is to use toDF, followed by rdd which represents a DataFrame (i.e. Dataset[Row]) as RDD[Row]:
val transaction = spark.createDataFrame(data1.toDF.rdd, schema1)
This question already has an answer here:
Defining a UDF that accepts an Array of objects in a Spark DataFrame?
(1 answer)
Closed 3 years ago.
I'm trying to write a Spark UDF in scala, I need to define a Function's input datatype
I have a schema variable with the StructType, mentioned the same below.
import org.apache.spark.sql.types._
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
I'm trying to write a Function like below
val relationsFunc: Array[Map[String,String]] => Array[String] = _.map(do something)
val relationUDF = udf(relationsFunc)
input.withColumn("relation",relationUDF(col("relation")))
above code throws below exception
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(relation)' due to data type mismatch: argument 1 requires array<map<string,string>> type, however, '`relation`' is of array<struct<attribute:string,email:string,fname:string,lname:string>> type.;;
'Project [relation#89, UDF(relation#89) AS proc#273]
if I give the input type as
val relationsFunc: StructType => Array[String] =
I'm not able to implement the logic, as _.map gives me metadata, filed names, etc.
Please advice how to define relationsSchema as input datatype in the below function.
val relationsFunc: ? => Array[String] = _.map(somelogic)
Your structure under relation is a Row, so your function should have the following signature :
val relationsFunc: Array[Row] => Array[String]
then you can access your data either by position or by name, ie :
{r:Row => r.getAs[String]("email")}
Check the mapping table in the documentation to determine the data type representations between Spark SQL and Scala: https://spark.apache.org/docs/2.4.4/sql-reference.html#data-types
Your relation field is a Spark SQL complex type of type StructType, which is represented by Scala type org.apache.spark.sql.Row so this is the input type you should be using.
I used your code to create this complete working example that extracts email values:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(
Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
val data = Seq(
Row("{'relation':[{'attribute':'1','email':'johnny#example.com','fname': 'Johnny','lname': 'Appleseed'}]}")
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
relationsSchema
)
val relationsFunc = (relation: Array[Row]) => relation.map(_.getAs[String]("email"))
val relationUdf = udf(relationsFunc)
df.withColumn("relation", relationUdf(col("relation")))
I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"
If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.
I am seeing the error Cannot have map type columns in DataFrame which calls set operations when using Spark MapType.
Below is the sample code I wrote to reproduce it. I understand this is happening because the MapType objects are not hashable but I have an use case where I need to do the following.
val schema1 = StructType(Seq(
StructField("a", MapType(StringType, StringType, true)),
StructField("b", StringType, true)
))
val df = spark.read.schema(schema1).json("path")
val filteredDF = df.filter($"b" === "apple")
val otherDF = df.except(filteredDF)
Any suggestions for workarounds?