How to join datasets with same columns and select one? - scala

I have two Spark dataframes which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one. Therefore I am getting an Exception for ambiguous column.
I have tried this:
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left").select($"d1.columnName")
and this:
d1.join(d2, d1("id") === d2("id"), "left").select($"d1.columnName")
but it does not work.

which spark version you're using ? can you put a sample of your dataframes ?
try this:
d2prim = d2.withColumnRenamed("columnName", d2_columnName)
d1.join(d2prim , Seq("id"), "left_outer").select("columnName")

I have two dataframes
val d1 = spark.range(3).withColumn("columnName", lit("d1"))
scala> d1.printSchema
root
|-- id: long (nullable = false)
|-- columnName: string (nullable = false)
val d2 = spark.range(3).withColumn("columnName", lit("d2"))
scala> d2.printSchema
root
|-- id: long (nullable = false)
|-- columnName: string (nullable = false)
which I am joining and selecting afterwards.
I want to select a specific column of one of the Dataframes. But the same column name exists in the other one.
val q1 = d1.as("d1")
.join(d2.as("d2"), Seq("id"), "left")
.select("d1.columnName")
scala> q1.show
+----------+
|columnName|
+----------+
| d1|
| d1|
| d1|
+----------+
As you can see it just works.
So, why did it not work for you? Let's analyze each.
// you started very well
d1.as("d1")
// but here you used $ to reference a column to join on
// with column references by their aliases
// that won't work
.join(d2.as("d2"), $"d1.id" === $"d2.id", "left")
// same here
// $ + aliased columns won't work
.select($"d1.columnName")
PROTIP: Use d1("columnName") to reference a specific column in a dataframe.
The other query was very close to be fine, but...
d1.join(d2, d1("id") === d2("id"), "left") // <-- so far so good!
.select($"d1.columnName") // <-- that's the issue, i.e. $ + aliased column

This happens because when spark combines the columns from the two DataFrames it doesn't do any automatic renaming for you. You just need to rename one of the columns before joining. Spark provides a method for this. After the join you can drop the renamed column.
val df2join = df2.withColumnRenamed("id", "join_id")
val joined = df1.join(df2, $"id" === $"join_id", "left").drop("join_id")

Related

Passing column information to SparkSession within withColumn

I have a simple df with 2 columns, as shown below,
+------------+---+
|file_name |id |
+------------+---+
|file1.csv |1 |
|file2.csv |2 |
+------------+---+
root
|-- file_name: string (nullable = true)
|-- id: string (nullable = true)
I wish to add a 3rd column with the count() from each file specified in the file_name column
These are large files so I wish to go for a Spark based approach for getting the count() from each file.
Assuming originalDF is the above df,
I have tried:
dfWithCounts = originalDF.withColumn("counts", lit(spark.read.csv(lit(col('file_name'))).count))
but this seems to be throwing error.
Column is not iterable
Is there way I can achieve this?
I'm using Spark 2.4.
You can't run a Spark job from within another Spark job. Assuming that the file list is not super huge you can collect originalDF to the driver and spawn individual jobs to count lines from there.
val dfWithCounts = originalDF.collect.map { r =>
(r.getString(0), r.getInt(1), spark.read.csv(r.getString(0)).count)
}.toSeq.toDF("file_name", "id", "count")
Optionally you can use Scala parallel collections to run these jobs in parallel.
val dfWithCounts = originalDF.collect.par.map { r =>
(r.getString(0), r.getInt(1), spark.read.csv(r.getString(0)).count)
}.toSeq.seq.toDF("file_name", "id", "count")

String Functions for Nested Schema in Spark Scala

I am learning Spark in Scala programming language.
Input file ->
"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}
Schema ->
root
|-- Personal: struct (nullable = true)
| |-- ID: integer (nullable = true)
| |-- Name: array (nullable = true)
| | |-- element: string (containsNull = true)
Operation for output ->
I want to concat the Strings of "Name" element
Eg - abcs|dakjdb
I am reading the file using dataframe API.
Please help me from this.
It should be pretty straightforward if you are working with Spark >= 1.6.0 you can use get_json_object and concat_ws:
import org.apache.spark.sql.functions.{get_json_object, concat_ws}
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
df.select(
concat_ws(
"-",
get_json_object($"data", "$.Personal.Name[0]"),
get_json_object($"data", "$.Personal.Name[1]")
).as("FullName")
).show(false)
// +-----------+
// |FullName |
// +-----------+
// |abcs-dakjdb|
// |cfg-woooww |
// +-----------+
With get_json_object we go through the json data an extract the two elements of the Name array which we concatenate later on.
There is an inbuilt function concat_ws which should be useful here.
to extend #Alexandros Biratsis answer. you can first convert Name into array[String] type before concatenating to avoid writing every name position. Querying by position would also fail when the value is null or when only one value exist instead of two.
import org.apache.spark.sql.functions.{get_json_object, concat_ws, from_json}
import org.apache.spark.sql.types.{ArrayType, StringType}
val arraySchema = ArrayType(StringType)
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
.select(get_json_object($"data", "$.Personal.Name") as "name")
.select(from_json($"name", arraySchema) as "name")
.select(concat_ws("|", $"name"))
.show(false)

How to get data of second data frame for all values of particular columns values matched in first dataframe?

Have two dataframe as below
first_df
|-- company_id: string (nullable = true)
|-- max_dd: date (nullable = true)
|-- min_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
second_df
|-- company_id: string (nullable = true)
|-- max_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
I have some companies data in second_df . I need to get data from second_df for those company ids which are listed in first_df.
what kind of spark apis useful here for me ?
How can i do it ?
Thank you.
Question extension :
If there is no stored records then first_df would be empty. Hence first_df("mean") & first_df("count") would be null resulting "acc_new_mean" is null. In that case I need to set "new_mean" as second_df("mean") , how to do it ?
I tried like this but it is not working
Any clue how to handle here .withColumn("new_mean", ... ) ???
val acc_new_mean = (second_df("mean") + first_df("mean")) / (second_df("count") + first_df("count"))
val acc_new_count = second_df("count") + first_df("count")
val new_df = second_df.join(first_df.withColumnRenamed("company_id", "right_company_id").as("a"),
( $"a.right_company_id" === second_df("company_id") && ( second_df("min_dd") > $"a.max_dd" ) )
, "leftOuter")
.withColumn("new_mean", if(acc_new_mean == null) lit(second_df("mean")) else acc_new_mean )
APPROACH 1 :
If you are finding difficult to join 2 dataframes using the dataframe's join API, you could use sql if you are comfortable in sql. For that you can register your 2 dataframes as tables in the spark memory and the write sql on top of that.
second_df.registerTempTable("table_second_df")
first_df.registerTempTable("table_first_df")
val new_df = spark.sql("select distinct s.* from table_second_df s join table_first_df f on s.company_id=f.company_id")
new_df.show()
As you requested, I have added the logic.
Consider your first_df looks like below :
+----------+----------+----------+----+-----+
|company_id| max_dd| min_dd|mean|count|
+----------+----------+----------+----+-----+
| A|2019-04-05|2019-04-01| 10| 100|
| A|2019-04-06|2019-04-02| 20| 200|
| B|2019-04-08|2019-04-01| 30| 300|
| B|2019-04-09|2019-04-02| 40| 400|
+----------+----------+----------+----+-----+
Consider your second_df looks like below :
+----------+----------+----+-----+
|company_id| max_dd|mean|count|
+----------+----------+----+-----+
| A|2019-04-03| 10| 100|
| A|2019-04-02| 20| 200|
+----------+----------+----+-----+
Since company id A is there in the second table, I have taken the latest max_dd record from second_df. For company id B, it is not in second_df I took the latest max_dd record from first_df.
Please find the code below.
first_df.registerTempTable("table_first_df")
second_df.registerTempTable("table_second_df")
val new_df = spark.sql("select company_id,max_dd,min_dd,mean,count from (select distinct s.company_id,s.max_dd,null as min_dd,s.mean,s.count,row_number() over (partition by s.company_id order by s.max_dd desc) rno from table_second_df s join table_first_df f on s.company_id=f.company_id) where rno=1 union select company_id,max_dd,min_dd,mean,count from (select distinct f.*,row_number() over (partition by f.company_id order by f.max_dd desc) rno from table_first_df f left join table_second_df s on s.company_id=f.company_id where s.company_id is null) where rno=1")
new_df.show()
Below is the result :
APPROACH 2 :
Instead of creating a temp table as I mentioned in Approach 1, you can use the join of dataframe's API. This is the same logic in Approach 1 but here I am using dataframe's API to accomplish this. Please don't forget to import org.apache.spark.sql.expressions.Window as I have used Window.patitionBy in the below code.
val new_df = second_df.as('s).join(first_df.as('f),$"s.company_id" === $"f.company_id","inner").drop($"min_dd").withColumn("min_dd",lit("")).select($"s.company_id", $"s.max_dd",$"min_dd", $"s.mean", $"s.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"s.company_id").orderBy($"s.max_dd".desc))).filter($"Rno" === 1).drop($"Rno").union(first_df.as('f).join(second_df.as('s),$"s.company_id" === $"f.company_id","left_anti").select($"f.company_id", $"f.max_dd",$"f.min_dd", $"f.mean", $"f.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"f.company_id").orderBy($"f.max_dd".desc))).filter($"Rno" === 1).drop($"Rno"))
new_df.show()
Below is the result :
Please let me know if you have any questions.
val acc_new_mean = //new mean literaal
val acc_new_count = //new count literaal
val resultDf = computed_df.join(accumulated_results_df.as("a"),
( $"company_id" === computed_df("company_id") )
, "leftOuter")
.withColumn("new_mean", when( acc_new_mean.isNull,lit(computed_df("mean")) ).otherwise(acc_new_mean) )
.withColumn("new_count", when( acc_new_count.isNull,lit(computed_df("count")) ).otherwise(acc_new_count) )
.select(
computed_df("company_id"),
computed_df("max_dd"),
col("new_mean").as("mean"),
col("new_count").as("count")
)

How to add List[String] values to a single column in Dataframe

I have a dataframe, I have a list of values (possibly list string) and I want to create a new column in my dataframe and add those list values as column values to this new column. I tried
val x = List("def", "cook", "abc")
val c_df = null
x.foldLeft(c_df)((df, column) => df.withColumn("newcolumnname" , lit(column)))
but it throws StackOverflow exception, I also tried iterating over list of string values and adding to dataframe but result value is a list of dataframe but all i want is a single dataframe.
Please help!
here is the sample input and output dataframe:
You can try below code.
Create First DataFrame with Index.
from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.orderBy("Col2")
df = spark.createDataFrame([("a", 10), ("b", 20), ("c", 30)], ["Col1", "Col2"])
df1 = df.withColumn("index", row_number().over(w))
df1.show()
Create Another DataFrame from List of Values.
from pyspark.sql.types import * newdf = spark.createDataFrame(['x','y', 'z'], StringType()) newdf.show()
Add Index column to DF created from List of values in step 2.
w = Window.orderBy("value")
df2 = newdf.withColumn("index", row_number().over(w))
df2.show()
Join the DataFrame df1 and df2 based on index.
df1.join(df2, "index").show()
There is a function array in Spark 1.4 or later that takes an array of Columns and returns a new Column. Function lit takes a Scala value and returns a Column type.
import spark.implicits._
val df = Seq(1, 2, 3).toDF("col1")
df.withColumn("new_col", array(lit("def"), lit("cook"), lit("abc"))).show
+----+----------------+
|col1| new_col|
+----+----------------+
| 1|[def, cook, abc]|
| 2|[def, cook, abc]|
| 3|[def, cook, abc]|
+----+----------------+
In Spark 2.2.0, there is a function typedLit that takes Scala types and returns a Column type. this function can handle parameterized scala types e.g.: List, Seq and Map.
val newDF = df.withColumn("new_col", typedLit(List("def", "cook", "abc")))
newDF.show()
newDF.printSchema()
+----+----------------+
|col1| new_col|
+----+----------------+
| 1|[def, cook, abc]|
| 2|[def, cook, abc]|
| 3|[def, cook, abc]|
+----+----------------+
root
|-- col1: integer (nullable = false)
|-- new_col: array (nullable = false)
| |-- element: string (containsNull = true)
This is what you wanted to do ? You can add when to conditionally add different set of lists to each row.

Pyspark - passing list/tuple to toDF function

I have a dataframe, and want to rename it using toDF by passing the columns names from list, here column list is dynamic, when i do as below getting error, how can i achieve this?
>>> df.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- dept: string (nullable = true)
columns = ['NAME_FIRST', 'DEPT_NAME']
df2 = df.toDF('ID', 'NAME_FIRST', 'DEPT_NAME')
(or)
df2 = df.toDF('id', columns[0], columns[1])
this, does not work if we dont know how many columns would be there in the input data frame, so want to pass the list to df2, i tried as below
df2 = df.toDF('id', columns)
pyspark.sql.utils.IllegalArgumentException: u"requirement failed: The number of columns doesn't match.\nOld column names (3): id, name, dept\nNew column names (2): id, name_first, dept_name"
Here it treats list as single item, how to pass the columns from list?
df2 = df.toDF(columns) does not work, add a * like below -
columns = ['NAME_FIRST', 'DEPT_NAME']
df2 = df.toDF(*columns)
"*" is the "splat" operator: It takes a list as input, and expands it into actual positional arguments in the function call
What you tried is correct except you did not add all columns to your "columns" array.
This will work:
columns = ['ID','NAME_FIRST', 'DEPT_NAME']
df2 = df.toDF(columns)
Updating answer with all steps I followed in pyspark:
list=[(1,'a','b'),(2,'c','d'),(3,'e','f')]
df = sc.parallelize(list)
columns = ['ID','NAME_FIRST', 'DEPT_NAME']
df2 = df.toDF(columns)