join parameterization in pyspark - pyspark

I need to parameterized join condition and joining columns should get passes from CLI (I'm using: prompt.in pyspark)
my code is:
x1 = col(argv[1])
x2 = col(argv[2])
df = df1.join(df2, (df1.x1 == df2.x2))
This is my script:
join.py empid emdid
I get this error
df has no such columns.
Any ideas on how to solve this?

Follow this approach, It will work even if your dataframes are joining on column having same name.
argv = ['join.py', 'empid', 'empid']
x1 = argv[1]
x2 = argv[2]
df1 = spark.createDataFrame([(1, "A"),(2, "B")], ("empid", "c2"))
df2 = spark.createDataFrame([(1, "A"),(2, "B")], ("empid", "c2"))
df = df1.join(df2, df1[x1] == df2[x2])
df.show()

Related

Pyspark set values based on column's condition

I have this dataframe
data = [['tom','kik',1], ['nick','ken', 1], ['juli','ryan', 2]]
df = pd.DataFrame(data, columns=['Name','Name2', 'stat'])
df= spark.createDataFrame(df)
I need to make this transformation for the two columns (if stat ==1 then Name and Name2 ==Toto)
data = [['toto','toto', 1], ['toto','toto',1], ['juli','juli', 2]]
df = pd.DataFrame(data, columns=['Name','Name2', 'stat'])
df= spark.createDataFrame(df)
from pyspark.sql.functions import col, when
condition = (col("stat")==1)
new_df = df.withColumn("Name", when(condition, "toto")).withColumn("Name2", when(condition, "toto"))

Spark DataFrame's `except()` removes different items every time

var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.except(df2)
df3.show()
Surprisingly, I found that except is not working the way it should. Here is my output:
df2: created correctly, contains 1,2 and 3. But my df3 still has 1, 2 and/or 3 in it. It's kind of random. If I run it multiple times, I get different result. Can anyone please help me? Thanks in advance.
You need to put a spark "action" to collect the data that is required for "df2" before performing the "except" operation, which will ensure that the dataframe df2 get computed before hand and has the fixed content which will be subtracted from df.
Randomness is because spark lazy evaluation and spark is putting all your code in one stage. And the contents of "df2" is not fixed when you performed the "except" operation on it. As per the spark function definition for limit:
Returns a new Dataset by taking the first n rows. The difference between this function
and head is that head is an action and returns an array (by triggering query execution)
while limit returns a new Dataset.
since, it return a datset, will be lazy evaluation,
Below code will give you a consistent output.
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.head(3).map(f => f.mkString).toList.toDF("num")
df2.show()
var df3 = df.except(df2)
df3.show()
Best way to test this is to just create a new DF that has the values you want to diff.
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = List(1,2,3).toDF("num")
df2.show()
val df3 = df.except(df2)
df3.show()
Alternatively, just write a deterministic filter to select the rows you want:
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = df.filter("num <= 3")
df2.show()
val df3 = df.except(df2)
df3.show()
One could use a leftanti join for this if you have uniqueness in the column for which you are comparing.
Example:
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.join(df2,Seq("num"),"leftanti")
df3.show()

Check every column in a spark dataframe has a certain value

Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks,
Sai
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
If you want to filter every row, in which any of the columns is equal to 1 (or anything else), you can dynamically create a query like this:
cols = [col(c) == lit(1) for c in patients.columns]
query = cols[0]
for c in cols[1:]:
query |= c
df.filter(query).show()
It's a bit verbose, but it is very clear what is happening. A more elegant version would be:
res = df.filter(reduce(lambda x, y: x | y, (col(c) == lit(1) for c in cols)))
res.show()

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

Unable to compare Spark SQL Date columns

I have a case class in scala
case class TestDate (id: String, loginTime: java.sql.Date)
I created 2 RDD's of type TestDate
I wanted to do an inner join on two rdd's where the values of loginTime column is equal. Please find the code snippet below,
firstRDD.toDF.registerTempTable("firstTable")
secondRDD.toDF.registerTempTable("secondTable")
val res = sqlContext.sql("select * from firstTable INNER JOIN secondTable on to_date(firstTable.loginTime) = to_date(secondTable.loginTime)")
I'm not getting any exception. But i'm not getting correct answer too.
It does a cartesian and some random dates are generated in the result.
The issue was due to a wrong format given while creating the date object. When the format was rectified, it worked fine.
You can try using another approach:
val df1 = firstRDD.toDF
val df2 = secondRDD.toDF
val res = df1.join(df2, Seq("loginTime"))
If it doesn't work, you can try casting your dates to string:
val df1 = firstRDD.toDF.withColumn("loginTimeStr", col("loginTime").cast("string"))
val df2 = secondRDD.toDF.withColumn("loginTimeStr", col("loginTime").cast("string"))
val res = df1.join(df2, Seq("loginTimeStr"))
Finally, maybe the problem is that you also need the ID column in the join?
val df1 = firstRDD.toDF
val df2 = secondRDD.toDF
val res = df1.join(df2, Seq("id", "loginTime"))