How to calculate age from birth date in pyspark? - pyspark

I am calculating age from birth date in pyspark :
def run(first):
out = spark.sql("""
SELECT
p.birth_date,
FROM table1 p
LEFT JOIN table2 a USING(id)
LEFT JOIN table2 m ON m.id = p.id
LEFT JOIN table4 i USING(id))"""
out = out.withColumn('month', F.lit(first))
out = out.withColumn('age',
F.when(F.col('birth_date').isNull(), None).otherwise(
F.floor(F.datediff(
F.col('month'), F.col('birth_date'))/365.25)))
I get the following error at this line:
F.col('month'), F.col('birth_date'))/365.25)))
TypeError: unsupported operand type(s) for -: 'DataFrame' and 'DataFrame'
Any ideas on how to resolve this ?

The problem is likely due to mixing of datatypes. Since I am not sure what data types your columns are, here is a solution with both TimestampType and DateType columns:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame(
data=[
(1, "foo", datetime.strptime("1999-12-19", "%Y-%m-%d"), datetime.strptime("1999-12-19", "%Y-%m-%d").date()),
(2, "bar", datetime.strptime("1989-12-14", "%Y-%m-%d"), datetime.strptime("1989-12-14", "%Y-%m-%d").date()),
],
schema=T.StructType([
T.StructField("id", T.IntegerType(), True),
T.StructField("name", T.StringType(), True),
T.StructField("birth_ts", T.TimestampType(),True),
T.StructField("birth_date", T.DateType(), True)
])
)
df = df.withColumn("age_ts", F.floor(F.datediff(F.current_timestamp(), F.col("birth_ts"))/365.25))
df = df.withColumn("age_date", F.floor(F.datediff(F.current_date(), F.col("birth_date"))/365.25))
df.show()
id
name
birth_ts
birth_date
age_ts
age_date
1
foo
1999-12-19 00:00:00
1999-12-19
22
22
2
bar
1989-12-14 00:00:00
1989-12-14
32
32

Related

How to convert a 'datetime' column

In my PySpark dataframe, I have a column 'TimeStamp' which is in DateTime format. I want to covert that to 'Date' format and then use that in the 'GroupBy'.
df = spark.sql("SELECT * FROM `myTable`")
df.filter((df.somthing!="thing"))
df.withColumn('MyDate', col('Timestamp').cast('date')
df.groupBy('MyDate').count().show()
But I get this error:
cannot resolve 'MyDate' given input columns:
Can you please help me with this ?
each time you do df. you are creating a new dataframe.
df was only initialized in your first line of code, so that dataframe object does not have the new column MyDate.
you can look at the id() of each object to see
df = spark.sql("SELECT * FROM `myTable`")
print(id(df))
print(id(df.filter(df.somthing!="thing")))
this is correct syntax to chain operations
df = spark.sql("SELECT * FROM myTable")
df = (df
.filter(df.somthing != "thing")
.withColumn('MyDate', col('Timestamp').cast('date'))
.groupBy('MyDate').count()
)
df.show(truncate=False)
UPDATE: this is a better way to write it
df = (
spark.sql(
"""
SELECT *
FROM myTable
""")
.filter(col("something") != "thing")
.withColumn("MyDate", col("Timestamp").cast("date"))
.groupBy("MyDate").count()
)

cannot select columns in a table as one of the column name is limit

The above code is resulting in issues as it has a column name as keyword
limit. If I remove the column 'limit' from the select list, the script is
working fine.
Table A has following contents
\**** Table A *******\\\\
There are two tables A , B Table A as follows
ID Day Name Description limit
1 2016-09-01 Sam Retail 100
2 2016-01-28 Chris Retail 200
3 2016-02-06 ChrisTY Retail 50
4 2016-02-26 Christa Retail 10
3 2016-12-06 ChrisTu Retail 200
4 2016-12-31 Christi Retail 500
Table B has following contents
\\\**** Table B *****\\\\\\\
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
The above code is resulting in issues as it has a column name as keyword
limit. If I remove the column 'limit' from the select list, the script is
working fine.
\\\**** Tried Code *****\\\\\\\
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql(
"select * From A where day ='{0}'".format(i[0])
)
Join = ABC2.join(
Tab2,
(
ABC2.ID == Tab2.ID
)
)\
.select(
Tab2.skey,
ABC2.Day,
ABC2.Name,
ABC2.limit,)
withColumn('newcol1, lit('')),
withColumn('newcol2, lit('A'))
ABC2 .show()
ABC=spark.sql(
"select distinct day from A where day= '2016-01-01' "
)
\\\**** Expected Result *****\\\\\\\
How can we amend the code so that the limit is also selected
It worked this wasy. not sure functional reason but is successful, Renaming
the limit as alias before and there after getting it back
\\**** Tried Code *****\\\\\\\
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql( "select Day,Name,Description,limit as liu From A where day
='{0}'".format(i[0]) )
Join = ABC2.join( Tab2, ( ABC2.ID == Tab2.ID ) )\
.selectexpr( "skey as skey",
"Day as Day",
"Name as Day",
"liu as limit",)
withColumn('newcol1, lit('')),
withColumn('newcol2, lit('A'))
ABC2 .show()

To create a new column based on the joining column of two data frames using scala

I have two tables with columns table1 has id,name
and table2 has only id
table 1
--------------
id name
--------------
1 sudheer
2 sandeep
3 suresh
----------------
table2
--------
id
--------
1
2
-------
required table should be if "id" column doesn't exist in the table2 my new column value should be "N" otherwise "Y"
table3
id name IND
1 sudheer Y
2 sandeep Y
3 suresh N
I have tried the below steps to approach:
val df = hc.sql("select * from table1")
val df1 = hc.sql("select * from table2")
I tried to have a one more column (phone) in table2,as my join dataframe doesn't consist of id from table2,based on that null value I tried to set the value to Y/N
val df2 = df.join(df1,Seq("id"),"left_outer").withColumn("IND",exp(when(df1("phone")!= "null","Y").otherwise("N")))
But this didn't worked out with error
found : Boolean
required: org.apache.spark.sql.Column
Can anyone suggest any idea how to get the required result without adding a column to my table2?
This you can add one new column in table2 with default value "Y" and join and replace the null values with "N"
val df1 = Seq(
(1, "sudheer"),
(2, "sandeep"),
(3, "suresh")
).toDF("id", "name")
val df2 = Seq(1, 2).toDF("id")
.withColumn("IND", lit("Y"))
val df3 = df1.join(df2, Seq("id"), "left_outer")
.na.fill("N")
Or you can use when as you did
val df3 = df1.join(df2, Seq("id"), "left_outer")
.withColumn("IND", when($"IND".isNull, "N").otherwise("Y"))
Hope this helps!

Spark SQL 1.5.2: left excluding join

Given dataframes df_a and df_b, how can I achieve the same result as left excluding join:
SELECT df_a.*
FROM df_a
LEFT JOIN df_b
ON df_a.id = df_b.id
WHERE df_b.id is NULL
I've tried:
df_a.join(df_b, df_a("id")===df_b("id"), "left")
.select($"df_a.*")
.where(df_b.col("id").isNull)
I get an exception from the above:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
If you wish to do it through dataframes try below example :
import sqlContext.implicits._
val df1 = sc.parallelize(List("a", "b", "c")).toDF("key1")
val df2 = sc.parallelize(List("a", "b")).toDF("key2")
import org.apache.spark.sql.functions._
df1.join(df2,
df1.col("key1") <=> df2.col("key2"),
"left")
.filter(col("key2").isNull)
.show
You would get output :
+----+----+
|key1|key2|
+----+----+
| c|null|
+----+----+
You can try executing SQL query itself - keeping it simple..
df_a.registerTempTable("TableA")
df_b.registerTempTable("TableB")
result = sqlContext.sql("SELECT * FROM TableA A \
LEFT JOIN TableB B \
ON A.id = B.id \
WHERE B.id is NULL ")

Merging Dataframes in Spark

I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. When the keys match in A & B, I need to get the row from B, not from A.
For example:
DataFrame A:
Employee1, salary100
Employee2, salary50
Employee3, salary200
DataFrame B
Employee1, salary150
Employee2, salary100
Employee4, salary300
The resulting DataFrame should be:
DataFrame C:
Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300
How can I do this in Spark & Scala?
Try:
dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
or
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
CASE dfB.employee IS NOT NULL THEN dfB.salary
CASE dfB.employee IS NOT NULL THEN dfA.salary
END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
Assuming dfA and dfB have 2 columns emp and sal. You can use the following:
import org.apache.spark.sql.{functions => f}
val dfB1 = dfB
.withColumnRenamed("sal", "salB")
.withColumnRenamed("emp", "empB")
val joined = dfA
.join(dfB1, 'emp === 'empB, "outer")
.select(
f.coalesce('empB, 'emp).as("emp"),
f.coalesce('salB, 'sal).as("sal")
)
NB: you should have only one row per Dataframe for a giving value of emp