hiveql remove duplicates including records that had a duplicate

hiveql remove duplicates including records that had a duplicate - scala

I have a select statement that i am storing in a dataframe....
val df = spark.sqlContext.sql("select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'");
I then want to take this dataframe and ONLY select unique records. So determine all duplicates on the prty_tax_govt_issu_id field and if there are duplicates not only remove the duplicate(s), but the entire record that has that prty_tax_govt_issu_id
So original data frame may look like...
+---------------------+
|prty_tax_govt_issu_id|
+---------------------+
| 000000005|
| 000000012|
| 000000012|
| 000000028|
| 000000038|
+---------------------+
The new dataframe should look like....
|prty_tax_govt_issu_id|
+---------------------+
| 000000005|
| 000000028|
| 000000038|
+---------------------+
Not sure if i need to do this after I store in the dataframe or if i can just get that result in my select statement. Thanks :)

Count the number of rows per id and select those ones with count=1.
val df = spark.sql("select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'")
// Get counts per id
val counts = df.groupBy("prty_tax_govt_issu_id").count()
// Filter for id's having only one row
counts.filter($"count" == 1).select($"prty_tax_govt_issu_id").show()
In SQL, you could do
val df = spark.sql("""
select prty_tax_govt_issu_id
from CST_EQUIFAX.eqfx_prty_emp_incm_info
where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'
group by prty_tax_govt_issu_id
having count(*)=1
""")
df.show()

a group by clause would do it
select prty_tax_govt_issu_id
from CST_EQUIFAX.eqfx_prty_emp_incm_info
where emp_mtch_cd = 'Y'
and emp_mtch_actv_rcrd_in = 'Y'
and emp_sts_in = 'A'
GROUP BY prty_tax_govt_issu_id

Related

Append Column to Spark Dataframe

This is my current code :
impcomp = ['connectors', 'contract_no', 'document_confidentiality', 'document_type', 'drawing_no', 'equipment_specifications', 'external_drawings', 'is_psi', 'line_numbers', 'owner_no', 'plant', 'project_title', 'psi_category', 'revision', 'revision_date', 'revision_status', 'tags', 'unit']
for el in impcomp:
df1 = df3.select(df[Pk[1]],posexplode_outer(df3[el]))
df1 = df1.where(df1.pos !='1')
df1 = df1.drop('pos')
df1 = df1.withColumnRenamed('col',el)
dfu = df4.join(df1,df4.DocumentNo == df1.DocumentNo,"left")
display(dfu)
What i want is for the final dataframe to iterate to each and every element and append itself to main dataframe (dfu). But instead, my current code overwrite the previous element column, leaving the final dataframe (dfu) as: dfu + 'unit' column. is there any way for me to store the value for each column that was iterate in the for loop without overwriting the previous element?
expcted result:
Document | Author | connectors | contract_no .........|unit|
A | AA | 12 | C13 |Z12
current result:
Document | Author | unit|
A | AA | Z12
thanks in advance
dfs = []
for el in impcomp:
df1 = df3.select(df[Pk[1]],posexplode_outer(df3[el]))
df1 = df1.where(df1.pos !='1')
df1 = df1.drop('pos')
df1 = df1.withColumnRenamed('col',el)
dfs.append(df1[el])
df6 = reduce(df4.union(dfs))
i have tried this but it returns error :
AttributeError: 'list' object has no attribute '_jdf'

Spark scala dynamically create filter condition value from sequence Pair

I have a spark dataframe : df :
|id | year | month |
-------------------
| 1 | 2020 | 01 |
| 2 | 2019 | 03 |
| 3 | 2020 | 01 |
I have a sequence year_month = Seq[(2019,01),(2020,01),(2021,01)]
val year_map gets genrated dynamically based on code runs everytime
I want to filter the dataframe : df based on the year_month sequence for on ($year=seq[0] & $month = seq[1]) for each value pair in sequence year_month

You can achieve this by
Create a dataframe from year_month
Perform an inner join on year_month with your original dataframe on month and year
Choosing distinct records
The resulting dataframe will be the matched rows
Working Example
Setup
import spark.implicits._
val dfData = Seq((1,2020,1),(2,2019,3),(3,2020,1))
val df = dfData.toDF()
.selectExpr("_1 as id"," _2 as year","_3 as month")
df.createOrReplaceTempView("original_data")
val year_month = Seq((2019,1),(2020,1),(2021,1))
Step 1
// Create Temporary DataFrame
val yearMonthDf = year_month.toDF()
.selectExpr("_1 as year","_2 as month" )
yearMonthDf.createOrReplaceTempView("temp_year_month")
Step 2
var dfResult = spark.sql("select o.id, o.year, o.month from original_data o inner join temp_year_month t on o.year = t.year and o.month = t.month")
Step3
var dfResultDistinct = dfResult.distinct()
Output
dfResultDistinct.show()
+---+----+-----+
| id|year|month|
+---+----+-----+
| 1|2020| 1|
| 3|2020| 1|
+---+----+-----+
NB. If you are interested in finding the similar records that exist irrespective of the id. You could update the spark sql to the following (it has removed o.id)
select
o.year,
o.month
from
original_data o
inner join
temp_year_month t on o.year = t.year and
o.month = t.month
which would give as the result
+----+-----+
|year|month|
+----+-----+
|2020| 1|
+----+-----+

Unexpected incorrect result after unixtime conversion in sparksql

I have a dataframe with content like below:
scala> patDF.show
+---------+-------+-----------+-------------+
|patientID| name|dateOtBirth|lastVisitDate|
+---------+-------+-----------+-------------+
| 1001|Ah Teck| 1991-12-31| 2012-01-20|
| 1002| Kumar| 2011-10-29| 2012-09-20|
| 1003| Ali| 2011-01-30| 2012-10-21|
+---------+-------+-----------+-------------+
all the columns are string
I want to get the list of records with lastVisitDate falling in the range of format of yyyy-mm-dd and now, so here is the script:
patDF.registerTempTable("patients")
val results2 = sqlContext.sql("SELECT * FROM patients WHERE from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')) between '2012-09-15' and current_timestamp() order by lastVisitDate")
results2.show()
It gets me nothing, presumably, there should be records with patientID of 1002 and 1003.
So I modified the query to:
val results3 = sqlContext.sql("SELECT from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')), * FROM patients")
results3.show()
Now I get:
+-------------------+---------+-------+-----------+-------------+
| _c0|patientlD| name|dateOtBirth|lastVisitDate|
+-------------------+---------+-------+-----------+-------------+
|2012-01-20 00:01:00| 1001|Ah Teck| 1991-12-31| 2012-01-20|
|2012-01-20 00:09:00| 1002| Kumar| 2011-10-29| 2012-09-20|
|2012-01-21 00:10:00| 1003| Ali| 2011-01-30| 2012-10-21|
+-------------------+---------+-------+-----------+-------------+
If you look at the first column, you will see all the months were somehow changed to 01
What's wrong with the code?

The correct format for year-month-day should be yyyy-MM-dd:
val patDF = Seq(
(1001, "Ah Teck", "1991-12-31", "2012-01-20"),
(1002, "Kumar", "2011-10-29", "2012-09-20"),
(1003, "Ali", "2011-01-30", "2012-10-21")
)toDF("patientID", "name", "dateOtBirth", "lastVisitDate")
patDF.createOrReplaceTempView("patTable")
val result1 = spark.sqlContext.sql("""
select * from patTable where to_timestamp(lastVisitDate, 'yyyy-MM-dd')
between '2012-09-15' and current_timestamp() order by lastVisitDate
""")
result1.show
// +---------+-----+-----------+-------------+
// |patientID| name|dateOtBirth|lastVisitDate|
// +---------+-----+-----------+-------------+
// | 1002|Kumar| 2011-10-29| 2012-09-20|
// | 1003| Ali| 2011-01-30| 2012-10-21|
// +---------+-----+-----------+-------------+
You can also use DataFrame API, if wanted:
val result2 = patDF.where(to_timestamp($"lastVisitDate", "yyyy-MM-dd").
between(to_timestamp(lit("2012-09-15"), "yyyy-MM-dd"), current_timestamp())
).orderBy($"lastVisitDate")

select data from Hive where column value in list

I would like to get data from Hive such that: if one column value is in List, then select data from Hive.
Example Data in Hive table is as:
Col1 | Col2 | Col3
-------+---------------
Joe | 32 | Place-1
Nancy | 28 | Place-2
Shalyn | 35 | Place-1
Andy | 20 | Place-3
I am querying Hive table as:
val name = List("Sherley","Joe","Shalyan","Dan")
var dataFromHive = sqlCon.sql("select Col1,Col2,Col3 from default.NameInfo where Col1 in (${name})")
I know that my query is wrong, as its throwing error. But I am not able to get proper replacement for where Col1 in (${name}).

The better idea is converting name to DataFrame and joins with dataFromHive. The inner join does the same as filtering only intersected data.
val nameDf = List("Sherley","Joe","Shalyan","Dan").toDF("Col1")
var dataFromHive = sqlCon.table("default.NameInfo").join(nameDf, "Col1").select("Col1", "Col2", "Col3")
Try to use the DataFrame API. It will make the code easy to read.

convert your List to String (with proper format to use in hive query)
val name = List("Sherley","Joe","Shalyan","Dan")
val name_string = name.mkString("('","','", "')")
//name_string: String = ('Sherley','Joe','Shalyan','Dan')
var dataFromHive = sqlCon.sql("select Col1,Col2,Col3 from default.NameInfo where Col1 in " + name_string )

How to refer broadcast variable in dataframes

I use spark1.6. I tried to broadcast a RDD and am not sure how to access the broadcasted variable in the data frames?
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name | Emp_Age
------------------
1 | john | 25
2 | David | 35
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
import scala.collection.Map
val df_emp = hiveContext.sql("select * from emp")
val df_dept = hiveContext.sql("select * from dept")
val rdd = df_emp.rdd.map(row => (row.getInt(0),row.getString(1)))
val lkp = rdd.collectAsMap()
val bc = sc.broadcast(lkp)
print(bc.value.get(1).get)
--Below statement doesn't work
val combinedDF = df_dept.withColumn("emp_name",bc.value.get($"emp_id").get)
How do I refer the broadcast variable in the above combinedDF statement?
How to handle if the lkp doesn't return any value?
Is there a way to return multiple records from the lkp (lets assume if there are 2 records for emp_id=1 in the look up, I would like to get both records)
How to return more than one value from broadcast...(emp_name & emp_age)

How do I refer the broadcast variable in the above combinedDF statement?
Use udf. If emp_id is Int
val f = udf((emp_id: Int) => bc.value.get(emp_id))
df_dept.withColumn("emp_name", f($"emp_id"))
How to handle if the lkp doesn't return any value?
Don't use get as shown above
Is there a way to return multiple records from the lkp
Use groupByKey:
val lkp = rdd.groupByKey.collectAsMap()
and explode:
df_dept.withColumn("emp_name", f($"emp_id")).withColumn("emp_name", explode($"emp_name"))
or just skip all the steps and broadcast:
import org.apache.spark.sql.functions._
df_emp.join(broadcast(df_dep), Seq("Emp Id"), "left")

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

hiveql remove duplicates including records that had a duplicate - scala

a group by clause would do it select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A' GROUP BY prty_tax_govt_issu_id

Related

Append Column to Spark Dataframe

Spark scala dynamically create filter condition value from sequence Pair

Unexpected incorrect result after unixtime conversion in sparksql

select data from Hive where column value in list

How to refer broadcast variable in dataframes

Categories

Resources