I have problem with duplicates in my base but I can't use distinct select to other table becouse I have unique data in some colums. I want to keep last rating.
Example:
ID| ProductName | Code | Rating
------|------ | ------ | ------
1| Bag | 1122 | 5
2| Car| 1133 | 2
3| Bag | 1122 | 3
4| Car | 1133 | 1
5| Train| 1144 | 1
As result of query I want to get:
ID| ProductName | Code | Rating
------|------ | ------ | ------
3| Bag | 1122 | 3
4| Car | 1133 | 1
5| Train| 1144 | 1
One option uses a GROUP BY to identify the id values of the most recent duplicates for each Code/ProductName group:
SELECT t1.*
FROM yourTable t1
INNER JOIN
(
SELECT Code, MAX(ID) AS ID
FROM yourTable
GROUP BY Code
) t2
ON t1.Code = t2.Code AND
t1.ID = t2.ID
Related
I have two tables I'm trying to join on date and account. For table one the field for date is continuous and is type string. Whereas for table two the field date has a portion of the date field as end of month only.
Edit: The sample I provided below is an example of tables I am working with. Originally I had only mentioned about joining the ID and Date fields but there are other fields in both tables that I am trying to keep in the final output. On a larger scale Table 1 has thousands of IDs with records continuously recorded daily across multiple years. In Table 2 this is similar but at some point the date field switched from only end of month data to the same daily date data as table one. The tables below have been updated.
A sample of the data set could be seen as follows:
Table1:
| ID | Date |
| -- | ---------- |
| 1 | "2022-01-30" |
| 1 | "2022-01-31" |
| 1 | "2022-02-01" |
| 1 | "2022-02-02"|
Table2:
| ID | Date |Field_flag|
| ---- | ---------- | ------- |
| 1 | "2021-12-31" |a |
| 1 | "2022-01-31" |a |
| 1 | "2022-02-01" |a |
| 1 | "2022-02-02" |b |
| 1 | "2022-02-03"|b |
Desired result:
| table1.ID | table1.Date | table2.date| table2.Field_flag|
| -------- | ---------- | ---------- | --------------- |
| 1 | "2022-01-30"|"2022-01-31"| a |
| 1 | "2022-01-31"|"2022-01-31"| a |
| 1 | "2022-02-01"|"2022-02-01"| a |
| 1 | "2022-02-02"|"2022-02-02"| b |
Are there any suggestions on how to approach this type of result?
I'm currently temp fielding and sub sampling the date fields into month and year as a temporary solution but would like something like an inner join as such to work.
Select table1.*
,table2.date as date_table2
,table2.Field_flag
FROM table1
INNER JOIN (SELECT * FROM Table2)
ON table1.id = table2.id and (table1.date = table2.date or table1.date < table2.date)
One way to handle this would be to use a correlated subquery to find the same or closest but greater date for each date in the first table.
SELECT
t1.ID,
t1.Date AS Date_t1,
(SELECT t2.Date FROM Table2 t2
WHERE t2.ID = t1.ID AND t2.Date >= t1.Date
ORDER BY t2.Date LIMIT 1) AS Date_t2
FROM Table1 t1
ORDER BY t1.ID, t1.Date;
Demo
This might not be the optimal way but throwing it into the mix. We can join the second table twice - once for the daily dates and once for the month end dates.
data2_2_sdf = spark.sql('''
select *,
datediff(dt, lag(dt) over (partition by id order by dt)) as dt_diff_lag,
datediff(lead(dt) over (partition by id order by dt), dt) as dt_diff_lead
from data2
''')
data2_2_sdf.createOrReplaceTempView('data2_2')
# +---+----------+----+-----------+------------+
# | id| dt|flag|dt_diff_lag|dt_diff_lead|
# +---+----------+----+-----------+------------+
# | 1|2021-12-31| a| null| 31|
# | 1|2022-01-31| a| 31| 1|
# | 1|2022-02-01| a| 1| 1|
# | 1|2022-02-02| b| 1| 1|
# | 1|2022-02-03| b| 1| null|
# +---+----------+----+-----------+------------+
spark.sql('''
select a.*, coalesce(b.dt, c.dt) as dt2, coalesce(b.flag, c.flag) as flag
from data1 a
left join (select * from data2_2 where dt_diff_lag>=28 or dt_diff_lead>=28) b
on a.id=b.id and year(a.dt)*100+month(a.dt)=year(b.dt)*100+month(b.dt)
left join (select * from data2_2 where dt_diff_lag=1 and coalesce(dt_diff_lead, 1)=1) c
on a.id=c.id and a.dt=c.dt
'''). \
show()
# +---+----------+----------+----+
# | id| dt| dt2|flag|
# +---+----------+----------+----+
# | 1|2022-02-01|2022-02-01| a|
# | 1|2022-01-31|2022-01-31| a|
# | 1|2022-01-30|2022-01-31| a|
# | 1|2022-02-02|2022-02-02| b|
# +---+----------+----------+----+
I want to pivot a column and then rank the data from the pivoted column. Here is sample data:
| id | objective | metric | score |
|----|-----------|-------------|-------|
| 1 | Sales | Total Sales | 10 |
| 1 | Marketing | Total Reach | 4 |
| 2 | Sales | Total Sales | 2 |
| 2 | Marketing | Total Reach | 11 |
| 3 | Sales | Total Sales | 9 |
This would be my expected output after pivot + rank:
| id | Sales | Marketing |
|----|--------|-----------|
| 1 | 1 | 2 |
| 2 | 3 | 1 |
| 3 | 2 | 3 |
The ranking is based on sum(score) from each objective. An objective can have also have multiple metrics but that isn't included in the sample for simplicity.
I have been able to successfully pivot and count the scores like so:
pivot = (
spark.table('scoring_table')
.select('id', 'objective', 'metric', 'score')
.groupBy('id')
.pivot('objective')
.agg(
sf.sum('score').alias('score')
)
This then lets me see the total score per objective, but I'm unsure how to rank these. I have tried the following after aggregation:
.withColumn('rank', rank().over(Window.partitionBy('id', 'objective').orderBy(sf.col('score').desc())))
However objective is no longer callable from this point as it has been pivoted. I then tried this instead:
.withColumn('rank', rank().over(Window.partitionBy('id', 'Sales', 'Marketing').orderBy(sf.col('score').desc())))
But also the score column is no longer available. How can I rank these scores after pivoting the data?
You just need to order by the score after pivot:
from pyspark.sql import functions as F, Window
df2 = df.groupBy('id').pivot('objective').agg(F.sum('score')).fillna(0)
df3 = df2.select(
'id',
*[F.rank().over(Window.orderBy(F.desc(c))).alias(c) for c in df2.columns[1:]]
)
df3.show()
+---+---------+-----+
| id|Marketing|Sales|
+---+---------+-----+
| 2| 1| 3|
| 1| 2| 1|
| 3| 3| 2|
+---+---------+-----+
I am processing the following tables and I would like to compute a new column (outcome) based on the distinct value of 2 other columns.
| id1 | id2 | outcome
| 1 | 1 | 1
| 1 | 1 | 1
| 1 | 3 | 2
| 2 | 5 | 1
| 3 | 1 | 1
| 3 | 2 | 2
| 3 | 3 | 3
The outcome should begin in incremental order starting from 1 based on the combined value of id1 and id2. Any hints how this can be accomplished in Scala. row_number doesn't seem to be useful here in this case.
The logic here is that for each unique value of id1 we will start numbering the outcome with min(id2) for corresponding id1 being assigned a value of 1.
You could try dense_rank()
with your example
val df = sqlContext
.read
.option("sep","|")
.option("header", true)
.option("inferSchema",true)
.csv("/home/cloudera/files/tests/ids.csv") // Here we read the .csv files
.cache()
df.show()
df.printSchema()
df.createOrReplaceTempView("table")
sqlContext.sql(
"""
|SELECT id1, id2, DENSE_RANK() OVER(PARTITION BY id1 ORDER BY id2) AS outcome
|FROM table
|""".stripMargin).show()
output
+---+---+-------+
|id1|id2|outcome|
+---+---+-------+
| 2| 5| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 3| 2|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
+---+---+-------+
Use Window function to club(partition) them by first id and then order each partition based on second id.
Now you just need to assign a rank (dense_rank) over each Window partition.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df
.withColumn("outcome", dense_rank().over(Window.partitionBy("id1").orderBy("id2")))
I'm using Dataframe in pyspark. I have one table like Table 1 bellow. I need to obtain Table 2. Where:
num_category - it is how many differents categories for each id
sum(count) - it is the sum of the third column in Table 1 for each id.
Example:
Table 1
id |category | count
1 | 4 | 1
1 | 3 | 2
1 | 1 | 2
2 | 2 | 1
2 | 1 | 1
Table 2
id |num_category| sum(count)
1 | 3 | 5
2 | 2 | 2
I try:
table1 = data.groupBy("id","category").agg(count("*"))
cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)
Error:
1 table1 = data.groupBy("id","category").agg(count("*"))
---> 2 cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)
TypeError: 'DataFrame' object is not callable
You can do multiple column aggregation on single grouped data,
data.groupby('id').agg({'category':'count','count':'sum'}).withColumnRenamed('count(category)',"num_category").show()
+---+-------+--------+
| id|num_cat|sum(cnt)|
+---+-------+--------+
| 1| 3| 5|
| 2| 2| 2|
+---+-------+--------+
I would like to query over three tables. Right now I have managed to join two tables. I'm doing my first databases, but right now I'm really stuck. Here are my tables
Drivers
|DRIVER_ID|FIRST_NAME|LAST_NAME|AGE|
| 1|John |Smith |19 |
| 2|Steve |Oak |33 |
| 3|Mary |Sanchez |22 |
Drivers_in_Teams
|DRIVERS_IN_TEAMS_ID|DRIVER_ID|TEAM_ID|BEG_DATE |END_DATE |CAR |
| 1| 1| 1|18-NOV-05| |Toyota |
| 2| 3| 2|10-APR-12| |Ford |
| 3| 2| 3|19-JUL-01|02-AUG-04|Volkswagen |
Team
|TEAM_ID |NAME |COUNTRY |
| 1|Turbo |Sweden |
| 2|Rally |UK |
| 3|Baguette |France |
BEG_DATEs are done with "sysdate-number"
My goal is to find a driver, who is driving a Ford and still has a valid contract (END_DATE is not set)
I would like to make a query over three tables, so the result should display a drivers FIRST NAME, LAST NAME and a COUNTRY of the team
I tried some examples which I have found from StackOverFlow and edited those, but I got stuck adding third TEAMS table to the query.
Here's the one I used
SELECT FIRST_NAME, LAST_NAME
FROM DRIVERS
JOIN DRIVERS_IN_TEAMS ON DRIVERS.DRIVER_ID = DRIVERS_IN_TEAMS.DRIVER_ID
WHERE DRIVERS_IN_TEAMS.CAR = 'Ford' AND DRIVERS_IN_TEAMS.END_DATE IS NOT NULL
I think this should work(join all the tables with the corresponding ids and then you have your conditions):
SELECT d.FIRST_NAME, d.LAST_NAME, t.COUNTRY FROM DRIVERS d JOIN DRIVERS_IN_TEAMS dit ON dit.DRIVER_ID = d.DRIVER_ID JOIN TEAM t ON dit.TEAM_ID = t.TEAM_ID WHERE dit.END_DATE IS NULL AND dit.CAR = 'Ford'