I would like to query over three tables. Right now I have managed to join two tables. I'm doing my first databases, but right now I'm really stuck. Here are my tables
Drivers
|DRIVER_ID|FIRST_NAME|LAST_NAME|AGE|
| 1|John |Smith |19 |
| 2|Steve |Oak |33 |
| 3|Mary |Sanchez |22 |
Drivers_in_Teams
|DRIVERS_IN_TEAMS_ID|DRIVER_ID|TEAM_ID|BEG_DATE |END_DATE |CAR |
| 1| 1| 1|18-NOV-05| |Toyota |
| 2| 3| 2|10-APR-12| |Ford |
| 3| 2| 3|19-JUL-01|02-AUG-04|Volkswagen |
Team
|TEAM_ID |NAME |COUNTRY |
| 1|Turbo |Sweden |
| 2|Rally |UK |
| 3|Baguette |France |
BEG_DATEs are done with "sysdate-number"
My goal is to find a driver, who is driving a Ford and still has a valid contract (END_DATE is not set)
I would like to make a query over three tables, so the result should display a drivers FIRST NAME, LAST NAME and a COUNTRY of the team
I tried some examples which I have found from StackOverFlow and edited those, but I got stuck adding third TEAMS table to the query.
Here's the one I used
SELECT FIRST_NAME, LAST_NAME
FROM DRIVERS
JOIN DRIVERS_IN_TEAMS ON DRIVERS.DRIVER_ID = DRIVERS_IN_TEAMS.DRIVER_ID
WHERE DRIVERS_IN_TEAMS.CAR = 'Ford' AND DRIVERS_IN_TEAMS.END_DATE IS NOT NULL
I think this should work(join all the tables with the corresponding ids and then you have your conditions):
SELECT d.FIRST_NAME, d.LAST_NAME, t.COUNTRY FROM DRIVERS d JOIN DRIVERS_IN_TEAMS dit ON dit.DRIVER_ID = d.DRIVER_ID JOIN TEAM t ON dit.TEAM_ID = t.TEAM_ID WHERE dit.END_DATE IS NULL AND dit.CAR = 'Ford'
Related
I have two tables I'm trying to join on date and account. For table one the field for date is continuous and is type string. Whereas for table two the field date has a portion of the date field as end of month only.
Edit: The sample I provided below is an example of tables I am working with. Originally I had only mentioned about joining the ID and Date fields but there are other fields in both tables that I am trying to keep in the final output. On a larger scale Table 1 has thousands of IDs with records continuously recorded daily across multiple years. In Table 2 this is similar but at some point the date field switched from only end of month data to the same daily date data as table one. The tables below have been updated.
A sample of the data set could be seen as follows:
Table1:
| ID | Date |
| -- | ---------- |
| 1 | "2022-01-30" |
| 1 | "2022-01-31" |
| 1 | "2022-02-01" |
| 1 | "2022-02-02"|
Table2:
| ID | Date |Field_flag|
| ---- | ---------- | ------- |
| 1 | "2021-12-31" |a |
| 1 | "2022-01-31" |a |
| 1 | "2022-02-01" |a |
| 1 | "2022-02-02" |b |
| 1 | "2022-02-03"|b |
Desired result:
| table1.ID | table1.Date | table2.date| table2.Field_flag|
| -------- | ---------- | ---------- | --------------- |
| 1 | "2022-01-30"|"2022-01-31"| a |
| 1 | "2022-01-31"|"2022-01-31"| a |
| 1 | "2022-02-01"|"2022-02-01"| a |
| 1 | "2022-02-02"|"2022-02-02"| b |
Are there any suggestions on how to approach this type of result?
I'm currently temp fielding and sub sampling the date fields into month and year as a temporary solution but would like something like an inner join as such to work.
Select table1.*
,table2.date as date_table2
,table2.Field_flag
FROM table1
INNER JOIN (SELECT * FROM Table2)
ON table1.id = table2.id and (table1.date = table2.date or table1.date < table2.date)
One way to handle this would be to use a correlated subquery to find the same or closest but greater date for each date in the first table.
SELECT
t1.ID,
t1.Date AS Date_t1,
(SELECT t2.Date FROM Table2 t2
WHERE t2.ID = t1.ID AND t2.Date >= t1.Date
ORDER BY t2.Date LIMIT 1) AS Date_t2
FROM Table1 t1
ORDER BY t1.ID, t1.Date;
Demo
This might not be the optimal way but throwing it into the mix. We can join the second table twice - once for the daily dates and once for the month end dates.
data2_2_sdf = spark.sql('''
select *,
datediff(dt, lag(dt) over (partition by id order by dt)) as dt_diff_lag,
datediff(lead(dt) over (partition by id order by dt), dt) as dt_diff_lead
from data2
''')
data2_2_sdf.createOrReplaceTempView('data2_2')
# +---+----------+----+-----------+------------+
# | id| dt|flag|dt_diff_lag|dt_diff_lead|
# +---+----------+----+-----------+------------+
# | 1|2021-12-31| a| null| 31|
# | 1|2022-01-31| a| 31| 1|
# | 1|2022-02-01| a| 1| 1|
# | 1|2022-02-02| b| 1| 1|
# | 1|2022-02-03| b| 1| null|
# +---+----------+----+-----------+------------+
spark.sql('''
select a.*, coalesce(b.dt, c.dt) as dt2, coalesce(b.flag, c.flag) as flag
from data1 a
left join (select * from data2_2 where dt_diff_lag>=28 or dt_diff_lead>=28) b
on a.id=b.id and year(a.dt)*100+month(a.dt)=year(b.dt)*100+month(b.dt)
left join (select * from data2_2 where dt_diff_lag=1 and coalesce(dt_diff_lead, 1)=1) c
on a.id=c.id and a.dt=c.dt
'''). \
show()
# +---+----------+----------+----+
# | id| dt| dt2|flag|
# +---+----------+----------+----+
# | 1|2022-02-01|2022-02-01| a|
# | 1|2022-01-31|2022-01-31| a|
# | 1|2022-01-30|2022-01-31| a|
# | 1|2022-02-02|2022-02-02| b|
# +---+----------+----------+----+
I am rewriting legacy SAS codes to PySpark. In one of those blocks, the SAS codes used the lag function. The way I understood the notes, it says an ID is a duplicate if it as two intake dates that are less than 4 days apart.
/*Next create flag if the same ID has two intake dates less than 4 days apart*/
/*Data MUST be sorted by ID and DESCENDING IntakeDate!!!*/
data duplicates (drop= lag_ID lag_IntakeDate);
set df2;
by ID;
lag_ID = lag(ID);
lag_INtakeDate = lag(IntakeDate);
if ID = lag_ID then do;
intake2TIME = intck('day', lag_IntakeDate, IntakeDate);
end;
if 0 <= abs(intake2TIME) < 4 then DUPLICATE = 1;
run;
/* If the DUPLICATE > 1, then it is a duplicate and eventually will be dropped.*/
I tried meeting the condition as described in the comments: I pulled by sql the ID and intake dates ordered by ID and descending intake date:
SELECT ID, intakeDate, col3, col4
from df order by ID, intakeDate DESC
I googled the lag equivalent and this is what I found:
https://www.educba.com/pyspark-lag/
However, I have not used window function before, the concept introduced by the site does not somehow make sense to me, though I tried the following to check if my understanding of WHERE EXISTS might work:
SELECT *
FROM df
WHERE EXISTS (
SELECT *
FROM df v2
WHERE df.ID = v2.ID AND DATEDIFF(df.IntakeDate, v2.IntakeDate) > 4 )
/* not sure about the second condition, though*/)
Initial df
+-----------+------------------+
| Id| IntakeDate|
+-----------+------------------+
| 5.0| 2021-04-14|
| 5.0| 2021-05-06|
| 5.0| 2021-05-08|
| 10.0| 2021-04-21|
| 10.0| 2021-05-25|
| 14.0| 2021-03-08|
| 14.0| 2021-03-09|
| 14.0| 2021-09-30|
| 14.0| 2022-04-08|
| 15.0| 2021-04-27|
| 15.0| 2021-05-18|
| 15.0| 2022-01-17|
| 26.0| 2021-08-27|
| 26.0| 2021-09-17|
+-----------+------------------+
expected df will have row dropped if the next intakedate is less than 3 days of the prior date
+-----------+------------------+
| Id| IntakeDate|
+-----------+------------------+
| 5.0| 2021-04-14|
| 5.0| 2021-05-06| row to drop
| 5.0| 2021-05-08|
| 10.0| 2021-04-21|
| 10.0| 2021-05-25|
| 14.0| 2021-03-08| row to drop
| 14.0| 2021-03-09|
| 14.0| 2021-09-30|
| 14.0| 2022-04-08|
| 15.0| 2021-04-27|
| 15.0| 2021-05-18|
| 15.0| 2022-01-17|
| 26.0| 2021-08-27|
| 26.0| 2021-09-17|
+-----------+------------------+
Please try the following code:
import pyspark.sql.window as Window
import pyspark.sql.functions as F
lead_over_id = Window.partitionBy('id').orderBy('IntakeDate')
df = (df
.withColumn('lead_1_date', F.lag('IntakeDate', -1).over(lead_over_id))
.withColumn('date_diff', F.datediff('IntakeDate', 'lead_1_date'))
.where((F.col('date_diff') > 4) | F.col('date_diff').isnull())
.drop('lead_1_date', 'date_diff')
)
I would like to do a fairly simple query, but I cant figure out, how to join the tables together. I am new to this world of SQL and after reading documentation of JOIN and SELECT clauses, I still can't figure this one out.
Here are my 3 tables:
Seller
|SELLER_ID|NUMBER|FIRST_NAME|LAST_NAME|TEAM_NR|
| 1|105 |John |Smith |1 |
| 2|106 |James |Brown |3 |
| 3|107 |Jane |Doe |3 |
| 4|108 |Nicole |Sanchez |2 |
Service
|SERVICE_ID|CODE|NAME |PRICE |SELLER_ID|CLIENT_ID|
| 1| 502|BLAHBLAH |200 |2 |2 |
| 2| 503|BLAHBLAH2|175 |1 |3 |
| 3| 504|BLAHBLAH3|250 |3 |2 |
| 4| 505|BLAHBLAH4|130 |2 |4 |
Client
|CLIENT_ID|NUMBER |FIRST_NAME | LAST_NAME |
| 1|51 |JOHN | ADAMS |
| 2|52 |MARY | BRYANT |
| 3|53 |FRANCIS | JOHNSON |
| 4|55 |BEN | CASTLE |
The goal of this query would be to figure out which team(TEAM_NR from Seller) sold the most services in a month, on the basis of total amount sold(sum of PRICE from Service)
The result should display FIRST_NAME, LAST_NAME and TEAM_NR of everyone in the "winning" team.
I already looked for help from StackOverFlow and Google and tried editing these according to my tables, but they didn't pan out.
Thank You!
SELECT S.FIRST_NAME, S.LAST_NAME, S.TEAM_NR, sum(R.PRICE) Winning
FROM Seller S
LEFT JOIN Service R ON (S.SELLER_ID=R.SELLER_ID)
GROUP BY S.TEAM_NR, S.FIRST_NAME, S.LAST_NAME
EDIT Don't even need any join on Client table.
EDIT 2 All fields from the SELECT have to be in the GROUP BY.
I know my problem seems better be solved by RDBMS models. But I really want to deploy it using MongoDB because I have potential irregular fields to add on each record in the future and also want to practice my NoSQL database skills.
PE ratio and PB ratio data provided by one vendor:
| Vendor5_ID| PE| PB|date |
|----------:|----:|-----:|:----------|
| 210| 3.90| 2.620|2017-08-22 |
| 210| 3.90| 2.875|2017-08-22 |
| 228| 3.85| 2.320|2017-08-22 |
| 214| 3.08| 3.215|2017-08-22 |
| 187| 3.15| 3.440|2017-08-22 |
| 181| 2.76| 3.460|2017-08-22 |
Price data and analyst covering provided by another vendor
|Symbol | Price| Analyst|date |
|:------|-----:|-------:|:----------|
|AAPL | 160| 6|2017-08-22 |
|MSFT | 160| 6|2017-08-22 |
|GOOG | 108| 4|2017-08-22 |
And I have key convert data:
| uniqueID|Symbol |from |to |
|--------:|:------|:----------|:----------|
| 1|AAPL |2016-01-10 |2017-08-22 |
| 2|MSFT |2016-01-10 |2017-08-22 |
| 3|GOOG |2016-01-10 |2017-08-22 |
| uniqueID| Vendor5_ID|from |to |
|--------:|----------:|:----------|:----------|
| 1| 210|2016-01-10 |2017-08-22 |
| 2| 228|2016-01-10 |2017-08-22 |
| 3| 214|2016-01-10 |2017-08-22 |
I want to execute time range query fast. I come up with an idea that store each column as a collection,
db.PE:
{
_id,
uniqueID,
Vendor5_ID,
value,
date
}
db.PB:
{
_id,
uniqueID,
Vendor5_ID,
value,
date
}
db.Price:
{
_id,
uniqueID,
Symbol,
value,
date
}
db.Analyst:
{
_id,
uniqueID,
Symbol,
value,
date
}
Is this a good solution? What model do you think is the best if there are far more data to add by different vendor?
I would consider using nested table or child table approach. I am not sure the extent of support mongo has for this kind of support. I would consider using Oracle NoSQL Database for this usecase with nested tables support with TTL and higher throughput (because of BDB as storage engine). With nested tables you could store PE and PB with timestamps in the child/nested table while the parent table continues to hold symbol/vendor_id and any other details. This will ensure that your queries are on the same shard, putting them in a different collection will not guarentee same shard.
I have problem with duplicates in my base but I can't use distinct select to other table becouse I have unique data in some colums. I want to keep last rating.
Example:
ID| ProductName | Code | Rating
------|------ | ------ | ------
1| Bag | 1122 | 5
2| Car| 1133 | 2
3| Bag | 1122 | 3
4| Car | 1133 | 1
5| Train| 1144 | 1
As result of query I want to get:
ID| ProductName | Code | Rating
------|------ | ------ | ------
3| Bag | 1122 | 3
4| Car | 1133 | 1
5| Train| 1144 | 1
One option uses a GROUP BY to identify the id values of the most recent duplicates for each Code/ProductName group:
SELECT t1.*
FROM yourTable t1
INNER JOIN
(
SELECT Code, MAX(ID) AS ID
FROM yourTable
GROUP BY Code
) t2
ON t1.Code = t2.Code AND
t1.ID = t2.ID