how to write hive queries for following problems

how to write hive queries for following problems - hiveql

i have a table with 10 columns
ColumnNo Name Example DataType
Column1 athlete_name Michael Phelps STRING
Column2 age 23 INT
Column3 country United States STRING
Column4 year 2008 INT
Column5 closing_date 8/24/2008 STRING
Column6 sport Swimming STRING
Column7 gold_medals 8 INT
Column8 silver_medals 0 INT
Column9 bronze_medals 0 INT
Column10 total_medals 8 INT
i have few queries to write
Find out the total number of medals won by each country in swimming
 Find out the total number of medals won by India in each year
 Find out the total number of medals won by each country and also display the name of the
country
 Find out the total number of gold medals won by each country
 Find out the country got medals for Shooting for each year

Related

How to get a ratio of male to female medal winners where a single name can show up multiple times?

Say I have the following table:
Name
Sex
Medal
John
M
Gold
John
M
Silver
Chris
M
Bronze
Ana
F
Null
Isobel
F
Bronze
I would like to get the ratio of Male to Female medal winners; in this case, I need to get the number 2 (John and Chris won medals, And won a medal). I don't know how to do this.
What I have is simply listing the number of distinct medal winners, grouped by gender:
SELECT "Sex", COUNT( DISTINCT "Name" ) AS number_of_medal_winners
FROM table
WHERE "Medal" IS NOT NULL
GROUP BY "Sex";
which results in
Sex
number_of_medal_winners
F
1
M
2

Use a case expression to filter what is counted to build your ratio:
SELECT COUNT(CASE WHEN sex = "M" THEN name END)/COUNT(CASE WHEN sex = 'F' THEN name END) as ratio_of_male_medal_winners_to_female
FROM yourtable
WHERE Medal IS NOT NULL

SQL Subquery for each

I have following tables
create table players
(
name varchar(30) not null primary key,
);
create table injuries
bId int not null primarykey,
date DATE not null,
name varchar(30),
foreign key(name) references players
);
create table sportsBegins
(
cId int not null primarykey,
date DATE,
sportname varchar(20),
name varchar(30)
foreign key(name) references players
);
Following example data:
players
name
John
Jane
George
shows players in db
sportsBegins
cId | date | sportname | name
1 2020-01-01 Basketball John
2 2020-02-02 Basketball John
3 2020-01-01 Soccer John
4 2020-02-02 Basketball Jane
5 2020-01-03 Basketball George
6 2020-01-04 Badminton George
shows what date players begin playing a sport
injuries
bId | date | name
1 2020-01-01 John
2 2020-02-03 Jane
3 2020-01-05 George
shows the date these players reported injuries.
I want to count the number of DISTINCT players that have experienced an injury in Basketball AFTER the first day they got assigned the sport (not the same day).
So for each player, i need to only grab the first date they started playing basketball. Then for that player, i need to compate his name AND date to the name AND date in the injuries table to see if he ever reported an injury after the date he got the sport assigned.
Example
In the example data I provided this would be the output
Total basketball injuries
2
Explanation of answer
John got assigned basketball twice. Only look at first date he got assigned basketball. Then look at injuries table. He only reported an injury on that day, but never after, so ignore. Jane and George reported injuries after first day assigned basketball so count them

This should get you the desired result
SELECT count(distinct injuries.name)
FROM injuries
INNER JOIN (SELECT name, min(date) as startDate FROM sportsBegins WHERE sportname = 'Basketball' GROUP BY name) as startDates ON injuries.name=startDates.name and injuries.date > startDates.startDate
Quick explanation:
startDates extracts the first date each player started playing basketball
the join condition filters only injuries which happened after the first start date for each player
count(distinct injuries.name) ensures each player only gets counted once even if he/she reported more than one injury after the first start date

Create a group from two columns

I have this:
Column1 Column2 Column3
Water 1 2€
Water 2 3€
Water 2 5€
Milk 1 8€
Milk 1 4€
Milk 2 10€
Milk 3 1€
I am trying to group a column and sum the column to the side referring to the price.
And I want this:
Column1 Column2 Column3
Water 1 2€
Water 2 8€
Milk 1 12€
Milk 2 10€
Milk 3 1€
How can I do this?

Assuming that the datatype of Column2 is numeric, create a formula-field with the following content and then create a group from this field:
{Table.Column1} + "###" + CStr({Table.Column2})
Now just create the sum for Column3.
If necessary, change the string ### to something else, that will not appear in Column1 and Column2.

How do I produce a report to show the number of occurrences an employee has been absent from work

I have been asked to generate a report to show the number of occurrences an employee is absent from work sick.
If an employee is absent from work for 3 consecutive days this will be counted as 1 occurrence. If they then return to work and are then absent again for another 2 consecutive days this will be recorded as 2 occurrences.
I need to generate a report to show the number of occurrences an employee is away from work sick within a 6 month period.
I have set out an example below of the data showing an employee's absence records and how i need the report to look.
How data shows in database:
enter image description here
Name Absence Dates
John Smith 01-Sep-19
John Smith 02-Sep-19
John Smith 03-Sep-19
John Smith 10-Sep-19
John Smith 11-Sep-19
How i wish for the report to look:
Name Occurrences
John Smith 2
I would be grateful for any assistance with writing to code to achieve this result.

Not a full answer, as you should really do some of this yourself, however, based on what you have detailed in your quesiton, you could use the approach below to count up any spells of absence, within a 6 month period.
Assumes you would be compiling this using SQL Server
declare #absences table (empid nvarchar(10), [abs date] date, [ret date] date);
declare #staff table ([empid] int, [name1] nvarchar(50), [name2] nvarchar(50), [surname] nvarchar(50));
-- put some test values in the staff table to work with
insert into #staff
values
(1, 'John', 'Lewis', 'Smith'), -- using a unique ID here, in any good system this should be an incremental number for each new staff member added to the table
(2, 'James', 'Thomas', 'Brown')
-- put some test values in the absences table to work with
insert into #absences
values
(1, '2019-07-01', '2019-07-04'), -- userid, absence date & return date
(1, '2019-08-04', '2019-08-06'),
(2, '2019-07-02', '2019-07-05'),
(2, '2019-08-05', '2019-08-07')
select count(*) spellsoff, empid, name1, name2, surname, [days absent]
from
(
select
s.empid,
s.name1,
s.name2,
s.surname,
a.[abs date],
a.[ret date],
datediff(d,a.[abs date], a.[ret date]) [days absent]
from #staff s
left join #absences a
on s.empid = a.empid
where [abs date] >= DATEADD(M,-6,GETDATE()) -- pull back those employeess that have been absent in the last 6 months from today's date
)doff
group by empid, name1, name2, surname, [days absent]
Gives you the following breakdown:
spellsoff empid name1 name2 surname days absent
1 1 John Lewis Smith 2
1 1 John Lewis Smith 3
1 2 James Thomas Brown 2
1 2 James Thomas Brown 3

Full Outer Joins In PostgreSql [duplicate]

This question already has answers here:
Left Outer Join Not Working?
(4 answers)
Closed 4 years ago.
I've created a table of students with columns student_id as primary key,
student_name and gender.
I've an another table gender which consists of gender_id and gender.
gender_id in student refers to table gender.
Tables data looks like this:
Student table
STUDENT_ID STUDENT_NAME GENDER
1 Ajith 1
2 Alan 1
3 Ann 2
4 Alexa 2
5 Amith 1
6 Nisha 2
7 Rathan 1
8 Rebecca 2
9 asdf null
10 asd null
11 dbss null
Gender Table
GENDER_ID GENDER
1 Male
2 Female
3 Others
My query and its result
SELECT S.STUDENT_NAME,
G.GENDER
FROM STUDENTS S
FULL OUTER JOIN GENDER G ON G.GENDER_ID = S.GENDER
result is giving with 12 rows including the Others value from the gender table.
STUDENT_ID STUDENT_NAME GENDER
1 Ajith Male
2 Alan Male
3 Ann Female
4 Alexa Female
5 Amith Male
6 Nisha Female
7 Rathan Male
8 Rebecca Female
Others
9 asdf
10 asd
11 dbss
I'm trying to restrict a particular student_id:
SELECT S.STUDENT_ID,
S.STUDENT_NAME,
G.GENDER
FROM STUDENTS S
FULL OUTER JOIN GENDER G ON G.GENDER_ID = S.GENDER
WHERE S.STUDENT_ID <> 11;
now the the total number of the rows are reduced to 10.
STUDENT_ID STUDENT_NAME GENDER
1 Ajith Male
2 Alan Male
3 Ann Female
4 Alexa Female
5 Amith Male
6 Nisha Female
7 Rathan Male
8 Rebecca Female
9 asdf
10 asd
Why has the one row with Others Values disappeared from the second select query?
I'm trying to find the cause of this issue.

That's because NULL <> 11 is not TRUE, but NULL, and only rows where the condition is TRUE are included in the result.
You'd have to write something like
WHERE s.student_id IS DISTINCT FROM 11

Your second select query returns all rows where student_id is different (<>) from 11.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to write hive queries for following problems - hiveql

Related

How to get a ratio of male to female medal winners where a single name can show up multiple times?

SQL Subquery for each

Create a group from two columns

How do I produce a report to show the number of occurrences an employee has been absent from work

Full Outer Joins In PostgreSql [duplicate]

Categories

Resources