Date frame having two categorical variable column with date time stamp.
Date
Time
Va
Vb
01-01-2023
05:55
A
B
01-01-2023
06:25
A
01-01-2023
17:42
B
01-01-2023
19:17
A
B
02-01-2023
05:55
A
B
02-01-2023
06:25
A
B
02-01-2023
17:42
A
B
02-01-2023
19:17
A
To group by the set by date and count Va and Vb for a date.
Expected Result:
Va
Vb
01-01-2023
3
3
02-01-2023
4
3
Wrote in previous slide
If you are using an SQL database (and those empty values are NULL):
Select Date, Count(Va) as Va, Count(Vb) as Vb
from sourceTable
group by date;
DBFiddle demo
Try:
out = df[['Date', 'Va', 'Vb']].groupby('Date').count()
print(out)
Prints:
Va Vb
Date
01-01-2023 3 3
02-01-2023 4 3
Related
Is it possible to write PostgreSQL code that looks at the sample data in the selects only the persons who have been active for the whole first quarter( 01/01/2018 to 03/31/2018) as shown in the desired output? Note that person H should not be selected because they are missing January.
Sample Data
Person Start Date End Date
A 1/1/2018 1/31/2018
A 2/1/2018 2/28/2018
A 3/1/2018 3/31/2018
B 1/1/2018 2/28/2018
C 1/1/2018 2/28/2018
C 3/1/2018 3/31/2018
D 2/1/2018 3/31/2018
E 2/1/2018 2/28/2018
F 1/1/2018 3/31/2018
G 1/1/2018 4/30/2018
H 2/1/2018 4/30/2018
Desired Output
Person
A
C
F
G
Assuming your columns are proper DATE columns and there are no overlaps, you could do something like this:
select person
from the_table
group by person
having sum(end_date - start_date + 1) >= date '2018-03-31' - date '2018-01-01' + 1
order by person;
Subtracting one date from another yields the number of days between those two dates. Then the sum of all differences is compared to the difference between the start and end date of the quarter.
Online example: https://rextester.com/OIN10602
There are three tables A , B in Hive
A Table has the following columns and is Partitioned based upon Day. We need to extract data from 1st jan 2016 till 31st Dec 2016. I've just mentioned sample but these records are in millions for 1 year.
ID Day Name Description
1 2016-09-01 Sam Retail
2 2016-01-28 Chris Retail
3 2016-02-06 ChrisTY Retail
4 2016-02-26 Christa Retail
3 2016-12-06 ChrisTu Retail
4 2016-12-31 Christi Retail
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
Table C
Start_Date End_Date Month_No
2016-01-01 2016-01-31 1
2016-02-01 2016-02-28 2
2016-03-01 2016-03-31 3
2016-04-01 2016-04-30 4
2016-05-01 2016-05-31 5
2016-06-01 2016-06-30 6
2016-07-01 2016-07-31 7
2016-08-01 2016-08-31 8
2016-09-01 2016-09-30 9
2016-10-01 2016-10-30 10
2016-11-01 2016-11-31 11
2016-12-01 2016-12-31 12
I've tried to write the code in spark but didn't work and resulting in a cartisa product on the join and performance was also very bad
Df_A=spark.sql("select * from A join B where a.day>=b.start_date
and a.day<=b.end_date and b.month_no=(I)")
Actual Output should have the code in pyspark where A join B where every month needs to be processed. the value of I should automatically be incremented from 1 to 12 along with month dates.
A Join B as shown above and A Join C using ID as well as performance should be good
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark import HiveContext
hiveContext= HiveContext(sc)
def UDF_df(i):
print(i[0])
ABC2=spark.sql("select * From A where day where day
='{0}'.format(i[0]))
Join=ABC2.join(Tab2.join(ABC2.ID == Tab2.ID))\
.select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)
Join\
.select("Tab2.skey","ABC2.Day","ABC2.Name","ABC2.Description")
.write\
.mode("append")\
.format("parquet')\
.insertinto("Table")
ABC=spark.sql("select distinct day from A where day<= ' 2016-01-01' and day<='2016-
12-31'")
Tab2=spark.sql("select * from B where day is not null)
for in in ABC.collect():
UDF_df(i)
The following query is working but taking a long time as the number of
columns are around 60(just used sample 3). Also didn't join Table C as I
wasn't sure how to join to avoid cartisan join. performance isn't good, am
not sure how to optimise the query.
I am trying to group data by the day of the year that it falls on. I have been able to achieve this with the code below. The issue is that I lose the information as to which day (i.e. Jan 1st, Jan 2nd etc) each grouping represents. I am simply left with a number (e.g. 1, 2 etc.) representing the day of the year. Is there any to convert this number back into the more descriptive date? Thanks a lot.
CREATE TABLE tmp2 AS
SELECT extract(doy from trd_exctn_dt) as day_of_year
,sum(dollar_vol) AS dollar_vol
FROM tmp
GROUP BY extract(doy from trd_exctn_dt);
Current Output:
day_of_year | dollar_vol
------------|------------
1 10
2 15
3 7
Desired Output: N.b. The exact format of the first column doesn't matter too much. I would be happy with DD/MM, MM/DD or any other clear output.
day_of_year | dollar_vol
------------|------------
Jan 1 | 10
Jan 2 | 15
Jan 3 | 7
Using the to_char fucntion:
SELECT to_char(trd_exctn_dt,'MM/DD') as day_of_year ,sum(dollar_vol) AS dollar_vol
FROM tmp
GROUP BY day_of_year ;
I want to compare two rows in a table. Retrieve record where line 2 value for a particular year is less than the line 1 value for the same year:
Year Line Dollar
2001 1 $50
2001 2 $50
2002 1 $100
2002 2 $100
2003 1 $150
2003 2 $100
The result is
Year Line Dollar
2003 1 $150
2003 2 $100
Thanks
select a.*, b.*
from yourtable a, yourtable b
where a.year = b.year
and a.line = 1
and b.line = 2
and a.dollar > b.dollar
I'm trying to update a date dimension table from the accounting years table of our ERP System. If I run the following Query:
SELECT fcname FYName
,min(fdstart) YearStart
,max(fdend) YearEnd
,max(fnnumber) PeriodCount
FROM M2MData01.dbo.glrule GLR
GROUP BY fcname
I get the following data:
FYName YearStart YearEnd PeriodCount
FY 2000 1/1/2000 12:00:00 AM 12/31/2000 12:00:00 AM 12
FY 2001 1/1/2001 12:00:00 AM 12/31/2001 12:00:00 AM 12
FY 2002 1/1/2002 12:00:00 AM 12/31/2002 12:00:00 AM 12
FY 2003 1/1/2003 12:00:00 AM 12/31/2003 12:00:00 AM 12
FY 2004 1/1/2004 12:00:00 AM 12/31/2004 12:00:00 AM 12
FY 2005 1/1/2005 12:00:00 AM 12/31/2005 12:00:00 AM 12
FY 2006 1/1/2006 12:00:00 AM 12/31/2006 12:00:00 AM 12
FY 2007 1/1/2007 12:00:00 AM 12/31/2007 12:00:00 AM 12
FY 2008 1/1/2008 12:00:00 AM 12/31/2008 12:00:00 AM 12
FY 2009 1/1/2009 12:00:00 AM 12/31/2009 12:00:00 AM 12
FY 2010 1/1/2010 12:00:00 AM 12/31/2010 12:00:00 AM 12
In my case my company has 12 periods per year which roughly correspond to months. Basically, I am trying to create an update statement to set Fiscal Quarters which will follow this logic:
1. If PeriodCount is divisible by 4 then the number of periods in a quarter is PeriodCount/4.
2. If PeriodNumber is in the first quarter (in this case periods 1 through 3) then FiscalQuarter =1 and so on for quarters 2 through 4.
The problem is that I cannot be guaranteed that everyone uses 12 periods, some companies I support use a different number such as 10.
I started creating the following select statement:
DECLARE #QuarterSize INT
DECLARE #SemesterSize INT
SELECT TST.Date,
CASE WHEN glr.PeriodCount % 4 = 0 THEN
-- Can Be divided into quarters. Quarter size is PeriodCount/4
set #quartersize = (GLR.PeriodCount/4)
CASE
END
ELSE 0
End
FROM m2mdata01.dbo.AllDates TST
INNER JOIN (
SELECT fcname FYName
,min(fdstart) YearStart
,MAX(fdend) YearEnd
,MAX(fnnumber) PeriodCount
FROM M2MData01.dbo.glrule GLR
GROUP BY fcname ) GLR
ON TST.DATE >= GLR.YearStart AND TST.DATE <= GLR.YearEnd
Can I set the value of a variable inside a case statement like this? What's the best way to accomplish this? Am I forced to use a cursor statement and check each date in my dimension against the range in the table above?
Not sure what you want to do here - you can assign variable outside case statement in select clause. Such as
SELECT
SomeCol,
#var = CASE
WHEN condition1 THEN some value
WHEN condition2 THEN other value
END,
OtherCol
FROM
...
Note that #var value be set to the value evaluated at the last row. As said earlier, I am not sure how you intend to use you #quartersize variable. If the value is needed on every row then u shouldn't be using variable at all.
It may not be the most elegant solution, but here is what I ended up with.
I linked a copy of the script details to a grouped by version of the same thing.
SELECT fcname FYName, fdstart PeriodStart, fdend PeriodEnd, fnnumber PeriodNo, GLRAGG.AGGFYName,
GLRAGG.QuarterSize, GLRAGG.PeriodCount, GLRAGG.Quarterific, GLRAGG.SemesterSize, GLRAGG.Semesterific
FROM M2MData01.dbo.glrule GLR
INNER JOIN
(SELECT fcname AGGFYName, min(fdstart) YearStart,
MAX(fdend) YearEnd, MAX(fnnumber) PeriodCount,
(Max(fnnumber) / 4) QuarterSize, CASE WHEN Max(fnnumber) % 4 = 0 THEN 'Yes' ELSE 'No' END AS Quarterific,
(Max(fnnumber) / 2) SemesterSize, CASE WHEN Max(fnnumber) % 2 = 0 THEN 'Yes' ELSE 'No' END AS Semesterific
FROM M2MData01.dbo.glrule
GROUP BY fcname) GLRAGG
ON GLR.FCNAME = GLRAGG.AGGFYNAME
This isn't a big deal because that table only has 12 rows for each year, in this case only 132 total rows.
That produces every fiscal period with the total number of periods in each Fiscal Year and whether it can be evenly divisible by 4 and 2. It then uses the "Quarterific" value to determine whether to do so in the update statement and I can get by wtihout using variables.
It may not be the best way, but it works and is performant given the small data set.