How to count unique values on pyspark - pyspark

Here's my dataset
Food Day
Chiken 2
Chiken 6
Chiken 2
Beef 3
Chiken 4
Beef 6
Beef 7
My Output
Food Day_Count
Chicken 3
Beef 3
Chicken is 3, because it shows in day 2,4,6
and
Beef is 3, because it shows in day 3,6,7

Your question is vague. I presume you want to find get unique days for each group of food?
df.groupBy('Food').agg(countDistinct('Day').alias('count')).show()

Related

How to add an iterative id column which goes up when a value in another column resets to 1 in Postgresql

I have a SQL table which has two columns called seq and sub_seq as seen below. I would like to add a third column called id, which goes up by 1 every time the sub_seq starts again at 1 as shown in the table below.
seq
sub_seq
id
1
1
1
2
2
1
3
3
1
4
4
1
5
5
1
6
1
2
7
2
2
8
3
2
9
1
3
10
2
3
11
3
3
12
4
3
13
5
3
14
6
3
15
7
3
I could write a solution using plpgsql, however I would like to know if there is a way of doing this in standard SQL. Any help would be greatly appreciated.
If sub_seq is always a running sequence then you can use the DENSE RANK function and order over the differences of two columns, assuming it will consistently uniform.
SELECT seq, sub_Seq, DENSE_RANK() OVER (ORDER BY seq-sub_Seq) AS id
FROM tableDemo
This solution is based on the sample data you have provided, I think more sample data would be helpful to check the whole scenario.

How to average monthly data using a specific method in Matlab?

I have the following vector of monthly values (vectorA). I put the date related info next to it to help illustrate the task but I work with just the vector itself
dates month_in_q vectorA
31/01/2020 1 10
29/02/2020 2 15
31/03/2020 3 6
30/04/2020 1 8
31/05/2020 2 4
30/06/2020 3 3
How can I create a new vectorNEW according to this algorithm
In each quarter the first month is the original first month
In each quarter the second month is the average of first and second month
In each quarter the third month is the average of all three months
So that I get the following vectorNEW by manipulating the original vectorA in a loop given this the re-occuring pattern above
dates month_in_q vectorA vectorNEW
31/01/2020 1 10 10
29/02/2020 2 15 AVG(10+15)
31/03/2020 3 6 AVG(10+15+6)
30/04/2020 1 8 8
31/05/2020 2 4 AVG(8+4)
30/06/2020 3 3 AVG(8+4+3)
... ... ... ...
An elegant solution was provided by the user dpb on mathworks website.
vectorNEW=reshape(cumsum(reshape(vectorA,3,[]))./[1:3].',[],1);
Further info below
https://uk.mathworks.com/matlabcentral/answers/823055-how-to-average-monthly-data-using-a-specific-method-in-matlab

Rolling N monthly average in Redshift with multiple entries per month

I would like to use Redshift's window aggregation functions to create an 'N' month rolling average of some data. The data will have multiple unique entries per any given month. If possible, I'd like to avoid first grouping by and averaging within months before performing rolling average as this is taking an average of an average and not ideal (as this post does: 3 Month Moving Average - Redshift SQL).
This is a sample dataset of just one account (there will be more than 1).
Quote Date Account. Value
3/24/2015 acme. 3
3/25/2015 acme. 7
4/1/2015 acme. 12
4/3/2015 acme. 17
5/15/2015 acme. 1
6/30/2015 acme. 3
7/30/2015 acme. 9
And this is what I would like the result to look like for a 3 month rolling average (for an example).
Quote_Date Account. Value Month 3M_Rolling_Average
3/24/2015 acme. 3 1 3
3/25/2015 acme. 7 1 5
4/1/2015 acme. 12 2 7.33
4/3/2015 acme. 17 2 9.75
5/15/2015 acme. 1 3 8
6/30/2015 acme. 3 4 8.25
7/30/2015 acme. 9 5 4.33
The code I have tried looks like this:
avg(Value) over (partition by Account order by Quote Date rows between 2 preceding and current row)
But, this only operates over the last 2 rows (and including current row) which would work if I had one unique value for each month but as stated, this is not the case. I am open to any kind of ranking solution or nested partitioning. Any help is greatly appreciated.
Since an average is just the sum() / count(), you just need to group by month but get the sum() and count(). Then use your lag to sum 3 months of sums and divide by the sum of 3 months of counts. You are correct that average of averages is not correct but if you carry the sums and counts things work.

Calculating MAX(DATE) for Value Groups Where Values Go Back and Forth

I have another challenge that I am trying to resolve but unable to get the solution yet. Here is the scenario. Pardon the formatting if it messes up at the time of posting.
ACCT_NUM CERT_ID Code Date Desired Output
A 1 10 1/1/2007 1/1/2008
A 1 10 1/1/2008 1/1/2008
A 1 20 1/1/2009 1/1/2010
A 1 20 1/1/2010 1/1/2010
A 1 10 1/1/2011 1/1/2012
A 1 10 1/1/2012 1/1/2012
A 2 20 1/1/2007 1/1/2008
A 2 20 1/1/2008 1/1/2008
A 2 10 1/1/2009 1/1/2010
A 2 10 1/1/2010 1/1/2010
A 2 30 1/1/2011 1/1/2011
A 2 10 1/1/2012 1/1/2013
A 2 10 1/1/2013 1/1/2013
As you can see, I need to do a MAX on the date based on each group of code values (apart from ACCT_NUM and CERT_ID) before the value changes. If the same value repeats, I need to a MAX of the data again for that group separately. For example, for CERT_ID of '1', I cannot group all four rows of Code 10 to get a MAX date of 1/1/2012. I need to get the MAX for the first two rows and then another MAX for the next two rows separately since there is another code in between. I am trying to accomplish this in Cognos Framework Manager.
Gurus, please advise.
The syntax for getting the max value for CERT_ID is:
maximum(Date for CERT_ID)
If you want additional level/s for max you can use the following syntax:
maximum(Date for ACCT_NUM,CERT_ID,Code)
In general, it is best practice to group and summarize values in report, not in framework manager.

Crystal Reports groups in multiple columns page breaking after row

I have a crystal report that I formatted to have the groups display in multiple (4) columns across the page, using the Format with Multiple Columns and Format Groups with multiple column options in the section expert.
The data driving the report is (snake draft):
round pick team name
===== ==== ==========
1 1 Charlie
1 2 Bob
1 3 Sam
1 4 Kevin
2 1 Kevin
2 2 Sam
2 3 Bob
2 4 Charlie
3 1 Charlie
3 2 Bob
3 3 Sam
3 4 Kevin
4 1 Kevin
4 2 Sam
4 3 Bob
4 4 Charlie
5 1 Charlie
5 2 Bob
5 3 Sam
5 4 Kevin
6 1 Kevin
6 2 Sam
6 3 Bob
6 4 Charlie
I want the output to look like:
Round 1 Round 2 Round 3 Round 4
1 Charlie 1 Kevin 1 Charlie 1 Kevin
2 Bob 2 Sam 2 Bob 2 Sam
3 Sam 3 Bob 3 Sam 3 Bob
4 Kevin 4 Charlie 4 Kevin 4 Charlie
Round 5 Round 6
1 Charlie 1 Kevin
2 Bob 2 Sam
3 Sam 3 Bob
4 Kevin 4 Charlie
I have 2 problems:
1) I don't need the group footer, but, if I suppress it, the groups don't display in multiple columns across the page, they just go down the page.
2) If I don't suppress the group footer, the first row of columns displays exactly like I want it, but, there is a page break between the rows of columns. So, rounds 5 and 6 appear on the next page. I have verified that all of the 'new page' options are unchecked in the section expert.
I have the "Round #" in the group header and that is showing correctly. One thing that is weird is that if I put a literal in the group footer, it shows at the very bottom of the page below each column. It's as if CR wants to use the entire page for the column height, even though it doesn't need it.
Can anyone help me out?
Thanks,
Dan
Keep the round information in the main report. Put the # in the details section and format in multiple sections (like you've done).
Move the pick/teammate information to a subreport. format as desired. place subreport in a second detials section. link main to subreport on round.