Pivot aggregation filling columns with value on the same row in PySpark - pyspark

I need to do a pivot aggregation filling columns with answer.
Here below the example, thank u!
Input
id
question
answer
1
quest_1
Good
1
quest_2
Bad
2
quest_1
Bad
2
quest_2
Good
2
quest_3
Quite Good
Output
id
quest_1
quest_2
quest_3
1
Good
Bad
NULL
2
Bad
Good
Quite Good

Do a pivot
df.groupby('id').pivot('question').agg(first('answer')).show()

Related

Tableau counting multiple blanks

I have a dataset containing roughly 50 columns. I want to check the data inputs and show how many blanks/if any are in each column.
I originally started using calculated fields for columns and using sum(if isnull([column] then 1 else 0 end)). This works but doesn’t seem very efficient if I want to do it for multiple columns.
Tried doing a pivot table as well however I need the data in it’s original form for other analysis I would like to do.
Would appreciate any help.
Thanks
Fiona

How to split one column into two separated by according to third column in knex.js?

I have a task that I have been cracking my head off.
So I have this table transactions and it has 2 columns bonus and type like :
bonus | type
20 1
15 -1
What I want is to have a query with bonus column divided into two columns bonus_spent and bonus_left by type.
It should probably look like this one:
bonus_left | bonus_spent
20 15
I know I can duplicate tables and join them with where clause but is there any way I can do this operation on single query?
In vanilla SQL you would use conditional aggregation. We use the user_id column which indicates who the bonus belongs to and I've used SUM for aggregation to allow for there being more than one of each type of bonus:
SELECT user_id,
SUM(CASE WHEN type = 1 THEN bonus ELSE 0 END) AS bonus_left,
SUM(CASE WHEN type = -1 THEN bonus ELSE 0 END) AS bonus_spent
FROM transactions
GROUP BY user_id
Output:
user_id bonus_left bonus_spent
1 20 15
Demo on dbfiddle
I agree with Nick and you should mark that answer correct IMHO. For completeness and some Knex:
knex('users AS u')
.join('transactions AS t', 'u.id', 't.user_id')
.select('u.id', 'u.name')
.sum(knex.raw('CASE WHEN t.type = 1 THEN t.bonus ELSE 0 END AS bonus_left'))
.sum(knex.raw('CASE WHEN t.type = -1 THEN t.bonus ELSE 0 END AS bonus_spent'))
Note that, lacking your table schema, this is untested. It'll look roughly like this though. You could also just embed the two SUMs as knex.raw in the select list, but this is perhaps a little more organised.
Consider creating the type as a Postgres enum. This would allow you to avoid having to remember what a 'magic number' is in your table, instead writing comparisons like:
CASE WHEN type = 'bonus_left'
It also stops you from accidentally entering some other integer, like 99, because Postgres will type-check the insertion.
I have a nagging concern that having bonus 'left' vs 'spent' in the same table reflects a wider problem with the schema (for example, why isn't the total amount of bonus remaining the only value we need to track?) but perhaps that's just my paranoia!

SUM the NUMC field in SELECT

I need to group a table by the sum of a NUMC-column, which unfortunately seems not to be possible with ABAP / OpenSQL.
My code looks like that:
SELECT z~anln1
FROM zzanla AS z
INTO TABLE gt_
GROUP BY z~anln1 z~anln2
HAVING SUM( z~percent ) <> 100 " percent unfortunately is a NUMC -> summing up not possible
What would be the best / easiest practices here as I cannot alter the table itself?
Unfortunately the NUMC type is described as numerical text, so at the end it lands in the database as VARCHAR and that is why the functions like SUM or AVG cannot be used.
It all depends on how big your table is. If it is rather small you could get the group fields and the values for sum into an internal table and then sum it using COLLECT statement and eventually remove the rows for which the sum is equal 100%.
One solution is to define the field in the table using a more appropriate type.
NUMC is often used for key fields - like document numbers, which there would never be a reason to add together.
I didn't find a smooth solution.
What I did, was to copy everything in an internal table, looped over it converting the NUMC values to DEC values. Grouping and summing up worked at that point.
At the end, I converted the DEC values back to NUMC values.
It's been awhile. I came back to this post, because someone voted up my original answer. I was thinking about editing my old answer but I decided to post a new one. As this question was asked in 2017, there were some restictions but now it can be done by using CAST function in new OpenSQL.
SELECT z~anln1
FROM zzanla AS z
INTO TABLE #gt_
GROUP BY z~anln1, z~anln2
HAVING SUM( CAST( z~percent AS INT4 ) ) <> 100

pentaho distinct count over date

I am currently working on Pentaho and I have the following problem:
I want to get a "rooling distinct count on a value, which ignores the "group by" performed by Business Analytics. For instance:
Date Field
2013-01-01 A
2013-02-05 B
2013-02-06 A
2013-02-07 A
2013-03-02 C
2013-04-03 B
When I use a classical "distinct count" aggregator in my schema, sum it, and then add "month" to column, I get:
Month Count Sum
2013-01 1 1
2013-02 2 3
2013-03 1 4
2013-04 1 5
What I would like to get would be:
Month Sum
2013-01 1
2013-02 2
2013-03 3
2013-04 3
which is the distinct count of all Fields so far. Does anyone has any idea on this topic?
my database is in Postgre, and I'm looking for any solution under PDI, PSW, PBA or PME.
Thank you!
A naive approach in PDI is the following:
Sort the rows by the Field column
Add a sequence for changing values in the Field column
Map all sequence values > 1 to zero
These first 3 effectively flag the first time a value was seen (no matter the date).
Sort the rows by year/month
Sum the mapped sequence values by year+month
Get a Cumulative Sum of all the previous sums
These 3 aggregate the distinct values per month, then keep a cumulative sum. In PDI this might look something like:
I posted a Gist of this transformation here.
A more efficient solution is to parallelize the two sorts, then join at the latest point possible. I posted this one as it is easier to explain, but it shouldn't be too difficult to take this transformation and make it more parallel.

Kdb find first not null value

While doing a Group By In KDB. I have to find the first not null value in that group for a column
For e.g.
t:([]a:1 1 1 2;b:0n 1 3 4 )
select first b by a from t
I found one way to achieve this is:
select first b except 0n by a from t
I am not sure if it is a correct way to do this. Please provide suggestions.
It seems a good way to do it to me.
Two alternatives would include:
select first b where not null b by a from t
Benefit being it doesn't rely on a certain column type, maybe more clearly explains your intent but it is slightly longer. Or
select b:last fills reverse b by a from t
Which on some test runs was the quickest way.
In kdb there's always multiple ways to do things and never really a right or wrong answer.