Group by on multiple column one by one - pyspark

I am new to pyspark so I wanted to know, is there any better way to do group by on multiple columns one by one instead of using loop over all columns? currenctly, I am using loop to iterate over all required group by columns but it is taking very long time. I have around 50-60 columns for which I need to group one by one using aggregration on fixed columns.
current code using loop:
for name in req_string_columns:
tmp=Selected_data.groupBy(name).agg(mean("ABC"),mean("XYZ"),count("ABC")
,count("XYZ")).withColumnRenamed(name,'Category')
Is there any better way to do it?

Related

sum of multiple columns based on year condition

I am try to add up columns to get sum of sales made over years
how to achieve this dynamically without adding manually like
F.sum(2015,2016...)
SUM() is an aggregation function, it works column-wise not row-wise. To add results across rows simply use + e.g.
(`2015`+`2016`+...)
Having said all of that if your objective is to support the operation in a dynamic way. I suggest normalising your data (columns to rows) so the year becomes a single column with values of 2015,2016,... Doing so will allow you employ the SUM() function on the year column.
Working with denormalised data is generally bad practice for all sorts of reasons and only usually employed in the final output for display/presentation purposes. i.e. poor support for changing data (such as a new year value being added).
You can normalise the data using the STACK() function

DAX: Distinct and then aggregate twice

I'm trying to create a Measure in Power BI using DAX that achieves the below.
The data set has four columns, Name, Month, Country and Value. I have duplicates so first I need to dedupe across all four columns, then group by Month and sum up the value. And then, I need to average across the Month to arrive at a single value. How would I achieve this in DAX?
I figured it out. Reply by #OscarLar was very close but nested SUMMARIZE causes problems because it cannot aggregate values calculated dynamically within the query itself (https://www.sqlbi.com/articles/nested-grouping-using-groupby-vs-summarize/).
I kept the inner SUMMARIZE from #OscarLar's answer changed the outer SUMMARIZE with a GROUPBY. Here's the code that worked.
AVERAGEX(GROUPBY(SUMMARIZE(Data, Data[Name], Data[Month], Data[Country], Data[Value]), Data[Month], "Month_Value", sumx(CURRENTGROUP(), Data[Value])), [Month_Value])
Not sure I completeley understood the question since you didn't provide example data or some DAX code you've already tried. Please do so next time.
I'm assuming parts of this can not (for reasons) be done using power query so that you have to use DAX. Then I think this will do what you described.
Create a temporary data table called Data_reduced in which duplicate rows have been removed.
Data_reduced =
SUMMARIZE(
'Data';
[Name];
[Month];
[Country];
[Value]
)
Then create the averaging measure like this
AveragePerMonth =
AVERAGEX(
SUMMARIZE(
'Data_reduced';
'Data_reduced'[Month];
"Sum_month"; SUM('Data_reduced'[Value])
);
[Sum_month]
)
Where Data is the name of the table.

Tallying unknown words across columns in Tableau (or from comma separated column)

I have an issue that I have been trying to solve for the better part of a week now. I have a large database (in Google sheets) representing casestudies. I have some columns with multiple categories listed (in this example 'species', 'genera', and 'morphologies'), and I want to be able to tally how many times each category occurs in the data set.
I use Tableau to visalise the data, and the final output will be a large publc tableau. I know I can do a "find" based on the specific string, but I'd like the dataset to be dynamic and be able to handle new data being added without having to update calculated fields? Is there a way of finding uniqe terms (either from a single column of comma separated values, or from multiple columns), and tallying them?
Things I have tried so far:
1 - A pivot table in Tableau. Works well, but messes with all the other data, since it repeats lines.
2 - A pivot table on its own data source in Tableau. Also works well, and avoids the problem of messing with the other data. However, now each figure is disconnected from the others so I can't do a large dashboard where everything is filtered by each other (ie filtering species and genera by country at the same time).
3 - An SQL query() in google sheets, which finds all unique terms and queries them, which can then be plotted in Tableau. Also works well, but similar problem of the data being disconnected from all the other terms in the dataset.
Any ideas of a field calculation that will find, list and tally unique terms in a single comma separated column (or across multiple columns), without changing the data structure?
I have placed a sample data set here (google sheets), which is a smaller version of what I'm actually working on. In it I have marked comma separated columns in grey, and they're followed by a bunch of columns with the values split into columns. I only need to analyse either of those (ie either a calculation to separate comma separate values or from multiple columns).
I've also added a sample Tableau workbook here.

How to list multiple rows within Table component specific column?

I have Table and I want to display multiple row in same band like the below image
I have tried with by Adding frame inside the table column and using list component to list multiple rows but it wont works.
like the following hierarchy
Table-->Detail band-->Column-->Frame-->list component-->TextField
Can any one help me How to Solve this ? thanks in advance
There are two ways of doing it.
Way 1:
Put a sub report into last column. Pass the data to it. This way there would be multiple fields in it.
Way 2:
Use groups and custom styles along with groups(styles and groups will have to manages logically).
I think you should go for way 1 if the data in your last column does not span more than one page. Otherwise way 2 is applicable.

Joining two datasets to create a single tablix in report builder 3

I am attempting to join two datasets in to one tablix for a report. The second dataset requires a personID from the first dataset as its parameter.
If i preview this report only the first dataset is shown. but for my final result what i would like to happen is for each row of a student there is a rowgrouping (?) of that one students modules with their month to month attendance. Can this be done in report builder?
The best practice here is to do the join within one dataset (i.e. joining in SQL)
But in cases that you need data from two separate cubes(SSAS) the only way is the following:
Select the main dataset for the Tablix
Use the lookup function to lookup values from the second dataset like this:
=Lookup(Fields!ProductID.Value, Fields!ID.Value, Fields!Name.Value, "Product")
Note: The granularity of the second dataset must match the first one.
We had a similar issue and that can be resolved this way.
First of All, ensure the first data set's query and second data set's query are working fine by executing separately on the Database client tool such as Datastudio.
Build two data sets on SSRS tool with the respective queries and make sure both the data sets have same key column (personID).
On the SSRS report design, create a table from tool box and add the required columns from the first data set along with the matching key column(personID). Add a new column and use look up function to get the required column from the other data set against the same key column (personID).