Spark sql group by and sum changing column name? - scala

In this data frame I am finding total salary from each group. In Oracle I'd use this code
select job_id,sum(salary) as "Total" from hr.employees group by job_id;
In Spark SQL tried the same, I am facing two issues
empData.groupBy($"job_id").sum("salary").alias("Total").show()
The alias total is not displaying instead it is showing "sum(salary)" column
I could not use $ (I think Scala SQL syntax). Getting compilation issue
empData.groupBy($"job_id").sum($"salary").alias("Total").show()
Any idea?

Use Aggregate function .agg() if you want to provide alias name. This accepts scala syntax ($" ")
empData.groupBy($"job_id").agg(sum($"salary") as "Total").show()
If you dont want to use .agg(), alias name can be also be provided using .select():
empData.groupBy($"job_id").sum("salary").select($"job_id", $"sum(salary)".alias("Total")).show()

Related

Cannot use Named Parameters with SSRS and PostgreSQL

I'm trying to add named parameters to a dataset query in a SSRS report (I'm using Report Builder), but I have had no luck discovering the correct syntax. I have tried #parameter, $1, $parameter and others, all without success. I suspect the syntax is just different for PostgreSQL versus normal SQL.
The only success I have had with passing parameters was based on this answer.
It involves using ? for every single parameter.
My query might look something like this:
SELECT address, code, remarks FROM table_1 WHERE date BETWEEN ? AND ? AND apt_num IS NULL AND ADDRESS = ?
This does work, but in the case of a query where I pass the same parameter to more than one part of the SELECT statement, I have to add the same parameter to the list multiple times as shown here. They are passed in this order, so adding a new parameter to an existing query results in having to reshuffle, and sometimes completely rebuild, the query parameters tab.
What are the proper syntax and naming requirements for adding named Parameters when using a PostgreSQL data source in SSRS?
From my comment, this is what it would look like with a regular join:
with inparms as (
select ? as from_date, ? as to_date, ? as address
)
select t.address, t.code, t.remarks
from inparms i
join table_1 t
on t.date between i.from_date and i.to_date
and t.apt_num is null
and t.address = i.address;
I said cross join in my comment because it is sometimes quicker when retrofitting somebody else's SQL instead of trying to untangle things (thinking of a friend who uses right join sometimes just to ruin my day).

Scala Dataframe columns with space save as a databricks table

I am working on databricks notebook (Scala) and I have a spark query that goes kinda like this:
df = spark.sql("SELECT columnName AS `Column Name` FROM table")
I want to store this as a databricks table. I tried below code for the same:
df.write.mode("overwrite").saveAsTable("df")
But it is giving an error because of the space in the column name. Here's the error:
Attribute name contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
I don't want to remove the space so is there any alternative for this?
No, that's a limitation of the underlying technologies used by Databricks under the hood (for example, PARQUET-677). The only solution here is to rename column, and if you need to have space in the name, do renaming when reading it back.

PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data

I have a dataset which is getting grouped by multiple variables where we finding aggregates like mean , std dev etc. Now i want to find Mean of a variables excluding the top 1 percentile of data
I am trying something like
df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),expr('percentile(value, array(0.99))')[0].alias('99_percentile'),mean(when(col('value')<=col('99_percentile'),col('value')))
But it seems spark cannot use the agg name which is defined in the same group statement.
I even tried this ,
~df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),mean(when(col('value')<=expr('percentile(value, array(0.99))')[0],col('value')))~
But it throws below error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
I hope some one would be able to answer this
Update :
I try doing the otherway
Here's a straightforward modification of your code. It will aggregate df twice. As far as I can tell, that's what is required.
df_final=(
df.join(df
.groupby(groupbyElement)
.agg(expr('percentile(value, array(0.99))')[0].alias('99_percentile'),
on=["groupbyElement"], how="left"
)
.groupby(groupbyElement)
.agg(mean('value').alias('Mean'),
stddev('value').alias('Stddev'),
mean(when(col('value')<=col('99_percentile'), col('value')))
)

SSRS multi value parameter - can't get it to work

First off this is my first attempt at a multi select. I've done a lot of searching but I can't find the answer that works for me.
I have a postgresql query which has bg.revision_key in (_revision_key) which holds the parameter. A side note, we've named all our parameters in the queries with the underscore and they all work, they are single select in SSRS.
In my SSRS report I have a parameter called Revision Key Segment which is the multi select parameter. I've ticked Allow multi value and in Available Values I have value field pointing to revision_key in the dataset.
In my dataset parameter options I have Parameter Value [#revision_key]
In my shared dataset I also have my parameter set to Allow multi value.
For some reason I can't seem to get the multi select to work so I must be missing something somewhere but I've ran out of ideas.
Unlike with SQL Server, when you connect to a database using an ODBC connection, the parameter support is different. You cannot use named parameters and instead have to use the ? syntax.
In order to accommodate multiple values you can concatenate them into a single string and use a like statement to search them. However, this is inefficient. Another approach is to use a function to split the values into an in-line table.
In PostgreSQL you can use an expression like this:
inner join (select CAST(regexp_split_to_table(?, ',') AS int) as filter) as my on my.filter = key_column
Then in the dataset properties, under the parameters tab, use an expression like this to concatenate the values:
=Join(Parameters!Keys.Value, ",")
In other words, the report is concatenating the values into a comma-separated list. The database is splitting them into a table of integers then inner joining on the values.

sqlalchemy group_by error

The following works
s = select([tsr.c.kod]).where(tsr.c.rr=='10').group_by(tsr.c.kod)
and this does not:
s = select([tsr.c.kod, tsr.c.rr, any fields]).where(tsr.c.rr=='10').group_by(tsr.c.kod)
Why?
thx.
It doesn't work because the query isn't valid like that.
Every column needs to be in the group_by or needs an aggregate (i.e. max(), min(), whatever) according to the SQL standard. Most databases have always complied to this but there are a few exceptions.
MySQL has always been the odd one in this regard, within MySQL this behaviour depends on the ONLY_FULL_GROUP_BY setting: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html
I would personally recommend setting the sql_mode setting to ANSI. That way you're largely compliant to the SQL standard which will help you in the future if you ever need to use (or migrate) to a standards compliant database such as PostgreSQL.
What you are trying to do is somehow valid in mysql, but invalid in standard sql, postgresql and common sense. When you group rows by 'kod', each row in a group has the same 'kod' value, but different values for 'rr' for example. With aggregate functions you can get some aspect of the values in this column for each group, for example
select kod, max(rr) from table group by kod
will give you list of 'kod's and the max of 'rr's in each group (by kod).
That being sad, in the select clause you can only put columns from the group by clause and/or aggregate functions from other columns. You can put whatever you like in where - this is used for filtering. You can also put additional 'having' clause after group that contains aggregate function expression that can also be used as post-group filtering.