Aggregate on multiple columns in spark dataframe (all combination) - scala

I want to get count of customer based on all combination of columns i have in dataframe.
For eg: - Suppose if I have dataframe with 5 columns.
id, col1, col2, col3, cust_id
I need the count of customer for all combination:
id, col1, count(cust_id)
id, col1, col2, count(cust_id)
id, col1, col3, count(cust_id)
id, col1, col2, col3, count(cust_id)
id, col2, count(cust_id)
id, col2, col3, count(cust_id)
And so on for all permutation and combination.
Its very difficult to do it separately providing all different combination to groupBy function of data frame and then aggregate the count of customer.
Is there any way we can achieve this and then combine all the result to add it in one data frame to that we can write the result in one output file.
to me it looks bit complex, really appreciate if any one can provide any solution. please let me know if any more details are required.
Thanks a lot.

It is possible and it is called cube:
df.cube("id", "col1", "col2", "col3").agg(count("cust_id"))
.na.drop(minNonNulls=3) // To exclude some combinations
SQL version also provides a GROUPING SET which can be more efficient than .na.drop.

Related

Can the Custom SQL Query in a Tableau Dashboard accept a list of values in a Parameter?

I have a Tableau dashboard drawing data from a Vertica Database via a Custom SQL Query.
The database table contains more than 100 million rows, with a column COL1 indicated as primary key. Each COL1 value corresponds to exactly one row of data. Therefore COL1 is unique for all rows.
The Custom SQL Query below refreshes the dashboard whenever the parameter is updated.
SELECT COL1, COL2, COL3, COL4, COL5 FROM TABLE WHERE COL1=<Parameters.Col1Param>
Can the dashboard users input more than one value to get more than 1 row of data?
I have tried using the IN condition as below:
SELECT COL1, COL2, COL3, COL4, COL5 FROM TABLE WHERE COL1 IN (<Parameters.Col1Param>)
However, I can't seem to be able to make this work with Parameter values Param1;Param2;Param3 or Param1,Param2,Param3.
I also tried including all values of COL1 and allowing the user to filter on-the-fly, but the database table is too large (over 100M of rows) for the dashboard to load the data into memory.
As always, minutes after posting a question on StackOverflow, I find a reasonable answer to my question.
The answer to this can be found here: Convert comma separated string to a list
SELECT COL1, COL2, COL3, COL4, COL5 FROM TABLE WHERE COL1 IN (
SELECT SPLIT_PART(<Parameters.Col1Param>, ';', row_num) params
FROM (SELECT ROW_NUMBER() OVER () AS row_num FROM tables) row_nums
WHERE SPLIT_PART(<Parameters.Col1Param>, ';', row_num) <> ''
)

count distinct concat in BigQuery

I have tried PostgreSQL:count distinct (col1,col2,col3,col4,col5)
in BigQuery :Count distinct concat(col1,col2,col3,col4,col5)
My scenario is I need to get same result as PostgreSQL in BigQuery
Though this scenario works on 3 columns ,I am not getting same value as PostgreSQL for 5 columns.
sample query:
select col1,
count(distinct concat((col1,col2,col3,col4,col5)
from table A
group by col1
when I remove distinct and concat, simple count(col1,col2,col3,col4,col5) gives exact value as populated in PostgreSQL. But i need to have distinct of these columns. Is there any way to achieve this? and does bigquery concat works differently?
Below few options for BigQuery Standard SQL
#standardSQL
SELECT col1,
COUNT(DISTINCT TO_JSON_STRING((col1,col2,col3,col4,col5)))
FROM A
GROUP BY col1
OR
#standardSQL
SELECT col1,
COUNT(DISTINCT FORMAT('%T', [col1,col2,col3,col4,col5]))
FROM A
GROUP BY col1
An alternative suitable for the many databases that don't support that form of COUNT DISTINCT:
SELECT COUNT(*)
FROM (
SELECT DISTINCT Origin, Dest, Reporting_Airline
FROM `fh-bigquery.flights.ontime_201908`
WHERE FlightDate_year = "2018-01-01"
)
My guess on why CONCAT didn't work in your sample: Do you have any null columns?

Entity Framework: View exclusion without primary key

I am using SQL Server where I have designed a view to sum the results of two tables and I want the output to be a single table with the results. My query simplified is something like:
SELECT SUM(col1), col2, col3
FROM Table1
GROUP BY col2, col3
This gives me the data I want, but when updating my EDM the view is excluded because "a primary key cannot be inferred".
With a little research I modified the query to spoof an id column to as follows:
SELECT ROW_NUMBER() OVER (ORDER BY col2) AS 'ID', SUM(col1), col2, col3
FROM Table1
GROUP BY col2, col3
This kind of query gives me a nice increasing set of ids. However, when I attempt to update my model it still excludes my view because it cannot infer a primary key. How can we use views that aggregate records and connect them with Linq-to-Entities?
As already discussed in the comments you can try adding MAX(id) as id to the view. Based on your feedback this would become:
SELECT ISNULL(MAX(id), 0) as ID,
SUM(col1),
col2,
col3
FROM Table1
GROUP BY col2, col3
Another option is to try creating an index on the view:
CREATE UNIQUE CLUSTERED INDEX idx_view1 ON dbo.View1(id)
I use this code alter view
ISNULL(ROW_NUMBER() OVER(ORDER BY ActionDate DESC), -1) AS RowID
I use this clause in multi relations view / table query
ROW_NUMBER never give null value because it never seen -1
This is all I needed to add in order to import my view into EF6.
select ISNULL(1, 1) keyField

PostgreSQL - INSERT INTO statement

What I'm trying to do is select various rows from a certain table and insert them right back into the same table. My problem is that I keep running into the whole "duplicate PK" error - is there a way to skip the PK field when executing an INSERT INTO statement in PostgreSQL?
For example:
INSERT INTO reviews SELECT * FROM reviews WHERE rev_id=14;
the rev_id in the preceding SQL is the PK key, which I somehow need to skip. (To clarify: I am using * in the SELECT statement because the number of table columns can increase dynamically).
So finally, is there any way to skip the PK field?
Thanks in advance.
You can insert only the values you want so your PK will get auto-incremented
insert into reviews (col1, col2, col3) select col1, col2, col3 from reviews where rev_id=14
Please do not retrieve/insert the id-column
insert into reviews (col0, col1, ...) select col0, col1, ... from reviews where rev_id=14;

Copy three columns from one database table to another

I'm updating an iPhone app with a SQLite3 database. The user's have a database on their phone currently, and I need to update three of the columns with new data (stored in a separate database) if the id of the rows match.
I've been able to attach the two tables and copy an entire table, but not only three columns.
database1
table1
id, col1, col2, col3, col4
database2
table1
id, col1, col2, col3, col4
I want to copy col1, col2, & col3 (not col4) from database1, table1 to database2, table1 if the ids match.
You could use a query along the following lines:
//Select the database, table and table values
SELECT INTO 'db2.table2' (field1, field2, field3)
VALUES
//Insert data into second database using nested SQL
(SELECT * FROM 'db1.table1' (field1, field2, field3) WHERE field1 = 1)
Hope this helps (and works for you) :)