Group by certain record in array (pyspark)

Group by certain record in array (pyspark) - pyspark

I want to group a data in such a way that for particular record each array values also used to group for that record
I am able to group by name only. I am not able to figure out the way to this.
I have tried following query;
import pyspark.sql.functions as f
df.groupBy('name').agg(f.collect_list('data').alias('data_new')).show()
Following is the dataframe;
|-------|--------------------|
| name | data |
|-------|--------------------|
| a | [a,b,c,d,e,f,g,h,i]|
| b | [b,c,d,e,j,k] |
| c | [c,f,l,m] |
| d | [k,b,d] |
| n | [n,o,p,q] |
| p | [p,r,s,t] |
| u | [u,v,w,x] |
| b | [b,f,e,g] |
| c | [c,b,g,h] |
| a | [a,l,f,m] |
|----------------------------|
I am expecting following output;
|-------|----------------------------|
| name | data |
|-------|----------------------------|
| a | [a,b,c,d,e,f,g,h,i,j,k,l,m]|
| n | [n,o,p,q,r,s,t] |
| u | [u,v,w,x] |
|-------|----------------------------|

Related

PostgreSQL: Transforming rows into columns when more than three columns are needed

I have a table like the following one:
+---------+-------+-------+-------------+--+
| Section | Group | Level | Fulfillment | |
+---------+-------+-------+-------------+--+
| A | Y | 1 | 82.2 | |
| A | Y | 2 | 23.2 | |
| A | M | 1 | 81.1 | |
| A | M | 2 | 28.2 | |
| B | Y | 1 | 89.1 | |
| B | Y | 2 | 58.2 | |
| B | M | 1 | 32.5 | |
| B | M | 2 | 21.4 | |
+---------+-------+-------+-------------+--+
And this would be my desired output:
+---------+-------+--------------------+--------------------+
| Section | Group | Level1_Fulfillment | Level2_Fulfillment |
+---------+-------+--------------------+--------------------+
| A | Y | 82.2 | 23.2 |
| A | M | 81.1 | 28.2 |
| B | Y | 89.1 | 58.2 |
| B | M | 32.5 | 21.4 |
+---------+-------+--------------------+--------------------+
Thus, for each section and group I'd like to obtain their percents of fulfillment for level 1 and level 2. To achieve this, I've tried crosstab(), but using this function returns me an error ("The provided SQL must return 3 columns: rowid, category, and values.") because I'm using more than three columns (I need to maintain section and group as identifiers for each row). Is possible to use crosstab in this case?
Regards.

I find crosstab() unnecessary complicated to use and prefer conditional aggregation:
select section,
"group",
max(fulfillment) filter (where level = 1) as level_1,
max(fulfillment) filter (where level = 2) as level_2
from the_table
group by section, "group"
order by section;
Online example

Replace null by negative id number in not consecutive rows in hive

I have this table in my database:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| NULL | C |
| 3 | D |
| NULL | D |
| NULL | E |
| 4 | F |
---------------
And I want to transform this table into a table that replace nulls by consecutive negative ids:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| -1 | C |
| 3 | D |
| -2 | D |
| -3 | E |
| 4 | F |
---------------
Anyone knows how can I do this in hive?

Below approach works
select coalesce(id,concat('-',ROW_NUMBER() OVER (partition by id))) as id,desc from database_name.table_name;

T-SQL : Pivot table without aggregate

I am trying to understand how to pivot data within T-SQL but can't seem to get it working. I have the following table structure
+-------------------+-----------------------+
| Name | Value |
+-------------------+-----------------------+
| TaskId | 12417 |
| TaskUid | XX00044497 |
| TaskDefId | 23 |
| TaskStatusId | 4 |
| Notes | |
| TaskActivityIndex | 0 |
| ModifiedBy | Orange |
| Modified | /Date(1554540200000)/ |
| CreatedBy | Apple |
| Created | /Date(2121212100000)/ |
| TaskPriorityId | 40 |
| OId | 2 |
+-------------------+-----------------------+
I want to pivot the name column to be columns expected output
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| TASKID | TASKUID | TASKDEFID | TASKSTATUSID | NOTES | TASKACTIVITYINDEX | MODIFIEDBY | MODIFIED | CREATEDBY | CREATED | TASKPRIORITYID | OID |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| | | | | | | | | | | | |
| 12417 | XX00044497 | 23 | 4 | | 0 | Orange | /Date(1554540200000)/ | Apple | /Date(2121212100000)/ | 40 | 2 |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
Is there an easy way of doing it? The columns are fixed (not dynamic).
Any help appreciated

Try this:
select * from yourtable
pivot
(
min(value)
for Name in ([TaskID],[TaskUID],[TaskDefID]......)
) as pivotable
You can also use case statements.
You must use the aggregate function in the pivot table.
If you want to learn more, here is the reference:
https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-2017
Output (I only tried three columns):
DB<>Fiddle

in postgresql how to get the last 4 numbers from a field and copy it to a new field

I'm trying to get the last four digits of the field "SERIAL8" and put that in a new field called "SS4". Here is the query I'm trying to use but it isn't working. I'm new at this, so any help would be appreciated
SELECT * FROM CUSTOMER_TABLE
SUBSTRING (SERIAL,4,4) as 'SS4'
CUSTOMER_TABLE
+-----------------------+------------+----------+--+
| "Complaint Full Date" | Source | SERIAL | |
+-----------------------+------------+----------+--+
| 02/04/16 | DAPIS_CAIR | DG540732 | |
| 04/18/16 | DAPIS_CAIR | DG553384 | |
| 03/23/17 | RO | DG559515 | |
| 03/29/16 | CAIR | DG559781 | |
| 12/10/14 | DAPIS_CAIR | DG561621 | |
+-----------------------+------------+----------+--+

Talend: how to split column data into rows

I have one table:
| id | head1| head2 | head3|
| 1 | fv1 | fw1,fw2,fw3| fv3 |
| 2 | sv2 | sw1,sw2,sw3| sv4 |
And would like to have the following:
| id | head2 |
| 1 | fw1 |
| 1 | fw2 |
| 1 | fw3 |
| 2 | sw1 |
| 2 | sw2 |
| 2 | sw3 |
So I would like to split a comma-delimited content of some columns and then copy it over into the different table as rows for search purposes.
Which Talend component should I use to achieve this? Is that possible?

tNormalize should help you with this problem.
Just select "," as field separator, and head2 as the column to normalize.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Group by certain record in array (pyspark) - pyspark

Related

PostgreSQL: Transforming rows into columns when more than three columns are needed

Replace null by negative id number in not consecutive rows in hive

T-SQL : Pivot table without aggregate

in postgresql how to get the last 4 numbers from a field and copy it to a new field

Talend: how to split column data into rows

Categories

Resources