spark scala - How to create a UDT (cassandra user data type [list <UDT>]) from a CSV file

spark scala - How to create a UDT (cassandra user data type [list <UDT>]) from a CSV file - scala

I have a CSV file with ID, ID1, ID2, col1, col2, and col3 fields...I need to group the record based on the ID field and convert it to a UDT list.
ex:
ID ID1 ID2 COL1 COL2 COL3 COL4
1 AA 01 A B C D
1 AA 02 A B C D
1 AA 02 B C D E
1 AA 03 A B C D
2 BB 01 A B C D
2 BB 02 A B C D
3 CC 01 A B C D
3 CC 01 B C D E
THE OUTPUT SHOULD BE
1,[{ID1:"AA",ID2:"01"},{ID1:"AA",ID2:"02"},{ID1:"AA",ID2:"03"}]
2,[{ID1:"BB",ID2:"01"},{ID1:"BB",ID2:"02"}]
3,[{ID1:"CC",ID2:"01"}] (grouped by ID; rest of the ID fields in a list array)
I tried collect_list / collect_set to group the fields but could not convert them to an array.

Related

how to move data from one column to another column's row in postgres?

How I start coding to get below output.
id
Column1
1
A1
2
A2
3
A1
4
A2
5
A1
6
A1
output should be below.
id
Column1
Column1.1
Column1.2
1
A1
A1
2
A2
A2
3
A1
A1
4
A2
A2
5
A1
A1
6
A1
A1

We can try to use CASE WHEN expression to make it.
SELECT id,
Column1,
CASE WHEN Column1 = 'A1' THEN Column1 END 'Column1.1',
CASE WHEN Column1 = 'A2' THEN Column1 END 'Column1.2'
FROM T

PostgreSQL- get records with unique column combination

I want to select the records that have a unique column combination in postgresql, however it doesn't seem to work with distinct as distinct only removes duplicates.
Example
ID A B
01 1 2
02 1 2
03 1 3
04 2 4
05 1 4
06 2 4
07 2 5
08 1 3
In this example row with ID 05 and 07 have unique combination AB, how can i get these records
SELECT ...

With NOT EXISTS:
select t.* from tablename t
where not exists (
select 1 from tablename
where id <> t.id and a = t.a and b = t.b
)
Or with COUNT() window function:
select t.id, t.a, t.b
from (
select *, count(id) over (partition by a, b) counter
from tablename
) t
where t.counter = 1
Or with aggregation:
select max(id) id, a, b
from tablename
group by a, b
having count(id) = 1
Or with a self LEFT join that excludes the matching rows:
select t.*
from tablename t left join tablename tt
on tt.id <> t.id and tt.a = t.a and tt.b = t.b
where tt.id is null
See the demo.
Results:
| id | a | b |
| --- | --- | --- |
| 05 | 1 | 4 |
| 07 | 2 | 5 |

KDB get substring

How can I add a column containing a substring of a another columns containing symbols. So, go from
t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA3;1 2 3 1)
date sym pos
d1 aaaA1 1
d1 bbA1 2
d1 aaaA2 3
d2 aaaA3 1
to
t:flip `date`sym`pos`ext!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA3;1 2 3 1;`aaa`bbb`aaa`aaa)
date sym pos ext
d1 aaaA1 1 aaa
d1 bbA1 2 bb
d1 aaaA2 3 aaa
d2 aaaA3 1 aaa
EDIT. The substring should always contain the first len(symbol) -2 characters, so in my example above, aaa for aaaAx and bb for bbAx

If the substring you wish to extract is a constant length, you can do something like this following:
q)t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA3;1 2 3 1)
q)update ext:`$3#'string sym from t
date sym pos ext
------------------
d1 aaaA1 1 aaa
d1 bbbA1 2 bbb
d1 aaaA2 3 aaa
d2 aaaA3 1 aaa
If that's not the case, please provide some more detail with regards to how the substring to be extracted can be identified
Hope this helps
Jonathon

There can be a clever way of applying this below, but this is what i first came up with.
t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA3;1 2 3 1)
t: update ctr: {-2 + count string x} each sym from t;
t:{[x] :update ext:x[`ctr]#string(x[`sym]) from x} each t;
2nd line is applying your logic of: len(symbol) - 2
3rd line is taking 'ctr' number of characters from the original symbol characters.

You didn’t say so, but this is kdb+, so let’s assume:
your table is long
your sym column has duplicates
You don’t need to convert all the symbols to strings and back: only the distinct ones. (In this example, I’ve changed one of the symbols to create a duplicate.)
q)t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA1;1 2 3 1)
q)update ext:{nub:distinct x;(`$-2 _'string nub)nub?x}sym from t
date sym pos ext
------------------
d1 aaaA1 1 aaa
d1 bbbA1 2 bbb
d1 aaaA2 3 aaa
d2 aaaA1 1 aaa
The utility .Q.fu applies a function to the distinct items.
q)update ext:.Q.fu[{`$-2 _'string x};sym] from t
date sym pos ext
------------------
d1 aaaA1 1 aaa
d1 bbbA1 2 bbb
d1 aaaA2 3 aaa
d2 aaaA1 1 aaa
This operation would be faster if the sym column were already stored as an enumeration, because the distinct values would then be available without calculation.

Using drop:
q)t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbA1`aaaA2`aaaA3;1 2 3 1)
q)update ext:`$-2_'string sym from t
date sym pos ext
------------------
d1 aaaA1 1 aaa
d1 bbA1 2 bb
d1 aaaA2 3 aaa
d2 aaaA3 1 aaa

Give "ID" to multiple columns

Is there a native postgresql function that gives "IDs", based on the column.
column 1 column 2 id1 id2
aa AA 1 1
aa BB 1 2
bb BB 2 2
cc BB 3 2
cc CC 3 3
dd DD 4 4
I only want the "ID" to increment, when the value in the column changes. Otherwise, the "ID" should be the same.

SELECT o.column1, o.column2
, dense_rank() OVER (ORDER BY column1) AS id1
, dense_rank() OVER (ORDER BY column2) AS id2
FROM ordi o
;

Postgresql count different groups in one query

I need to calculate the counts per city,state and country. For example if i have the following data in my table T1:
name city state country
------------------------------
name1 c1 s1 C1
name2 c1 s1 C1
name3 c2 s1 C1
name4 c3 s1 C1
name5 c4 s2 C1
name6 c5 s2 C1
name7 c5 s3 C1
name8 c12 s12 C2
the query should results in:
city state country citycount, statecount, countrycount
-------------------------------------------------------------------
c1 s1 C1 2 4 7
c2 s1 C1 1 4 7
c3 s1 C1 1 4 7
c4 s2 C1 1 2 7
c5 s2 C1 1 2 7
c5 s3 C1 1 1 7
c12 s12 C2 1 1 1
if i do group by and count then i need to write 3 different queries. But i want to do it in one query. please help.

You can use window functions, for example one solution could be this one:
SELECT DISTINCT city
,STATE
,country
,count(*) OVER (PARTITION BY city,STATE,country) AS citycount
,count(*) OVER (PARTITION BY STATE,country) AS statecount
,count(*) OVER (PARTITION BY country) AS countrycount
FROM T1
ORDER BY city
,STATE
,country
Please see a fiddle here.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

spark scala - How to create a UDT (cassandra user data type [list <UDT>]) from a CSV file - scala

Related

how to move data from one column to another column's row in postgres?

PostgreSQL- get records with unique column combination

KDB get substring

Give "ID" to multiple columns

Postgresql count different groups in one query

Categories

Resources