I have this RDD:
val resultRdd: RDD[(VertexId, String, Seq[Long])]
I want to count the distinct values in Seq of all records.
for example, if I have 3 records with Seq values as follows:
VertexId ------- String -------Seq[Long]
1 ----------------- x ------------- 1, 3
2 ----------------- x ------------- 1, 5
3 ----------------- x--------------- 2, 3, 6
the result should be = 5 , the count of {1,3,5,2,6}
Thanks :)
resultRdd.flatMap(_._3).distinct().count()
Related
I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns.
For Example: My df is like this
col1
col2
col3
Cat1
XYZ
A
Cat1
XYZ
C
Cat1
ABC
B
Cat2
ABC
A
Cat2
XYZ
B
Cat2
MNO
A
I want something like this
VarNAME
Category
Count
A
B
C
col1
Cat1
3
1
1
1
col1
Cat2
3
2
0
1
col2
XYZ
3
1
1
1
col2
ABC
2
1
1
0
col2
MNO
1
1
0
0
col3
A
3
3
0
0
col3
B
2
0
2
0
Col3
C
1
0
0
1
So, Basically, I want cross-tabulation for all individual columns with col3 and the total count.
I can do it in Python using a loop but the loop is somewhat different in pyspark.
Here are my 2 cents.
Created a sample dataframe
df = spark.createDataFrame(
[("Cat1","XYZ","A"),
("Cat1","XYZ","C"),
("Cat1","ABC","B"),
("Cat2","ABC","A"),
("Cat2","XYZ","B"),
("Cat2","MNO","A")
],schema = ['col1','col2','col3'])
Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it.
Then performed union for all these dataframes
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df_union = \
df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
Printing the data frame based on the column order
df_union.select('VarName','Category','count','A','B','C').show()
Please check the sample output for the reference:
I have a dataset containing two columns, user_id and item_id. The DataFrame looks like this:
index user_id item_id
0 user1 A
1 user1 B
2 user2 A
3 user3 B
4 user4 C
I'm looking for a way to transform this table into an item-item interaction matrix where we have distinct intersection of common users between items:
A B C
A 2 1 0
B 1 2 0
C 0 0 1
And another item-item interaction matrix where we have distinct union of users between items:
A B C
A 2 3 3
B 3 2 3
C 3 3 1
Step 0. Define the dataframe
import pyspark.sql.functions as F
data = [(0, "user1", "A"),
(1, "user1", "B"),
(2, "user2", "A"),
(3, "user3", "B"),
(4, "user4", "C")]
df = spark.createDataFrame(data, schema=["index", "user_id", "item_id"])
Step 1. Collect user data for each item in df_collect
df_collect = (df
.select("user_id", "item_id")
.groupBy("item_id")
.agg(F.collect_set("user_id").alias("users")))
Step 2. Cross join df_collect with itself to get all item-item combinations
df_crossjoin = (df_collect
.join(df_collect
.withColumnRenamed("item_id", "item_y")
.withColumnRenamed("users", "users_y")))
Step 2. Find user union and intersection and the count
df_ui = (df_crossjoin
.withColumn("users_union",
F.size((F.array_union("users", "users_y"))))
.withColumn("users_intersect",
F.size(F.array_intersect("users", "users_y"))))
Step 3. Pivot to get item-item matrix
df_matrix_union = (df_ui
.groupBy("item_id")
.pivot("item_y")
.agg(F.first("users_union"))
.orderBy("item_id"))
df_matrix_intrsct = (df_ui
.groupBy("item_id")
.pivot("item_y")
.agg(F.first("users_intersect"))
.orderBy("item_id"))
To get an appropriate table rows count I thought to use a naive approach: use count 1 construct. And it works in a simple case:
q)t:([]sym:`a`a`b`b);
q)select cnt: count 1 by sym from t
sym| cnt
---| ---
a | 2
b | 2
But when I added other fields, I've got wrong result:
q)select cnt: count 1, sym by sym from t
sym| cnt sym
---| -------
a | 1 a a
b | 1 b b
Why does count 1 work (or just it seems so) in one column case and failed with multiple columns?
Upd: Expected to get something like this
sym| cnt sym
---| -------
a | 2 a a
b | 2 b b
I don't think count 1 will produce the result you're looking for, nor even a consistent one.
I think you might want to use count i instead. When selecting by sym you are specifying which column you want to count by.
q)t:([]sym:`a`a`b`b)
q)select cnt:count i,sym by sym from t
sym| cnt sym
---| -------
a | 2 a a
b | 2 b b
q).z.K
3.6
A point to note however is that this solution will not work on kdb+ 4.0.
q)t:([]sym:`a`a`b`b)
q)select cnt:count i,sym by sym from t
'dup names for cols/groups sym
[0] select cnt:count i,sym by sym from t
^
q).z.K
4f
This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.
Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))
How can I extract the first n rows from each group? For example: for table
bb: ([]sym:(4#`a),(5#`b);val: til 9)
sym val
-------------
a 0
a 1
a 2
a 3
b 4
b 5
b 6
b 7
b 8
How can I select the first 2 rows of each group by sym?
Thanks
Can use fby:
q)select from bb where ({x in 2#x};i) fby sym
sym val
-------
a 0
a 1
b 4
b 5
You can try this:
q)select from t where i in raze exec 2#i by sym from t
sym val
-------
a 0
a 1
b 4
b 5