Related
Assume I have a Spark DataFrame d1 with two columns, elements_1 and elements_2, that contain sets of integers of size k, and value_1, value_2 that contain a integer value. For example, with k = 3:
d1 =
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3) | (3, 4, 5) |
| (2, 1, 3) | (1, 0, 2) |
| (4, 3, 1) | (3, 5, 6) |
+-------------------------+
I need to create a new column combinations made that contains, for each pair of sets elements_1 and elements_2, a list of the sets from all possible combinations of their elements. These sets must have the following properties:
Their size must be k+1
They must contain either the set in elements_1 or the set in elements_2
For example, from (1, 2, 3) and (3, 4, 5) we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]. The list does not contain (1, 2, 5) because it is not of length 3+1, and it does not contain (1, 2, 4, 5) because it contains neither of the original sets.
You need to create a custom user-defined function to perform the transformation, create a spark-compatible UserDefinedFunction from it, then apply using withColumn. So really, there are two questions here: (1) how to do the set transformation you described, and (2) how to create a new column in a DataFrame using a user-defined function.
Here's a first shot at the set logic, let me know if it does what you're looking for:
def combo[A](a: Set[A], b: Set[A]): Set[Set[A]] =
a.diff(b).map(b+_) ++ b.diff(a).map(a+_)
Now create the UDF wrapper. Note that under the hood these sets are all represented by WrappedArrays, so we need to handle this. There's probably a more elegant way to deal with this by defining some implicit conversions, but this should work:
import scala.collection.mutable.WrappedArray
val comboWrap: (WrappedArray[Int],WrappedArray[Int])=>Array[Array[Int]] =
(x,y) => combo(x.toSet,y.toSet).map(_.toArray).toArray
val comboUDF = udf(comboWrap)
Finally, apply it to the DataFrame by creating a new column:
val data = Seq((Set(1,2,3),Set(3,4,5))).toDF("elements_1","elements_2")
val result = data.withColumn("result",
comboUDF(col("elements_1"),col("elements_2")))
result.show
I have a query which gives a result:
id | manager_id | level | star_level
----+------------+-------+------------
1 | NULL | 1 | 0
2 | 1 | 2 | 1
3 | 2 | 3 | 1
4 | 3 | 4 | 2
5 | 4 | 5 | 2
6 | 5 | 6 | 2
7 | 6 | 7 | 3
8 | 7 | 8 | 3
9 | 8 | 9 | 4
(9 rows)
Here is the query:
WITH RECURSIVE parents AS (
SELECT e.id
, e.manager_id
, 1 AS level
, CAST(s.is_star AS INTEGER) AS star_level
FROM employees AS e
INNER JOIN skills AS s
ON e.skill_id = s.id
WHERE manager_id IS NULL
UNION ALL
SELECT e.id
, e.manager_id
, p.level + 1 AS level
, p.star_level + CAST(s.is_star AS INTEGER) AS star_level
FROM employees AS e
INNER JOIN skills AS s
ON e.skill_id = s.id
INNER JOIN parents AS p
ON e.manager_id = p.id
WHERE e.manager_id = p.id
)
SELECT *
FROM parents
;
Can you please tell me how you can change the query so that in the same query the level and star_level values can be written to the corresponding columns?
Demo data:
create table Employees(
id INT,
name VARCHAR,
manager_id INT,
skill_id INT,
level INT,
star_level INT
);
create table Skills(
id INT,
name VARCHAR,
is_star BOOL
);
INSERT INTO Employees
(id, name, manager_id, skill_id)
VALUES
(1, 'Employee 1', NULL, 1),
(2, 'Employee 2', 1, 2),
(3, 'Employee 3', 2, 3),
(4, 'Employee 4', 3, 4),
(5, 'Employee 5', 4, 5),
(6, 'Employee 6', 5, 1),
(7, 'Employee 7', 6, 2),
(8, 'Employee 8', 7, 3),
(9, 'Employee 9', 8, 4)
;
INSERT INTO Skills
(id, name, is_star)
VALUES
(1, 'Skill 1', FALSE),
(2, 'Skill 2', TRUE),
(3, 'Skill 3', FALSE),
(4, 'Skill 4', TRUE),
(5, 'Skill 5', FALSE)
;
As a result, I need a query which will count level and star_level columns for Employees table and write their values (in Employees table) in one query.
You can use an UPDATE statement together with your CTE:
with recursive parents as (
... your original query goes here ...
)
update employees
set level = p.level,
star_level = p.star_level
from parents p
where employees.id = p.id;
I am new to spark and I am trying to find specific information about a couple of lists of data that I have converted into two separate DataFrames.
The two DataFrames are:
Users: item_Details:
user_id | item_id item_id | item_name
----------------- ----------------------
1 | 123 123 | phone
2 | 223 223 | game
3 | 423 423 | foo
2 | 1223 1223 | bar
1 | 3213 3213 | foobar
I need to find all pairs of users that have more than 50 common items and sorted on the number of items. There can be no duplicates meaning there should only be one set of userId 1 and userId 2.
The result needs to look like this:
user_id1 | user_id2 | count_of_items | list_of_items
-------------------------------------------------------------
1 | 2 | 51 | phone,foo,bar,foobar
Here's one approach:
assemble item pairs per distinct user-pair via a self-join
generate common items from the item pairs using a UDF
filter the result dataset by the specific common item count
as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val users = Seq(
(1, 123), (1, 223), (1, 423),
(2, 123), (2, 423), (2, 1223), (2, 3213),
(3, 223), (3, 423), (3, 1223), (3, 3213),
(4, 123), (4, 1223), (4, 3213)
).toDF("user_id", "item_id")
val item_details = Seq(
(123, "phone"), (223, "game"), (423, "foo"), (1223, "bar"), (3213, "foobar")
)toDF("item_id", "item_name")
val commonItems = udf( (itemPairs: Seq[Row]) =>
itemPairs.collect{ case Row(a: Int, b: Int) if a == b => a }
)
val commonLimit = 2 // Replace this with any specific common item count
val user_common_items =
users.as("u1").join(users.as("u2"), $"u1.user_id" < $"u2.user_id").
groupBy($"u1.user_id", $"u2.user_id").agg(
collect_set(
struct($"u1.item_id".as("ui1"), $"u2.item_id".as("ui2"))
).as("item_pairs")).
withColumn("common_items", commonItems($"item_pairs")).
drop("item_pairs").
where(size($"common_items") > commonLimit)
user_common_items.show(false)
// +-------+-------+-----------------+
// |user_id|user_id|common_items |
// +-------+-------+-----------------+
// |2 |3 |[423, 3213, 1223]|
// |2 |4 |[3213, 123, 1223]|
// +-------+-------+-----------------+
If common item names instead of item ids are wanted, you can join item_details in the above step to aggregate on the item names; or, you can explode the existing common item ids and join item_details along with a collect_list aggregation by user-pair:
user_common_items.
withColumn("item_id", explode($"common_items")).
join(item_details, Seq("item_id")).
groupBy($"u1.user_id", $"u2.user_id").agg(collect_list($"item_name").as("common_items")).
withColumn("item_count", size($"common_items")).
show
// +-------+-------+--------------------+----------+
// |user_id|user_id| common_items|item_count|
// +-------+-------+--------------------+----------+
// | 2| 3| [foo, foobar, bar]| 3|
// | 2| 4|[foobar, phone, bar]| 3|
// +-------+-------+--------------------+----------+
Another solution, without using UDFs. Since we need the common items, the matching can be given in the joinExprs itself. Check this out
val users = Seq(
(1, 123), (1, 223), (1, 423),
(2, 123), (2, 423), (2, 1223), (2, 3213),
(3, 223), (3, 423), (3, 1223), (3, 3213),
(4, 123), (4, 1223), (4, 3213)
).toDF("user_id", "item_id")
val items = Seq(
(123, "phone"), (223, "game"), (423, "foo"), (1223, "bar"), (3213, "foobar")
)toDF("item_id", "item_name")
val common_items =
users.as("t1").join(users.as("t2"),$"t1.user_id" < $"t2.user_id" and $"t1.item_id" === $"t2.item_id" )
.join(items.as("it"),$"t1.item_id"===$"it.item_id","inner")
.groupBy($"t1.user_id",$"t2.user_id")
.agg(collect_set('item_name).as("items"))
.filter(size('items)>2) // change here for count
.withColumn("size",size('items))
common_items.show(false)
Results
+-------+-------+--------------------+----+
|user_id|user_id|items |size|
+-------+-------+--------------------+----+
|2 |3 |[bar, foo, foobar] |3 |
|2 |4 |[bar, foobar, phone]|3 |
+-------+-------+--------------------+----+
Example code:
Declare #table1 TABLE(myIndex int identity(1,1),[cal] int, Name Nvarchar(20) );
Declare #range int = 5;
INSERT INTO #table1 ([cal], Name)
VALUES (1, 'A'), (3, 'B'), (4, 'C'), (2, 'D'), (3, 'E'), (4, 'F'), (6, 'G'), (2, 'H');
SELECT * FROM #table1
Output:
myIndex | Sum(cal) | Name |
--------+----------+------+
1 | 1 | A |
2 | 3 | B |
3 | 4 | C |
4 | 2 | D |
5 | 3 | E |
6 | 4 | F |
7 | 6 | G |
8 | 2 | H |
I wan to Sum(cal) > 5 then join string
TSQL - 2012 - Report Expect Example
myIndex | Sum(cal) | Name | Description
--------+----------+--------+--------------------------------
1 | 7 | A,B,C | (Explain: First Sum(cal) > 5, Merge String)
2 | 9 | D,E,F | (Explain:Second Sum(cal) > 5, Merge String)
3 | 6 | G | (Explain:Third, Sum(cal) > 5, Merge String)
4 | 2 | H | (Explain:Last, still one last step)
Please, help me to resolve the problems.
This is a solution using a cursor. Hope this could help. Try it and see if performance could be acceptable.
Declare #table1 TABLE(myIndex int identity(1,1),[cal] int, Name Nvarchar(20) );
Declare #range int = 5;
INSERT INTO #table1 ([cal], Name)
VALUES (1, 'A'), (3, 'B'), (4, 'C'), (2, 'D'), (3, 'E'), (4, 'F'), (6, 'G'), (2, 'H');
SELECT * FROM #table1
-----
DECLARE #aggregates TABLE (myIndex int identity(1,1),SumCal int, AggregateNames Nvarchar(MAX) );
DECLARE #SumCal INT
, #AggregateNames NVARCHAR(MAX)
, #cal INT
, #Name Nvarchar(20)
;
SET #SumCal = 0;
SET #AggregateNames = NULL;
DECLARE cur CURSOR LOCAL FAST_FORWARD FOR
SELECT [cal], name from #table1 ORDER BY myIndex
OPEN cur
FETCH NEXT FROM cur INTO #cal, #Name
WHILE ##FETCH_STATUS = 0
BEGIN
SET #SumCal = #SumCal + #cal
SET #AggregateNames = ISNULL(#AggregateNames + ',', '') + #Name
IF #SumCal > 5
BEGIN
INSERT INTO #aggregates([SumCal], AggregateNames)
VALUES(#SumCal, #AggregateNames)
SET #SumCal = 0
SET #AggregateNames = NULL
END
FETCH NEXT FROM cur INTO #cal, #Name
END
IF #SumCal > 0
BEGIN
INSERT INTO #aggregates([SumCal], AggregateNames)
VALUES(#SumCal, #AggregateNames)
END
CLOSE cur
DEALLOCATE cur
SELECT * FROM #aggregates
Here's my input data:
CREATE TEMP TABLE test AS SELECT * FROM (VALUES
(1, 12),
(2, 7),
(3, 8),
(4, 8),
(5, 7)
) AS rows (position, value);
I want to, in a single query (no subqueries or CTEs), assign a unique number for each distinct value. However, I also want those numbers to ascend according to the associated position -- i.e., a distinct value's number should be assigned according to its lowest position.
Assumptions:
each row will always have a unique position
value is not guaranteed unique per row
the number of a distinct value is only for ordinal purposes, e.g. it doesn't matter whether distinct_values goes 1-2-3 or 3-8-14
The desired output is:
position | value | distinct_value
----------+-------+----------------
1 | 12 | 1
2 | 7 | 2
3 | 8 | 3
4 | 8 | 3
5 | 7 | 2
I can get close using DENSE_RANK to number distinct values:
SELECT
position,
value,
DENSE_RANK() OVER (ORDER BY value) AS distinct_value
FROM test ORDER BY position;
The result obviously ignores position:
position | value | distinct_value
----------+-------+----------------
1 | 12 | 3
2 | 7 | 1
3 | 8 | 2
4 | 8 | 2
5 | 7 | 1
Is there a better window function for this?
with
t(x,y) as (values
(1, 12),
(2, 7),
(3, 8),
(4, 8),
(5, 7)),
pos(i,y) as (select min(x), y from t group by y),
ind(i,y) as (select row_number() over(order by i), y from pos)
select * from ind join t using(y) order by x;