Tallying in Scala DataFrame Array - scala

I have 2 column spark Scala DataFrame. The first is of one variable, the second one is an array of letters. What I am trying to do is find a way to code a tally (without using a for loop) of the variables in an array.
For example, this is what I have (I am sorry its not that neat, this is my first stack post). You have 5 computers, each person is represented by a letter. I want to find a way to find out how many computers a person (A,B,C,D,E) has used.
+-----------------+--------------+
| id | [person] |
+-----------------+--------------+
| Computer 1 | [A,B,C,D] |
| Computer 2 | [A,B] |
| Computer 3 | [A,B,E] |
| Computer 4 | [A,C,D] |
| Computer 5 | [A,B,C,D,E] |
+-----------------+--------------+
What I would like to code up or asking if anyone has a solution would be something like this:
+---------+-----------+
| Person | [Count] |
+---------+-----------+
| A | 5 |
| B | 4 |
| C | 3 |
| D | 3 |
| E | 2 |
+---------+-----------+
Somehow count the people who are in arrays within the dataframe.

There's a function called explode which will expand the arrays into one row for each item:
| id | person
+-----------------+------------------------+
| Computer 1| A |
| Computer 1| B |
| Computer 1| C |
| Computer 1| D |
....
+---+----+----+----+----+
Then you can group by the person and count. Something like:
val df2 = df.select(explode($"person").as("person"))
val result = df2.groupBy($"person").count

Related

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

PostgreSQL - How to do a Loop on a column

I am struggling to do a loop on a Postgres, but functions on postgres are not my piece of cake.
I have the following table on postgres:
| portfolio_1 | total_risk |
|----------------|------------|
| Top 10 Bets | |
| AAPL34 | 2,06699 |
| DISB34 | 1,712684 |
| PETR4 | 0,753324 |
| PETR3 | 0,087767 |
| VALE3 | 0,086346 |
| LREN3 | 0,055108 |
| AMZO34 | 0,0 |
| Bottom 10 Bets | |
| AAPL34 | 0,0 |
What I'm trying to do is get the values after the "Top 10 Bets" and before the "Botton 10 Bets".
My goal is the following result:
| portfolio_1 | total_risk |
|-------------|------------|
| AAPL34 | 2,06699 |
| DISB34 | 1,712684 |
| PETR4 | 0,753324 |
| PETR3 | 0,087767 |
| VALE3 | 0,086346 |
| LREN3 | 0,055108 |
| AMZO34 | 0,0 |
So, my goal is to take off the "Top 10 Bets", the "Botton 10 Bets" and the AAPL34 after the "Botton 10 Bets", which was repeated.
The quantity of rows is variable (I'm importing it from an Excel file), so I need a loop to do this, right?
SQL tables and result sets represent unordered sets. There is no "before" or "after" unless rows explicitly provide that information.
Let me assume that you have such a column, which I will call id for convenience.
Then you can do this in several ways. Here is one:
select t.*
from t
where t.id > (select min(t2.id) from t t2 where t2.portfolio_1 = 'Top 10 Bets') and
t.id < (select max(t2.id) from t t2 where t2.portfolio_1 = 'Bottom 10 Bets');

Eloquent; getting rows not in other table

I'm trying to get a list of records from a table, where the PK is not in a composite table. If you see consider below table, I'm trying to get all candidates that are not in a pool, so in this case I should get only Eric.
+-------------+ +-------------+
| candidates | | pools |
+-------------+ +-------------+
| id | name | | id | name |
+----+--------| +----+--------|
| 1 | John | | 1 | Pool A |
| 2 | Richard| | 2 | Pool B |
| 3 | Eric | | 3 | Pool C |
+----+--------+ +----+--------+
+--------------------------+
| candidate_pools |
+--------------------------+
| pool_id | candidate_id |
+---------+----------------|
| 1 | 1 |
| 1 | 2 |
| 3 | 2 |
| 2 | 1 |
+---------+----------------+
I've found this answer: https://laracasts.com/discuss/channels/eloquent/get-records-collection-if-not-exists-in-another-table
$crashedCarIds = CrashedCar::pluck('car_id')->all();
$cars = Car::whereNotIn('id', $crashedCarIds)->select(...)->get();
Which should work, however it seems extremely inefficient in our fairly big and rapidly growing database, on a page that is accessed often by a lot of people.
In normal MySQL you would do something like this:
WHERE candidates.id NOT IN
( SELECT candidate_id
FROM candidate_pools
LEFT JOIN pools
ON candidate_pools.pool_id = pool.id
)
Note: The tables are just a simple illustration, but the actual database set-up and queries are a lot bigger.
I managed to do it using scope, like this:
public function scopeNotInPool($query) {
$query->whereNotIn('candidate_id', function ($query) {
$query->select('candidate_id')
->from('candidate_pool')
->join('pool', function ($join) {
$join->on('pool.id', '=', 'candidate_pool.pool_id');
}
)
;
});
return $query;
}
Then call it like:
$candidates = Models\Candidate::notInPool()->get();

Combine multiple columns to yield unique values

I'm trying to use Tableau (v10.1) to combine 5 separate columns and get a count of the distinct values for that combination. Some rows/columns are empty. For example:
+-------+-------+-------+-------+-------+
| Tag 1 | Tag 2 | Tag 3 | Tag 4 | Tag 5 |
+-------+-------+-------+-------+-------+
| A | B | C | D | E |
| B | D | E | - | - |
| - | - | - | - | - |
| E | A | - | - | - |
+-------+-------+-------+-------+-------+
I want to obtain the following in a Tableau worksheet:
+-----+-------+
| Tag | Count |
+-----+-------+
| E | 3 |
| A | 2 |
| B | 2 |
| D | 2 |
| C | 1 |
+-----+-------+
I would like to do this in Tableau (using calculated fields, etc.) and not change the original data source.
Click on the data source tab, select the five fields named Tag # and then use the pivot command to reshape the data without changing the original source

PostgreSQL simple count query

Trying to scale this down so the answer is simple. I can probably extrapolate the answers here to apply to a bigger data set.
Given the following table:
+------+-----+
| name | age |
+------+-----+
| a | 5 |
| b | 7 |
| c | 8 |
| d | 8 |
| e | 10 |
+------+-----+
I want to make a table that shows the count of people where their age is equal to or greater than x. For instance, the table about would produce:
+--------------+-------+
| at least age | count |
+--------------+-------+
| 5 | 5 |
| 6 | 4 |
| 7 | 4 |
| 8 | 3 |
| 9 | 1 |
| 10 | 1 |
+--------------+-------+
Is there a single query that can accomplish this task? Obviously, it is easy to write a simple function for it, but I'm hoping to be able to do this quickly with one query.
Thanks!
Yes, what you're looking for is a window function.
with cte_age_count as (
select age,
count(*) c_star
from people
group by age)
select age,
sum(c_star) over (order by age
range between unbounded preceding
and current row)
from cte_age_count
Not syntax checked ... let me know if it works!