Create a new DF using another two - scala

I have two dataframes that shares the column colour, and I would like to create a new column with the Code that correspond to the column colour in a new DF as you can see:
DF1
+------------+--------------------+
| Code | colour |
+------------+--------------------+
| 1001 | brown |
| 1201 | black |
| 1300 | green |
+------------+--------------------+
DF2
+------------+--------------------+-----------+
| Name | colour | date |
+------------+--------------------+-----------+
| Joee | brown | 20210101 |
| Jess | black | 20210101 |
| James | green | 20210101 |
+------------+--------------------+-----------+
Output:
+------------+--------------------+-----------+----------+
| Name | colour | date | Got |
+------------+--------------------+-----------+----------+
| Joee | black | 20210101 | 1201 |
| Jess | brown | 20210101 | 1001 |
| James | blue | 20210101 | 092 |
+------------+--------------------+-----------+----------+
How can I do this? With join?

As mck suggested, a simple SQL join would be enough for your case, by explicitly specifying the equality of the colour column's values between the two DataFrames, as seen below (we drop one of the two colour columns since they have the same value for each row after the join):
val joined = df1.join(df2, df1("colour").equalTo(df2("colour")))
.drop(df1("colour"))
This is what we get after showing the newly formed joined DataFrame:
+----+-----+------+--------+
|code| name|colour| date|
+----+-----+------+--------+
|1001| Joe| brown|20210101|
|1201| Jess| black|20210101|
|1300|James| green|20210101|
+----+-----+------+--------+

Related

Compare specific rows of DataFrames in Scala

I have two Scala DataFrames which I am testing for similarities. I want to be able to pick a specific row number, and compare each value of that row between the two DataFrames. For example:
Dataframe 1: df1
+------+-----+-----------+
| Name | Age | Eye Color |
+------+-----+-----------+
| Bob | 12 | Blue |
| Bil | 17 | Red |
| Ron | 13 | Brown |
+------+-----+-----------+
Dataframe 2: df2
+------+-----+-----------+
| Name | Age | Eye Color |
+------+-----+-----------+
| Bob | 12 | Blue |
| Bil | 14 | Blue |
| Ron | 13 | Brown |
+------+-----+-----------+
Input: Row 2, output: Age, Eye Color.
What would be ideal, is for the output to show the values that are different too. I have considered the option here but the issue is that my DataFrames are very large (in excess of 200,000 rows) so this takes far too long. Is there a simpler way to select a specific row value of a Dataframe in Scala?

How to combine pyspark dataframes with different shapes and different columns

I have two dataframes in Pyspark. One has more than 1000 rows and the other only 4 rows. The columns also are not matching.
df1 with more than 1000 rows:
+----+--------+--------------+-------------+
| ID | col1 | col2 | col 3 |
+----+--------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 |
| 2 | time 2 | value2_col2 | value2_col3 |
+----+--------+--------------+-------------+
...
df2 with only 4 rows:
+-----+--------------+--------------+
| key | col_c | col_d |
+-----+--------------+--------------+
| a | valuea_colc | valuea_cold |
| b | valueb_colc | valueb_cold |
+-----+--------------+--------------+
I want to create a dataframe looking like this:
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| ID | col1 | col2 | col 3 | a_col_c | a_col_d | b_col_c | b_col_d |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
| 2 | time 2 | value2_col2 | value2_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
Can you please help with this? I prefer not to use Pandas.
Thank you!
I actually figured this out using crossJoin.
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html explains how to use crossJoin with Pyspark DataFrames.

PostgreSQL - How to do a Loop on a column

I am struggling to do a loop on a Postgres, but functions on postgres are not my piece of cake.
I have the following table on postgres:
| portfolio_1 | total_risk |
|----------------|------------|
| Top 10 Bets | |
| AAPL34 | 2,06699 |
| DISB34 | 1,712684 |
| PETR4 | 0,753324 |
| PETR3 | 0,087767 |
| VALE3 | 0,086346 |
| LREN3 | 0,055108 |
| AMZO34 | 0,0 |
| Bottom 10 Bets | |
| AAPL34 | 0,0 |
What I'm trying to do is get the values after the "Top 10 Bets" and before the "Botton 10 Bets".
My goal is the following result:
| portfolio_1 | total_risk |
|-------------|------------|
| AAPL34 | 2,06699 |
| DISB34 | 1,712684 |
| PETR4 | 0,753324 |
| PETR3 | 0,087767 |
| VALE3 | 0,086346 |
| LREN3 | 0,055108 |
| AMZO34 | 0,0 |
So, my goal is to take off the "Top 10 Bets", the "Botton 10 Bets" and the AAPL34 after the "Botton 10 Bets", which was repeated.
The quantity of rows is variable (I'm importing it from an Excel file), so I need a loop to do this, right?
SQL tables and result sets represent unordered sets. There is no "before" or "after" unless rows explicitly provide that information.
Let me assume that you have such a column, which I will call id for convenience.
Then you can do this in several ways. Here is one:
select t.*
from t
where t.id > (select min(t2.id) from t t2 where t2.portfolio_1 = 'Top 10 Bets') and
t.id < (select max(t2.id) from t t2 where t2.portfolio_1 = 'Bottom 10 Bets');

Tableau - Calculated field for difference between date and maximum date in table

I have the following table that I have loaded in Tableau (It has only one column CreatedOnDate)
+-----------------+
| CreatedOnDate |
+-----------------+
| 1/1/2016 |
| 1/2/2016 |
| 1/3/2016 |
| 1/4/2016 |
| 1/5/2016 |
| 1/6/2016 |
| 1/7/2016 |
| 1/8/2016 |
| 1/9/2016 |
| 1/10/2016 |
| 1/11/2016 |
| 1/12/2016 |
| 1/13/2016 |
| 1/14/2016 |
+-----------------+
I want to be able to find the maximum date in the table, compare it with every date in the table and get the difference in days. For the above table, the maximum date in table is 1/14/2016. Every date is compared to 1/14/2016 to find the difference.
Expected Output
+-----------------+------------+
| CreatedOnDate | Difference |
+-----------------+------------+
| 1/1/2016 | 13 |
| 1/2/2016 | 12 |
| 1/3/2016 | 11 |
| 1/4/2016 | 10 |
| 1/5/2016 | 9 |
| 1/6/2016 | 8 |
| 1/7/2016 | 7 |
| 1/8/2016 | 6 |
| 1/9/2016 | 5 |
| 1/10/2016 | 4 |
| 1/11/2016 | 3 |
| 1/12/2016 | 2 |
| 1/13/2016 | 1 |
| 1/14/2016 | 0 |
+-----------------+------------+
My goal is to create this Difference calculated field. I am struggling to find a way to do this using DATEDIFF.
And help would be appreciated!!
woodhead92, this approach would work, but means you have to use table calculations. Much more flexible approach (available since v8) is Level of Details expressions:
First, define a MAX date for the whole dataset with this calculated field called MaxDate LOD:
{FIXED : MAX(CreatedOnDate) }
This will always calculate the maximum date on table (will overwrite filters as well, if you need to reflect them, make sure you add them to context.
Then you can use pretty much the same calculated field, but no need for ATTR or Table Calculations:
DATEDIFF('day', [CreatedOnDate], [MaxDate LOD])
Hope this helps!

org mode spreadsheet formula for the number of lines in a cell

I am looking at a org-mode spreadsheet formula to get the number of non-empty lines in a cell. Example :
| col1 | col2 |
|------+------|
| a | 3 |
| b | |
| c | |
| | |
|------+------|
| a | 1 |
| | |
|------+------|
| a | 2 |
| b | |
| | |
|------+------|
I have "col1" as input, and would like to fill "col2" automatically (the values can be anything, not just a b c).
Note that what you call "cell" is actually a group of cells delimited by horizontal separators (hlines).
The following example uses calc's vlen function to get the size of the vector of cells on column 1, and rows between the previous (#-I) and next (#+I) hlines.
| col1 | col2 |
|------+------|
| a | 3 |
| b | |
| c | |
| | |
|------+------|
#+TBLFM: #2$2=vlen(#-I$1..#+I$1)
You have to apply this same formula for all row groups.