Compare specific rows of DataFrames in Scala - scala

I have two Scala DataFrames which I am testing for similarities. I want to be able to pick a specific row number, and compare each value of that row between the two DataFrames. For example:
Dataframe 1: df1
+------+-----+-----------+
| Name | Age | Eye Color |
+------+-----+-----------+
| Bob | 12 | Blue |
| Bil | 17 | Red |
| Ron | 13 | Brown |
+------+-----+-----------+
Dataframe 2: df2
+------+-----+-----------+
| Name | Age | Eye Color |
+------+-----+-----------+
| Bob | 12 | Blue |
| Bil | 14 | Blue |
| Ron | 13 | Brown |
+------+-----+-----------+
Input: Row 2, output: Age, Eye Color.
What would be ideal, is for the output to show the values that are different too. I have considered the option here but the issue is that my DataFrames are very large (in excess of 200,000 rows) so this takes far too long. Is there a simpler way to select a specific row value of a Dataframe in Scala?

Related

Create a new DF using another two

I have two dataframes that shares the column colour, and I would like to create a new column with the Code that correspond to the column colour in a new DF as you can see:
DF1
+------------+--------------------+
| Code | colour |
+------------+--------------------+
| 1001 | brown |
| 1201 | black |
| 1300 | green |
+------------+--------------------+
DF2
+------------+--------------------+-----------+
| Name | colour | date |
+------------+--------------------+-----------+
| Joee | brown | 20210101 |
| Jess | black | 20210101 |
| James | green | 20210101 |
+------------+--------------------+-----------+
Output:
+------------+--------------------+-----------+----------+
| Name | colour | date | Got |
+------------+--------------------+-----------+----------+
| Joee | black | 20210101 | 1201 |
| Jess | brown | 20210101 | 1001 |
| James | blue | 20210101 | 092 |
+------------+--------------------+-----------+----------+
How can I do this? With join?
As mck suggested, a simple SQL join would be enough for your case, by explicitly specifying the equality of the colour column's values between the two DataFrames, as seen below (we drop one of the two colour columns since they have the same value for each row after the join):
val joined = df1.join(df2, df1("colour").equalTo(df2("colour")))
.drop(df1("colour"))
This is what we get after showing the newly formed joined DataFrame:
+----+-----+------+--------+
|code| name|colour| date|
+----+-----+------+--------+
|1001| Joe| brown|20210101|
|1201| Jess| black|20210101|
|1300|James| green|20210101|
+----+-----+------+--------+

How to derive a column based on two different merged dimensions in SAP Business Objects?

I have two tables, and I want to derive one column from Table1 to Table2 by common fields Name and Color (by merging).
But until I was using only one merged dimension and it was working well for me.
Now, for two dimensions, it is not working.
If I have two tables :
Table1 Name : Fruits
|--------|--------|---------------|
| Name | Color | Rateperunit |
|--------|--------|---------------|
| banana | yellow | 3 |
|--------|--------|---------------|
| apple | red | 25 |
|--------|--------|---------------|
| apple | green | 30 |
|--------|--------|---------------|
Table2 Name : Purchase
|--------|---------|-------------|-----------|----------|
| Item | Color | Rateperunit | Noofitems | Totalbill|
|--------|---------|-------------|-----------|----------|
| apple | green |30 | 3 | 90 |
|--------|---------|-------------|-----------|----------|

Fast split Spark dataframe by keys in some column and save as different dataframes

I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?

Tallying in Scala DataFrame Array

I have 2 column spark Scala DataFrame. The first is of one variable, the second one is an array of letters. What I am trying to do is find a way to code a tally (without using a for loop) of the variables in an array.
For example, this is what I have (I am sorry its not that neat, this is my first stack post). You have 5 computers, each person is represented by a letter. I want to find a way to find out how many computers a person (A,B,C,D,E) has used.
+-----------------+--------------+
| id | [person] |
+-----------------+--------------+
| Computer 1 | [A,B,C,D] |
| Computer 2 | [A,B] |
| Computer 3 | [A,B,E] |
| Computer 4 | [A,C,D] |
| Computer 5 | [A,B,C,D,E] |
+-----------------+--------------+
What I would like to code up or asking if anyone has a solution would be something like this:
+---------+-----------+
| Person | [Count] |
+---------+-----------+
| A | 5 |
| B | 4 |
| C | 3 |
| D | 3 |
| E | 2 |
+---------+-----------+
Somehow count the people who are in arrays within the dataframe.
There's a function called explode which will expand the arrays into one row for each item:
| id | person
+-----------------+------------------------+
| Computer 1| A |
| Computer 1| B |
| Computer 1| C |
| Computer 1| D |
....
+---+----+----+----+----+
Then you can group by the person and count. Something like:
val df2 = df.select(explode($"person").as("person"))
val result = df2.groupBy($"person").count

PostgreSQL simple count query

Trying to scale this down so the answer is simple. I can probably extrapolate the answers here to apply to a bigger data set.
Given the following table:
+------+-----+
| name | age |
+------+-----+
| a | 5 |
| b | 7 |
| c | 8 |
| d | 8 |
| e | 10 |
+------+-----+
I want to make a table that shows the count of people where their age is equal to or greater than x. For instance, the table about would produce:
+--------------+-------+
| at least age | count |
+--------------+-------+
| 5 | 5 |
| 6 | 4 |
| 7 | 4 |
| 8 | 3 |
| 9 | 1 |
| 10 | 1 |
+--------------+-------+
Is there a single query that can accomplish this task? Obviously, it is easy to write a simple function for it, but I'm hoping to be able to do this quickly with one query.
Thanks!
Yes, what you're looking for is a window function.
with cte_age_count as (
select age,
count(*) c_star
from people
group by age)
select age,
sum(c_star) over (order by age
range between unbounded preceding
and current row)
from cte_age_count
Not syntax checked ... let me know if it works!