I've been struggling with this one for a while. I need to remove "duplicates" in my dataframe composed like this:
---------------------
X | Y | Z |
---------------------
team1 | ABC | DEF |
team1 | | |
team2 | | |
team2 | GHK | LMN |
team3 | | RST |
team4 | UVW | WYZ |
I need the outcome to be:
team1 | ABC | DEF |
team2 | GHK | LMN |
team3 | | RST |
team4 | UVW | WYZ |
The problem is, not all rows have these empty values. I've tried using fist('Y', True), i've tried first(coalesce(col('Y')), True) but these are not nulls, just empty values. As result i removed duplicates, but as values i got empty values. Is there a way to pick a first non-empty value if it exist?
Some rows are have "naturally" empty values, no duplicates there. Sorry, I'm new here, thank you so much!
Related
I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show
I am relatively new to Spark and Scala. I have a dataframe which has the following format:
| Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 |
| 1234 | AAAA | 1111 | afsdf | ewqre | 1970-01-01 00:00:00.0 | false |
| 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true |
| 1234 | AAAA | 1111 | dafsd | afwew | 2015-01-17 07:09:32.748 | false |
| 5678 | BBBB | 2222 | afsdf | qwerq | 1970-01-01 00:00:00.0 | true |
| 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 | false |
| 9101 | CCCC | 3333 | caxad | fsdaa | 1970-01-01 00:00:00.0 | false |
What I need to do is to get the row that corresponds to the latest timestamp.
In the example above, the keys are Col1, Col2 and Col3. Col_TS represents the timestamp and Col_7 is a boolean that determines the validity of the record.
What I want to do is to find a way to group these records based on the keys and retain the one that has the latest timestamp.
So the output of the operation in the dataframe above should be:
| Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 |
| 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true |
| 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 | false |
| 9101 | CCCC | 3333 | caxad | fsdaa | 1970-01-01 00:00:00.0 | false |
I came up with a partial solution but this way I can only return the dataframe of the Column keys on which the records are grouped and not the other columns.
df = df.groupBy("Col1","Col2","Col3").agg(max("Col_TS"))
| Col1 | Col2 | Col3 | max(Col_TS) |
| 1234 | AAAA | 1111 | 2017-01-17 07:09:32.748 |
| 5678 | BBBB | 2222 | 2016-12-08 07:58:43.04 |
| 9101 | CCCC | 3333 | 1970-01-01 00:00:00.0 |
Can someone help me in coming up with a Scala code for performing this operation?
You can use window function as following
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("Col1","Col2","Col3").orderBy(col("Col_TS").desc)
df.withColumn("maxTS", first("Col_TS").over(windowSpec))
.select("*").where(col("maxTS") === col("Col_TS"))
.drop("maxTS")
.show(false)
You should get output as following
+----+----+----+-----+-----+----------------------+-----+
|Col1|Col2|Col3|Col_4|Col_5|Col_TS |Col_7|
+----+----+----+-----+-----+----------------------+-----+
|5678|BBBB|2222|bafva|qweqe|2016-12-0807:58:43.04 |false|
|1234|AAAA|1111|ewqrw|dafda|2017-01-1707:09:32.748|true |
|9101|CCCC|3333|caxad|fsdaa|1970-01-0100:00:00.0 |false|
+----+----+----+-----+-----+----------------------+-----+
One option is firstly order the data frame by Col_TS, then group by Col1, Col2 and Col3 and take the last item from each other column:
val val_columns = Seq("Col_4", "Col_5", "Col_TS", "Col_7").map(x => last(col(x)).alias(x))
(df.orderBy("Col_TS")
.groupBy("Col1", "Col2", "Col3")
.agg(val_columns.head, val_columns.tail: _*).show)
+----+----+----+-----+-----+--------------------+-----+
|Col1|Col2|Col3|Col_4|Col_5| Col_TS|Col_7|
+----+----+----+-----+-----+--------------------+-----+
|1234|AAAA|1111|ewqrw|dafda|2017-01-17 07:09:...| true|
|9101|CCCC|3333|caxad|fsdaa|1970-01-01 00:00:...|false|
|5678|BBBB|2222|bafva|qweqe|2016-12-08 07:58:...|false|
+----+----+----+-----+-----+--------------------+-----+
I have one table:
| id | head1| head2 | head3|
| 1 | fv1 | fw1,fw2,fw3| fv3 |
| 2 | sv2 | sw1,sw2,sw3| sv4 |
And would like to have the following:
| id | head2 |
| 1 | fw1 |
| 1 | fw2 |
| 1 | fw3 |
| 2 | sw1 |
| 2 | sw2 |
| 2 | sw3 |
So I would like to split a comma-delimited content of some columns and then copy it over into the different table as rows for search purposes.
Which Talend component should I use to achieve this? Is that possible?
tNormalize should help you with this problem.
Just select "," as field separator, and head2 as the column to normalize.
I'm trying to use Tableau (v10.1) to combine 5 separate columns and get a count of the distinct values for that combination. Some rows/columns are empty. For example:
+-------+-------+-------+-------+-------+
| Tag 1 | Tag 2 | Tag 3 | Tag 4 | Tag 5 |
+-------+-------+-------+-------+-------+
| A | B | C | D | E |
| B | D | E | - | - |
| - | - | - | - | - |
| E | A | - | - | - |
+-------+-------+-------+-------+-------+
I want to obtain the following in a Tableau worksheet:
+-----+-------+
| Tag | Count |
+-----+-------+
| E | 3 |
| A | 2 |
| B | 2 |
| D | 2 |
| C | 1 |
+-----+-------+
I would like to do this in Tableau (using calculated fields, etc.) and not change the original data source.
Click on the data source tab, select the five fields named Tag # and then use the pivot command to reshape the data without changing the original source
I have been trying to simplify a semi-complex table that I have by adding named fields, without a problem, until I get to the vsum operator. I had the formula set to $M=vsum($3..#-4) which works, however I am continuously having to add and remove items from those fields, which changes the column numbering. This results in me having to change the field specifications of the vsum range after every update/change. I thus tried naming the top field and bottom fields with the thought of supplying the named variables to vsum, giving me a table similar to the following:
| / | <> | <> |
|---+--------+---------|
| | Title1 | Title 2 |
|---+--------+---------|
| _ | | START |
| | name | 1000 |
| | name | 3456 |
| | name | 123 |
| ^ | | END |
|---+--------+---------|
| _ | | MT |
| # | Total | #ERROR |
| # | | |
|---+--------+---------|
#+TBLFM: $MT=vsum($START..$END)
This is the debug formula output from the above table:
Substitution history of formula
Orig: vsum($START..$END)
$xyz-> vsum((1000)..(123))
#r$c-> vsum((1000)..(123))
$1-> vsum((1000)..(123))
-----------^
Error: Expected `)'
I have tried embrasing the named field variables in parenthesis, and several other ways but have thus far not been able to get this to work. I am hoping I am just missing something and being blind, but perhaps this is not possible to do?
I have also tried the sum-up function with no success as well. Thank you in advance for your assistance.
The following solution works by using #II and #III to refer to all entries between the second and third hline.
| / | <> | <> |
|---+--------+---------|
| | Title1 | Title 2 |
|---+--------+---------|
| | name | 1000 |
| | name | 3456 |
| | name | 123 |
|---+--------+---------|
| _ | | MT |
| # | Total | 4579 |
| # | | |
|---+--------+---------|
#+TBLFM: $MT=vsum(#II..#III)
Documentation: http://orgmode.org/manual/References.html#References