Dataframe GroupBy aggregate on column that contains pattern - scala

i have a dataframe, with column c1, c2. I want to group them on c1 and want to pick c2 such that c2 value contains a pattern, if all c2 don't contain pattern return anyone
example df :
c1 c2
1 ai_za
1 ah_px
1 ag_po
1 af_io
1 ae_aa
1 ad_iq
1 ac_on
1 ab_eh
1 aa_bs
2 aa_ab
2 aa_ac
if pattern needed in c2 is '_io'
expected result:
c1 c2
1 af_io
2 aa_ab
1 af_io is returned as it contains '_io' pattern
2 aa_ab is returned as random as no one in group 2 contains pattern '_io'
How to get this using spark dataframe/dataset api ?

If it doesn't matter which row to pick if there is no match, you can try:
df.groupByKey(_.getAs[Int]("c1")).
reduceGroups((x, y) => if(x.getAs[String]("c2").matches(".*_io")) x else y).
toDF("key", "value").
select("value.c1", "value.c2").show
+---+-----+
| c1| c2|
+---+-----+
| 1|af_io|
| 2|aa_ac|
+---+-----+
Note: this picks the first row that matches the pattern and picks the last row in the group if there is no match.

Related

Seperating string and numeric values in Pyspark

I have the following dataframe. There are several ID's which have either a numeric or a string value. If the ID is a string_value like "B" the numeric_value is "NULL" as a string. Vice versa for the numeric value e.g. ID "D".
ID string_value numeric_value timestamp
0 B On NULL 1632733508
1 B Off NULL 1632733508
2 A Inactive NULL 1632733511
3 A Active NULL 1632733512
4 D NULL 450 1632733513
5 D NULL 431 1632733515
6 C NULL 20 1632733518
7 C NULL 30 1632733521
Now I want to seperate the dataframe in a new one for each ID by an existing list containing all the unique ID's. Afterwards the new dataframes like "B" in this example, should drop the column with the "NULL" values. So if B is a string_value the numeric_value should be dropped.
ID string_value timestamp
0 B On 1632733508
1 B Off 1632733508
After that, the column with the value should be renamed with the ID "B" and the ID column should be dropped.
B timestamp
0 On 1632733508
1 Off 1632733508
As described, the same procedure should be applied for the numeric values in this case ID "D"
ID numeric_value timestamp
0 D 450 1632733513
1 D 431 1632733515
D timestamp
0 450 1632733513
1 431 1632733515
It is important to safe the original data types within the value column.
Assuming your dataframe is called df and your list of IDs is ids. You can write a function which does what you need and call it for every id.
The function applies the required filter, and the selects the needed columns with the id as an alias.
from pyspark.sql import functions as f
ids = ["B", "A", "D", "C"]
def split_df(df, id):
split_df = df.filter(f.col("ID") == id).select(
f.coalesce(f.col("string_value"), f.col("numeric_value")).alias(id),
f.col("timestamp"),
)
return split_df
dfs = [split_df(df, id) for id in ids]
for df in dfs:
df.show()
output
+---+----------+
| B| timestamp|
+---+----------+
| On|1632733508|
|Off|1632733508|
+---+----------+
+--------+----------+
| A| timestamp|
+--------+----------+
|Inactive|1632733511|
| Active|1632733512|
+--------+----------+
+---+----------+
| D| timestamp|
+---+----------+
|450|1632733513|
|431|1632733515|
+---+----------+
+---+----------+
| C| timestamp|
+---+----------+
| 20|1632733518|
| 30|1632733521|
+---+----------+

How can I count the null entries by column in a kdb q table?

Given a table that contains a number of null entries how can I create a summary table that describes the number of nulls per column? Can this be done on a general table where the number of columns and column names are not known beforehand?
q)t: ([] a: 1 2 3 4; b: (2018.10.08; 0Nd; 2018.10.08; 2018.10.08); c: (0N;0N;30;40); d: `abc`def``jkl)
q)t
a b c d
-------------------
1 2018.10.08 abc
2 def
3 2018.10.08 30
4 2018.10.08 40 jkl
Expected result:
columnName nullCount
--------------------
a 0
b 1
c 2
d 1
While sum null t is the simplest solution in this example, it doesn't handle string (or nested) columns. To handle string or nested columns for example you would need something like
q)t: ([] a: 1 2 3 4; b: (2018.10.08; 0Nd; 2018.10.08; 2018.10.08); c: (0N;0N;30;40); d: `abc`def``jkl;e:("aa";"bb";"";()," "))
q){sum$[0h=type x;0=count#'x;null x]}each flip t
a| 0
b| 1
c| 2
d| 1
e| 1
You can make such a table using
q)flip `columnName`nullCount!(key;value)#\:sum null t
columnName nullCount
--------------------
a 0
b 1
c 2
d 1
where sum null t gives a dictionary of the null values in each column
q)sum null t
a| 0
b| 1
c| 2
d| 1
and we apply the column names as keys and flip to a table.
To produce a table with the columns as the headers and number of nulls and the values you can use:
q)tab:enlist sum null t
Which enlists a dictionary with the number of nulls as the values and the columns names as keys:
a b c d
-------
0 1 2 1
If you then wanted this in your given format you could then use:
result:([]columnNames:cols tab; nullCount:raze value each tab)

How to merge two columns into a new DataFrame?

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.
How can I merge these two columns into a new DataFrame?
I tried join, but I think that there should be some other way to do it.
Also, I tried to apply withColumm, but it does not compile.
val result = df1.withColumn(col("col2"), df2.col1)
UPDATE:
For example:
df1 =
col1
1
2
3
df2 =
col2
4
5
6
result =
col1 col2
1 4
2 5
3 6
If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.union(df2).show
+---+
|one|
+---+
| a |
| b |
| c |
| d |
| e |
| f |
+---+
[edit]
Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:
import org.apache.spark.sql.functions.monotonically_increasing_id
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
.drop("id")
.show
+---+---+
|one|two|
+---+---+
| a | d |
| b | e |
| c | f |
+---+---+
As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.
But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.
[Update with example of in-driver collect/zip]
val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")
Depends in what you want to do.
If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)
You are saying that your Data Frames just had one column each.
In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by #Chondrops) witch give you a one column table with all elements.
I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).

Dataframe groupBy, get corresponding rows value, based on result of aggregate function [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 6 years ago.
I have dataframe with column by name c1, c2, c3, c4. I want to group it on a column and use agg function on other column eg min/max/agg.. etc and get the corresponding other column value based on result of agg function
Example :
c1 c2 c3 c4
1 23 1 1
1 45 2 2
1 91 3 3
1 90 4 4
1 71 5 5
1 42 6 6
1 72 7 7
1 44 8 8
1 55 9 9
1 21 0 0
Should result:
c1 c2 c3 c4
1 91 3 3
let dataframe be df
df.groupBy($"c1").agg(max($"c2"), ??, ??)
can someone please help what should go inplace of ??
i know solution of this problem using RDD. Wanted to explore if this can be solve in easier way using Dataframe/Dataset api
You can do this in two steps:
calculate the aggregated data frame;
join the data frame back with the original data frame and filter based on the condition;
so:
val maxDF = df.groupBy("c1").agg(max($"c2").as("maxc2"))
// maxDF: org.apache.spark.sql.DataFrame = [c1: int, maxc2: int]
df.join(maxDF, Seq("c1")).where($"c2" === $"maxc2").drop($"maxc2").show
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 1| 91| 3| 3|
+---+---+---+---+

Talend - Append two rows from a delimited file

How can I append two rows of a delimited file?
For example, I have:
a | b | c | d
e | f | g | h
and I want:
a | b | c | d | e | f | g | h
This new file may or may not be saved after the transformation.
Is there any possiblity that you have a join condition or relation between these two rows. Or it is always just two rows, as lets say your file contains 4 rows (how would like to merge them now)
a|b|c
d|e|f
x|y|z
m|g|s
if you have a way to relate these rows, then it will be easier using tmap
Ok the information you have shared in comment helps..
try this
tfileinputdelimited_1 (read all rows from file) -->filter_01 (only 'TX' rows)-->tmap(add sequence start with 1,1)
so output of tmap will have all columns + sequence_column having value 1, 2, 3..for row 1, row2, row3...and so on..
Similarly have another pipeline
tfileinputdelimited_2 (read all rows from file) -->filter_02 (only 'RX' rows)-->tmap(add sequence start with 1,1)
so output of tmap will have all columns + sequence_column having value 1, 2, 3..for row 1, row2, row3...and so on..
Now both these pipeline input them to tMap - and join based on sequence column and select all columns you need from them into single output.