How to feed the array_repeat function count value from another column:
>>>import pyspark.sql.functions as F
>>>dftmp = spark.createDataFrame([('ab',)], ['data'])
>>>dftmp.select(F.array_repeat(dftmp.data, 3).alias('r'))
>>>dftmp.show()
+----+------------+
|data| repeat|
+----+------------+
| ab|[ab, ab, ab]|
+----+------------+
Is there a way to use the repeat count value based on another column? e.g.
>>>dftmp.withColumn('len', F.length(F.col('data')) )
>>>dftmp.withColumn('repeat', F.array_repeat(dftmp.data, F.col('len')))
TypeError: Column is not iterable
Expected Result
+----+------------+---+
|data| repeat|len|
+----+------------+---+
| ab|[ab, ab] | 2|
+----+------------+---+
You could use an .expr :
from pyspark.sql import functions as F
dftmp.withColumn('repeat', F.expr("""array_repeat(data, len)"""))
Or you could just calculate length in there too like:
dftmp.withColumn('repeat', F.expr("""array_repeat(data, length(data))"""))
Related
I have a pyspark dataframe like this:
+-------+---------------+------------+
|s_field|s_check| t_filter|
+-------+---------------+------------+
| MANDT| true| !=E|
| WERKS| true|0010_0020_0021_00...|
+-------+---------------+------------+
And as a first step, I split t_filter based on _ with f.split(f.col("t_filter"), "_")
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_")).show(truncate=False)
+-------+---------------+------------+------------+------------+
|s_field|s_check| t_filter| t_filter_1|
+-------+---------------+------------+------------+------------+
| MANDT| true| 070_70| [!= E]|
| WERKS| true|0010_0020_0021_00...| [0010, 0020, 0021, 00...]
+-------+---------------+------------+------------+------------+
What I want to achieve is to create a new column, using s_field and t_filter as the input while doing a logic check for !=.
ultimate aim
+------------+------------+------------+
| t_filter_2|
+------------+------------+------------+
| MANDT != 'E'|
| WERKS in ('0010', '0020', ...)|
+------------+------------+------------+
I have tried using withColumn but I keep getting error on col must be Column.
I am also not sure what the proper approach should be in order to achieve this.
Note: there is a large amount of rows, like 10k. I understand that using a UDF would be quite expensive, so i'm interested to know if there are other ways that can be done.
You can achieve this using withColumn with conditional evaluation by using the when and otherwise function. Following your example the following logic would apply, if t_filter contains != concatenate s_field and t_filter, else first convert t_filter_1 array to a string with , as separator then concat with s_field along with literals for in and ().
from pyspark.sql import functions as f
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
)
Output
+-------+-------+--------------------+-------------------------+---------------------------------------+
|s_check|s_field|t_filter |t_filter_1 |t_filter_2 |
+-------+-------+--------------------+-------------------------+---------------------------------------+
|true |MANDT |!='E' |[!='E'] |MANDT!='E' |
|true |WERKS |0010_0020_0021_00...|[0010, 0020, 0021, 00...]|WERKS in ('0010','0020','0021','00...')|
+-------+-------+--------------------+-------------------------+---------------------------------------+
Complete Working Example
from pyspark.sql import functions as f
filters_data = [
{"s_field": "MANDT", "s_check": True, "t_filter": "!='E'"},
{"s_field": "WERKS", "s_check": True, "t_filter": "0010_0020_0021_00..."},
]
filters = spark.createDataFrame(filters_data)
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_"))
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
).show(200, False)
So I have one pyspark dataframe like so, let's call it dataframe a:
+-------------------+---------------+----------------+
| reg| val1| val2 |
+-------------------+---------------+----------------+
| N110WA| 1590030660| 1590038340000|
| N876LF| 1590037200| 1590038880000|
| N135MH| 1590039060| 1590040080000|
And another like this, let's call it dataframe b:
+-----+-------------+-----+-----+---------+----------+---+----+
| reg| postime| alt| galt| lat| long|spd| vsi|
+-----+-------------+-----+-----+---------+----------+---+----+
|XY679|1590070078549| 50| 130|18.567169|-69.986343|132|1152|
|HI949|1590070091707| 375| 455| 18.5594|-69.987804|148|1344|
|JX784|1590070110666| 825| 905|18.544968|-69.990414|170|1216|
Is there some way to create a numpy array or pyspark dataframe, where for each row in dataframe a, all the rows in dataframe b with the same reg and postime between val 1 and val 2, are included?
You can try the below solution -- and let us know if works or anything else is expected ?
I have modified the imputes a little in order to showcase the working solution--
Input here
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',1590030660,1590038340000), ('N110WA',1590070078549,1590070078559)],[ "reg","val1","val2"])
df_b = spark.createDataFrame([('N110WA',1590070078549)],[ "reg","postime"])
df_a.show()
df_a
+------+-------------+-------------+
| reg| val1| val2|
+------+-------------+-------------+
|N110WA| 1590030660|1590038340000|
|N110WA|1590070078549|1590070078559|
+------+-------------+-------------+
df_b
+------+-------------+
| reg| postime|
+------+-------------+
|N110WA|1590070078549|
+------+-------------+
Solution here
from pyspark.sql import types as T
from pyspark.sql import functions as F
#df_a = df_a.join(df_b,'reg','left')
df_a = df_a.withColumn('condition_col', F.when(((F.col('postime') >= F.col('val1')) & (F.col('postime') <= F.col('val2'))),'1').otherwise('0'))
df_a = df_a.filter(F.col('condition_col') == 1).drop('condition_col')
df_a.show()
Final Output
+------+-------------+-------------+-------------+
| reg| val1| val2| postime|
+------+-------------+-------------+-------------+
|N110WA|1590070078549|1590070078559|1590070078549|
+------+-------------+-------------+-------------+
Yes, assuming df_a and df_b are both pyspark dataframes, you can use an inner join in pyspark:
delta = val
df = df_a.join(df_b, [
df_a.res == df_b.res,
df_a.posttime <= df_b.val1 + delta,
df_a.posttime >= df_b.val2 - delta
], "inner")
Will filter out the results to only include the ones specified
I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.
creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+
I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+
I have 2 columns say ID, value Id is of type Int and value is of type List[String].
Ids are repeating so to make them unique I apply GroupBy("id") on My DataFrame now my problem is I want to append the value with each other and value column must be distinct.
Example :- i have a data like
+---+---+
| id| v |
+---+---+
| 1|[a]|
| 1|[b]|
| 1|[a]|
| 2|[e]|
| 2|[b]|
+---+---+
and i want my output like this
+---+---+--
| id| v |
+---+-----+
| 1|[a,b]|
| 2|[e,b]|
i tried this :-
val uniqueDF = df.groupBy("id").agg(collect_list("v"))
uniqueDf.map{row => (row.getInt(0),
row.getAsSeq[String].toList.distinct)}
Can I do the same after groupBy() say in agg() or something I do not want to apply map operation
thanks
val uniqueDF = df.groupBy("id").agg(collect_set("v"))
Set will have only unique values