pyspark dataframe check if string contains substring - pyspark

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+

First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)

For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

Related

Check if a value is between two columns, spark scala

I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+

Use different dataframes to create new one with information (Scala Spark)

I have one dataframe with games and three valoration for every game from different reviews, every valoration is traduced in another dataframe as you can see:
Df_reviews
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |XX+ | K2 | L1 |
|FT |Z- | K1+ | L3 |
Df_rev1
+----------+-------------+
| review_1 | Equivalence |
+----------+-------------+
|XX+ | 9 |
|Y | 6 |
|Z- | 3 |
Df_rev2
+----------+-------------+
| review_2 | Equivalence |
+----------+-------------+
|K2 | 7 |
|K1+ | 6 |
|K3 | 10 |
Df_rev3
+----------+-------------+
| review_3 | Equivalence |
+----------+-------------+
|L3 | 10 |
|L2 | 9 |
|L1 | 8 |
I have to traduce it in a new dataframe with the valoration traduced and add a column with the second best valoration, for this example would be:
Df_output
+--------+---------+---------+----------+-------------+
|Game | rev_1_t | rev_2_t | rev_3_t | second_best |
+--------+---------+---------+----------+-------------+
|CA | 9 | 7 | 8 | 8 |
|FT | 3 | 6 | 10 | 6 |
To traduce it, I am trying with a left join but I am so lost. How can I deal with this?
####### Second Part ######
How can I translate columns from one dataframe to others from another dataframe, joining with multiple columns vs one? for example:
Df_revuews
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |XX+ | K2 | L1 |
|FT |Z- | K1+ | L3 |
Df_equiv
+--------+-------+
|Valorat | num |
+- ------+-------+
|X |3 |
|XX+ |5 |
|Z |7 |
|Z- |6 |
|K1+ |6 |
|K2 |4 |
|L1 |5 |
|L2 |6 |
|L3 |7 |
Output
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |5 | 4 | 5 |
|FT |6 | 6 | 7 |
I am doing this as you can see:
val joined = df_reviews
.join(df_equiv, df_reviews("rev_1") === df_equiv("num") && df_reviews("rev_2") === df_equiv("num")
&& df_reviews("rev_3") === df_equiv("num"), "left")
.select(df_reviews("Game"),
df_equiv("num").as("rev_1_t"),
df_equiv("num").as("rev_2_t"),
df_equiv("num").as("rev_3_t")
)
Thanks in advance!
You can do some left joins and get the second highest column using sort_array:
val joined = df_reviews
.join(df_rev1, df_reviews("rev_1") === df_rev1("review_1"), "left")
.join(df_rev2, df_reviews("rev_2") === df_rev2("review_2"), "left")
.join(df_rev3, df_reviews("rev_3") === df_rev3("review_3"), "left")
.select(df_reviews("Game"),
df_rev1("Equivalence").as("rev_1_t"),
df_rev2("Equivalence").as("rev_2_t"),
df_rev3("Equivalence").as("rev_3_t")
)
val array_sort_udf = udf((x: Seq[Int]) => x.sortBy(_ != null))
val result = joined.withColumn(
"second_best",
coalesce(
array_sort_udf(
array(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
)(1),
greatest(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
)
)
result.show
+----+-------+-------+-------+-----------+
|Game|rev_1_t|rev_2_t|rev_3_t|second_best|
+----+-------+-------+-------+-----------+
| CA| 9| 7| 8| 8|
| FT| 3| 6| 10| 6|
+----+-------+-------+-------+-----------+
For your second question:
val joined = df_reviews.as("r1")
.join(df_equiv.as("e1"), expr("r1.rev_1 = e1.Valorat"), "left")
.selectExpr("Game", "e1.num as rev_1", "rev_2", "rev_3")
.as("r2")
.join(df_equiv.as("e2"), expr("r2.rev_2 = e2.Valorat"), "left")
.selectExpr("Game", "rev_1", "e2.num as rev_2", "rev_3")
.as("r3")
.join(df_equiv.as("e3"), expr("r3.rev_3 = e3.Valorat"), "left")
.selectExpr("Game", "rev_1", "rev_2", "e3.num as rev_3")
joined.show
+----+-----+-----+-----+
|Game|rev_1|rev_2|rev_3|
+----+-----+-----+-----+
| CA| 5| 4| 5|
| FT| 6| 6| 7|
+----+-----+-----+-----+

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

Scala group by with mapped keys

I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.
Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+

reorder column values pyspark

When I perform a Select operation on a DataFrame in PySpark it reduces to the following:
+-----+--------+-------+
| val | Feat1 | Feat2 |
+-----+--------+-------+
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 8 | f1b | f2f |
| 9 | f1a | f2d |
| 4 | f1b | f2c |
| 6 | f1b | f2a |
| 1 | f1c | f2c |
| 3 | f1c | f2g |
| 9 | f1c | f2e |
+-----+--------+-------+
I require the val column to be ordered group wise based on another field Feat1 like the following:
+-----+--------+-------+
| val | Feat1 | Feat2 |
+-----+--------+-------+
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 3 | f1a | f2d |
| 1 | f1b | f2c |
| 2 | f1b | f2a |
| 3 | f1b | f2f |
| 1 | f1c | f2c |
| 2 | f1c | f2g |
| 3 | f1c | f2e |
+-----+--------+-------+
NOTE that the val values don't depend on the order of Feat2 but are instead ordered based on their original val values.
Is there a command to reorder the column value in PySpark as required.
NOTE: Question exists for the same but is specific to SQL-lite.
data = [(1, 'f1a', 'f2a'),
(2, 'f1a', 'f2b'),
(8, 'f1b', 'f2f'),
(9, 'f1a', 'f2d'),
(4, 'f1b', 'f2c'),
(6, 'f1b', 'f2a'),
(1, 'f1c', 'f2c'),
(3, 'f1c', 'f2g'),
(9, 'f1c', 'f2e')]
table = sqlContext.createDataFrame(data, ['val', 'Feat1', 'Feat2'])
Edit: For this purpose, you can use window with rank function:
from pyspark.sql import Window
from pyspark.sql.functions import rank
w = Window.partitionBy('Feat1').orderBy('val')
table.withColumn('val', rank().over(w)).orderBy('Feat1').show()
+---+-----+-----+
|val|Feat1|Feat2|
+---+-----+-----+
| 1| f1a| f2a|
| 2| f1a| f2b|
| 3| f1a| f2d|
| 1| f1b| f2c|
| 2| f1b| f2a|
| 3| f1b| f2f|
| 1| f1c| f2c|
| 2| f1c| f2g|
| 3| f1c| f2e|
+---+-----+-----+