I have a pyspark dataframe with multiple map columns. I want to flatten all map columns recursively. personal and financial are map type columns. Similarly, we might have more map columns.
Input dataframe:
-------------------------------------------------------------------------------------------------------
| id | name | Gender | personal | financial |
-------------------------------------------------------------------------------------------------------
| 1 | A | M | {age:20,city:Dallas,State:Texas} | {salary:10000,bonus:2000,tax:1500}|
| 2 | B | F | {city:Houston,State:Texas,Zipcode:77001} | {salary:12000,tax:1800} |
| 3 | C | M | {age:22,city:San Jose,Zipcode:940088} | {salary:2000,bonus:500} |
-------------------------------------------------------------------------------------------------------
Output dataframe:
--------------------------------------------------------------------------------------------------------------
| id | name | Gender | age | city | state | Zipcode | salary | bonus | tax |
--------------------------------------------------------------------------------------------------------------
| 1 | A | M | 20 | Dallas | Texas | null | 10000 | 2000 | 1500 |
| 2 | B | F | null | Houston | Texas | 77001 | 12000 | null | 1800 |
| 3 | C | M | 22 | San Jose | null | 940088 | 2000 | 500 | null |
--------------------------------------------------------------------------------------------------------------
use map_concat to merge the map fields and then explode them. exploding a map column creates 2 new columns - key and value. pivot the key column with value as values to get your desired output.
data_sdf. \
withColumn('personal_financial', func.map_concat('personal', 'financial')). \
selectExpr(*[c for c in data_sdf.columns if c not in ['personal', 'financial']],
'explode(personal_financial)'
). \
groupBy([c for c in data_sdf.columns if c not in ['personal', 'financial']]). \
pivot('key'). \
agg(func.first('value')). \
show(truncate=False)
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |id |name|gender|State|Zipcode|age |bonus|city |salary|tax |
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |1 |A |M |Texas|null |20 |2000 |Dallas |10000 |1500|
# |2 |B |F |Texas|77001 |null|null |Houston |12000 |1800|
# |3 |C |M |null |940088 |22 |500 |San Jose|2000 |null|
# +---+----+------+-----+-------+----+-----+--------+------+----+
i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+
I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.
Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+
I want to count the number of missing values in each row of a data frame in spark scala.
Code:
val samplesqlDF = spark.sql("SELECT * FROM sampletable")
samplesqlDF.show()
Input Dataframe:
------------------------------------------------------------------
| name | age | degree | Place |
| -----------------------------------------------------------------|
| Ram | | MCA | Bangalore |
| | 25 | | |
| | 26 | BE | |
| Raju | 21 | Btech | Chennai |
-----------------------------------------------------------------
The Output Data frame (Row Level Count) as follows:
-----------------------------------------------------------------
| name | age | degree | Place | rowcount |
| ----------------------------------------------------------------|
| Ram | | MCA | Bangalore | 1 |
| | 25 | | | 3 |
| | 26 | BE | | 2 |
| Raju | 21 | Btech | Chennai | 0 |
-----------------------------------------------------------------
I am a beginner to scala and spark. Thanks in advance.
Looks like you want to get the null count in a dynamic way. Check this out
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")
df.show(false)
val df2 = df.columns.foldLeft(df)( (df,c) => df.withColumn(c+"_null", when(col(c).isNull,1).otherwise(0) ) )
df2.createOrReplaceTempView("student")
val sql_str_null = df.columns.map( x => x+"_null").mkString(" ","+"," as null_count ")
val sql_str_full = df.columns.mkString( "select ", ",", " , " + sql_str_null + " from student")
spark.sql(sql_str_full).show(false)
Output:
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+
Also a possibility and checking also for "" but not using foldLeft just to demonstrate the point:
import org.apache.spark.sql.functions._
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,""),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","place")
// Count per row the null or "" columns!
val null_counter = Seq("name", "age", "degree", "place").map(x => when(col(x) === "" || col(x).isNull , 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("nulls_cnt", null_counter)
df2.show(false)
returns:
+----+----+------+---------+---------+
|name|age |degree|place |nulls_cnt|
+----+----+------+---------+---------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null | |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+---------+
A simplified version of the one suggested by #stack0114106 is
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),
(null,"26","BE",null),("Raju","21","Btech","Chennai"))
.toDF("name","age","degree","Place")
.withColumn("null_count", lit(0))
val df2 = df.columns.foldLeft(df)((df,c) =>
df.withColumn("null_count",
when(col(c).isNull,$"null_count" + 1).otherwise($"null_count")
)
)
df2.show(false)
the output is
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+
I have a 2 tables. First table contains information of the object, second table contains related objects. Second table objects have 4 types( lets call em A,B,C,D).
I need a query that does something like this
|table1 object id | A |value for A|B | value for B| C | value for C|D | vlaue for D|
| 1 | 12| cat | 13| dog | 2 | house | 43| car |
| 1 | 5 | lion | | | | | | |
The column "table1 object id" in real table is multiple columns of data from table 1(for single object its all the same, just repeated on multiple rows because of table 2).
Where 2nd table is in form
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
I hope this is clear enough of the thing i want.
I have tryed using AND and OR and JOIN. This does not seem like something that can be done with crosstab.
EDIT
Table 2
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
|C |wolf | 2 | 6 |
Table 1
| id | value1 | value 2|value 3|
| 1 | hello | test | hmmm |
| 2 | bye | test2 | hmm2 |
Result
|value1| value2| value3| A| value| B |value| C|value | D | value|
|hello | test | hmmm |12| cat | 13| dog |2 | house | 23| car |
|hello | test | hmmm |5 | lion | | | | | | |
|bye | test2 | hmm2 | | | | |6 | wolf | | |
I hope this explains bit bettter of what I want to achieve.