Assign label to categorical data in a table in PySpark - pyspark

I want to assign the label to the categorical numbers in a dataframe below using pyspark sql.
In the MARRIAGE column 1=Married and 2=Unmarried. In the EDUCATION Column 1=Grad and 2=Undergrad
Current Dataframe:
+--------+---------+-----+
|MARRIAGE|EDUCATION|Total|
+--------+---------+-----+
| 1| 2| 87|
| 1| 1| 123|
| 2| 2| 3|
| 2| 1| 8|
+--------+---------+-----+
Resulting Dataframe:
+---------+---------+-----+
|MARRIAGE |EDUCATION|Total|
+---------+---------+-----+
|Married |Grad | 87|
|Married |UnderGrad| 123|
|UnMarried|Grad | 3|
|UnMarried|UnderGrad| 8|
+---------+---------+-----+
Is it possible to assign the labels using a single udf and the withColumn()? Is there any way to assign in the single UDF by passing the whole dataframe and keep the column names as it is?
I can think of a solution to do the operation on each column by using separate udfs as below. But can't figure out if there's a way to do together.
from pyspark.sql import functions as F
def assign_marital_names(record):
if record == 1:
return "Married"
elif record == 2:
return "UnMarried"
def assign_edu_names(record):
if record == 1:
return "Grad"
elif record == 2:
return "UnderGrad"
assign_marital_udf = F.udf(assign_marital_names)
assign_edu_udf = F.udf(assign_edu_names)
df.withColumn("MARRIAGE", assign_marital_udf("MARRIAGE")).\
withColumn("EDUCATION", assign_edu_udf("EDUCATION")).show(truncate=False)

One UDF can result in only one column. But this can be structured column and UDF can apply labels on both marriage and education. See code below:
from pyspark.sql.types import *
from pyspark.sql import Row
udf_result = StructType([StructField('MARRIAGE', StringType()), StructField('EDUCATION', StringType())])
marriage_dict = {1: 'Married', 2: 'UnMarried'}
education_dict = {1: 'Grad', 2: 'UnderGrad'}
def assign_labels(marriage, education):
return Row(marriage_dict[marriage], education_dict[education])
assign_labels_udf = F.udf(assign_labels, udf_result)
df.withColumn('labels', assign_labels_udf('MARRIAGE', 'EDUCATION')).printSchema()
root
|-- MARRIAGE: long (nullable = true)
|-- EDUCATION: long (nullable = true)
|-- Total: long (nullable = true)
|-- labels: struct (nullable = true)
| |-- MARRIAGE: string (nullable = true)
| |-- EDUCATION: string (nullable = true)
But as you see, it's not replacing the original columns, it's just adding a new one. To replace them you will need to use withColumn twice and then drop labels.

Related

Is there an efficient way to return Array[Int] from a spark Dataframe without using collect()

I have a dataframe something like this.
root
|-- key1: string (nullable = true)
|-- value1: string (nullable = true)
+----+------+
|key1|value1|
+----+------+
| E1| 1|
| E3| 0|
| E4| 1|
| E2| 0|
...
+----+------+
And i convert "value1" column to array[Int] by using collect() function as below. But this is not efficient solution, it takes 10-15 seconds. Because there are lots of data in the dataframe and in each spark streaming cycle, data is collected to the driver.
val data = Seq(("E1","1"),
("E3","0"),
("E4","1"),
("E2","0")
)
val columns = Seq("key1", "value1")
import spark.implicits._
val df = data.toDF(columns:_*)
val ordered_df = df.orderBy("key1").select("value1").collect().map(_(0)).toList
ordered_df.foreach(print)
Output :
1001
So, what is the efficient way to return Array of Int from the above dataframe without using Collect() function ?
Thanks,

How to change struct dataType to Integer in pyspark?

I have a dataframe df, and one column has data type of struct<long:bigint, string:string>
because of this data type structure, I can not perform addition, subtration etc...
how to change struct<long:bigint, string:string> to just IntegerType??
You can use a dot syntax to access parts of the struct column.
For example if you start with this dataframe
df = spark.createDataFrame([(1,(3,'x')),(4,(8, 'y'))]).toDF("col1", "col2")
df.show()
df.printSchema()
+----+------+
|col1| col2|
+----+------+
| 1|[3, x]|
| 4|[8, y]|
+----+------+
root
|-- col1: long (nullable = true)
|-- col2: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: string (nullable = true)
use can select the first part of the struct column and either create a new column or replace an existing one:
df.withColumn('col2', df['col2._1']).show()
prints
+----+----+
|col1|col2|
+----+----+
| 1| 3|
| 4| 8|
+----+----+

How can I split a column containing array of some struct into separate columns?

I have the following scenarios:
case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])
val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
entity("2",List(attribute("home","hyd"))))
val df = entities.toDF()
// df.show
+---+--------------------+
| id| attr|
+---+--------------------+
| 1|[[name,sasha], [d...|
| 2| [[home,hyd]]|
+---+--------------------+
//df.printSchema
root
|-- id: string (nullable = true)
|-- attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
what I want to produce is
+---+--------------------+-------+
| id| name | home |
+---+--------------------+-------+
| 1| sasha |del |
| 2| null |hyd |
+---+--------------------+-------+
How do I go about this. I looked at quite a few similar questions on stack but couldn't find anything useful.
My main motive is to do groupBy on different attributes, thus want to bring it in the above mentioned format.
I looked into explode functionality. It breaks downs a list in separate rows, I don't want that. I want to create more columns from the array of attribute.
Similar things I found:
Spark - convert Map to a single-row DataFrame
Split 1 column into 3 columns in spark scala
Spark dataframe - Split struct column into 2 columns
That can easily be reduced to PySpark converting a column of type 'map' to multiple columns in a dataframe or How to get keys and values from MapType column in SparkSQL DataFrame. First convert attr to map<string, string>
import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}
val dfMap = df.withColumn("attr", map_from_entries($"attr"))
then it's just a matter of finding the unique keys
val keys = dfMap.select(explode(map_keys($"attr"))).as[String].distinct.collect
then selecting from the map
val result = dfMap.select($"id" +: keys.map(key => $"attr"(key) as key): _*)
result.show
+---+-----+----+
| id| name|home|
+---+-----+----+
| 1|sasha| del|
| 2| null| hyd|
+---+-----+----+
Less efficient but more concise variant is to explode and pivot
val result = df
.select($"id", explode(map_from_entries($"attr")))
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
result.show
+---+----+-----+
| id|home| name|
+---+----+-----+
| 1| del|sasha|
| 2| hyd| null|
+---+----+-----+
but in practice I'd advise against it.

How to combine 2 different dataframes together?

I have 2 DataFrames:
Users (~29.000.000 entries)
|-- userId: string (nullable = true)
Impressions (~1000 entries)
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- id: string (nullable = true)
I want to walk through all the Users and attach to each User 1 Impression from these ~1000 entries. So actually at each ~1000th User the Impression would be the same, then the loop on the Impressions would start from the beginning and assign the same ~1000 impressions for the next ~1000 users.
At the end I want to have a DataFrame with the combined data. Also the Users dataframe could be reused by adding the columns of the Impressions or a newly created one would work also as a result.
You have any ideas, which would be a good solution here?
What I would do is use the old trick of adding a monotically increasing ID to both dataframes, then create a new column on your LARGER dataframe (Users) which contains the modulo of each row's ID and the size of smaller dataframe.
This new column then provides a rolling matching key against the items in the Impressions dataframe.
This is a minimal example (tested) to give you the idea. Obviously this will work if you have 1000 impressions to join against:
var users = Seq("user1", "user2", "user3", "user4", "user5", "user6", "user7", "user8", "user9").toDF("users")
var impressions = Seq("a", "b", "c").toDF("impressions").withColumn("id", monotonically_increasing_id())
var cnt = impressions.count
users=users.withColumn("id", monotonically_increasing_id())
.withColumn("mod", $"id" mod cnt)
.join(impressions, $"mod"===impressions("id"))
.drop("mod")
users.show
+-----+---+-----------+---+
|users| id|impressions| id|
+-----+---+-----------+---+
|user1| 0| a| 0|
|user2| 1| b| 1|
|user3| 2| c| 2|
|user4| 3| a| 0|
|user5| 4| b| 1|
|user6| 5| c| 2|
|user7| 6| a| 0|
|user8| 7| b| 1|
|user9| 8| c| 2|
+-----+---+-----------+---+
Sketch of idea:
Add monotonically increasing id to both dataframes Users and Impressions via
val indexedUsersDF = usersDf.withColumn("index", monotonicallyIncreasingId)
val indexedImpressionsDF = impressionsDf.withColumn("index", monotonicallyIncreasingId)
(see spark dataframe :how to add a index Column )
Determine number of rows in Impressions via count and store as int, e.g.
val numberOfImpressions = ...
Apply UDF to index-column in indexedUsersDF that computes the modulo in a seperate column (e.g. moduloIndex)
val moduloIndexedUsersDF = indexedUsersDF.select(...)
Join moduloIndexedUsersDF and indexedImperessionsDF on
moduloIndexedUsersDF("moduloIndex")===indexedImpressions("index")

How to join two dataframes?

I cannot get Sparks DataFrame join to work (no result gets produced). Here is my code:
val e = Seq((1, 2), (1, 3), (2, 4))
var edges = e.map(p => Edge(p._1, p._2)).toDF()
var filtered = edges.filter("start = 1").distinct()
println("filtered")
filtered.show()
filtered.printSchema()
println("edges")
edges.show()
edges.printSchema()
var joined = filtered.join(edges, filtered("end") === edges("start"))//.select(filtered("start"), edges("end"))
println("joined")
joined.show()
It requires case class Edge(start: Int, end: Int) to be defined at top level. Here is the output it produces:
filtered
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
edges
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
| 2| 4|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
joined
+-----+---+-----+---+
|start|end|start|end|
+-----+---+-----+---+
+-----+---+-----+---+
I don't understand why the output is empty. Why isn't the first row of filtered get combined with the last row of edges?
val f2 = filtered.withColumnRenamed("start", "fStart").withColumnRenamed("end", "fEnd")
f2.join(edges, f2("fEnd") === edges("start")).show
I believe this is because filtered("start").equals(edges("start")), that is as filtered is a filtered view on edges and they share the column definitions. The columns are the same so Spark does not understand which you are referencing.
As such you can do things like
edges.select(filtered("start")).show