I have some example data that originated from json that looks similar to the below:
{ hero: axe, attribute: strength, active_abilities: [q, w, r], inactive_abilities: e }
{ hero: invoker, attribute: intelligence, active_abilities: [q, w, e, r, f, d], inactive_abilities: null }
{ hero: phantom assassin, attribute: agility, active_abilities: [q, w, e], inactive_abilities: r }
{ hero: life stealer, attribute: strength, active_abilities: [q, r], inactive_abilities: [w, e] }
The issue I'm having is that column 'inactive_abilities' is being read as a string due to the variability in the data types that can be present in that column. It is possible for the data to be null, a single string (if only 1 ability), an array (if multiple abilites). What I want in the end is to have several new columns based on the number of 'inactive_abilities'. If there is only 1 or null ability, I want a new column inactive_ability which will only be populated if there is one inactive ability and null if there are none or if there are multiple inactive abilities. Then I would like multiple columns like inactive_ability1, inactive_ability2, inactive_ability3, etc... in the case that the array holds > 1 value. So from the example above, the end result should look like:
{ hero: axe, attribute: strength, active_abilities: [q, w, r], inactive_abilities: e , inactive_ability: e, inactive_ability1: null, inactive_ability2: null, inactive_ability3, null, inactive_ability4: null}
{ hero: invoker, attribute: intelligence, active_abilities: [q, w, e, r, f, d], inactive_abilities: null, inactive_ability: null, inactive_ability1: null, inactive_ability2: null, inactive_ability3, null, inactive_ability4: null }
{ hero: phantom assassin, attribute: agility, active_abilities: [q, w, e], inactive_abilities: r, inactive_ability: r, inactive_ability1: null, inactive_ability2: null, inactive_ability3, null, inactive_ability4: null }
{ hero: life stealer, attribute: strength, active_abilities: [q, r], inactive_abilities: [w, e], inactive_ability: null, inactive_ability1: w, inactive_ability2: e, inactive_ability3, null, inactive_ability4: null }
I can't assume there will be a fixed number of 'inactive_abilities' but if it exceeds 4, the rest can be ignored. The part I'm having trouble is being able to cast the field into an array and reading it as such when appropriate, and then creating and populating the new columns based on the conditions mentioned above.
The string inactive_abilities column can be converted to array by removing the [ and ] and splitting the string by ,.
from pyspark.sql import functions as F
data = [{"hero": "axe", "attribute": "strength", "active_abilities": ["q", "w", "r"], "inactive_abilities": "e"},
{"hero": "invoker", "attribute": "intelligence", "active_abilities": ["q", "w", "e", "r", "f", "d"],
"inactive_abilities": None},
{"hero": "phantom assassin", "attribute": "agility", "active_abilities": ["q", "w", "e"],
"inactive_abilities": "r"},
{"hero": "life stealer", "attribute": "strength", "active_abilities": ["q", "r"],
"inactive_abilities": "[w, e]"}, ]
df = spark.createDataFrame(data)
array_select_expr = [
F.when(F.size("parsed_inactive_abilities") > 1, F.trim(F.col("parsed_inactive_abilities")[i])).alias(
f"inactive_ability{i}") if i > 0 else F.when(
F.size("parsed_inactive_abilities") == 1, F.trim(F.col("parsed_inactive_abilities")[0])).alias(
"inactive_abilities")
for i in range(0, 5)]
(df.withColumn("parsed_inactive_abilities", F.split(F.regexp_replace(F.col("inactive_abilities"), "[\[\]]", ""), ","))
.select("*", *array_select_expr)
.drop("parsed_inactive_abilities").show())
Output
+------------------+------------+----------------+------------------+------------------+-----------------+-----------------+-----------------+-----------------+
| active_abilities| attribute| hero|inactive_abilities|inactive_abilities|inactive_ability1|inactive_ability2|inactive_ability3|inactive_ability4|
+------------------+------------+----------------+------------------+------------------+-----------------+-----------------+-----------------+-----------------+
| [q, w, r]| strength| axe| e| e| null| null| null| null|
|[q, w, e, r, f, d]|intelligence| invoker| null| null| null| null| null| null|
| [q, w, e]| agility|phantom assassin| r| r| null| null| null| null|
| [q, r]| strength| life stealer| [w, e]| null| e| null| null| null|
+------------------+------------+----------------+------------------+------------------+-----------------+-----------------+-----------------+-----------------+
# Value in array_select_expr
[Column<'CASE WHEN (size(parsed_inactive_abilities) = 1) THEN trim(parsed_inactive_abilities[0]) END AS inactive_abilities'>,
Column<'CASE WHEN (size(parsed_inactive_abilities) > 1) THEN trim(parsed_inactive_abilities[1]) END AS inactive_ability1'>,
Column<'CASE WHEN (size(parsed_inactive_abilities) > 1) THEN trim(parsed_inactive_abilities[2]) END AS inactive_ability2'>,
Column<'CASE WHEN (size(parsed_inactive_abilities) > 1) THEN trim(parsed_inactive_abilities[3]) END AS inactive_ability3'>,
Column<'CASE WHEN (size(parsed_inactive_abilities) > 1) THEN trim(parsed_inactive_abilities[4]) END AS inactive_ability4'>]
Related
I have a DF in the following structure
Col1. Col2 Col3
Data1Col1,Data2Col1. Data1Col2,Data2Col2. Data1Col3,Data2Col3
I want the resultant dataset to be of the following type:
Col1 Col2 Col3
Data1Col1. Data1Col2. Data1Col3
Data2Col1. Data2Col2 Data2Col3
Please suggest me how to approach this. I have tried explode , but that results in duplicate rows.
val df = Seq(("C,D,E,F","M,N,O,P","K,P,B,P")).toDF("Col1","Col2","Col3")
df.show
+-------+-------+-------+
| Col1| Col2| Col3|
+-------+-------+-------+
|C,D,E,F|M,N,O,P|K,P,B,P|
+-------+-------+-------+
val res1 = df.withColumn("Col1",split(col("Col1"),",")).withColumn("Col2",split(col("Col2"),",")).withColumn("Col3",split(col("Col3"),","))
res1.show
+------------+------------+------------+
| Col1| Col2| Col3|
+------------+------------+------------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|
+------------+------------+------------+
val zip = udf((x: Seq[String], y: Seq[String], z: Seq[String]) => z.zip(x.zip(y)))
val res14 = res1.withColumn("test",explode(zip(col("Col1"),col("Col2"),col("Col3")))).show
+------------+------------+------------+-----------+
| Col1| Col2| Col3| test|
+------------+------------+------------+-----------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[K, [C, M]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [D, N]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[B, [E, O]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [F, P]]|
+------------+------------+------------+-----------+
res14.withColumn("t3",col("test._1")).withColumn("tn",col("test._2")).withColumn("t2",col("tn._2")).withColumn("t1",col("tn._1")).select("t1","t2","t3").show
+---+---+---+
| t1| t2| t3|
+---+---+---+
| C| M| K|
| D| N| P|
| E| O| B|
| F| P| P|
+---+---+---+
res1 - Initial Dataframe
res14 - intermediate Df
Say I have a dataframe, originalDF, which looks like this
+--------+--------------+
|data_id |data_list |
+--------+--------------+
| 3| [a, b, d] |
| 2|[c, a, b, e] |
| 1| [g] |
+--------+--------------+
And I have another dataframe, extraInfoDF, which looks like this:
+--------+--------------+
|data_id |data_list |
+--------+--------------+
| 3| [q, w, x, a] |
| 2|[r, q, l, p] |
| 1| [z, k, j, f] |
+--------+--------------+
For the two data_lists in originalDF that are shorter than 4, I want to add in data from the corresponding data_lists in extraInfoDF so that each list has a length of 4.
The resulting dataframe would look like:
+--------+--------------+
|data_id |data_list |
+--------+--------------+
| 3| [a, b, d, q] |
| 2|[c, a, b, e] |
| 1|[g, z, k, j] |
+--------+--------------+
I was trying to find some way to iterate through each row in the dataframe and append to the list that way but was having trouble. Now I'm wondering if there is a simpler way to accomplish this with a UDF?
You can append the second list to the first and take the left-most N elements in a UDF, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
def padList(n: Int) = udf{ (l1: Seq[String], l2: Seq[String]) =>
(l1 ++ l2).take(n)
}
val df1 = Seq(
(3, Seq("a", "b", "d")),
(2, Seq("c", "a", "b", "e")),
(1, Seq("g"))
).toDF("data_id", "data_list")
val df2 = Seq(
(3, Seq("q", "w", "x", "a")),
(2, Seq("r", "q", "l", "p")),
(1, Seq("z", "k", "j", "f"))
).toDF("data_id", "data_list")
df1.
join(df2, "data_id").
select($"data_id", padList(4)(df1("data_list"), df2("data_list")).as("data_list")).
show
// +-------+------------+
// |data_id| data_list|
// +-------+------------+
// | 3|[a, b, d, q]|
// | 2|[c, a, b, e]|
// | 1|[g, z, k, j]|
// +-------+------------+
I have two data frames in PySpark: df1
+---+-----------------+
|id1| items1|
+---+-----------------+
| 0| [B, C, D, E]|
| 1| [E, A, C]|
| 2| [F, A, E, B]|
| 3| [E, G, A]|
| 4| [A, C, E, B, D]|
+---+-----------------+
and df2:
+---+-----------------+
|id2| items2|
+---+-----------------+
|001| [B]|
|002| [A]|
|003| [C]|
|004| [E]|
+---+-----------------+
I would like to create a new column in df1 that would update values in
items1 column, so that it only keeps values that also appear (in any row of) items2 in df2. The result should look as follows:
+---+-----------------+----------------------+
|id1| items1| items1_updated|
+---+-----------------+----------------------+
| 0| [B, C, D, E]| [B, C, E]|
| 1| [E, A, C]| [E, A, C]|
| 2| [F, A, E, B]| [A, E, B]|
| 3| [E, G, A]| [E, A]|
| 4| [A, C, E, B, D]| [A, C, E, B]|
+---+-----------------+----------------------+
I would normally use collect() to get a list of all values in items2 column and then use a udf applied to each row in items1 to get an intersection. But the data is extremely large (over 10 million rows) and I cannot use collect() to get such list. Is there a way to do this while keeping data in a data frame format? Or some other way without using collect()?
The first thing you want to do is explode the values in df2.items2 so that contents of the arrays will be on separate rows:
from pyspark.sql.functions import explode
df2 = df2.select(explode("items2").alias("items2"))
df2.show()
#+------+
#|items2|
#+------+
#| B|
#| A|
#| C|
#| E|
#+------+
(This assumes that the values in df2.items2 are distinct- if not, you would need to add df2 = df2.distinct().)
Option 1: Use crossJoin:
Now you can crossJoin the new df2 back to df1 and keep only the rows where df1.items1 contains an element in df2.items2. We can achieve this using pyspark.sql.functions.array_contains and this trick that allows us to use a column value as a parameter.
After filtering, group by id1 and items1 and aggregate using pyspark.sql.functions.collect_list
from pyspark.sql.functions import expr, collect_list
df1.alias("l").crossJoin(df2.alias("r"))\
.where(expr("array_contains(l.items1, r.items2)"))\
.groupBy("l.id1", "l.items1")\
.agg(collect_list("r.items2").alias("items1_updated"))\
.show()
#+---+---------------+--------------+
#|id1| items1|items1_updated|
#+---+---------------+--------------+
#| 1| [E, A, C]| [A, C, E]|
#| 0| [B, C, D, E]| [B, C, E]|
#| 4|[A, C, E, B, D]| [B, A, C, E]|
#| 3| [E, G, A]| [A, E]|
#| 2| [F, A, E, B]| [B, A, E]|
#+---+---------------+--------------+
Option 2: Explode df1.items1 and left join:
Another option is to explode the contents of items1 in df1 and do a left join. After the join, we have to do a similar group by and aggregation as above. This works because collect_list will ignore the null values introduced by the non-matching rows
df1.withColumn("items1", explode("items1")).alias("l")\
.join(df2.alias("r"), on=expr("l.items1=r.items2"), how="left")\
.groupBy("l.id1")\
.agg(
collect_list("l.items1").alias("items1"),
collect_list("r.items2").alias("items1_updated")
).show()
#+---+---------------+--------------+
#|id1| items1|items1_updated|
#+---+---------------+--------------+
#| 0| [E, B, D, C]| [E, B, C]|
#| 1| [E, C, A]| [E, C, A]|
#| 3| [E, A, G]| [E, A]|
#| 2| [F, E, B, A]| [E, B, A]|
#| 4|[E, B, D, C, A]| [E, B, C, A]|
#+---+---------------+--------------+
I don't have any ideas to get column names when it has null value
For example,
case class A(name: String, id: String, email: String, company: String)
val e1 = A("n1", null, "n1#c1.com", null)
val e2 = A("n2", null, "n2#c1.com", null)
val e3 = A("n3", null, "n3#c1.com", null)
val e4 = A("n4", null, "n4#c2.com", null)
val e5 = A("n5", null, "n5#c2.com", null)
val e6 = A("n6", null, "n6#c2.com", null)
val e7 = A("n7", null, "n7#c3.com", null)
val e8 = A("n8", null, "n8#c3.com", null)
val As = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(As).toDF
This code makes dataframe like this :
+----+----+---------+-------+
|name| id| email|company|
+----+----+---------+-------+
| n1|null|n1#c1.com| null|
| n2|null|n2#c1.com| null|
| n3|null|n3#c1.com| null|
| n4|null|n4#c2.com| null|
| n5|null|n5#c2.com| null|
| n6|null|n6#c2.com| null|
| n7|null|n7#c3.com| null|
| n8|null|n8#c3.com| null|
+----+----+---------+-------+
and I want to get column names all of their rows are null : id, company
I don't care the type of output. Array, String, RDD whatever
You can do a simple count on all your columns, then using the indices of the columns that return a count of 0, you subset df.columns:
import org.apache.spark.sql.functions.{count,col}
// Get column indices
val col_inds = df.select(df.columns.map(c => count(col(c)).alias(c)): _*)
.collect()(0)
.toSeq.zipWithIndex
.filter(_._1 == 0).map(_._2)
// Subset column names using the indices
col_inds.map(i => df.columns.apply(i))
//Seq[String] = ArrayBuffer(id, company)
An alternative solution could be as follows (but am afraid the performance might not be satisfactory).
val ids = Seq(
("1", null: String),
("1", null: String),
("10", null: String)
).toDF("id", "all_nulls")
scala> ids.show
+---+---------+
| id|all_nulls|
+---+---------+
| 1| null|
| 1| null|
| 10| null|
+---+---------+
val s = ids.columns.
map { c =>
(c, ids.select(c).dropDuplicates(c).na.drop.count) }. // <-- performance here!
collect { case (c, cnt) if cnt == 0 => c }
scala> s.foreach(println)
all_nulls
I have been using the Stanford CoreNLP wrapper for Apache Spark to do NEP analysis and found it works well. However, i want to extend the simple example to where I can map the analysis back to an original dataframe id. See below, I have added two more row to the simple example.
val input = Seq(
(1, "<xml>Apple is located in California. It is a great company.</xml>"),
(2, "<xml>Google is located in California. It is a great company.</xml>"),
(3, "<xml>Netflix is located in California. It is a great company.</xml>")
).toDF("id", "text")
input.show()
input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id| text|
+---+--------------------+
| 1|<xml>Apple is loc...|
| 2|<xml>Google is lo...|
| 3|<xml>Netflix is l...|
+---+--------------------+
I can then run this dataframe through the Spark CoreNLP wrapper to do both sentiment and NEP analysis.
val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
However, in the output below i have lost the connection back to the original dataframe row ids.
+--------------------+--------------------+--------------------+---------+
| sen| words| nerTags|sentiment|
+--------------------+--------------------+--------------------+---------+
|Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--------------------+--------------------+--------------------+---------+
Ideally, I want something like the following:
+--+---------------------+--------------------+--------------------+---------+
|id| sen| words| nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
| 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
| 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
| 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--+---------------------+--------------------+--------------------+---------+
I have tried to create a UDF but am unable to make it work.
Using UDF defined in the Stanford CoreNLP wrapper for Apache Spark you can use the following code to produced the desired output
val output = input.withColumn("doc", cleanxml('text).as('doc))
.withColumn("sen", ssplit('doc).as('sen))
.withColumn("sen", explode($"sen"))
.withColumn("words", tokenize('sen).as('words))
.withColumn("ner", ner('sen).as('nerTags))
.withColumn("sentiment", sentiment('sen).as('sentiment))
.drop("text")
.drop("doc").show()
will produce the following Dataframe
+--+---------------------+--------------------+--------------------+---------+
|id| sen| words| nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
| 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
| 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
| 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--+---------------------+--------------------+--------------------+---------+