Convert Array to Columns and replace values - scala

I have data of following format:
+-----+---------------+
| name| Data|
+-----+---------------+
|Alpha| [A, B, C]|
| Beta| [A, B, C, D]|
|Gamma|[A, B, C, D, E]|
+-----+---------------+
How to transform it into ?
+-----+----+-----+-----+-----+-----+
| name| A| B| C| D| E|
+-----+----+-----+-----+-----+-----+
|Alpha| 1| 1| 1| 0| 0|
| Beta| 1| 1| 1| 1| 0|
|Gamma| 1| 1| 1| 1| 1|
+-----+----+-----+-----+-----+-----+
Thanks to #Jarrod Baker for help in similar transformation earlier
Here is the code that i have:
val df = Seq(
("Alpha", Array("A", "B", "C")),
("Beta", Array("A", "B", "C", "D")),
("Gamma", Array("A", "B", "C", "D", "E")),
).toDF("name", "Data")
df.show()
val arrayDataSize = df.withColumn("arr_size", size(col("Data"))).agg(max("arr_size") as "maxSize")
val newDF = df.select(($"name") +: (0 until arrayDataSize.first.getInt(0)).map(i => {($"Data") (i).contains("A").alias("A") }): _*)
newDF.show()
+-----+----+-----+-----+-----+-----+
| name| A| A| A| A| A|
+-----+----+-----+-----+-----+-----+
|Alpha|true|false|false| null| null|
| Beta|true|false|false|false| null|
|Gamma|true|false|false|false|false|
+-----+----+-----+-----+-----+-----+
Thanks in advance for your help.

You can use the RelationalGroupedDataset's pivot method to achieve what you want. To create such a Dataset, you need to use groupBy on a Dataset.
It would look something like this:
import spark.implicits._
val df = Seq(
("Alpha", Seq("A", "B", "C")),
("Beta", Seq("A", "B", "C", "D")),
("Gamma", Seq("A", "B", "C", "D", "E"))
).toDF("name", "Data")
val output = df
.select(df("name"), explode(col("Data")).alias("Data"))
.groupBy("name")
.pivot("Data")
.count()
output.show()
+-----+---+---+---+----+----+
| name| A| B| C| D| E|
+-----+---+---+---+----+----+
| Beta| 1| 1| 1| 1|null|
|Gamma| 1| 1| 1| 1| 1|
|Alpha| 1| 1| 1|null|null|
+-----+---+---+---+----+----+
As you can see, we're first explode-ing our Sequences into separate rows. This allows us to treat each element in each sequence as a separate "entity".
Then, we're using groupBy to get our RelationalGroupedDataset, after which we pivot and count the occurences.

Related

Spark: Transform array to Column with size of Array using Map iterable

I have following data
df.show
+---------+--------------------+------------+
| name| age| tokens| tokensCount|
+---------+----+---------------+------------+
| Alice| 29| [A,B,C]| 3|
| Bob| 28| [A,B,C,D]| 4|
| Charlie| 29| [A,B,C,D,E]| 5|
+---------+----+---------------+------------+
I transform data with following command
val newDF = df.select(($"name") +: (0 until 4).map(i => ($"tokens")(i).alias(s"token$i")): _*).show
+---------+-------+-------+-------+-------+
| name| token0| token1| token2| token3|
+---------+-------+-------+-------+-------+
| Alice| A| B| C| null|
| Bob| A| B| C| D|
| Charlie| A| B| C| D|
+---------+-------+-------+-------+-------+
I want to give tokensCount instead of static value 4 at (0 until 4)
I tried a few things like $"tokensCount" and size($"tokens"), but could not get through.
Can anyone suggest how to loop or map according to the size of array or count of array ?
Many thanks
You can modify your code to find the maximum length of tokens, and then use that to create the necessary columns:
val df = Seq(
("Alice", 29, Array("A", "B", "C")),
("Bob", 28, Array("A", "B", "C", "D")),
("Charlie", 29, Array("A", "B", "C", "D", "E")),
).toDF("name", "age", "tokens")
val maxTokenCount = df.withColumn("token_count", size(col("tokens"))).agg(max("token_count") as "mtc")
val newDF = df.select(($"name") +: (0 until maxTokenCount.first.getInt(0)).map(i => ($"tokens")(i).alias(s"token$i")): _*).show
Which will give you:
+-------+------+------+------+------+------+
| name|token0|token1|token2|token3|token4|
+-------+------+------+------+------+------+
| Alice| A| B| C| null| null|
| Bob| A| B| C| D| null|
|Charlie| A| B| C| D| E|
+-------+------+------+------+------+------+
It might be useful to explain why you want to do this transformation, as there might be a much more efficient way. This has the potential to create a very sparse dataframe. Imagine that most names have no tokens, but Bob has 100 tokens: all of a sudden you have one hundred columns of mostly null values.

Fill blank rows in a column with a non-blank value above it in Spark

I have an input file having around 8.5+ Million records.
My requirement is to fill the empty row values in a column with the immediate non-blank value ABOVE it. Have a look at the example:
+-----+-----+---+------+
|FName|LName|Age|Gender|
+-----+-----+---+------+
| A| B| 29| M|
| A| C| 12| |
| B| D| 35| |
| Q| D| 85| F|
| W| R| 14| |
+-----+-----+---+------+
Desired Ouput:
+-----+-----+---+------+
|FName|LName|Age|Gender|
+-----+-----+---+------+
| A| B| 29| M|
| A| C| 12| M|
| B| D| 35| M|
| Q| D| 85| F|
| W| R| 14| F|
+-----+-----+---+------+
Increment column can be added, and function "last" with ignoring nulls can be used over window:
val idWindow = Window.orderBy($"ID")
df
.withColumn("id", monotonically_increasing_id())
.withColumn("Gender",
last(
when($"Gender" === "", null).otherwise($"Gender"),
ignoreNulls = true).over(idWindow)
)
.drop("id")
Add rowId column with a Gender_temp and mark even odd column as M and F
save it to Gender_temp
and drop the unused columns
import org.apache.spark.sql.functions._
object DataframeFill {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val personDF = Seq(("A", "B", 29, "M"),
("A", "C", 12, ""),
("B", "D", 35, "F"),
("Q", "D", 85, ""),
("W", "R", 14, "")).toDF("FName", "LName", "Age", "Gender")
personDF.show()
personDF
.withColumn("rowId", monotonically_increasing_id())
.withColumn("Gender_temp", when($"Gender".isin(""),
when ($"rowId" % 2 ===0 ,"M").otherwise("F") ).otherwise($"Gender"))
.drop("Gender")
.drop("rowId")
.withColumnRenamed("Gender_temp","Gender")
.show()
}
}

Apache Spark - Scala API - Aggregate on sequentially increasing key

I have a data frame that looks something like this:
val df = sc.parallelize(Seq(
(3,1,"A"),(3,2,"B"),(3,3,"C"),
(2,1,"D"),(2,2,"E"),
(3,1,"F"),(3,2,"G"),(3,3,"G"),
(2,1,"X"),(2,2,"X")
)).toDF("TotalN", "N", "String")
+------+---+------+
|TotalN| N|String|
+------+---+------+
| 3| 1| A|
| 3| 2| B|
| 3| 3| C|
| 2| 1| D|
| 2| 2| E|
| 3| 1| F|
| 3| 2| G|
| 3| 3| G|
| 2| 1| X|
| 2| 2| X|
+------+---+------+
I need to aggregate the strings by concatenating them together based on the TotalN and the sequentially increasing ID (N). The problem is there is not a unique ID for each aggregation I can group by. So, I need to do something like "for each row look at the TotalN, loop through the next N rows and concatenate, then reset".
+------+------+
|TotalN|String|
+------+------+
| 3| ABC|
| 2| DE|
| 3| FGG|
| 2| XX|
+------+------+
Any pointers much appreciated.
Using Spark 2.3.1 and the Scala Api.
Try this:
val df = spark.sparkContext.parallelize(Seq(
(3, 1, "A"), (3, 2, "B"), (3, 3, "C"),
(2, 1, "D"), (2, 2, "E"),
(3, 1, "F"), (3, 2, "G"), (3, 3, "G"),
(2, 1, "X"), (2, 2, "X")
)).toDF("TotalN", "N", "String")
df.createOrReplaceTempView("data")
val sqlDF = spark.sql(
"""
| SELECT TotalN d, N, String, ROW_NUMBER() over (order by TotalN) as rowNum
| FROM data
""".stripMargin)
sqlDF.withColumn("key", $"N" - $"rowNum")
.groupBy("key").agg(collect_list('String).as("texts")).show()
Solution is to calculate a grouping variable using the row_number function which can be used in later groupBy.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var w = Window.orderBy("TotalN")
df.withColumn("GeneratedID", $"N" - row_number.over(w)).show
+------+---+------+-----------+
|TotalN| N|String|GeneratedID|
+------+---+------+-----------+
| 2| 1| D| 0|
| 2| 2| E| 0|
| 2| 1| X| -2|
| 2| 2| X| -2|
| 3| 1| A| -4|
| 3| 2| B| -4|
| 3| 3| C| -4|
| 3| 1| F| -7|
| 3| 2| G| -7|
| 3| 3| G| -7|
+------+---+------+-----------+

Aggregation the derived column spark

DF.groupBy("id")
.agg(
sum((when(upper($"col_name") === "text", 1)
.otherwise(0)))
.alias("df_count")
.when($"df_count"> 1, 1)
.otherwise(0)
)
Can I do aggregation on the column which was named as alias? ,i.e if the sum is greater than one then return 1 else 0
Thanks in advance.
I think you could wrap another when.otherwise around the sum result:
val df = Seq((1, "a"), (1, "a"), (2, "b"), (3, "a")).toDF("id", "col_name")
df.show
+---+--------+
| id|col_name|
+---+--------+
| 1| a|
| 1| a|
| 2| b|
| 3| a|
+---+--------+
df.groupBy("id").agg(
sum(when(upper($"col_name") === "A", 1).otherwise(0)).alias("df_count")
).show()
+---+--------+
| id|df_count|
+---+--------+
| 1| 2|
| 3| 1|
| 2| 0|
+---+--------+
df.groupBy("id").agg(
when(sum(when(upper($"col_name")==="A", 1).otherwise(0)) > 1, 1).otherwise(0).alias("df_count")
).show()
+---+--------+
| id|df_count|
+---+--------+
| 1| 1|
| 3| 0|
| 2| 0|
+---+--------+

E-num / get Dummies in pyspark

I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list
PFA the Before and After DF:
before and After data frame- Example
The code in python looks like that:
enum = ['column1','column2']
for e in enum:
print e
temp = pd.get_dummies(data[e],drop_first=True,prefix=e)
data = pd.concat([data,temp], axis=1)
data.drop(e,axis=1,inplace=True)
data.to_csv('enum_data.csv')
First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column.
Here is sample code using select statement:-
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["ID", "TYPE", "CODE"])
types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()
OUTPUT
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
The solutions provided by Freek Wiemkeijer and Rakesh Kumar are perfectly adequate, however, since I coded it up, I thought it was worth posting this generic solution as it doesn't require hard coding of the column names.
pivot_cols = ['TYPE','CODE']
keys = ['ID','TYPE','CODE']
before = sc.parallelize([(1,'A','X1'),
(2,'B','X2'),
(3,'B','X3'),
(1,'B','X3'),
(2,'C','X2'),
(3,'C','X2'),
(1,'C','X1'),
(1,'B','X1')]).toDF(['ID','TYPE','CODE'])
#Helper function to recursively join a list of dataframes
#Can be simplified if you only need two columns
def join_all(dfs,keys):
if len(dfs) > 1:
return dfs[0].join(join_all(dfs[1:],keys), on = keys, how = 'inner')
else:
return dfs[0]
dfs = []
combined = []
for pivot_col in pivot_cols:
pivotDF = before.groupBy(keys).pivot(pivot_col).count()
new_names = pivotDF.columns[:len(keys)] + ["e_{0}_{1}".format(pivot_col, c) for c in pivotDF.columns[len(keys):]]
df = pivotDF.toDF(*new_names).fillna(0)
combined.append(df)
join_all(combined,keys).show()
This gives as output:
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
I was looking for the same solution but is scala, maybe this will help someone:
val list = df.select("category").distinct().rdd.map(r => r(0)).collect()
val oneHotDf = list.foldLeft(df)((df, category) => finalDf.withColumn("category_" + category, when(col("category") === category, 1).otherwise(0)))
If you'd like to get the PySpark version of pandas "pd.get_dummies" function, you can you the following function:
import itertools
def spark_get_dummies(df):
categories = []
for i, values in enumerate(df.columns):
categories.append(df.select(values).distinct().rdd.flatMap(lambda x: x).collect())
expressions = []
for i, values in enumerate(df.columns):
expressions.append([F.when(F.col(values) == i, 1).otherwise(0).alias(str(values) + "_" + str(i)) for i in categories[i]])
expressions_flat = list(itertools.chain.from_iterable(expressions))
df_final = df.select(*expressions_flat)
return df_final
The reproducible example is:
df = sqlContext.createDataFrame([
("A", "X1"),
("B", "X2"),
("B", "X3"),
("B", "X3"),
("C", "X2"),
("C", "X2"),
("C", "X1"),
("B", "X1"),
], ["TYPE", "CODE"])
dummies_df = spark_get_dummies(df)
dummies_df.show()
You will get:
The first step is to make a DataFrame from your CSV file.
See Get CSV to Spark dataframe ; the first answer gives a line by line example.
Then you can add the columns. Assume you have a DataFrame object called df, and the columns are: [ID, TYPE, CODE].
The rest van be fixed with DataFrame.withColumn() and pyspark.sql.functions.when:
from pyspark.sql.functions import when
df_with_extra_columns = df.withColumn("e_TYPE_A", when(df.TYPE == "A", 1).otherwise(0).withColumn("e_TYPE_B", when(df.TYPE == "B", 1).otherwise(0)
(this adds the first two columns. you get the point.)