Dynamically selecting columns with parameter using Scala - Spark

Dynamically selecting columns with parameter using Scala - Spark - scala

I need to dynamically select columns from one of the two tables I am joining. The name of one of the columns to be selected is passed to a variable. Here are the details.
The table names are passed to variables. So is the join_id and the join_type.
//Creating scala variables for each table
var table_name_a = dbutils.widgets.get("table_name_a")
var table_name_b = dbutils.widgets.get("table_name_b")
//Create scala variable for Join Id
var join_id = dbutils.widgets.get("table_name_b") + "Id"
// Define join type
var join_type = dbutils.widgets.get("join_type")
Then, I join the tables. I want to select all columns from table A and only two columns from table B: one column is called "Description" no matter what table B is passed in the parameter above; the second column has the same name of the table B, e.g., if table B's name is Employee, I want to select a column named "Employee" from table B. The code below selects all columns from table A and the Description column from table B (aliased). But I still need to select another column from table B that has the same name as the table. I don't know in advance how many columns table B has in total nor column order or their names - since Table B is passed as a parameter.
// Joining Tables
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"))
My question is: How do I pass the variable table_name_b as a column I am trying to select from table B?
I tried the code below which is obviously wrong because in "$"df_b.table_name_b" the "table_name_b" is supposed to be the content of the parameter and not the name of the columns itself.
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"),$"df_b.table_name_b")
Then I tried the code below and it gives the error: "value table_name_b is not a member of org.apache.spark.sql.DataFrame"
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"),df_b.table_name_b)
How do I pass the variable table_name_b as a column I need to select from table B?

You can build a List[org.apache.spark.sql.Column] and use it on your select function like the below example:
// sample input:
val df = Seq(
("A", 1, 6, 7),
("B", 2, 7, 6),
("C", 3, 8, 5),
("D", 4, 9, 4),
("E", 5, 8, 3)
).toDF("name", "col1", "col2", "col3")
df.printSchema()
val columnNames = List("col1", "col2") // string column names from your params
val columnsToSelect = columnNames.map(col(_)) // convert the required column names from string to column type
df.select(columnsToSelect: _*).show() // using the list of columns
// output:
+----+----+
|col1|col2|
+----+----+
| 1| 6|
| 2| 7|
| 3| 8|
| 4| 9|
| 5| 8|
+----+----+
Similarly can be applied for join's
Update
Adding another example:
val aliasTableA = "tableA"
val aliasTableB = "tableB"
val joinField = "name"
val df1 = Seq(
("A", 1, 6, 7),
("B", 2, 7, 6),
("C", 3, 8, 5),
("D", 4, 9, 4),
("E", 5, 8, 3)
).toDF("name", "col1", "col2", "col3")
val df2 = Seq(
("A", 11, 61, 71),
("B", 21, 71, 61),
("C", 31, 81, 51)
).toDF("name", "col_1", "col_2", "col_3")
df1.alias(aliasTableA)
.join(df2.alias(aliasTableB), Seq(joinField))
.selectExpr(s"${aliasTableA}.*", s"${aliasTableB}.col_1", s"${aliasTableB}.col_2").show()
// output:
+----+----+----+----+-----+-----+
|name|col1|col2|col3|col_1|col_2|
+----+----+----+----+-----+-----+
| A| 1| 6| 7| 11| 61|
| B| 2| 7| 6| 21| 71|
| C| 3| 8| 5| 31| 81|
+----+----+----+----+-----+-----+

Related

How to create a column with the maximum number in each row of another column in PySpark?

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?

If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

Looking to substract every value in a row based on the value of a separate DF

As the title states, I would like to subtract each value of a specific column by the mean of that column.
Here is my code attempt:
val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
.withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()

You can either use Spark's DataFrame functions or a mere SQL query to a DataFrame to aggregate the values of the means for the columns you are focusing on (rating1, rating2).
val moviePairs = spark.createDataFrame(
Seq(
("Moonlight", 7, 8),
("Lord Of The Drinks", 10, 1),
("The Disaster Artist", 3, 5),
("Airplane!", 7, 9),
("2001", 5, 1),
)
).toDF("movie", "rating1", "rating2")
// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()
// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))
subMean.show()
Output for the test input DataFrame moviePairs (with the good ol' double precision loss which you can manage as seen here):
+-------------------+-------+-------+-------------------+-------------------+
| movie|rating1|rating2| meanDeltaX| meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
| Moonlight| 7| 8| 0.5999999999999996| 3.2|
| Lord Of The Drinks| 10| 1| 3.5999999999999996| -3.8|
|The Disaster Artist| 3| 5|-3.4000000000000004|0.20000000000000018|
| Airplane!| 7| 9| 0.5999999999999996| 4.2|
| 2001| 5| 1|-1.4000000000000004| -3.8|
+-------------------+-------+-------+-------------------+-------------------+

How to create array of structs in spark

I'm trying to create a array of struct(col, col) in spark dataframe but getting error.
Using sample data to arrive at same error here.
Dataframe
val df = Seq((1, "One", "uno", true), (2, "Two", "Dos", true), (3, "Three", "Tres", false)).toDF("number", "English", "Spanish", "include_spanish")
scala> df.show
+------+-------+-------+---------------+
|number|English|Spanish|include_spanish|
+------+-------+-------+---------------+
| 1| One| uno| true|
| 2| Two| Dos| true|
| 3| Three| Tres| false|
+------+-------+-------+---------------+
Now, here is I'm trying to create struct out of existing columns and then create an array out of it.
val df1 = df.withColumn("numberToEnglish", struct(col("number"), col("English"))).withColumn("numberToSpanish", struct("number", "Spanish")).withColumn("numberToLanguage", when(col("include_spanish") === true, array("numberToEnglish", "numberToSpanish")).otherwise(array("numberToEnglish"))
Getting below error,
org.apache.spark.sql.AnalysisException: cannot resolve 'array(`numberToEnglish`, `numberToSpanish`)' due to data type mismatch: input to function array should all be the same type, but it's [struct<number:int,English:string>, struct<number:int,Spanish:string>];;
'Project [number#200, English#201, Spanish#202, include_spanish#203, numberToEnglish#253, numberToSpanish#259, CASE WHEN (include_spanish#203 = true) THEN array(numberToEnglish#253, numberToSpanish#259) ELSE array(numberToEnglish#253) END AS numberToLanguage#266]
what would be the best way achieve this functionality ?

In order for the array method to view struct($"number", $"English") and struct($"number", $"Spanish") as same data type, you'll need to name the struct elements, as shown below:
val df = Seq(
(1, "One", "uno", true), (2, "Two", "Dos", true), (3, "Three", "Tres", false)
).toDF("number", "English", "Spanish", "include_spanish")
df.
withColumn("numberToEnglish", struct($"number".as("num"), $"English".as("lang"))).
withColumn("numberToSpanish", struct($"number".as("num"), $"Spanish".as("lang"))).
withColumn("numberToLanguage",
when($"include_spanish", array($"numberToEnglish", $"numberToSpanish")).
otherwise(array($"numberToEnglish"))
).
show
// +------+-------+-------+---------------+---------------+---------------+--------------------+
// |number|English|Spanish|include_spanish|numberToEnglish|numberToSpanish| numberToLanguage|
// +------+-------+-------+---------------+---------------+---------------+--------------------+
// | 1| One| uno| true| [1, One]| [1, uno]|[[1, One], [1, uno]]|
// | 2| Two| Dos| true| [2, Two]| [2, Dos]|[[2, Two], [2, Dos]]|
// | 3| Three| Tres| false| [3, Three]| [3, Tres]| [[3, Three]]|
// +------+-------+-------+---------------+---------------+---------------+--------------------+

pyspark Transpose dataframe

I have a dataframe as given below
ID, Code_Num, Code, Code1, Code2, Code3
10, 1, A1005*B1003, A1005, B1003, null
12, 2, A1007*D1008*C1004, A1007, D1008, C1004
I need help on transposing the above dataset, and output should be displayed as below.
ID, Code_Num, Code, Code_T
10, 1, A1005*B1003, A1005
10, 1, A1005*B1003, B1003
12, 2, A1007*D1008*C1004, A1007
12, 2, A1007*D1008*C1004, D1008
12, 2, A1007*D1008*C1004, C1004

Step 1: Creating the DataFrame.
values = [(10, 'A1005*B1003', 'A1005', 'B1003', None),(12, 'A1007*D1008*C1004', 'A1007', 'D1008', 'C1004')]
df = sqlContext.createDataFrame(values,['ID','Code','Code1','Code2','Code3'])
df.show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Step 2: Explode the DataFrame -
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ["ID","Code"]).drop('key').withColumnRenamed("val","Code_T")
df.show()
+---+-----------------+------+
| ID| Code|Code_T|
+---+-----------------+------+
| 10| A1005*B1003| A1005|
| 10| A1005*B1003| B1003|
| 10| A1005*B1003| null|
| 12|A1007*D1008*C1004| A1007|
| 12|A1007*D1008*C1004| D1008|
| 12|A1007*D1008*C1004| C1004|
+---+-----------------+------+
In case you only want non-Null values in column Code_T, just run the statement below -
df = df.where(col('Code_T').isNotNull())

How to work around the immutable data frames in Spark/Scala?

I am trying to convert following pyspark code into scala. As you know, the dataframes in scala are immutable, which are constraining me to convert the following code:
pyspark code:
time_frame = ["3m","6m","9m","12m","18m","27m","60m","60m_ab"]
variable_name = ["var1", "var2", "var3"....., "var30"]
train_df = sqlContext.sql("select * from someTable")
for var in variable_name:
for tf in range(1,len(time_frame)):
train_df=train_df.withColumn(str(time_frame[tf]+'_'+var), fn.col(str(time_frame[tf]+'_'+var))+fn.col(str(time_frame[tf-1]+'_'+var)))
So, as you see above the table has different columns which are used to recreate more columns. However the immutable nature of the dataframe in Spark/Scala is objecting, can you help me with some work around?

Here's one approach that first uses a for-comprehension to generate a list of tuples consisting of column name pairs, and then traverses the list using foldLeft to iteratively transform trainDF via withColumn:
import org.apache.spark.sql.functions._
val timeframes: Seq[String] = ???
val variableNames: Seq[String] = ???
val newCols = for {
vn <- variableNames
tf <- 1 until timeframes.size
} yield (timeframes(tf) + "_" + vn, timeframes(tf - 1) + "_" + vn)
val trainDF = spark.sql("""select * from some_table""")
val resultDF = newCols.foldLeft(trainDF)( (accDF, cs) =>
accDF.withColumn(cs._1, col(cs._1) + col(cs._2))
)
To test the above code, simply provide sample input and create table some_table:
val timeframes = Seq("3m", "6m", "9m")
val variableNames = Seq("var1", "var2")
val df = Seq(
(1, 10, 11, 12, 13, 14, 15),
(2, 20, 21, 22, 23, 24, 25),
(3, 30, 31, 32, 33, 34, 35)
).toDF("id", "3m_var1", "6m_var1", "9m_var1", "3m_var2", "6m_var2", "9m_var2")
df.createOrReplaceTempView("some_table")
ResultDF should look like the following:
resultDF.show
// +---+-------+-------+-------+-------+-------+-------+
// | id|3m_var1|6m_var1|9m_var1|3m_var2|6m_var2|9m_var2|
// +---+-------+-------+-------+-------+-------+-------+
// | 1| 10| 21| 33| 13| 27| 42|
// | 2| 20| 41| 63| 23| 47| 72|
// | 3| 30| 61| 93| 33| 67| 102|
// +---+-------+-------+-------+-------+-------+-------+