how to access values from a array column in scala dataframe - scala

I have a dataframe coming with scala array of tuples (index, value) like the following, index has values from 1 to 4
id | units_flag_tuples
id1 | [(3,2.0), (4,6.0)]
id2 | [(1,10.0), (2,2.0), (3,5.0)]
I would like to access the value from the array and put it into columns based on index (unit1, unit2, unit3, unit4):
ID | unit1| unit2 | unit3 | unit 4
id1 | null | null | 2.0 | 6.0
id2 | 10.0 | 2.0 | 5.0 | null
here is the code:
df
.withColumn("unit1", col("units_flag_tuples").find(_._1 == '1').get._2 )
.withColumn("unit2", col("units_flag_tuples").find(_._1 == '2').get._2 )
.withColumn("unit3", col("units_flag_tuples").find(_._1 == '3').get._2 )
.withColumn("unit4", col("units_flag_tuples").find(_._1 == '4').get._2 )
Here is the error message I am getting:
error: value find is not a member of org.apache.spark.sql.Column
How to resolve this error or any better ways to do it?

Here is my different approach, that I have used the map_from_entries function to make a map for array and get each column by choosing the key from the map.
val df = Seq(("id1", Seq((3,2.0), (4,6.0))), ("id2", Seq((1,10.0), (2,2.0), (3,5.0)))).toDF("id", "units_flag_tuples")
df.show(false)
df.withColumn("map", map_from_entries(col("units_flag_tuples")))
.withColumn("unit1", col("map.1"))
.withColumn("unit2", col("map.2"))
.withColumn("unit3", col("map.3"))
.withColumn("unit4", col("map.4"))
.drop("map", "units_flag_tuples").show
The result is:
+---+-----+-----+-----+-----+
| id|unit1|unit2|unit3|unit4|
+---+-----+-----+-----+-----+
|id1| null| null| 2.0| 6.0|
|id2| 10.0| 2.0| 5.0| null|
+---+-----+-----+-----+-----+

Related

assign values to a new column depending on old column values in dataframe

I have assigned values to 4 variables in a conf or application.properties file,
A = 1
B = 2
C = 3
D = 4
I have a dataframe as follows,
+-----+
|name |
+-----+
| A |
| C |
| B |
| D |
| B |
+-----+
I want to add a new column that has the values assigned from the conf variables declared above for A,B,C,D respectively depending on the value in the name column.
Final Dataframe should have,
+----+----------+
|name|NAME_VALUE|
+----+----------+
| A | 1 |
| C | 3 |
| B | 2 |
| D | 4 |
| B | 2 |
+----+----------+
I tried lit function in .WITHCOLUMN with conf.getint($name), not accepting Column in lit func requires string, I have to hardcode the variable names in lit. Is there anyway for me to dynamically assign those respective conf variable names in LIT so it can automatically assign values to another column in spark scala?
For this moment i dont have any ideas how to do it as you intended with dynamic usage of vals names.
My proposition is to use a seq of tuples instead of multiple vals, in such case you can create some udf and try to map this value for each row, but you can also use join which i am showing in below example:
val data = Seq(("A"),("C"), ("B"), ("D"), ("B"))
val df = data.toDF("name")
val mappings = Seq(("A",1), ("B",2), ("C",3), ("D",4))
val mappingsDf = mappings.toDF("name", "value")
df.join(broadcast(mappingsDf), df("name") === mappingsDf("name"), "left")
.select(
df("name"),
mappingsDf("value")
).show
output is as expected:
+----+-----+
|name|value|
+----+-----+
| A| 1|
| C| 3|
| B| 2|
| D| 4|
| B| 2|
+----+-----+
This solution is pretty generic as your mapping are df here so you can hardcode them as showed in my example or load them from some csv or json easily with spark api
Due to broadcast join it should be quite efficient (you should remove this hint if you want to use big amount of mappings!)
I think its easy to understand and maintain as its not udf but only Spark api

Spark Scala split column values in a dataframe to appended lists

I have data in a spark dataframe that I need to search for elements by name, append the values to a list, and split searched elements into separate columns of the dataframe.
I am using scala and the below is an example of my current code that works to get the first value but I need to append all values available not just the first.
I'm new to Scala (and python) so any help will be greatly appreciated!
val getNumber: (String => String) = (colString: String) => {
if (colString != null) {
raw"number:(\d+)".r
.findAllIn(colString)
.group(1)
}
else
null
}
val udfGetColumn = udf(getNumber)
val mydf = df.select(cols.....)
.withColumn("var_number", udfGetColumn($"var"))
Example Data:
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| key| var |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |["[number:123456 rate:111970 position:1]","[number:123457 rate:662352 position:2]","[number:123458 rate:890 position:3]","[number:123459 rate:190 position:4]"] | |
|2 |["[number:654321 rate:211971 position:1]","[number:654322 rate:124 position:2]","[number:654323 rate:421 position:3]"] |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
Desired Result:
+------+------------------------------------------------------------+
| key| var_number | var_rate | var_position |
+------+------------------------------------------------------------+
|1 | 123456 | 111970 | 1 |
|1 | 123457 | 662352 | 2 |
|1 | 123458 | 890 | 3 |
|1 | 123459 | 190 | 4 |
|2 | 654321 | 211971 | 1 |
|2 | 654322 | 124 | 2 |
|2 | 654323 | 421 | 3 |
+------+-----------------+---------------------+--------------------+
You don't need to use UDF here. You can easily transform the array column var by converting each element into a map using str_to_map after removing the square brackets ([]) with regexp_replace function. Finally, explode the transformed array and select the fields:
val df = Seq(
(1, Seq("[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]")),
(2, Seq("[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"))
).toDF("key", "var")
val result = df.withColumn(
"var",
explode(expr(raw"transform(var, x -> str_to_map(regexp_replace(x, '[\\[\\]]', ''), ' '))"))
).select(
col("key"),
col("var").getField("number").alias("var_number"),
col("var").getField("rate").alias("var_rate"),
col("var").getField("position").alias("var_position")
)
result.show
//+---+----------+--------+------------+
//|key|var_number|var_rate|var_position|
//+---+----------+--------+------------+
//| 1| 123456| 111970| 1|
//| 1| 123457| 662352| 2|
//| 1| 123458| 890| 3|
//| 1| 123459| 190| 4|
//| 2| 654321| 211971| 1|
//| 2| 654322| 124| 2|
//| 2| 654323| 421| 3|
//+---+----------+--------+------------+
From you comment, it appears the column var is of type string not array. In this case, you can first transform it by removing [] and " characters then split by comma to get an array:
val df = Seq(
(1, """["[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]"]"""),
(2, """["[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"]""")
).toDF("key", "var")
val result = df.withColumn(
"var",
split(regexp_replace(col("var"), "[\\[\\]\"]", ""), ",")
).withColumn(
"var",
explode(expr("transform(var, x -> str_to_map(x, ' '))"))
).select(
// select your columns as above...
)

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

Spark scala create multiple columns from array column

Creating a multiple columns from array column
Dataframe
Car name | details
Toyota | [[year,2000],[price,20000]]
Audi | [[mpg,22]]
Expected dataframe
Car name | year | price | mpg
Toyota | 2000 | 20000 | null
Audi | null | null | 22
You can try this
Let's define the data
scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]
scala> carsDF.show(false)
+------+-----------------------------+
|car |details |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi |[[mpg,22]] |
+------+-----------------------------+
Splitting the data & accessing the values in the data
scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
| car| col| val|
+------+-----+------+
|toyota| year| 2000|
|toyota|price|100000|
| Audi| mpg| 22|
+------+-----+------+
Define the list of columns that are required
scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)
Use pivoting on the above defined column names gives required output.
By giving new column names in the sequence makes it a single point input
scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
| car| mpg| price| year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
| Audi|22.0| null| null| null|
+------+----+--------+------+-----+
This seems more elegant & easy way to achieve the output
you can do it like that
import org.apache.spark.functions.col
val df: DataFrame = Seq(
("toyota",Array(("year", 2000), ("price", 100000))),
("toyota",Array(("year", 2001)))
).toDF("car", "details")
+------+-------------------------------+
|car |details |
+------+-------------------------------+
|toyota|[[year, 2000], [price, 100000]]|
|toyota|[[year, 2001]] |
+------+-------------------------------+
val newdf = df
.withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.drop("details")
newdf.show()
+------+----+------+
| car|year| price|
+------+----+------+
|toyota|2000|100000|
|toyota|2001| null|
+------+----+------+

Map a multimap to columns of dataframe

Simply, I want to convert a multimap like this:
val input = Map("rownum"-> List("1", "2", "3") , "plant"-> List( "Melfi", "Pomigliano", "Torino" ), "tipo"-> List("gomme", "telaio")).toArray
in the following Spark dataframe:
+-------+--------------+-------+
|rownum | plant | tipo |
+------ +--------------+-------+
| 1 | Melfi | gomme |
| 2 | Pomigliano | telaio|
| 3 | Torino | null |
+-------+--------------+-------+
replacing missing values with "null" values. My issue is apply a map function to the RDD:
val inputRdd = sc.parallelize(input)
inputRdd.map(..).toDF()
Any suggestions? Thanks in advance
Although, see my comments, I'm really not sure the multimap format is well suited to your problem (did you have a look at Spark XML parsing modules ?)
The pivot table solution
The idea is to flatten you input table into a (elementPosition, columnName, columnValue) format :
// The max size of the multimap lists
val numberOfRows = input.map(_._2.size).max
// For each index in the list, emit a tuple of (index, multimap key, multimap value at index)
val flatRows = (0 until numberOfRows).flatMap(rowIdx => input.map({ case (colName, allColValues) => (rowIdx, colName, if(allColValues.size > rowIdx) allColValues(rowIdx) else null)}))
// Probably faster at runtime to write it this way (less iterations) :
// val flatRows = input.flatMap({ case (colName, existingValues) => (0 until numberOfRows).zipAll(existingValues, null, null).map(t => (t._1.asInstanceOf[Int], colName, t._2)) })
// To dataframe
val flatDF = sc.parallelize(flatRows).toDF("elementIndex", "colName", "colValue")
flatDF.show
Will output :
+------------+-------+----------+
|elementIndex|colName| colValue|
+------------+-------+----------+
| 0| rownum| 1|
| 0| plant| Melfi|
| 0| tipo| gomme|
| 1| rownum| 2|
| 1| plant|Pomigliano|
| 1| tipo| telaio|
| 2| rownum| 3|
| 2| plant| Torino|
| 2| tipo| null|
+------------+-------+----------+
Now this is a pivot table problem :
flatDF.groupBy("elementIndex").pivot("colName").agg(expr("first(colValue)")).drop("elementIndex").show
+----------+------+------+
| plant|rownum| tipo|
+----------+------+------+
|Pomigliano| 2|telaio|
| Torino| 3| null|
| Melfi| 1| gomme|
+----------+------+------+
This might not be the best looking solution, but it is fully scalable to any number of columns.