Convert Array of String column to multiple columns in spark scala - scala

I have a dataframe with following schema:
id : int,
emp_details: Array(String)
Some sample data:
1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)
This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:
.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)
Could you please guide how we can achieve this by using Spark (1.6) Scala.
Really appreciate your help...
Thanks a lot

You can use withColumn and split to get the required data
df1.withColumn("empname", split($"emp_details" (0), "=")(1))
.withColumn("city", split($"emp_details" (1), "=")(1))
.withColumn("zip", split($"emp_details" (2), "=")(1))
Output:
+---+----------------------------------+-------+----+-----+
|id |emp_details |empname|city|zip |
+---+----------------------------------+-------+----+-----+
|1 |[empname=xxx, city=yyy, zip=12345]|xxx |yyy |12345|
|2 |[empname=bbb, city=bbb, zip=22345]|bbb |bbb |22345|
+---+----------------------------------+-------+----+-----+
UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as
val getColumnsUDF = udf((details: Seq[String]) => {
val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
(detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})
Now use the udf
df1.withColumn("emp",getColumnsUDF($"emp_details"))
.select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
.show(false)
Output:
+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1 |xxx |xxx |xxx|
|2 |bbb |bbb |bbb|
+---+-------+----+---+
Hope this helps!

Related

How do I string concat two columns in Scala but order the resulting column alphabetically?

I have a dataframe like this...
val new_df =Seq(("a","b"),("b","a"),("a","c")).toDF("col1","col2")
and I want to create "col3" which is a string concatenation of "col1" and "col2". However, I want the concatenation of "ab" and "ba" to be treated the same, sorted alphabetically so that it's only "ab".
The resulting dataframe I would like to look like this:
val new_df =Seq(("a","b","ab"),("b","a","ab"),("a","c","ac")).toDF("col1","col2","col3")
And here's a before and after picture too:
before:
after:
thanks and have a great day!
With Spark SQL functions to take advantage of the Spark SQL Optimizations:
import org.apache.spark.sql.functions.{sort_array, array, concat_ws}
new_df.withColumn("col3",
concat_ws("",
sort_array(array(col("col1"), col("col2")))))
You can just create an udf to create a sorted String
val concatColumns = udf((c1: String, c2: String) => {
List(c1, c2).sorted.mkString
})
And then use it in a withColumn statement sending the desired columns to concatenate
new_df.withColumn("col3", concatColumns($"col1", $"col2")).show(false)
Result
+----+----+----+
|col1|col2|col3|
+----+----+----+
|a |b |ab |
|b |a |ab |
|a |c |ac |
+----+----+----+

Pass list of column values to spark dataframe as new column

I am trying to add a new column to spark dataframe as below:
val abc = [a,b,c,d] --- List of columns
I am trying to pass above list of column values as new column to dataframe and trying to do sha2 on that new column and trying to do a varchar(64).
source = source.withColumn("newcolumn", sha2(col(abc), 256).cast('varchar(64)'))
It complied and the runtime error I am getting as:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'abc' given input
columns:
The expected output should be a dataframe with newcolum as column name and the column value as varchar64 with sha2 of concatenate of Array of string with ||.
Please suggest.
We can use map and concat_ws || to create new column and apply sha2() on the concat data.
val abc = Seq("a","b","c","d")
val df=Seq(((1),(2),(3),(4))).toDF("a","b","c","d")
df.withColumn("newColumn",sha2(concat_ws("||", abc.map(c=> col(c)):_*),256)).show(false)
//+---+---+---+---+----------------------------------------------------------------+
//|a |b |c |d |newColumn |
//+---+---+---+---+----------------------------------------------------------------+
//|1 |2 |3 |4 |20a5b7415fb63243c5dbacc9b30375de49636051bda91859e392d3c6785557c9|
//+---+---+---+---+----------------------------------------------------------------+

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

Spark dataframe add a row for every existing row

I have a dataframe with following columns:
groupid,unit,height
----------------------
1,in,55
2,in,54
I want to create another dataframe with additional rows where unit=cm and height=height*2.54.
Resulting dataframe:
groupid,unit,height
----------------------
1,in,55
2,in,54
1,cm,139.7
2,cm,137.16
Not sure how I can use spark udf and explode here.
Any help is appreciated.
Thanks in advance.
you can create another dataframe with changes you require using withColumn and then union both dataframes as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "in", 55),
(2, "in", 54)
).toDF("groupid", "unit", "height")
val df2 = df.withColumn("unit", lit("cm")).withColumn("height", col("height")*2.54)
df.union(df2).show(false)
you should have
+-------+----+------+
|groupid|unit|height|
+-------+----+------+
|1 |in |55.0 |
|2 |in |54.0 |
|1 |cm |139.7 |
|2 |cm |137.16|
+-------+----+------+

How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When using with withColumn() method, I get the mismatch error, because the input is not Column type, but instead (Column,Column). I wonder if there is a solution beside running for loop over the rows in this case?
var dfCol=(col1:Column,col2:Column)=>(col1,col2)
val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) )
You can use struct function which creates a tuple of provided columns:
import org.apache.spark.sql.functions.struct
val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b")
df.withColumn("NewColumn", struct(df("a"), df("b")).show(false)
+---+---+---------+
|a |b |NewColumn|
+---+---+---------+
|1 |2 |[1,2] |
|3 |4 |[3,4] |
|5 |3 |[5,3] |
+---+---+---------+
You can use a User-defined function udf to achieve what you want.
UDF definition
object TupleUDFs {
import org.apache.spark.sql.functions.udf
// type tag is required, as we have a generic udf
import scala.reflect.runtime.universe.{TypeTag, typeTag}
def toTuple2[S: TypeTag, T: TypeTag] =
udf[(S, T), S, T]((x: S, y: T) => (x, y))
}
Usage
df.withColumn(
"tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
)
assuming "a" and "b" are the columns of type Int you want to put in a tuple.
You can merge multiple dataframe columns into one using array.
// $"*" will capture all existing columns
df.select($"*", array($"col1", $"col2").as("newCol"))
If you want to merge two dataframe columns into one column.
Just:
import org.apache.spark.sql.functions.array
df.withColumn("NewColumn", array("columnA", "columnB"))