Pass list of column values to spark dataframe as new column - scala

I am trying to add a new column to spark dataframe as below:
val abc = [a,b,c,d] --- List of columns
I am trying to pass above list of column values as new column to dataframe and trying to do sha2 on that new column and trying to do a varchar(64).
source = source.withColumn("newcolumn", sha2(col(abc), 256).cast('varchar(64)'))
It complied and the runtime error I am getting as:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'abc' given input
columns:
The expected output should be a dataframe with newcolum as column name and the column value as varchar64 with sha2 of concatenate of Array of string with ||.
Please suggest.

We can use map and concat_ws || to create new column and apply sha2() on the concat data.
val abc = Seq("a","b","c","d")
val df=Seq(((1),(2),(3),(4))).toDF("a","b","c","d")
df.withColumn("newColumn",sha2(concat_ws("||", abc.map(c=> col(c)):_*),256)).show(false)
//+---+---+---+---+----------------------------------------------------------------+
//|a |b |c |d |newColumn |
//+---+---+---+---+----------------------------------------------------------------+
//|1 |2 |3 |4 |20a5b7415fb63243c5dbacc9b30375de49636051bda91859e392d3c6785557c9|
//+---+---+---+---+----------------------------------------------------------------+

Related

not able to get nested json value as column

I'm trying to create schema for json, and see it as columns in dataframe
Input json
{"place":{"place_name":"NYC","lon":0,"lat":0,"place_id":1009}, "region":{"region_issues":[{"key":"health","issue_name":"Cancer"},{"key":"sports","issue_name":"swimming"}}}
code
val schemaRsvp = new StructType()
.add("place", StructType(Array(
StructField("place_name", DataTypes.StringType),
StructField("lon", DataTypes.IntegerType),
StructField("lat", DataTypes.IntegerType),
StructField("place_id", DataTypes.IntegerType))))
val ip = spark.read.schema(schemaRsvp).json("D:\\Data\\rsvp\\inputrsvp.json")
ip.show()
Its showing all the fields in single column place, want values column wise
place_name,lon,lat,place_id
NYC,0,0,1009
Any suggestion, how to fix this ?
You can convert struct type into columns using ".*"
ip.select("place.*").show()
+----------+---+---+--------+
|place_name|lon|lat|place_id|
+----------+---+---+--------+
| NYC| 0| 0| 1009|
+----------+---+---+--------+
UPDATE:
with the new column array you can explode your date and then do the same ".*" to convert struct type into columns:
ip.select(col("place"), explode(col("region.region_issues")).as("region_issues"))
.select("place.*", "region_issues.*").show(false)
+---+---+--------+----------+----------+------+
|lat|lon|place_id|place_name|issue_name|key |
+---+---+--------+----------+----------+------+
|0 |0 |1009 |NYC |Cancer |health|
|0 |0 |1009 |NYC |swimming |sports|
+---+---+--------+----------+----------+------+

convert dataframe column values and apply SHA2 masking logic

I have a dataframe that contains the Property table and main table from Hive. I want to remove columns and then I want to apply masking logic (SHA2).
Reading Property config from postgre DB as a Dataframe in Spark/scala job.
val propertydf = loading the property dataframe from postgre db
Main Hive table
and the output should be
Anyone, please help me write a code in Spark/Scala. I am unable to convert List[String] and pass it to function from dataframe config.
You can manipulate the column names and select them as appropriate:
val masking = propertydf.head(1)(0).getAs[String]("maskingcolumns").split(",")
val exclude = propertydf.head(1)(0).getAs[String]("columnstoexclude").split(",")
val result = df.select(
masking.map(c => sha2(col(c).cast("string"), 256).as(c)) ++
df.columns.filterNot(c => masking.contains(c) || exclude.contains(c)).map(col)
:_*
)
result.show(false)
+----------------------------------------------------------------+----------------------------------------------------------------+---+---+
|a |b |c |d |
+----------------------------------------------------------------+----------------------------------------------------------------+---+---+
|ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad|6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b|11 |cbc|
+----------------------------------------------------------------+----------------------------------------------------------------+---+---+

Logic to manipulate dataframe in spark scala [Spark]

Take for example the following dataFrame:
x.show(false)
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/done/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/done/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
Now I am trying to update the existing DF to create a new DF based based on the column hdfsPath
The new DF should look like the following:
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/target/20200219/12/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
So the path done changes to target and then from the compiled-20200218050518-1-0-0-1582020318751.snappy portion I get the date 20200218 and then colID 11 and then finally the snappy file. What would be the easiest and most efficient way to achieve this?
It's not a hard requirement to create a newDF, I can update the existing DF with a new column.
To summarize:
Current hdfsPath:
hdfs://novus-nameservice/a/b/c/done/compiled-20200218050518-1-0-0-1582020318751.snappy
Expected hdfsPath:
hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy
Based on colID.
The simplest way i can imagine doing this is converting your dataframe to a dataset and apply a map operation and then back to dataframe,
// Define a case class
case class MyType(colId:Int,path:String,timestamp:Int) // they need to match the column names
dataframe.as[MyType].map(x=> <<Your Transformation code>>).toDf()
Here is what you can do with regex_replace and regex_extract, Extract the values you want and replace with it
df.withColumn("hdfsPath", regexp_replace(
$"hdfsPath",
lit("/done"),
concat(
lit("/target/"),
regexp_extract($"hdfsPath", "compiled-([0-9]{1,8})", 1),
lit("/"),
$"colId")
))
Output:
+-----+----------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-----+----------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/target/20200219/12/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-----+----------------------------------------------------------------------------------------------------+-------------+
Hope this helps!

Convert Array of String column to multiple columns in spark scala

I have a dataframe with following schema:
id : int,
emp_details: Array(String)
Some sample data:
1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)
This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:
.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)
Could you please guide how we can achieve this by using Spark (1.6) Scala.
Really appreciate your help...
Thanks a lot
You can use withColumn and split to get the required data
df1.withColumn("empname", split($"emp_details" (0), "=")(1))
.withColumn("city", split($"emp_details" (1), "=")(1))
.withColumn("zip", split($"emp_details" (2), "=")(1))
Output:
+---+----------------------------------+-------+----+-----+
|id |emp_details |empname|city|zip |
+---+----------------------------------+-------+----+-----+
|1 |[empname=xxx, city=yyy, zip=12345]|xxx |yyy |12345|
|2 |[empname=bbb, city=bbb, zip=22345]|bbb |bbb |22345|
+---+----------------------------------+-------+----+-----+
UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as
val getColumnsUDF = udf((details: Seq[String]) => {
val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
(detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})
Now use the udf
df1.withColumn("emp",getColumnsUDF($"emp_details"))
.select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
.show(false)
Output:
+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1 |xxx |xxx |xxx|
|2 |bbb |bbb |bbb|
+---+-------+----+---+
Hope this helps!

remove duplicate column from dataframe using scala

I need to remove one column from the DataFrame having another column with the same name. I need to remove only one column and need the other one for further usage.
For example, given this input DF:
sno | age | psk | psk
---------------------
1 | 12 | a4 | a4
I would like to obtain this output DF:
sno | age | psk
----------------
1 | 12 | a4
RDD is the way (but you need to know the column index of the duplicate columns for removing duplicate columns back to dataframe)
If you have dataframe with duplicate columns as
+---+---+---+---+
|sno|age|psk|psk|
+---+---+---+---+
|1 |12 |a4 |a4 |
+---+---+---+---+
You know that the last two column index are duplicates.
Next step is for you to have column names with duplicates removed and form schema
val columns = df.columns.toSet.toArray
val schema = StructType(columns.map(name => StructField(name, StringType, true)))
Vital part is to convert the dataframe to rdd and remove the required column index (here it is the 4th)
val rdd = df.rdd.map(row=> Row.fromSeq(Seq(row(0).toString, row(1).toString, row(2))))
Final step is to convert the rdd to dataframe using schema
sqlContext.createDataFrame(rdd, schema).show(false)
which should give you
+---+---+---+
|sno|age|psk|
+---+---+---+
|1 |12 |a4 |
+---+---+---+
I hope the answer is helpful