drop duplicate words in long string using scala - scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()

Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>

There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show

import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

Related

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!
In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>
If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

how to concat multiple columns in spark while getting the column names to be concatenated from another table (different for each row)

I am trying to concat multiple columns in spark using concat function.
For example below is the table for which I have to add new concatenated column
table - **t**
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
and below is the table which has the information about which columns are to be concatenated for given id (for id 1 column id and name needs to be concatenated and for id 2 only id)
table - **r**
+---+-------+
| id| att |
+---+-------+
| 1|id,name|
| 2| id |
+---+-------+
if I join the two tables and do something like below, I am able to concat but not based on the table r (as the new column is having 1,a for first row but for second row it should be 2 only)
t.withColumn("new",concat_ws(",",t.select("att").first.mkString.split(",").map(c => col(c)): _*)).show
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2,b|
+---+----+-------+---+
I have to apply filter before the select in the above query, but I am not sure how to do that in withColumn for each row.
Something like below, if that is possible.
t.withColumn("new",concat_ws(",",t.**filter**("id="+this.id).select("att").first.mkString.split(",").map(c => col(c)): _*)).show
As it will require to filter each row based on the id.
scala> t.filter("id=1").select("att").first.mkString.split(",").map(c => col(c))
res90: Array[org.apache.spark.sql.Column] = Array(id, name)
scala> t.filter("id=2").select("att").first.mkString.split(",").map(c => col(c))
res89: Array[org.apache.spark.sql.Column] = Array(id)
Below is the final required result.
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2 |
+---+----+-------+---+
We can use UDF
Requirements for this logic to work.
The column name of your table t should be in same order as it comes in col att of table r
scala> input_df_1.show
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
scala> input_df_2.show
+---+-------+
| id| att|
+---+-------+
| 1|id,name|
| 2| id|
+---+-------+
scala> val join_df = input_df_1.join(input_df_2,Seq("id"),"inner")
join_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val req_cols = input_df_1.columns
req_cols: Array[String] = Array(id, name)
scala> def new_col_udf = udf((cols : Seq[String],row : String,attr : String) => {
| val row_values = row.split(",")
| val attrs = attr.split(",")
| val req_val = attrs.map{at =>
| val index = cols.indexOf(at)
| row_values(index)
| }
| req_val.mkString(",")
| })
new_col_udf: org.apache.spark.sql.expressions.UserDefinedFunction
scala> val intermediate_df = join_df.withColumn("concat_column",concat_ws(",",'id,'name)).withColumn("new_col",new_col_udf(lit(req_cols),'concat_column,'att))
intermediate_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]
scala> val result_df = intermediate_df.select('id,'name,'att,'new_col)
result_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> result_df.show
+---+----+-------+-------+
| id|name| att|new_col|
+---+----+-------+-------+
| 1| a|id,name| 1,a|
| 2| b| id| 2|
+---+----+-------+-------+
Hope it answers your question.
This may be done in a UDF:
val cols: Seq[Column] = dataFrame.columns.map(x => col(x)).toSeq
val indices: Seq[String] = dataFrame.columns.map(x => x).toSeq
val generateNew = udf((values: Seq[Any]) => {
val att = values(indices.indexOf("att")).toString.split(",")
val associatedIndices = indices.filter(x => att.contains(x))
val builder: StringBuilder = StringBuilder.newBuilder
values.filter(x => associatedIndices.contains(values.indexOf(x)))
values.foreach{ v => builder.append(v).append(";") }
builder.toString()
})
val dfColumns = array(cols:_*)
val dNew = dataFrame.withColumn("new", generateNew(dfColumns))
This is just a sketch, but the idea is that you can pass a sequence of items to the user defined function, and select the ones that are needed dynamically.
Note that there are additional types of collection/maps that you can pass - for example How to pass array to UDF

Select column by name with multiple aggregate columns after pivot with Spark Scala

I am trying to aggregate multitple columns after a pivot in Scala Spark 2.0.1:
scala> val df = List((1, 2, 3, None), (1, 3, 4, Some(1))).toDF("a", "b", "c", "d")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields]
scala> df.show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| 1| 2| 3|null|
| 1| 3| 4| 1|
+---+---+---+----+
scala> val pivoted = df.groupBy("a").pivot("b").agg(max("c"), max("d"))
pivoted: org.apache.spark.sql.DataFrame = [a: int, 2_max(`c`): int ... 3 more fields]
scala> pivoted.show
+---+----------+----------+----------+----------+
| a|2_max(`c`)|2_max(`d`)|3_max(`c`)|3_max(`d`)|
+---+----------+----------+----------+----------+
| 1| 3| null| 4| 1|
+---+----------+----------+----------+----------+
I am unable to select or rename those columns so far:
scala> pivoted.select("3_max(`d`)")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: 3_max(`d`);
scala> pivoted.select("`3_max(`d`)`")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_max(`d`)`;
scala> pivoted.select("`3_max(d)`")
org.apache.spark.sql.AnalysisException: cannot resolve '`3_max(d)`' given input columns: [2_max(`c`), 3_max(`d`), a, 2_max(`d`), 3_max(`c`)];
There must be a simple trick here, any ideas? Thanks.
Seems like a bug, the back ticks caused the problem. One fix here would be to remove the back ticks from the column names:
val pivotedNewName = pivoted.columns.foldLeft(pivoted)((df, col) =>
df.withColumnRenamed(col, col.replace("`", "")))
Now you can select by column names as normal:
pivotedNewName.select("2_max(c)").show
+--------+
|2_max(c)|
+--------+
| 3|
+--------+

Spark - Csv data split with scala

test.csv
name,key1,key2
A,1,2
B,1,3
C,4,3
I want to change this data like this (as dataset or rdd)
whatIwant.csv
name,key,newkeyname
A,1,KEYA
A,2,KEYB
B,1,KEYA
B,3,KEYB
C,4,KEYA
C,3,KEYB
I loaded data with read method.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
I can load each dataset like (name, key1) or (name, key2), and union them by union, but want to do this in single spark session.
Any idea of this?
Those are not working.
val df2 = df.select( df("TAG_NO"), df.map { x => (x.getAs[String]("MK_VNDRNM"), x.getAs[String]("WK_ORD_DT")) })
val df2 = df.select( df("TAG_NO"), Seq(df("TAG_NO"), df("WK_ORD_DT")))
This can be accomplished with explode and a udf:
scala> val df = Seq(("A", 1, 2), ("B", 1, 3), ("C", 4, 3)).toDF("name", "key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, key1: int ... 1 more field]
scala> df.show
+----+----+----+
|name|key1|key2|
+----+----+----+
| A| 1| 2|
| B| 1| 3|
| C| 4| 3|
+----+----+----+
scala> val explodeUDF = udf((v1: String, v2: String) => Vector((v1, "Key1"), (v2, "Key2")))
explodeUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(StringType, StringType)))
scala> df = df.withColumn("TMP", explode(explodeUDF($"key1", $"key2"))).drop("key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string>]
scala> df = df.withColumn("key", $"TMP".apply("_1")).withColumn("new key name", $"TMP".apply("_2"))
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string> ... 2 more fields]
scala> df = df.drop("TMP")
df: org.apache.spark.sql.DataFrame = [name: string, key: string ... 1 more field]
scala> df.show
+----+---+------------+
|name|key|new key name|
+----+---+------------+
| A| 1| Key1|
| A| 2| Key2|
| B| 1| Key1|
| B| 3| Key2|
| C| 4| Key1|
| C| 3| Key2|
+----+---+------------+

Aggregations in JDBCRDD or RDD

I'm brand new in Sacla and Spark, and I'm trying to create a SQL query over SqlServer with Spark using jdbcRDD, and do some transformations on it with mappings and aggregations.
This is what I have, a Table with n String columns and m Number columns.
like
"A", "A1",1,2
"A", "A1",4,3
"A", "A2",3,4
"B", "B1",6,7
...
...
what i'm looking for is create a hierarchival structure grouping the strings and aggregating the numeric columns like
A
|->A1
|->(5,5)
|->A2
|->(3,4)
B
|->B1
|->(6,7)
I was able to create the hierarchie but I'm not able to perform the agregation on the list of numeric values.
If you're loading your data over JDBC I would simply use DataFrames:
import sqlContext.implicits._
import org.apache.spark.sql.functions.sum
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
val options: Map[(String, String)] = ???
val df: DataFrame = sqlContext.read
.format("jdbc")
.options(options)
.load()
.toDF("k1", "k2", "v1", "v2")
df.printSchema
// root
// |-- k1: string (nullable = true)
// |-- k2: string (nullable = true)
// |-- v1: integer (nullable = true)
// |-- v2: integer (nullable = true)
df.show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 1| 2|
// | A| A1| 4| 3|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
With input like above all you need is a basic aggregation
df
.groupBy($"k1", $"k2")
.agg(sum($"v1").alias("v1"), sum($"v2").alias("v2")).show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 5| 5|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
If you have RDD like this:
val rdd RDD[(String, String, Int, Int)] = ???
rdd.first
// (String, String, Int, Int) = (A,A1,1,2)
There is no reason to built complex hierarchy. Simple PairRDD should be enough:
val aggregated: RDD[((String, String), breeze.linalg.Vector[Int])] = rdd
.map{case (k1, k2, v1, v2) => ((k1, k2), breeze.linalg.Vector(v1, v2))}
.reduceByKey(_ + _)
aggregated.first
// ((String, String), breeze.linalg.Vector[Int]) = ((A,A2),DenseVector(3, 4))
Keeping hierarchical structure is ineffective but you can group above RDD like this:
aggregated.map{case ((k1, k2), v) => (k1, (k2, v))}.groupByKey