Spark - Csv data split with scala - scala

test.csv
name,key1,key2
A,1,2
B,1,3
C,4,3
I want to change this data like this (as dataset or rdd)
whatIwant.csv
name,key,newkeyname
A,1,KEYA
A,2,KEYB
B,1,KEYA
B,3,KEYB
C,4,KEYA
C,3,KEYB
I loaded data with read method.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
I can load each dataset like (name, key1) or (name, key2), and union them by union, but want to do this in single spark session.
Any idea of this?
Those are not working.
val df2 = df.select( df("TAG_NO"), df.map { x => (x.getAs[String]("MK_VNDRNM"), x.getAs[String]("WK_ORD_DT")) })
val df2 = df.select( df("TAG_NO"), Seq(df("TAG_NO"), df("WK_ORD_DT")))

This can be accomplished with explode and a udf:
scala> val df = Seq(("A", 1, 2), ("B", 1, 3), ("C", 4, 3)).toDF("name", "key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, key1: int ... 1 more field]
scala> df.show
+----+----+----+
|name|key1|key2|
+----+----+----+
| A| 1| 2|
| B| 1| 3|
| C| 4| 3|
+----+----+----+
scala> val explodeUDF = udf((v1: String, v2: String) => Vector((v1, "Key1"), (v2, "Key2")))
explodeUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(StringType, StringType)))
scala> df = df.withColumn("TMP", explode(explodeUDF($"key1", $"key2"))).drop("key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string>]
scala> df = df.withColumn("key", $"TMP".apply("_1")).withColumn("new key name", $"TMP".apply("_2"))
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string> ... 2 more fields]
scala> df = df.drop("TMP")
df: org.apache.spark.sql.DataFrame = [name: string, key: string ... 1 more field]
scala> df.show
+----+---+------------+
|name|key|new key name|
+----+---+------------+
| A| 1| Key1|
| A| 2| Key2|
| B| 1| Key1|
| B| 3| Key2|
| C| 4| Key1|
| C| 3| Key2|
+----+---+------------+

Related

In Apache Spark, I have a dataframe with one column which has string (its a date) but leading zero is missing from month and day

import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "9/11/2020"),
(2, "10/11/2020"),
(3, "1/1/2020"),
(4, "12/7/2020"))).toDF("Id", "x4")
val newDf = df
.withColumn("x4New", regexp_replace(df("x4"), "(?:(\\d{2}))/(?:(\\d{1}))/(?:(\\d{4}))", "$1/0$2/$3"))
val newDf1 = newDf
.withColumn("x4New1", regexp_replace(df("x4"), "(?:(\\d{1}))/(?:(\\d{1}))/(?:(\\d{4}))", "0$1/0$2/$3"))
.withColumn("x4New2", regexp_replace(df("x4"), "(?:(\\d{1}))/(?:(\\d{2}))/(?:(\\d{4}))", "0$1/$2/$3"))
newDf1.show
Output now
+---+----------+----------+-----------+-----------+
| Id| x4| x4New| x4New1| x4New2|
+---+----------+----------+-----------+-----------+
| 1| 9/11/2020| 9/11/2020| 9/11/2020| 09/11/2020|
| 2|10/11/2020|10/11/2020| 10/11/2020|100/11/2020|
| 3| 1/1/2020| 1/1/2020| 01/01/2020| 1/1/2020|
| 4| 12/7/2020|12/07/2020|102/07/2020| 12/7/2020|
+---+----------+----------+-----------+-----------+
`Desired Output, add a leading Zero in front of day or month is single-digit' Do not want to use a UDF for performance reasons
+---+----------+----------+
| Id| x4| date |
+---+----------+----------+
| 1| 9/11/2020|09/11/2020|
| 2|10/11/2020|10/11/2020|
| 3| 1/1/2020|01/01/2020|
| 4| 12/7/2020|12/07/2020|
+---+----------+----------+-----------+-----------+
Use from_unixtime,unix_timestamp (or) date_format,to_timestamp,(or) to_date
in built functions.
Example:(In Spark-2.4)
import org.apache.spark.sql.functions._
//sample data
val df = spark.createDataFrame(Seq((1, "9/11/2020"),(2, "10/11/2020"),(3, "1/1/2020"), (4, "12/7/2020"))).toDF("Id", "x4")
//using from_unixtime
df.withColumn("date",from_unixtime(unix_timestamp(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
//using date_format
df.withColumn("date",date_format(to_timestamp(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
df.withColumn("date",date_format(to_date(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
//+---+----------+----------+
//| Id| x4| date|
//+---+----------+----------+
//| 1| 9/11/2020|09/11/2020|
//| 2|10/11/2020|10/11/2020|
//| 3| 1/1/2020|01/01/2020|
//| 4| 12/7/2020|12/07/2020|
//+---+----------+----------+
`Found a workaround, see if there is a better solution using one dataframe and no UDF'
import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "9/11/2020"),
(2, "10/11/2020"),
(3, "1/1/2020"),
(4, "12/7/2020"))).toDF("Id", "x4")
val newDf = df.withColumn("x4New", regexp_replace(df("x4"), "(?:(\\b\\d{2}))/(?:(\\d))/(?:(\\d{4})\\b)", "$1/0$2/$3"))
val newDf1 = newDf.withColumn("x4New1", regexp_replace(newDf("x4New"), "(?:(\\b\\d{1}))/(?:(\\d))/(?:(\\d{4})\\b)", "0$1/$2/$3"))
val newDf2 = newDf1.withColumn("x4New2", regexp_replace(newDf1("x4New1"), "(?:(\\b\\d{1}))/(?:(\\d{2}))/(?:(\\d{4})\\b)", "0$1/$2/$3"))
val newDf3 = newDf2.withColumn("date", to_date(regexp_replace(newDf2("x4New2"), "(?:(\\b\\d{2}))/(?:(\\d{1}))/(?:(\\d{4})\\b)", "$1/0$2/$3"),"MM/dd/yyyy"))
val formatedDataDf = newDf3
.drop("x4New")
.drop("x4New1")
.drop("x4New2")
formatedDataDf.printSchema
formatedDataDf.show
Output looks like as follows
root
|-- Id: integer (nullable = false)
|-- x4: string (nullable = true)
|-- date: date (nullable = true)
+---+----------+----------+
| Id| x4| date|
+---+----------+----------+
| 1| 9/11/2020|2020-09-11|
| 2|10/11/2020|2020-10-11|
| 3| 1/1/2020|2020-01-01|
| 4| 12/7/2020|2020-12-07|
+---+----------+----------+

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!
In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>
If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

drop duplicate words in long string using scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()
Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>
There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show
import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

Select column by name with multiple aggregate columns after pivot with Spark Scala

I am trying to aggregate multitple columns after a pivot in Scala Spark 2.0.1:
scala> val df = List((1, 2, 3, None), (1, 3, 4, Some(1))).toDF("a", "b", "c", "d")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields]
scala> df.show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| 1| 2| 3|null|
| 1| 3| 4| 1|
+---+---+---+----+
scala> val pivoted = df.groupBy("a").pivot("b").agg(max("c"), max("d"))
pivoted: org.apache.spark.sql.DataFrame = [a: int, 2_max(`c`): int ... 3 more fields]
scala> pivoted.show
+---+----------+----------+----------+----------+
| a|2_max(`c`)|2_max(`d`)|3_max(`c`)|3_max(`d`)|
+---+----------+----------+----------+----------+
| 1| 3| null| 4| 1|
+---+----------+----------+----------+----------+
I am unable to select or rename those columns so far:
scala> pivoted.select("3_max(`d`)")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: 3_max(`d`);
scala> pivoted.select("`3_max(`d`)`")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_max(`d`)`;
scala> pivoted.select("`3_max(d)`")
org.apache.spark.sql.AnalysisException: cannot resolve '`3_max(d)`' given input columns: [2_max(`c`), 3_max(`d`), a, 2_max(`d`), 3_max(`c`)];
There must be a simple trick here, any ideas? Thanks.
Seems like a bug, the back ticks caused the problem. One fix here would be to remove the back ticks from the column names:
val pivotedNewName = pivoted.columns.foldLeft(pivoted)((df, col) =>
df.withColumnRenamed(col, col.replace("`", "")))
Now you can select by column names as normal:
pivotedNewName.select("2_max(c)").show
+--------+
|2_max(c)|
+--------+
| 3|
+--------+

Aggregations in JDBCRDD or RDD

I'm brand new in Sacla and Spark, and I'm trying to create a SQL query over SqlServer with Spark using jdbcRDD, and do some transformations on it with mappings and aggregations.
This is what I have, a Table with n String columns and m Number columns.
like
"A", "A1",1,2
"A", "A1",4,3
"A", "A2",3,4
"B", "B1",6,7
...
...
what i'm looking for is create a hierarchival structure grouping the strings and aggregating the numeric columns like
A
|->A1
|->(5,5)
|->A2
|->(3,4)
B
|->B1
|->(6,7)
I was able to create the hierarchie but I'm not able to perform the agregation on the list of numeric values.
If you're loading your data over JDBC I would simply use DataFrames:
import sqlContext.implicits._
import org.apache.spark.sql.functions.sum
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
val options: Map[(String, String)] = ???
val df: DataFrame = sqlContext.read
.format("jdbc")
.options(options)
.load()
.toDF("k1", "k2", "v1", "v2")
df.printSchema
// root
// |-- k1: string (nullable = true)
// |-- k2: string (nullable = true)
// |-- v1: integer (nullable = true)
// |-- v2: integer (nullable = true)
df.show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 1| 2|
// | A| A1| 4| 3|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
With input like above all you need is a basic aggregation
df
.groupBy($"k1", $"k2")
.agg(sum($"v1").alias("v1"), sum($"v2").alias("v2")).show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 5| 5|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
If you have RDD like this:
val rdd RDD[(String, String, Int, Int)] = ???
rdd.first
// (String, String, Int, Int) = (A,A1,1,2)
There is no reason to built complex hierarchy. Simple PairRDD should be enough:
val aggregated: RDD[((String, String), breeze.linalg.Vector[Int])] = rdd
.map{case (k1, k2, v1, v2) => ((k1, k2), breeze.linalg.Vector(v1, v2))}
.reduceByKey(_ + _)
aggregated.first
// ((String, String), breeze.linalg.Vector[Int]) = ((A,A2),DenseVector(3, 4))
Keeping hierarchical structure is ineffective but you can group above RDD like this:
aggregated.map{case ((k1, k2), v) => (k1, (k2, v))}.groupByKey