I have a spark dataframe as below:
+------------------------------------------------------------------------+
| domains |
+------------------------------------------------------------------------+
|["0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", |
| "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", |
| "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9", |
| "ab192f73b9db26ec2aca2b776c4398d2","ff9cf0599ae553d227e3f1078957a5d3", |
| "aa717380213450746a656fe4ff4e4072","f3346928db1c6be0682eb9307e2edf38", |
| "806a006b5e0d220c2cf714789828ecf7","9f6f8502e71c325f2a6f332a76d4bebf", |
| "c0cb38016fb603e89b160e921eced896","56ad547c6292c92773963d6e6e7d5e39"] |
+------------------------------------------------------------------------+
It contains column as list. I want to convert into Array[String].
eg:
Array("0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9",
"ab192f73b9db26ec2aca2b776c4398d2","ff9cf0599ae553d227e3f1078957a5d3",
"aa717380213450746a656fe4ff4e4072","f3346928db1c6be0682eb9307e2edf38",
"806a006b5e0d220c2cf714789828ecf7","9f6f8502e71c325f2a6f332a76d4bebf",
"c0cb38016fb603e89b160e921eced896","56ad547c6292c92773963d6e6e7d5e39")
I tried the following code but I am not getting intended results:
DF.select("domains").as[String].collect()
Instead I get this:
[Ljava.lang.String;#7535f28 ...
Any ideas how can I achieve this ?
You can first explode your domains column before collecting it, as follows:
import org.apache.spark.sql.functions.{col, explode}
val result: Array[String] = DF.select(explode(col("domains"))).as[String].collect()
You can then print your result array using mkString method:
println(result.mkString("[", ", ", "]"))
Here you are getting the Array[String] only as expected.
[Ljava.lang.String;#7535f28 --> this is a kind of type descriptor we use internally in byte code. [ represents an array and Ljava.lang.String represents the Class java.lang.String.
If you want to print the array values as a string, you can use .mkString() function.
import spark.implicits._
val data = Seq((Seq("0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9")))
val df = spark.sparkContext.parallelize(data).toDF("domains")
// df: org.apache.spark.sql.DataFrame = [domains: array<string>]
val array_values = df.select("domains").as[String].collect()
// array_values: Array[String] = Array([0b3642ab5be98c852890aff03b3f83d8, 4d7a5a24426749f3f17dee69e13194a9, 9d0f74269019ad82ae82cc7a7f2b5d1b, 0b113db8e20b2985d879a7aaa43cecf6, d095db19bd909c1deb26e0a902d5ad92, f038deb6ade0f800dfcd3138d82ae9a9])
val string_value = array_values.mkString(",")
print(string_value)
// [0b3642ab5be98c852890aff03b3f83d8, 4d7a5a24426749f3f17dee69e13194a9, 9d0f74269019ad82ae82cc7a7f2b5d1b, 0b113db8e20b2985d879a7aaa43cecf6, d095db19bd909c1deb26e0a902d5ad92, f038deb6ade0f800dfcd3138d82ae9a9]
This if you create normal arrays also, can see the same.
scala> val array_values : Array[String] = Array("value1", "value2")
array_values: Array[String] = Array(value1, value2)
scala> print(array_values)
[Ljava.lang.String;#70bf2681
scala> array_values.foreach(println)
value1
value2
Related
I have a list of strings which represent column names inside a dataframe.
I want to pass the arguments from these columns to a udf. How can I do it in spark scala ?
val actualDF = Seq(
("beatles", "help|hey jude","sad",4),
("romeo", "eres mia","old school",56)
).toDF("name", "hit_songs","genre","xyz")
val column_list: List[String] = List("hit_songs","name","genre")
// example udf
val testudf = org.apache.spark.sql.functions.udf((s1: String, s2: String) => {
// lets say I want to concat all values
})
val finalDF = actualDF.withColumn("test_res",testudf(col(column_list(0))))
From the above example, I want to pass my list column_list to a udf. I am not sure how can I pass a complete list of string representing column names. Though in case of 1 element I saw I can do it with col(column_list(0))). Please support.
Replace
testudf(col(column_list(0)))
with
testudf(column_list: _*)
This will interpret the list as multiple individual input arguments.
hit_songs is of type Seq[String], You need to change first parameter of your udf to Seq[String].
scala> singersDF.show(false)
+-------+-------------+----------+
|name |hit_songs |genre |
+-------+-------------+----------+
|beatles|help|hey jude|sad |
|romeo |eres mia |old school|
+-------+-------------+----------+
scala> actualDF.show(false)
+-------+----------------+----------+
|name |hit_songs |genre |
+-------+----------------+----------+
|beatles|[help, hey jude]|sad |
|romeo |[eres mia] |old school|
+-------+----------------+----------+
scala> column_list
res27: List[String] = List(hit_songs, name)
Change your UDF like below.
// s1 is of type Seq[String]
val testudf = udf((s1:Seq[String],s2:String) => {
s1.mkString.concat(s2)
})
Applying UDF
scala> actualDF
.withColumn("test_res",testudf(col(column_list.head),col(column_list.last)))
.show(false)
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |helphey judebeatles|
|romeo |[eres mia] |old school|eres miaromeo |
+-------+----------------+----------+-------------------+
Without UDF
scala> actualDF.withColumn("test_res",concat_ws("",$"name",$"hit_songs")).show(false) // Without UDF.
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |beatleshelphey jude|
|romeo |[eres mia] |old school|romeoeres mia |
+-------+----------------+----------+-------------------+
How to handle if my delimiter is present in data when loading a file using spark RDD.
My data looks like below:
NAME|AGE|DEP
Suresh|32|BSC
"Sathish|Kannan"|30|BE
How to convert this column into 3 columns like below.
NAME AGE DEP
suresh 32 Bsc
Sathish|Kannan 30 BE
Please refer the how i tried to load the data.
scala> val rdd = sc.textFile("file:///test/Sample_dep_20.txt",2)
rdd: org.apache.spark.rdd.RDD[String] = hdfs://Hive/Sample_dep_20.txt MapPartitionsRDD[1] at textFile at <console>:27
rdd.collect.foreach(println)
101|"Sathish|Kannan"|BSC
102|Suresh|DEP
scala> val rdd2=rdd.map(x=>x.split("\""))
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:29
scala> val rdd3=rdd2.map(x=>
| {
| var strarr = scala.collection.mutable.ArrayBuffer[String]()
| for(v<-x)
| {
| if(v.startsWith("\"") && v.endsWith("\""))
| strarr +=v.replace("\"","")
| else if(v.contains(","))
| strarr ++=v.split(",")
| else
| strarr +=v
| }
| strarr
| }
| )
rdd3: org.apache.spark.rdd.RDD[scala.collection.mutable.ArrayBuffer[String]] = MapPartitionsRDD[3] at map at <console>:31
scala> rdd3.collect.foreach(println)
ArrayBuffer(101|, Sathish|Kannan, |BSC)
ArrayBuffer(102|Suresh|DEP)
Maybe you need to explicitly define " as a quote character (it is by default for csv reader but maybe not in your case?). So adding .option("quote","\"") to the options when reading your .csv file should work.
scala> val inputds = Seq("Suresh|32|BSC","\"Satish|Kannan\"|30|BE").toDS()
inputds: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val outputdf = spark.read.option("header",false).option("delimiter","|").option("quote","\"").csv(inputds)
outputdf: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]
scala> outputdf.show(false)
+-------------+---+---+
|_c0 |_c1|_c2|
+-------------+---+---+
|Suresh |32 |BSC|
|Satish|Kannan|30 |BE |
+-------------+---+---+
Defining makes DataFrameReader ignore the delimiters found inside quoted strings, see Spark API doc here.
EDIT
If you want to play hard and still use plain RDDs, then try modifying your split() function like this:
val rdd2=rdd.map(x=>x.split("\\|(?=([^\"]*\"[^\"]*\")*[^\"]*$)"))
It uses positive look-ahead to ignore | delimiters found inside quotes, and saves you from doing string manipulations in your second .map.
I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.
Examples below:
Schema:
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Input as Dataframe:
+--------------------+
|carLineName |
+--------------------+
|[Avalon,CRV,Camry] |
|[Model T, Model S] |
|[Cayenne, Mustang] |
|[Pilot, Jeep] |
Desired output:
+--------------------+
|carLineName |
+--------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;# Jeep |
Current code which produces the input above:
val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))
You can use native function array_join (it is available since Spark 2.4):
import org.apache.spark.sql.functions.{array_join}
val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")
df.withColumn("str", array_join($"carLineName", ";#")).show()
+--------------------+------------------+
| carLineName| str|
+--------------------+------------------+
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
| [Model T, Model S]| Model T;#Model S|
| [Cayenne, Mustang]| Cayenne;#Mustang|
| [Pilot, Jeep]| Pilot;#Jeep|
+--------------------+------------------+
you can create a user defined function that concatenate elements with "#;" separator as the following example:
val df1 = Seq(
("1", Array("t1", "t2")),
("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")
import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))
+---+------------+----------+
| id| arr| formated|
+---+------------+----------+
| 1| [t1, t2]| t1#;t2|
| 2|[t1, t3, t5]|t1#;t3#;t5|
+---+------------+----------+
You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
carLineName.mkString(";#")
})
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))
This udf could be made generic further by passing the delimiter as the second parameter.
import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
carLineName.mkString(delimiter)
})
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))
Since you are using 1.6, we can do simple map of Row to WrappedArray.
Here is how it goes.
Input :
scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
| (Array("Model T", "Model S")),
| (Array("Cayenne", "Mustang")),
| (Array("Pilot", "Jeep"))
| ).toDF("carLineName")
carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]
Schema ::
scala> carLineDf.printSchema
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :
scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray
scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
+-------------------+
|carLineNameAsString|
+-------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;#Jeep |
+-------------------+
// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)
That's it. You might have to use a dataframe.rdd otherwise this should do.
I have a table which has a column containing array like this -
Student_ID | Subject_List | New_Subject
1 | [Mat, Phy, Eng] | Chem
I want to append the new subject into the subject list and get the new list.
Creating the dataframe -
val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")
I have tried this with UDF as follows -
def append_list = (arr: Seq[String], s: String) => {
arr :+ s
}
val append_list_UDF = udf(append_list)
val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))
With UDF, I get the required output
Student_ID | Subject_List | New_Subject | New_List
1 | [Mat, Phy, Eng] | Chem | [Mat, Phy, Eng, Chem]
Can we do it without udf ? Thanks.
In Spark 2.4 or later a combination of array and concat should do the trick,
import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column
def append(arr: Column, col: Column) = concat(arr, array(col))
df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+----------+---------------+-----------+--------------------+
|Student_ID| Subject_List|New_Subject| New_List|
+----------+---------------+-----------+--------------------+
| 1|[Mat, Phy, Eng]| Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+
but I wouldn't expect serious performance gains here.
val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
(2, Array("Hindi", "Bio", "Eng"), "IoT"),
(3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
.union(df.select($"Student_ID",$"New_Subject"))
.groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]
var columnnames= "callStart_t,callend_t" // Timestamp column names are dynamic input.
scala> df1.show()
+------+------------+--------+----------+
| name| callStart_t|personid| callend_t|
+------+------------+--------+----------+
| Bindu|1080602418 | 2|1080602419|
|Raphel|1647964576 | 5|1647964576|
| Ram|1754536698 | 9|1754536699|
+------+------------+--------+----------+
code which i tried :
val newDf = df1.withColumn("callStart_Time", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_Time", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
Here, I don't want new columns to convert (from_unixtime to to_utc_timestamp), the existing column itself I want to convert
Example Output
+------+---------------------+--------+--------------------+
| name| callStart_t |personid| callend_t |
+------+---------------------+--------+--------------------+
| Bindu|1970-01-13 04:40:02 | 2|1970-01-13 04:40:02 |
|Raphel|1970-01-20 06:16:04 | 5|1970-01-20 06:16:04 |
| Ram|1970-01-21 11:52:16 | 9|1970-01-21 11:52:16 |
+------+---------------------+--------+--------------------+
Note: The Timestamp column names are dynamic.
how to get each column dynamically?
Just use the same name for the column and it will replace it:
val newDf = df1.withColumn("callStart_t", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_t", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
To make it dynamic, just use the relevant string. For example:
val colName = "callend_t"
val newDf = df.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
For multiple columns you can do:
val columns=Seq("callend_t", "callStart_t")
val newDf = columns.foldLeft(df1){ case (curDf, colName) => curDf.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))}
Note: as stated in the comments, the division by 1000 is not needed.