How to handle if delimiter appears in data in spark rdd - scala

How to handle if my delimiter is present in data when loading a file using spark RDD.
My data looks like below:
NAME|AGE|DEP
Suresh|32|BSC
"Sathish|Kannan"|30|BE
How to convert this column into 3 columns like below.
NAME AGE DEP
suresh 32 Bsc
Sathish|Kannan 30 BE
Please refer the how i tried to load the data.
scala> val rdd = sc.textFile("file:///test/Sample_dep_20.txt",2)
rdd: org.apache.spark.rdd.RDD[String] = hdfs://Hive/Sample_dep_20.txt MapPartitionsRDD[1] at textFile at <console>:27
rdd.collect.foreach(println)
101|"Sathish|Kannan"|BSC
102|Suresh|DEP
scala> val rdd2=rdd.map(x=>x.split("\""))
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:29
scala> val rdd3=rdd2.map(x=>
| {
| var strarr = scala.collection.mutable.ArrayBuffer[String]()
| for(v<-x)
| {
| if(v.startsWith("\"") && v.endsWith("\""))
| strarr +=v.replace("\"","")
| else if(v.contains(","))
| strarr ++=v.split(",")
| else
| strarr +=v
| }
| strarr
| }
| )
rdd3: org.apache.spark.rdd.RDD[scala.collection.mutable.ArrayBuffer[String]] = MapPartitionsRDD[3] at map at <console>:31
scala> rdd3.collect.foreach(println)
ArrayBuffer(101|, Sathish|Kannan, |BSC)
ArrayBuffer(102|Suresh|DEP)

Maybe you need to explicitly define " as a quote character (it is by default for csv reader but maybe not in your case?). So adding .option("quote","\"") to the options when reading your .csv file should work.
scala> val inputds = Seq("Suresh|32|BSC","\"Satish|Kannan\"|30|BE").toDS()
inputds: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val outputdf = spark.read.option("header",false).option("delimiter","|").option("quote","\"").csv(inputds)
outputdf: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]
scala> outputdf.show(false)
+-------------+---+---+
|_c0 |_c1|_c2|
+-------------+---+---+
|Suresh |32 |BSC|
|Satish|Kannan|30 |BE |
+-------------+---+---+
Defining makes DataFrameReader ignore the delimiters found inside quoted strings, see Spark API doc here.
EDIT
If you want to play hard and still use plain RDDs, then try modifying your split() function like this:
val rdd2=rdd.map(x=>x.split("\\|(?=([^\"]*\"[^\"]*\")*[^\"]*$)"))
It uses positive look-ahead to ignore | delimiters found inside quotes, and saves you from doing string manipulations in your second .map.

Related

Convert column containing values as List to Array

I have a spark dataframe as below:
+------------------------------------------------------------------------+
| domains |
+------------------------------------------------------------------------+
|["0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", |
| "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", |
| "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9", |
| "ab192f73b9db26ec2aca2b776c4398d2","ff9cf0599ae553d227e3f1078957a5d3", |
| "aa717380213450746a656fe4ff4e4072","f3346928db1c6be0682eb9307e2edf38", |
| "806a006b5e0d220c2cf714789828ecf7","9f6f8502e71c325f2a6f332a76d4bebf", |
| "c0cb38016fb603e89b160e921eced896","56ad547c6292c92773963d6e6e7d5e39"] |
+------------------------------------------------------------------------+
It contains column as list. I want to convert into Array[String].
eg:
Array("0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9",
"ab192f73b9db26ec2aca2b776c4398d2","ff9cf0599ae553d227e3f1078957a5d3",
"aa717380213450746a656fe4ff4e4072","f3346928db1c6be0682eb9307e2edf38",
"806a006b5e0d220c2cf714789828ecf7","9f6f8502e71c325f2a6f332a76d4bebf",
"c0cb38016fb603e89b160e921eced896","56ad547c6292c92773963d6e6e7d5e39")
I tried the following code but I am not getting intended results:
DF.select("domains").as[String].collect()
Instead I get this:
[Ljava.lang.String;#7535f28 ...
Any ideas how can I achieve this ?
You can first explode your domains column before collecting it, as follows:
import org.apache.spark.sql.functions.{col, explode}
val result: Array[String] = DF.select(explode(col("domains"))).as[String].collect()
You can then print your result array using mkString method:
println(result.mkString("[", ", ", "]"))
Here you are getting the Array[String] only as expected.
[Ljava.lang.String;#7535f28 --> this is a kind of type descriptor we use internally in byte code. [ represents an array and Ljava.lang.String represents the Class java.lang.String.
If you want to print the array values as a string, you can use .mkString() function.
import spark.implicits._
val data = Seq((Seq("0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9")))
val df = spark.sparkContext.parallelize(data).toDF("domains")
// df: org.apache.spark.sql.DataFrame = [domains: array<string>]
val array_values = df.select("domains").as[String].collect()
// array_values: Array[String] = Array([0b3642ab5be98c852890aff03b3f83d8, 4d7a5a24426749f3f17dee69e13194a9, 9d0f74269019ad82ae82cc7a7f2b5d1b, 0b113db8e20b2985d879a7aaa43cecf6, d095db19bd909c1deb26e0a902d5ad92, f038deb6ade0f800dfcd3138d82ae9a9])
val string_value = array_values.mkString(",")
print(string_value)
// [0b3642ab5be98c852890aff03b3f83d8, 4d7a5a24426749f3f17dee69e13194a9, 9d0f74269019ad82ae82cc7a7f2b5d1b, 0b113db8e20b2985d879a7aaa43cecf6, d095db19bd909c1deb26e0a902d5ad92, f038deb6ade0f800dfcd3138d82ae9a9]
This if you create normal arrays also, can see the same.
scala> val array_values : Array[String] = Array("value1", "value2")
array_values: Array[String] = Array(value1, value2)
scala> print(array_values)
[Ljava.lang.String;#70bf2681
scala> array_values.foreach(println)
value1
value2

Pass arguments to a udf from columns present in a list of strings

I have a list of strings which represent column names inside a dataframe.
I want to pass the arguments from these columns to a udf. How can I do it in spark scala ?
val actualDF = Seq(
("beatles", "help|hey jude","sad",4),
("romeo", "eres mia","old school",56)
).toDF("name", "hit_songs","genre","xyz")
val column_list: List[String] = List("hit_songs","name","genre")
// example udf
val testudf = org.apache.spark.sql.functions.udf((s1: String, s2: String) => {
// lets say I want to concat all values
})
val finalDF = actualDF.withColumn("test_res",testudf(col(column_list(0))))
From the above example, I want to pass my list column_list to a udf. I am not sure how can I pass a complete list of string representing column names. Though in case of 1 element I saw I can do it with col(column_list(0))). Please support.
Replace
testudf(col(column_list(0)))
with
testudf(column_list: _*)
This will interpret the list as multiple individual input arguments.
hit_songs is of type Seq[String], You need to change first parameter of your udf to Seq[String].
scala> singersDF.show(false)
+-------+-------------+----------+
|name |hit_songs |genre |
+-------+-------------+----------+
|beatles|help|hey jude|sad |
|romeo |eres mia |old school|
+-------+-------------+----------+
scala> actualDF.show(false)
+-------+----------------+----------+
|name |hit_songs |genre |
+-------+----------------+----------+
|beatles|[help, hey jude]|sad |
|romeo |[eres mia] |old school|
+-------+----------------+----------+
scala> column_list
res27: List[String] = List(hit_songs, name)
Change your UDF like below.
// s1 is of type Seq[String]
val testudf = udf((s1:Seq[String],s2:String) => {
s1.mkString.concat(s2)
})
Applying UDF
scala> actualDF
.withColumn("test_res",testudf(col(column_list.head),col(column_list.last)))
.show(false)
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |helphey judebeatles|
|romeo |[eres mia] |old school|eres miaromeo |
+-------+----------------+----------+-------------------+
Without UDF
scala> actualDF.withColumn("test_res",concat_ws("",$"name",$"hit_songs")).show(false) // Without UDF.
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |beatleshelphey jude|
|romeo |[eres mia] |old school|romeoeres mia |
+-------+----------------+----------+-------------------+

Converting Array of Strings to String with different delimiters in Spark Scala

I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.
Examples below:
Schema:
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Input as Dataframe:
+--------------------+
|carLineName |
+--------------------+
|[Avalon,CRV,Camry] |
|[Model T, Model S] |
|[Cayenne, Mustang] |
|[Pilot, Jeep] |
Desired output:
+--------------------+
|carLineName |
+--------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;# Jeep |
Current code which produces the input above:
val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))
You can use native function array_join (it is available since Spark 2.4):
import org.apache.spark.sql.functions.{array_join}
val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")
df.withColumn("str", array_join($"carLineName", ";#")).show()
+--------------------+------------------+
| carLineName| str|
+--------------------+------------------+
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
| [Model T, Model S]| Model T;#Model S|
| [Cayenne, Mustang]| Cayenne;#Mustang|
| [Pilot, Jeep]| Pilot;#Jeep|
+--------------------+------------------+
you can create a user defined function that concatenate elements with "#;" separator as the following example:
val df1 = Seq(
("1", Array("t1", "t2")),
("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")
import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))
+---+------------+----------+
| id| arr| formated|
+---+------------+----------+
| 1| [t1, t2]| t1#;t2|
| 2|[t1, t3, t5]|t1#;t3#;t5|
+---+------------+----------+
You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
carLineName.mkString(";#")
})
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))
This udf could be made generic further by passing the delimiter as the second parameter.
import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
carLineName.mkString(delimiter)
})
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))
Since you are using 1.6, we can do simple map of Row to WrappedArray.
Here is how it goes.
Input :
scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
| (Array("Model T", "Model S")),
| (Array("Cayenne", "Mustang")),
| (Array("Pilot", "Jeep"))
| ).toDF("carLineName")
carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]
Schema ::
scala> carLineDf.printSchema
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :
scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray
scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
+-------------------+
|carLineNameAsString|
+-------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;#Jeep |
+-------------------+
// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)
That's it. You might have to use a dataframe.rdd otherwise this should do.

How to append a string column to array string column in Scala Spark without using UDF?

I have a table which has a column containing array like this -
Student_ID | Subject_List | New_Subject
1 | [Mat, Phy, Eng] | Chem
I want to append the new subject into the subject list and get the new list.
Creating the dataframe -
val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")
I have tried this with UDF as follows -
def append_list = (arr: Seq[String], s: String) => {
arr :+ s
}
val append_list_UDF = udf(append_list)
val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))
With UDF, I get the required output
Student_ID | Subject_List | New_Subject | New_List
1 | [Mat, Phy, Eng] | Chem | [Mat, Phy, Eng, Chem]
Can we do it without udf ? Thanks.
In Spark 2.4 or later a combination of array and concat should do the trick,
import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column
def append(arr: Column, col: Column) = concat(arr, array(col))
df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+----------+---------------+-----------+--------------------+
|Student_ID| Subject_List|New_Subject| New_List|
+----------+---------------+-----------+--------------------+
| 1|[Mat, Phy, Eng]| Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+
but I wouldn't expect serious performance gains here.
val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
(2, Array("Hindi", "Bio", "Eng"), "IoT"),
(3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
.union(df.select($"Student_ID",$"New_Subject"))
.groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]

Conversion of RDD to DataFrame using .toDF() When CSV data read using SparkContext (Not sqlContext)

I am a pure new guy in SparkSQL. Please help me anyone.
My specific question is that if we can convert the RDD hospitalDataText to a DataFrame(using .toDF()) where hospitalDataText has read the csv file using Spark Context(Not using sqlContext.read.csv("path")).
SO WHY WE CANNOT WRITE header.toDF() ? If I am trying to convert the variable header RDD to DataFrame it is throwing an error that: value toDF is not a member of String. Why? My main purpose is that I want to view the data of the variable header RDD using .show() function and therefore why I am unable to convert the RDD to a DataFrame? Please check the code given below! It is looks like DOUBLE-STANDARD :'(
scala> val hospitalDataText = sc.textFile("/Users/TheBhaskarDas/Desktop/services.csv")
hospitalDataText: org.apache.spark.rdd.RDD[String] = /Users/TheBhaskarDas/Desktop/services.csv MapPartitionsRDD[39] at textFile at <console>:33
scala> val header = hospitalDataText.first() //Remove the header
header: String = uhid,locationid,doctorid,billdate,servicename,servicequantity,starttime,endtime,servicetype,servicecategory,deptname
scala> header.toDF()
<console>:38: error: value toDF is not a member of String
header.toDF()
^
scala> val hospitalData = hospitalDataText.filter(a => a != header)
hospitalData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[40] at filter at <console>:37
scala> val m = hospitalData.toDF()
m: org.apache.spark.sql.DataFrame = [value: string]
scala> println(m)
[value: string]
scala> m.show()
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
|32d84f8b9c5193838...|
|213d66cb9aae532ff...|
|222f8f1766ed4e7c6...|
|222f8f1766ed4e7c6...|
|993f608405800f97d...|
|993f608405800f97d...|
|fa14c3845a8f1f6b0...|
|6e2899a575a534a1d...|
|6e2899a575a534a1d...|
|1f1603e3c0a0db5e6...|
|508a4fbea4752771f...|
|5f33395ae7422c3cf...|
|5f33395ae7422c3cf...|
|4ef07783ce800fc5d...|
|70c13902c9c9ccd02...|
|70c13902c9c9ccd02...|
|a950feff6911ab5e4...|
|b1a0d427adfdc4f7e...|
|b1a0d427adfdc4f7e...|
+--------------------+
only showing top 20 rows
scala> m.show(1)
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row
scala> m.show(1,true)
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row
scala> m.show(1,2)
+-----+
|value|
+-----+
| 32|
+-----+
only showing top 1 row
You keep saying header is an RDD while the output you posted clearly shows that header is a String. first() does not return an RDD. You can't use show() on a String, but you can use println.