let's say one have a plurality of files in a directory, each file being
adjacent lines with same date and value corresponding to the same event.
Two lines in two separate files can't be adjacent.
expected result:
where eN is here an arbitrary value (e1 <> e2 <> e3 ...) to clarify the sample.
does the following code provide a unique event id for all lines of all files:
case class Event(
LineNumber: Long, var EventId: Long,
Date: String, Value: String //,..
val lines = sc.textFile("theDirectory")
val rows = lines.filter(l => !l.startsWith("someString")).zipWithUniqueId
.map(l => l._2.toString +: l._1.split("""\|""", -1));
var lastValue: Float = 0;
var lastDate: String = "00010101";
var eventId: Long = 0;
var rowDF = rows
.map(c => {
var e = Event(
c(0).toLong, 0, c(1), c(2) //,...
if ( e.Date != lastDate || e.Value != lastValue) {
lastDate = e.Date
lastValue = e.Value
eventId = e.LineNumber
e.EventId = eventId
basically I use the unique line number given by zipWithUniqueId as a key for a sequence of adjacent lines.
I think my underlying question is: Is there a probabilty that the second map operation split the content of the files accross multiple process ?
Here is an idiomatic solution. Hope this helps. I have used filenames to distinguish files. A groupBy involving file name, zipindex and then join back to original input dataframe resulted in desired output.
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
scala> val lines = spark.read.textFile("file:///home/fsdjob/theDir").withColumn("filename", input_file_name())
scala> lines.show(false)
|value |filename |
scala> val linesGrpWithUid = lines.groupBy("value", "filename").count.drop("count").rdd.zipWithUniqueId
linesGrpWithUid: org.apache.spark.rdd.RDD[(org.apache.spark.sql.Row, Long)] = MapPartitionsRDD[135] at zipWithUniqueId at <console>:31
scala> val linesGrpWithIdRdd = linesGrpWithUid.map( x => { org.apache.spark.sql.Row(x._1.get(0),x._1.get(1), x._2) })
linesGrpWithIdRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[136] at map at <console>:31
scala> val schema =
| StructType(
| StructField("value", StringType, false) ::
| StructField("filename", StringType, false) ::
| StructField("id", LongType, false) ::
| Nil)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(value,StringType,false), StructField(filename,StringType,false), StructField(id,LongType,false))
scala> val linesGrpWithIdDF = spark.createDataFrame(linesGrpWithIdRdd, schema)
linesGrpWithIdDF: org.apache.spark.sql.DataFrame = [value: string, filename: string ... 1 more field]
scala> linesGrpWithIdDF.show(false)
|value |filename |id |
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|36.00|file:///home/fsdjob/theDir/file2.txt|6 |
|20100102|36.00|file:///home/fsdjob/theDir/file2.txt|20 |
|20100102|36.00|file:///home/fsdjob/theDir/file1.txt|30 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
scala> val output = lines.join(linesGrpWithIdDF, Seq("value", "filename"))
output: org.apache.spark.sql.DataFrame = [value: string, filename: string ... 1 more field]
scala> output.show(false)
|value |filename |id |
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|36.00|file:///home/fsdjob/theDir/file2.txt|6 |
|20100102|36.00|file:///home/fsdjob/theDir/file2.txt|20 |
|20100102|36.00|file:///home/fsdjob/theDir/file1.txt|30 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.
Examples below:
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Input as Dataframe:
|carLineName |
|[Avalon,CRV,Camry] |
|[Model T, Model S] |
|[Cayenne, Mustang] |
|[Pilot, Jeep] |
Desired output:
|carLineName |
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;# Jeep |
Current code which produces the input above:
val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))
You can use native function array_join (it is available since Spark 2.4):
import org.apache.spark.sql.functions.{array_join}
val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")
df.withColumn("str", array_join($"carLineName", ";#")).show()
| carLineName| str|
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
| [Model T, Model S]| Model T;#Model S|
| [Cayenne, Mustang]| Cayenne;#Mustang|
| [Pilot, Jeep]| Pilot;#Jeep|
you can create a user defined function that concatenate elements with "#;" separator as the following example:
val df1 = Seq(
("1", Array("t1", "t2")),
("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")
import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))
| id| arr| formated|
| 1| [t1, t2]| t1#;t2|
| 2|[t1, t3, t5]|t1#;t3#;t5|
You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))
This udf could be made generic further by passing the delimiter as the second parameter.
import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))
Since you are using 1.6, we can do simple map of Row to WrappedArray.
Here is how it goes.
Input :
scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
| (Array("Model T", "Model S")),
| (Array("Cayenne", "Mustang")),
| (Array("Pilot", "Jeep"))
| ).toDF("carLineName")
carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]
Schema ::
scala> carLineDf.printSchema
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :
scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray
scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
|Avalon;#CRV;#Camry |
|Model T;#Model S |
|Cayenne;#Mustang |
|Pilot;#Jeep |
// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)
That's it. You might have to use a dataframe.rdd otherwise this should do.
I am working on Scala with Spark and I have a dataframe including two columns with text.
Those columns are with the format of "term1, term2, term3,..." and I want to create a third column with the common terms of the two of them.
For example
orange, apple, melon
party, clouds, beach
apple, apricot, watermelon
black, yellow, white
The result would be
What I have done until now is to create a udf that splits the text and get the intersection of the two columns.
val common_terms = udf((a: String, b: String) => if (a.isEmpty || b.isEmpty) {
} else {
split(a, ",").intersect(split(b, ",")).length
And then on my dataframe
val results = termsDF.withColumn("col3", common_terms(col("col1"), col("col2"))
But I have the following error
Error:(96, 13) type mismatch;
found : String
required: org.apache.spark.sql.Column
split(a, ",").intersect(split(b, ",")).length
I would appreciate any help since I am new in Scala and just trying to learn from online tutorials.
val common_authors = udf((a: String, b: String) => if (a != null || b != null) {
} else {
val tempA = a.split( ",")
val tempB = b.split(",")
if ( tempA.isEmpty || tempB.isEmpty ) {
} else {
After the edit, if I try termsDF.show() it runs. But if I do something like that termsDF.orderBy(desc("col3")) then I get a java.lang.NullPointerException
val common_terms = udf((a: String, b: String) => if (a.isEmpty || b.isEmpty) {
} else {
var tmp1 = a.split(",")
var tmp2 = b.split(",")
val results = termsDF.withColumn("col3", common_terms($"a", $"b")).show
split(a, ",") its a spark column functions.
You are using an udf so you need to use string.split() wich is a scala function
After edit: change null verification to == not !=
In Spark 2.4 sql, you can get the same results without UDF. Check this out:
scala> val df = Seq(("orange,apple,melon","apple,apricot,watermelon"),("party,clouds,beach","black,yellow,white"), ("orange,apple,melon","apple,orange,watermelon")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> df.createOrReplaceTempView("tasos")
scala> spark.sql(""" select col1,col2, filter(split(col1,','), x -> array_contains(split(col2,','),x) ) a1 from tasos """).show(false)
|col1 |col2 |a1 |
|orange,apple,melon|apple,apricot,watermelon|[apple] |
|party,clouds,beach|black,yellow,white |[] |
|orange,apple,melon|apple,orange,watermelon |[orange, apple]|
If you want the size, then
scala> spark.sql(""" select col1,col2, filter(split(col1,','), x -> array_contains(split(col2,','),x) ) a1 from tasos """).withColumn("a1_size",size('a1)).show(false)
|col1 |col2 |a1 |a1_size|
|orange,apple,melon|apple,apricot,watermelon|[apple] |1 |
|party,clouds,beach|black,yellow,white |[] |0 |
|orange,apple,melon|apple,orange,watermelon |[orange, apple]|2 |
I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For instance,
id | name | likes
1 | Luke | [baseball, soccer]
should become
id | name | likes
1 | Luke | baseball
1 | Luke | soccer
This is my code
private DataFrame explodeDataFrame(DataFrame df) {
DataFrame resultDf = df;
for (StructField field : df.schema().fields()) {
if (field.dataType() instanceof ArrayType) {
resultDf = resultDf.withColumn(field.name(), org.apache.spark.sql.functions.explode(resultDf.col(field.name())));
return resultDf;
The problem is that in my data, some of the array columns have nulls. In that case, the entire row is deleted. So this dataframe:
id | name | likes
1 | Luke | [baseball, soccer]
2 | Lucy | null
id | name | likes
1 | Luke | baseball
1 | Luke | soccer
instead of
id | name | likes
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
How can I explode my arrays so that I don't lose the null rows?
I am using Spark 1.5.2 and Java 8
Spark 2.2+
You can use explode_outer function:
import org.apache.spark.sql.functions.explode_outer
df.withColumn("likes", explode_outer($"likes")).show
// +---+----+--------+
// | id|name| likes|
// +---+----+--------+
// | 1|Luke|baseball|
// | 1|Luke| soccer|
// | 2|Lucy| null|
// +---+----+--------+
Spark <= 2.1
In Scala but Java equivalent should be almost identical (to import individual functions use import static).
import org.apache.spark.sql.functions.{array, col, explode, lit, when}
val df = Seq(
(1, "Luke", Some(Array("baseball", "soccer"))),
(2, "Lucy", None)
).toDF("id", "name", "likes")
df.withColumn("likes", explode(
when(col("likes").isNotNull, col("likes"))
// If null explode an array<string> with a single null
The idea here is basically to replace NULL with an array(NULL) of a desired type. For complex type (a.k.a structs) you have to provide full schema:
val dfStruct = Seq((1L, Some(Array((1, "a")))), (2L, None)).toDF("x", "y")
val st = StructType(Seq(
StructField("_1", IntegerType, false), StructField("_2", StringType, true)
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
If array Column has been created with containsNull set to false you should change this first (tested with Spark 2.1):
df.withColumn("array_column", $"array_column".cast(ArrayType(SomeType, true)))
You can use explode_outer() function.
Following up on the accepted answer, when the array elements are a complex type it can be difficult to define it by hand (e.g with large structs).
To do it automatically I wrote the following helper method:
def explodeOuter(df: Dataset[Row], columnsToExplode: List[String]) = {
val arrayFields = df.schema.fields
.map(field => field.name -> field.dataType)
.collect { case (name: String, type: ArrayType) => (name, type.asInstanceOf[ArrayType])}
columnsToExplode.foldLeft(df) { (dataFrame, arrayCol) =>
dataFrame.withColumn(arrayCol, explode(when(size(col(arrayCol)) =!= 0, col(arrayCol))
Edit: it seems that spark 2.2 and newer have this built in.
To handle empty map type column: for Spark <= 2.1
List((1, Array(2, 3, 4), Map(1 -> "a")),
(2, Array(5, 6, 7), Map(2 -> "b")),
(3, Array[Int](), Map[Int, String]())).toDF("col1", "col2", "col3").show()
df.select('col1, explode(when(size(map_keys('col3)) === 0, map(lit("null"), lit("null"))).
from pyspark.sql.functions import *
def flatten_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(flat_cols +
[col(nc + '.' + c).alias(nc + '_' + c)
for nc in nested_cols
for c in nested_df.select(nc + '.*').columns])
print("flatten_df_count :", flat_df.count())
return flat_df
def explode_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct' and c[1][:5] != 'array']
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for array_col in array_cols:
schema = new_df.select(array_col).dtypes[0][1]
nested_df = nested_df.withColumn(array_col, when(col(array_col).isNotNull(), col(array_col)).otherwise(array(lit(None)).cast(schema)))
nested_df = nested_df.withColumn("tmp", arrays_zip(*array_cols)).withColumn("tmp", explode("tmp")).select([col("tmp."+c).alias(c) for c in array_cols] + flat_cols)
print("explode_dfs_count :", nested_df.count())
return nested_df
new_df = flatten_df(myDf)
while True:
array_cols = [c[0] for c in new_df.dtypes if c[1][:5] == 'array']
if len(array_cols):
new_df = flatten_df(explode_df(new_df))
Used arrays_zip and explode to do it faster and address the null issue.
I have a dataframe, which stores the scores and labels for various binary classification class problem that I have. For example:
| problem | score | label |
| a | 0.8 | true |
| a | 0.7 | true |
| a | 0.2 | false |
| b | 0.9 | false |
| b | 0.3 | true |
| b | 0.1 | false |
| ... | ... | ... |
Now my goal is to get binary evaluation metrics (take AreaUnderROC for example, see https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification) for each problem, with end result being something like:
| problem | areaUnderROC |
| a | 0.83 |
| b | 0.68 |
| ... | ... |
I thought about doing something like:
but then I am not sure how to write getMetrics in terms of Aggregators (see https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html). Any suggestions?
There's a module built just for binary metrics - see it in the python docs
This code should work,
from pyspark.mllib.evaluation import BinaryClassificationMetrics
score_and_labels_a = df.filter("problem = 'a'").select("score", "label")
metrics_a = BinaryClassificationMetrics(score_and_labels)
score_and_labels_b = df.filter("problem = 'b'").select("score", "label")
metrics_b = BinaryClassificationMetrics(score_and_labels)
... and so on for the other problems
This seems to me to be the easiest way :)
Spark has very useful classes to get metrics from binary or multiclass classification. But they are available for the RDD based api version. So, doing a little bit of code and playing around with dataframes and rdd it can be possible. A ful example could be like the following:
object TestMetrics {
def main(args: Array[String]) : Unit = {
implicit val spark: SparkSession =
import spark.implicits._
val sc = spark.sparkContext
// Test data with your schema
val someData = Seq(
Row("a",0.8, true),
Row("a",0.7, true),
Row("a",0.2, true),
Row("b",0.9, true),
Row("b",0.3, true),
Row("b",0.1, true)
// Set your threshold to get a positive or negative
val threshold : Double = 0.5
import org.apache.spark.sql.functions._
// First udf to convert probability in positives or negatives
def _thresholdUdf(threshold: Double) : Double => Double = prob => if(prob > threshold) 1.0 else 0.0
// Cast boolean to double
val thresholdUdf = udf { _thresholdUdf(threshold)}
val castToDouUdf = udf { (label: Boolean) => if(label) 1.0 else 0.0 }
// Schema to build the dataframe
val schema = List(StructField("problem", StringType), StructField("score", DoubleType), StructField("label", BooleanType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
// Apply first trans to get the double representation of all fields
val df0 = df.withColumn("binarypredict", thresholdUdf('score)).withColumn("labelDouble", castToDouUdf('label))
// First loop to get the 'problems list'. Maybe it would be possible to do all in one cycle
val pbl = df0.select("problem").distinct().as[String].collect()
// Get the RDD from dataframe and build the Array[(string, BinaryClassificationMetrics)]
val dfList = pbl.map(a => (a, new BinaryClassificationMetrics(df0.select("problem", "binarypredict", "labelDouble").as[(String, Double, Double)]
.filter(el => el._1 == a).map{ case (_, predict, label) => (predict, label)}.rdd)))
// And the metrics for each 'problem' are available
val results = dfList.toMap.mapValues(metrics =>
val moreMetrics = dfList.toMap.map((metrics) => (metrics._1, metrics._2.scoreAndLabels))
// Get Metrics by key, in your case the 'problem'
results.foreach(element => println(element))
moreMetrics.foreach(element => element._2.foreach { pr => println(s"${element._1} ${pr}") })
// Score and labels
I have data in one RDD and the data is as follows:
scala> c_data
res31: org.apache.spark.rdd.RDD[String] = /home/t_csv MapPartitionsRDD[26] at textFile at <console>:25
scala> c_data.count()
res29: Long = 45212
scala> c_data.take(2).foreach(println)
I want to split the data into another rdd and I am using:
scala> val csv_data = c_data.map{x=>
| val w = x.split(";")
| val age = w(0)
| val job = w(1)
| val marital_stat = w(2)
| val education = w(3)
| val default = w(4)
| val balance = w(5)
| val housing = w(6)
| val loan = w(7)
| val contact = w(8)
| val day = w(9)
| val month = w(10)
| val duration = w(11)
| val campaign = w(12)
| val pdays = w(13)
| val previous = w(14)
| val poutcome = w(15)
| val Y = w(16)
| }
that returns :
csv_data: org.apache.spark.rdd.RDD[Unit] = MapPartitionsRDD[28] at map at <console>:27
when I query csv_data it returns Array((),....).
How can I get the data with first row as header and rest as data ?
Where I am doing wrong ?
Thanks in Advance.
Your mapping function returns Unit, so you map to an RDD[Unit]. You can get a tuple of your values by changing your code to
val csv_data = c_data.map{x=>
val w = x.split(";")
val Y = w(16)
(w, age, job, marital_stat, education, default, balance, housing, loan, contact, day, month, duration, campaign, pdays, previous, poutcome, Y)