How to Handle ValueError from Dataframe use Scala - scala

I am developing Spark using Scala, and I don't have any background of Scala. I don't get the ValueError Yet, but I am preparing the ValueError Handler for my code.
|location|arrDate|deptDate|
|JFK |1201 |1209 |
|LAX |1208 |1212 |
|NYC | |1209 |
|22 |1201 |1209 |
|SFO |1202 |1209 |
If we have data like this, I would like to store Third row and Fourth row into Error.dat then process the fifth row again. In the error log, I would like to put the information of the data such as which file, the number of the row, and details of error. For logger, I am using log4j now.
What is the best way to implement that function? Can you guys help me?

I am assuming all the three columns are type String. in that case I would solve this using the below snippet. I have created two udf to check for the error records.
if a field is has only numeric characters [isNumber]
and if the string field is empty [isEmpty]
code snippet
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.udf
val df = rdd.zipWithIndex.map({case ((x,y,z),index) => (index+1,x,y,z)}).toDF("row_num", "c1", "c2", "c3")
val isNumber = udf((x: String) => x.replaceAll("\\d","") == "")
val isEmpty = udf((x: String) => x.trim.length==0)
val errDF = df.filter(isNumber($"c1") || isEmpty($"c2"))
val validDF = df.filter(!(isNumber($"c1") || isEmpty($"c2")))
scala> df.show()
+-------+---+-----+-----+
|row_num| c1| c2| c3|
+-------+---+-----+-----+
| 1|JFK| 1201| 1209|
| 2|LAX| 1208| 1212|
| 3|NYC| | 1209|
| 4| 22| 1201| 1209|
| 5|SFO| 1202| 1209|
+-------+---+-----+-----+
scala> errDF.show()
+-------+---+----+----+
|row_num| c1| c2| c3|
+-------+---+----+----+
| 3|NYC| |1209|
| 4| 22|1201|1209|
+-------+---+----+----+

Related

How to efficiently select dataframe columns containing a certain value in Spark?

Suppose you have a dataframe in spark (string type) and you want to drop any column that contains "foo". In the example dataframe below, you would drop column "c2" and "c3" but keep "c1". However I'd like the solution to generalize to large numbers of columns and rows.
+-------------------+
| c1| c2| c3|
+-------------------+
| this| foo| hello|
| that| bar| world|
|other| baz| foobar|
+-------------------+
My solution is to scan every column in the dataframe then aggregate the results using the dataframe API and built in functions.
So, scanning each column could be done like this (I'm new to scala please excuse syntax mistakes):
df = df.select(df.columns.map(c => col(c).like("foo"))
Logically, I would have an intermediate dataframe like this:
+--------------------+
| c1| c2| c3|
+--------------------+
| false| true| false|
| false| false| false|
| false| false| true|
+--------------------+
Which would then be aggregated into a single row to read off which columns need to be dropped.
exprs = df.columns.map( c => max(c).alias(c))
drop = df.agg(exprs.head, exprs.tail: _*)
+--------------------+
| c1| c2| c3|
+--------------------+
| false| true| true|
+--------------------+
Now any column containing true can be dropped.
My question is: Is there better way to do this, performance wise? In this case, does spark stop scanning a column once it finds "foo"? Does it matter how data is stored (would parquet help?).
Thanks, I'm new here so please tell my how the question can be improved.
Depending on your data, for example, if you have a lot of foo values, the code below may perform more efficiently:
val colsToDrop = df.columns.filter{ c =>
!df.where(col(c).like("foo")).limit(1).isEmpty
}
df.drop(colsToDrop: _*)
UPDATE: Removed redundant .limit(1):
val colsToDrop = df.columns.filter{ c =>
!df.where(col(c).like("foo")).isEmpty
}
df.drop(colsToDrop: _*)
An answer following your logic (worked out correctly), but I think the other answer is better, more so for posterity and your improved ability with Scala. I am not sure the other answer is in fact performant, but neither is this. Not sure if parquet would help, difficult to gauge.
The other option is to write a loop on the driver and access every
column and then parquet would be of use due to columnar, stats and
push down.
import org.apache.spark.sql.functions._
def myUDF = udf((cols: Seq[String], cmp: String) => cols.map(code => if (code == cmp) true else false ))
val df = sc.parallelize(Seq(
("foo", "abc", "sss"),
("bar", "fff", "sss"),
("foo", "foo", "ddd"),
("bar", "ddd", "ddd")
)).toDF("a", "b", "c")
val res = df.select($"*", array(df.columns.map(col): _*).as("colN"))
.withColumn( "colres", myUDF( col("colN") , lit("foo") ) )
res.show()
res.printSchema()
val n = 3
val res2 = res.select( (0 until n).map(i => col("colres")(i).alias(s"c${i+1}")): _*)
res2.show(false)
val exprs = res2.columns.map( c => max(c).alias(c))
val drop = res2.agg(exprs.head, exprs.tail: _*)
drop.show(false)

substring from lastIndexOf in spark scala

I have a column in my dataframe which contains the filename
test_1_1_1_202012010101101
I want to get the string after the lastIndexOf(_)
I tried this and it is working
val timestamp_df =file_name_df.withColumn("timestamp",split(col("filename"),"_").getItem(4))
But I want to make it more generic, so that if in future if the filename can have any number of _ in it, it can split it on the basis of lastIndexOf _
val timestamp_df =file_name_df.withColumn("timestamp", expr("substring(filename, length(filename)-15,17)"))
This also is not generic as the character length can vary.
Can anyone help me in using the lastIndexOf function with withColumn.
You can use element_at function with split to get last element of array.
Example:
df.withColumn("timestamp",element_at(split(col("filename"),"_"),-1)).show(false)
+--------------------------+---------------+
|filename |timestamp |
+--------------------------+---------------+
|test_1_1_1_202012010101101|202012010101101|
+--------------------------+---------------+
You can use substring_index
scala> val df = Seq(("a-b-c", 1),("d-ef-foi",2)).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: string, c2: int]
+--------+---+
| c1| c2|
+--------+---+
| a-b-c| 1|
|d-ef-foi| 2|
+--------+---+
scala> df.withColumn("c3", substring_index(col("c1"), "-", -1)).show
+--------+---+---+
| c1| c2| c3|
+--------+---+---+
| a-b-c| 1| c|
|d-ef-foi| 2|foi|
+--------+---+---+
Per docs: When the last argument "is negative, everything to the right of the final delimiter (counting from the right) is returned"
val timestamp_df =file_name_df.withColumn("timestamp",reverse(split(reverse(col("filename")),"_").getItem(0)))
It's working with this.

Maximum of some specific columns in a spark scala dataframe

I have a dataframe like this.
+---+---+---+---+
| M| c2| c3| d1|
+---+---+---+---+
| 1|2_1|4_3|1_2|
| 2|3_4|4_5|1_2|
+---+---+---+---+
I have to transform this df should look like below. Here, c_max = max(c2,c3) after splitting with _.ie, all the columns (c2 and c3) have to be splitted with _ and then getting the max.
In the actual scenario, I have 50 columns ie, c2,c3....c50 and need to take the max from this.
+---+---+---+---+------+
| M| c2| c3| d1|c_Max |
+---+---+---+---+------+
| 1|2_1|4_3|1_2| 4 |
| 2|3_4|4_5|1_2| 5 |
+---+---+---+---+------+
Here is one way using expr and build-in array functions for Spark >= 2.4.0:
import org.apache.spark.sql.functions.{expr, array_max, array}
val df = Seq(
(1, "2_1", "3_4", "1_2"),
(2, "3_4", "4_5", "1_2")
).toDF("M", "c2", "c3", "d1")
// get max c for each c column
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
expr(s"array_max(cast(split(${c}, '_') as array<int>))")
}
df.withColumn("max_c", array_max(array(c_cols:_*))).show
Output:
+---+---+---+---+-----+
| M| c2| c3| d1|max_c|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
For older versions use the next code:
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
val c_ar = split(col(c), "_").cast("array<int>")
when(c_ar.getItem(0) > c_ar.getItem(1), c_ar.getItem(0)).otherwise(c_ar.getItem(1))
}
df.withColumn("max_c", greatest(c_cols:_*)).show
Use greatest function:
val df = Seq((1, "2_1", "3_4", "1_2"),(2, "3_4", "4_5", "1_2"),
).toDF("M", "c2", "c3", "d1")
// get all `c` columns and split by `_` to get the values after the underscore
val c_cols = df.columns.filter(_.startsWith("c"))
.flatMap{
c => Seq(split(col(c), "_").getItem(0).cast("int"),
split(col(c), "_").getItem(1).cast("int")
)
}
// apply greatest func
val c_max = greatest(c_cols: _*)
// add new column
df.withColumn("c_Max", c_max).show()
Gives:
+---+---+---+---+-----+
| M| c2| c3| d1|c_Max|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
In spark >= 2.4.0, you can use the array_max function and get some code that would work even with columns containing more than 2 values. The idea is to start by concatenating all the columns (concat column). For that, I use concat_ws on an array of all the columns I want to concat, that I obtain with array(cols.map(col) :_*). Then I split the resulting string to get a big array of strings containing all the values of all the columns. I cast it to an array of ints and I call array_max on it.
val cols = (2 to 50).map("c"+_)
val result = df
.withColumn("concat", concat_ws("_", array(cols.map(col) :_*)))
.withColumn("array_of_ints", split('concat, "_").cast(ArrayType(IntegerType)))
.withColumn("c_max", array_max('array_of_ints))
.drop("concat", "array_of_ints")
In spark < 2.4, you can define array_max yourself like this:
val array_max = udf((s : Seq[Int]) => s.max)
The previous code does not need to be modified. Note however that UDFs can be slower than predefined spark SQL functions.

Case class mapping to csv

cat department
dept_id,dept_name
1,acc
2,finance
3,sales
4,marketing
Why there is difference in output of show() when used in df.show() and rdd.toDF.show(). can someone please help?
scala> case class Department (dept_id: Int, dept_name: String)
defined class Department
scala> val dept = sc.textFile("/home/sam/Projects/department")
scala> val mappedDpt = dept.map(p => Department( p(0).toInt,p(1).toString))
scala> mappedDpt.toDF.show()
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 49| ,|
| 50| ,|
| 51| ,|
| 52| ,|
+-------+---------+
scala>
val dept_df = spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("mode","permissive")
.load("/home/sam/Projects/department")
scala> dept_df.show()
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 1| acc|
| 2| finance|
| 3| sales|
| 4|marketing|
+-------+---------+
scala>
The problem is here
val mappedDpt = dept.map(p => Department( p(0).toInt,p(1).toString))
p here is a String not a Row (as you may think). To be more precise here p is each line of the text file, you can confirm that reading the scaladoc.
"returns RDD of lines of the text file".
So, when you apply the apply method ((0)) you're accessing a character by position on the line.
That is why you end up with "49, ','" 49 from the toInt of the first char which returns the ascii value of the character and the ',' from the second character on the line.
Edit
If you need to reproduce the read method you can do the following:
object Department {
/** The Option here is to handle errors. */
def fromRawArray(data: Array[String]): Option[Department] = data match {
case Array(raw_dept_id, dept_name) => Some(Department(raw_dept_id.toInt, dept_name))
case _ => None
}
}
// We use flatMap instead of map, to unwrap the values from the Option, the Nones get removed.
val mappedDpt = dept.flatMap(line => Department.fromRawArray(line.split(",")))
However, I hope this is only for learning. On production code you should always use the read version. Since it will be more robust (handling missing values, doing a better type cast, etc).
For example, the above code will throw an exception if the first value can't be casted to Int.
Always use spark.read.* variants since that gives you the dataframe and you can infer the schema as well.
Coming to your issue, in your RDD version, you have to filter the first line and then split the lines using comma separator, then you can map it to case class Department.
Once you map it to Department, note that you are creating a typed dataframe.. so it is a dataset. So you should use createDataset
The below code worked for me.
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object RDDSample {
case class Department(dept_id: Int, dept_name: String)
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Spark_processing").master("local[*]").getOrCreate()
import spark.implicits._
val dept = spark.sparkContext.textFile("in/department.txt")
val mappedDpt = dept.filter(line => !line.contains("dept_id")).map(p => {
val y = p.split(","); Department(y(0).toInt, y(1).toString)
})
spark.createDataset(mappedDpt).show
}
}
Results:
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 1| acc|
| 2| finance|
| 3| sales|
| 4|marketing|
+-------+---------+

Scala-Spark(version1.5.2) Dataframes split error

I have an input file foo.txt with the following content:
c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|
I want to transform it to a Dataframe to perform some Sql queries:
var text = sc.textFile("foo.txt")
var header = text.first()
var rdd = text.filter(row => row != header)
case class Data(c1: String, c2: String, c3: String, c4: String, c5: String, c6: String, c7: String, c8: String)
Until this point everything is ok, the problem comes in the next sentence:
var df = rdd.map(_.split("\\|")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
If I try to print df with df.show, I get an error message:
scala> df.show()
java.lang.ArrayIndexOutOfBoundsException: 7
I know that the error might be due to the split sentence. I also tried to split foo.txt using the following syntax:
var df = rdd.map(_.split("""|""")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
And then I get something like this:
scala> df.show()
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 |
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
| 0| 0| || | || 1| .| 0|
| 0| 1| || 2| || 3| .| 0|
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
Therefore, my question is how can I correctly pass this file to a Dataframe.
EDIT: The error is in the first row due to || field without an intermediate space. This type of field definition depending on the examples works fine or crashes.
This is because one of your lines is shorter than the others:
scala> var df = rdd.map(_.split("\\|")).map(_.length).collect()
df: Array[Int] = Array(7, 8)
You can fill in the rows manually (but you need to handle each case manually):
val df = rdd.map(_.split("\\|")).map{row =>
row match {
case Array(a,b,c,d,e,f,g,h) => Data(a,b,c,d,e,f,g,h)
case Array(a,b,c,d,e,f,g) => Data(a,b,c,d,e,f,g," ")
}
}
scala> df.show()
+---+---+---+---+---+----+---+---+
| c1| c2| c3| c4| c5| c6| c7| c8|
+---+---+---+---+---+----+---+---+
| 00| |1.0|1.0| 9|27.0| 0| |
| 01| 2|3.0|4.0| 1|10.0| 1| 1|
+---+---+---+---+---+----+---+---+
EDIT:
A more generic solution would be something like this:
val df = rdd.map(_.split("\\|", -1)).map(_.slice(0,8)).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
If you assume that you always have the right number of delimiters, it is safe to use this syntax an truncate the last value.
My suggestion would be to use databrick's csv parser.
Link : https://github.com/databricks/spark-csv
To load your example :
I loaded a sample file similar to yours:
c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|
To create the dataframe use the below code:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", "|") // default is ","
.load("foo.txt")
.show
I got the below output
+---+---+---+---+---+----+---+----+---+
| c1| c2| c3| c4| c5| c6| c7| c8| |
+---+---+---+---+---+----+---+----+---+
| 0| |1.0|1.0| 9|27.0| 0|null| |
| 1| 2|3.0|4.0| 1|10.0| 1| 1| |
+---+---+---+---+---+----+---+----+---+
This way you do not have to bother about parsing the file yourself. You get a dataframe directly