Spark Scala - java.util.NoSuchElementException & Data Cleaning - scala

I have had a similar problem before, but I am looking for a generalizable answer. I am using spark-corenlp to get Sentiment scores on e-mails. Sometimes, sentiment() crashes on some input (maybe it's too long, maybe it had an unexpected character). It does not tell me it crashes on some instances, and just returns the Column sentiment('email). Thus, when I try to show() beyond a certain point or save() my data frame, I get a java.util.NoSuchElementException because sentiment() must have returned nothing at that row.
My initial code is loading the data, and applying sentiment() as shown in spark-corenlp API.
val customSchema = StructType(Array(
StructField("contactId", StringType, true),
StructField("email", StringType, true))
)
// Load dataframe
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter","\t") // Delimiter is tab
.option("parserLib", "UNIVOCITY") // Parser, which deals better with the email formatting
.schema(customSchema) // Schema of the table
.load("emails") // Input file
val sent = df.select('contactId, sentiment('email).as('sentiment)) // Add sentiment analysis output to dataframe
I tried to filter for null and NaN values:
val sentFiltered = sent.filter('sentiment.isNotNull)
.filter(!'sentiment.isNaN)
.filter(col("sentiment").between(0,4))
I even tried to do it via SQL query:
sent.registerTempTable("sent")
val test = sqlContext.sql("SELECT * FROM sent WHERE sentiment IS NOT NULL")
I don't know what input is making the spark-corenlp crash. How can I find out? Else, how can I filter these non existing values from col("sentiment")? Or else, should I try catching the Exception and ignore the row? Is this even possible?

Related

import data with a column of type Pig Map into spark Dataframe?

So I'm trying to import data that has a column of type Pig map into a spark dataframe, and I couldn't find anything on how do I explode the map data into 3 columns with names: street, city and state. I'm probably searching for the wrong thing. Right now I can import them into 3 columns using StructType and StructField options.
val schema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("address", StringType, true))) #this is the part that I need to explode
val data = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ";")
.schema(schema)
.load("hdfs://localhost:8020/filename")
Example row of the data that I need to make 5 columns from:
328;Some Name;[street#streetname,city#Chicago,state#IL]
What do i need to do to explode the map into 3 columns so id have essentially a new dataframe with 5 columns ? I just started Spark and I've never used pig. I only figured out it was a pig map through searching the structure [key#value].
I'm using spark 1.6 by the way with scala. Thank you for any help.
I'm not too familiar with the pig format (there may even be libraries for it), but some good ol' fashioned string manipulation seems to work. In practice you may have to do some error checking, or you'll get index out of range errors.
val data = spark.createDataset(Seq(
(328, "Some Name", "[street#streetname,city#Chicago,state#IL]")
)).toDF("id", "name", "address")
data.as[(Long, String, String)].map(r => {
val addr = (r._3.substring(1, r._3.length - 1)).split(",")
val street = addr(0).split("#")(1)
val city = addr(1).split("#")(1)
val state = addr(2).split("#")(1)
(r._1, r._2, street, city, state)
}).toDF("id", "name", "street", "city", "state").show()
which results in
+---+---------+----------+-------+-----+
| id| name| street| city|state|
+---+---------+----------+-------+-----+
|328|Some Name|streetname|Chicago| IL|
+---+---------+----------+-------+-----+
I'm not 100% certain of the compatibility with spark 1.6, however. You may end up having to map the Dataframe (as opposed to Dataset, as I'm converting it with the .as[] call), and extract the individual value's from the Row object in your anonymous .map() function. The overall concept should be the same though.

Spark-shell : The number of columns doesn't match

I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below .
Column1|Column2
1|Name_a
2|Name_b
But sometimes we receive only one column value and other is missing like below
Column1|Column2
1|Name_a
2|Name_b
3
4
5|Name_c
6
7|Name_f
So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those rows, without having a exception while reading the data from spark-shell like below.
val readFile = spark.read.option("delimiter", "|").csv("File.csv").toDF(Seq("Column1", "Column2"): _*)
When we are trying to read the file we are getting the below exception.
java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (1): _c0
New column names (2): Column1, Column2
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:435)
... 49 elided
You can specify schema of your data file and allow some columns to be nullable. In scala it may look like:
val schm = StructType(
StructField("Column1", StringType, nullable = true) ::
StructField("Column3", StringType, nullable = true) :: Nil)
val readFile = spark.read.
option("delimiter", "|")
.schema(schm)
.csv("File.csv").toDF
Than you can filter your dataset by column is not null.
Just add DROPMALFORMED mode to the option as below, while reading. Setting this makes Spark to drop the corrupted records.
val readFile = spark.read
.option("delimiter", "|")
.option("mode", "DROPMALFORMED") // Option to drop invalid rows.
.csv("File.csv")
.toDF(Seq("Column1", "Column2"): _*)
This is documented here.

Handling schema mismatches in Spark

I am reading a csv file using Spark in Scala.
The schema is predefined and i am using it for reading.
This is the esample code:
// create the schema
val schema= StructType(Array(
StructField("col1", IntegerType,false),
StructField("col2", StringType,false),
StructField("col3", StringType,true)))
// Initialize Spark session
val spark: SparkSession = SparkSession.builder
.appName("Parquet Converter")
.getOrCreate
// Create a data frame from a csv file
val dataFrame: DataFrame =
spark.read.format("csv").schema(schema).option("header", false).load(inputCsvPath)
From what i read when reading cav with Spark using a schema there are 3 options:
Set mode to DROPMALFORMED --> this will drop the lines that don't match the schema
Set mode to PERMISSIVE --> this will set the whole line to null values
Set mode to FAILFAST --> this will throw an exception when a mismatch is discovered
What is the best way to combine the options? The behaviour I want is to get the mismatches in the schema, print them as errors and ignoring the lines in my data frame.
Basically, I want a combination of FAILFAST and DROPMALFORMED.
Thanks in advance
This is what I eventually did:
I added to the schema the "_corrupt_record" column, for example:
val schema= StructType(Array(
StructField("col1", IntegerType,true),
StructField("col2", StringType,false),
StructField("col3", StringType,true),
StructField("_corrupt_record", StringType, true)))
Then I read the CSV using PERMISSIVE mode (it is Spark default):
val dataFrame: DataFrame = spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "PERMISSIVE")
.load(inputCsvPath)
Now my data frame holds an additional column that holds the rows with schema mismatches.
I filtered the rows that have mismatched data and printed it:
val badRows = dataFrame.filter("_corrupt_record is not null")
badRows.cache()
badRows.show()
Just use DROPMALFORMED and follow the log. If malformed records are present there are dumped to the log, up to the limit set by maxMalformedLogPerPartition option.
spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "DROPMALFORMED")
.option("maxMalformedLogPerPartition", 128)
.load(inputCsvPath)

Flag faulty Row in spark dataframe with boolean values

I was trying a hands on the spark dataframes. With previous knowledge from Cascading Framework which has a trap mechanism to filter out faulty rows (rows with null values) into a separate Tap called Trap. Those who are unaware let me make that clear. When you get a faulty row which has been read from a text file. The framework either scraps out the bad row from the entire data or stops the execution. Now in apache spark, I observed that the bad rows didn't hinder the execution. That is good but when it comes to getting business insights from data, Quality of the data Does matter!
So, I have a text file with bunch of rows in it (you may pick up any dataset, you like to), in which few records do contain null values. Now I load the text file into a Dataframe with spark.read.csv. Now, what I want to do is analyze the Dataframe and dynamically create a column named "isMyRowBad" where the logic will analyze each rows at a time and if the logic finds out the row which has a null value, it flags the isMyRowBad column on that particular row as true and the columns which do not have null values, the corresponding column isMyRowBad should have false for that purticular row which is clean.
Giving you the overview of the incoming and outgoing datasets
INCOMING DATAFRAME
fname,lname,age
will,smith,40
Dwayne,Nunn,36
Aniruddha,Sinha,
Maria,,22
OUTGOING DATAFRAME
fname,lname,age,isMyRowBad
will,smith,40,false
Dwayne,Nunn,36,false
Aniruddha,Sinha,,true
Maria,,22,true
The above method to classify good and bad rows might seem a little foolish but it does make sense since I will not need to run filter operation multiple times. let us take a look, how?
Suppose I have a Dataframe named inDf as inputDf and AnalysedDf:(DataFrame,DataFrame) as output Df Tuple
Now, I did try this part of code
val analyzedDf: (DataFrame, DataFrame) = (inputDf.filter(_.anyNull),inputDf.filter(!_.anyNull))
This code segregates good and bad rows. I agree! but this has a performance setback as filter runs two times which means filter will iterate all over the dataset for two times!( you may counter this point if you feel running filter two times does make sense when considering 50 fields and atleast 584000 rows ( that is 250 mb of data)!)
and this as well
val analyzedDf: DataFrame = inputDf.select("*").withColumn("isMyRowBad", <this point, I am not able to analyze row>
The above snippet shows where I am not able to figure out how to sweep the entire row and mark the row as bad with a boolean value.
Hope, you all got to understand what am I aiming to achieve. Please ignore the syntactical errors if you find in the snippets since I typed them here right away(will correct the same with future edits)
Please give me a hint(a little code snippet or a pseudo code will be enough) on how to proceed with the challenge. Please reach out to me if you didn't understand what I intend to do.
Any help will be greatly appreciated. Thanks in advance!
P.S: There are brilliant people out here on BigData/spark/hadoop/scala etc. Request you to kindly correct me on any point which I might have wrongly written(conceptually)
The below code give me a solution by the way. Please have a look
package aniruddha.data.quality
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.functions._
/**
* Created by aniruddha on 8/4/17.
*/
object DataQualityCheck extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val schema: StructType = StructType(List(
StructField("fname", StringType, nullable = true),
StructField("lname", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("pan", StringType, nullable = true),
StructField("married", StringType, nullable = true)
))
val inputDataFrame: DataFrame = spark
.read
.schema(schema)
.option("header",true)
.option("delimiter",",")
.csv("inputData/infile")
//inputDataFrame.show()
val analysedDataFrame: DataFrame = inputDataFrame.select("*").withColumn("isRowBad", when($"pan".isNull||$"lname".isNull||$"married".isNull,true).otherwise(false))
analysedDataFrame show
}
input
fname,lname,age,pan,married
aniruddha,sinha,23,0AA22,no
balajee,venkatesh,23,0b96,no
warren,shannon,72,,
wes,borland,63,0b22,yes
Rohan,,32,0a96,no
james,bond,66,007,no
output
+---------+---------+---+-----+-------+--------+
| fname| lname|age| pan|married|isRowBad|
+---------+---------+---+-----+-------+--------+
|aniruddha| sinha| 23|0AA22| no| false|
| balajee|venkatesh| 23| 0b96| no| false|
| warren| shannon| 72| null| null| true|
| wes| borland| 63| 0b22| yes| false|
| Rohan| null| 32| 0a96| no| true|
| james| bond| 66| 007| no| false|
+---------+---------+---+-----+-------+--------+
The code works fine but I have a problem with the when function. Can't we just select all the columns without hardcoding it?
As far as I know, you can't do this with the inbuilt csv parser. You can get the parser to stop if it hits an error (failFast mode), but not annotate.
However, you could do this with a custom csv parser, that can process the data in a single pass. Unless we want to do some clever type introspection, it is easiest if we create a helper class to annotate the structure of the file:
case class CSVColumnDef(colPos: Int, colName: String, colType: String)
val columns = List(CSVColumnDef(0,"fname","String"),CSVColumnDef(1,"lname","String"),CSVColumnDef(2,"age", "Int"))
Next, we need some functions to a) split the input, b) extract data from split data, c) check if row is bad:
import scala.util.Try
def splitToSeq(delimiter: String) = udf[Seq[String],String](_.split(delimiter))
def extractColumnStr(i: Int) = udf[Option[String],Seq[String]](s => Try(Some(s(i))).getOrElse(None))
def extractColumnInt(i: Int) = udf[Option[Int],Seq[String]](s => Try(Some(s(i).toInt)).getOrElse(None))
def isRowBad(delimiter: String) = udf[Boolean,String](s => {
(s.split(delimiter).length != columns.length) || (s.split(delimiter).exists(_.length==0))
})
To use these, we first need to read in the text file (since I don't have it, and to allow people to replicate this answer, I will create an rdd):
val input = sc.parallelize(List(("will,smith,40"),("Dwayne,Nunn,36"),("Aniruddha,Sinha,"),("Maria,,22")))
input.take(5).foreach(println)
Given this input, we can create a dataframe with a single column, the raw line, and add our split column to it:
val delimiter = ","
val raw = "raw"
val delimited = "delimited"
val compDF = input.toDF(raw).withColumn(delimited, splitToSeq(delimiter)(col(raw)))
Finally, we can extract all the columns we previously defined, and check if the rows are bad:
val df = columns.foldLeft(compDF){case (acc,column) => column.colType match {
case "Int" => acc.withColumn(column.colName, extractColumnInt(column.colPos)(col(delimited)))
case _ => acc.withColumn(column.colName, extractColumnStr(column.colPos)(col(delimited)))
}}.
withColumn("isMyRowBad", isRowBad(delimiter)(col(raw))).
drop(raw).drop(delimited)
df.show
df.printSchema
The nice thing about this solution is that the spark execution planner is smart enough to build all of those .withColumn operations into a single pass (map) over the data, without zero shuffling. The annoying thing is that it is a lot more dev work than using a nice shiny csv library, and we need to define the columns somehow. If you wanted to be a bit more clever, you could get the column names from the first line of the file (hint: .mapPartitionsWithIndex), and just parse everything as a string. We also can't define a case class to describe the entire DF, since you have too many columns to approach the solution that way. Hope this helps...
This can be done using udf. Although the answer given by Ben Horsburgh is definitely brilliant, yet we can do this without getting much into internal architecture behind Dataframes. The following code can give you an idea
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
/**
* Created by vaijnath on 10/4/17.
*/
object DataQualityCheck extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val schema: StructType = StructType(List(
StructField("fname", StringType, nullable = true),
StructField("lname", StringType, nullable = true),
StructField("married", StringType, nullable = true)
))
val inputDataFrame: DataFrame = spark
.read
.schema(schema)
.option("header",false)
.option("delimiter",",")
.csv("hydrograph.engine.spark/testData/inputFiles/delimitedInputFile.txt")
//inputDataFrame.show()
def isBad(row:Row):Boolean={
row.anyNull
}
val simplefun=udf(isBad(_:Row))
val cols=struct(inputDataFrame.schema.fieldNames.map(e=> col(e)):_*)
// println(cols+"******************") //for debugging
val analysedDataFrame: DataFrame = inputDataFrame.withColumn("isRowBad", simplefun(cols))
analysedDataFrame.show
}
Please get back to me if you face any issues. I believe this solution can be appropriate since you seem to look for a code with usage of dataframe.
Thanks.

Pass one dataframe column values to another dataframe filter condition expression + Spark 1.5

I have two input datasets
First input dataset like as below :
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they are going fast",
2015,Chevy,Volt
Second Input dataset :
TagId,condition
1997_cars,year = 1997 and model = 'E350'
2012_cars,year=2012 and model ='S'
2015_cars ,year=2015 and model = 'Volt'
Now my requirement is read first data set and based on the filtering condition in second dataset need to tag rows of first input dataset by introducing a new column TagId to first input data set
so the expected should look like :
year,make,model,comment,blank,TagId
"2012","Tesla","S","No comment",2012_cars
1997,Ford,E350,"Go get one now they are going fast",1997_cars
2015,Chevy,Volt, ,2015_cars
I tried like :
val sqlContext = new SQLContext(sc)
val carsSchema = StructType(Seq(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true),
StructField("comment", StringType, true),
StructField("blank", StringType, true)))
val carTagsSchema = StructType(Seq(
StructField("TagId", StringType, true),
StructField("condition", StringType, true)))
val dfcars = sqlContext.read.format("com.databricks.spark.csv").option("header", "true") .schema(carsSchema).load("/TestDivya/Spark/cars.csv")
val dftags = sqlContext.read.format("com.databricks.spark.csv").option("header", "true") .schema(carTagsSchema).load("/TestDivya/Spark/CarTags.csv")
val Amendeddf = dfcars.withColumn("TagId", dfcars("blank"))
val cdtnval = dftags.select("condition")
val df2=dfcars.filter(cdtnval)
<console>:35: error: overloaded method value filter with alternatives:
(conditionExpr: String)org.apache.spark.sql.DataFrame <and>
(condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.sql.DataFrame)
val df2=dfcars.filter(cdtnval)
another way :
val col = dftags.col("TagId")
val finaldf = dfcars.withColumn("TagId", col)
org.apache.spark.sql.AnalysisException: resolved attribute(s) TagId#5 missing from comment#3,blank#4,model#2,make#1,year#0 in operator !Project [year#0,make#1,model#2,comment#3,blank#4,TagId#5 AS TagId#8];
finaldf.write.format("com.databricks.spark.csv").option("header", "true").save("/TestDivya/Spark/carswithtags.csv")
Would really appreciate if somebody give me pointers how can I pass the filter condition to filter function of dataframe.
Or another solution .
My apppologies for such a naive question as I am new to scala and Spark
Thanks
There is no simple solution to this. I think there are two general directions you can go with it:
Collect the conditions (dftags) to a local list. Then go through it one by one, executing each on the cars (dfcars) as a filter. Use the results to get the desired output.
Collect the conditions (dftags) to a local list. Implement the parsing and evaluation code for them yourself. Go through the cars (dfcars) once, evaluating the ruleset on each line in a map.
If you have a high number of conditions (so you cannot collect them) and a high number of cars, then the situation is very bad. You need to check every car against every condition, so this will be very inefficient. In this case you need to optimize the ruleset first, so it can be evaluated more efficiently. (A decision tree may be a nice solution.)