Spark badRecordsPath is not writing records to the Path as expected - scala

I have a following sample csv data:
id
name
salary
1
"Raju"
1000
2
"Gautam"
15000
3
"Kishan"
30000
4
"Mike"
two hundread
The salary field in last record is corrupted.
I am trying to handle the corrupt record with badRecordsPath as shown in the code below. But it is not working. I am using Spark 3.0.3, Scala 12 and Windows 10.
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.ArrayType
object BadDataPathExample extends App{
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name", "BadDataPathExample")
sparkConf.set("spark.master", "local[2]")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val schema_string = "id int, name String, salary int"
Logger.getLogger(getClass.getName).info(">> Starting to read Data")
// read CSV
val badDF = spark.read
.format("csv")
.option("header", true)
.schema(schema_string)
.option("badRecordsPath", "D:/spark_practice/bad_dir")
.option("path", "D:/spark_practice/data/bad_emp.csv")
.load
badDF.show()
badDF.printSchema()
}
The Output from the above code is as below:
As we can see that record is present with corrupted column value set to Null., which is coming from default behavior of "PERMISSIVE" mode. Also, there is no record being written to the bad records path specified.
But same code works as expected in Databricks as shown below.
What am I doing wrong? Or is badRecordsPath a Databricks specific feature?

badRecordsPath is only a Databricks specific feature.
We can see the logic in source code FailureSafeParser.
class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null))
} catch {
case e: BadRecordException => mode match {
case PermissiveMode =>
Iterator(toResultRow(e.partialResult(), e.record))
case DropMalformedMode =>
Iterator.empty
case FailFastMode =>
throw QueryExecutionErrors.malformedRecordsDetectedInRecordParsingError(e)
}
}
}
}
emmm...
I have a idea to refactor this code...
When there have badRecordsPath option, the mode forced to be DropMalformedMode and ignore mode which user set.
DropMalformedMode parse rows with exception and write to badRecordsPath, then empty Iterator.

Related

why the error is given while reading the text file in spark

The path given to the text file is correct still I am getting error " Input path does not exist: file:/C:/Users/cmpil/Downloads/hunger_games.txt". Why is it happening
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object WordCountDataSet {
case class Book(value:String)
def main(args:Array[String]): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("WordCount")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//Another way of doing it
val bookRDD = spark.sparkContext.textFile("C:/Users/cmpil/Downloads/hunger_games.txt")
val wordsRDD = bookRDD.flatMap(x => x.split("\\W+"))
val wordsDS = wordsRDD.toDS()
val lowercaseWordsDS = wordsDS.select(lower($"value").alias("word"))
val wordCountsDS = lowercaseWordsDS.groupBy("word").count()
val wordCountsSortedDS = wordCountsDS.sort("count")
wordCountsSortedDS.show(wordCountsSortedDS.count().toInt)
}
}
on windows you have to use '\\' in place of '/'
try using "C:\\Users\\cmpil\\Downloads\\hunger_games.txt"

Not able to read pipe delimited csv

Input data:
Ord_value|other_data
12345|u1=876435;u5=4356|4357|4358;u15=Mr. Noodles,n/a,Great Value;u16=0.77,4.92,7.96;u17=4,1,7;
Details of U variables
U1= order I'd --single value
U5= pid --is a list
U15= name --is a list
U16= price -- is a list
U17= quantity -- is a list
Output:
Ord_value|orderid|pid|name|price|quantity
12345|876435|4356|Mr. Noodles|0.77|4
12345|876435|4357|n/a|4.92|1
i tried reading the file using semi colon
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
object example3 {
def main(args:Array[String]):Unit={
System.setProperty("hadoop.home.dir", "C:\\hadoop\\")
val conf=new SparkConf().setAppName("first_demo").setMaster("local[*]")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
spark.sparkContext.setLogLevel("Error")
import spark.implicits._
// val rdd1=sc.textFile("file:///C://Users//User//Desktop//example3.txt")
// rdd1.map(x=>x.split(";")).foreach(println)
spark.read.option("delimiter",";").option("header","true").load("file:///C://Users//User//Desktop//example3.txt").show()
}
}
Not able to read the file. am getting above error. it looks like complex file.
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=file:/C:/Users/User/Desktop/example3.txt; isDirectory=false; length=95; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
i tried again
val df1=spark.read.option("delimiter","|").csv("file:///C://Users//User//Desktop//example3.txt")
+-----+-----------------+----+------------------------------------------------------------------+
|_c0 |_c1 |_c2 |_c3 |
+-----+-----------------+----+------------------------------------------------------------------+
|12345|u1=876435;u5=4356|4357|4358;u15=Mr. Noodles,n/a,Great Value;u16=0.77,4.92,7.96;u17=4,1,7;|
+-----+-----------------+----+------------------------------------------------------------------+

how to fix Scala error with "Not found type"

I'm newbie in Scala, just trying to learn it in Spark. Now I'm writing a Scala app to load csv file from hadoop into dataframe, then I want to add a new column in that dataframe. There is a function to populate the content of that new column, for testing the function just uppercase the column from csv file, the csv file only contains one column: emp_id and it's string.. the function is defined in Object TestService. My IDE is Eclipse. Now I have error: not found: type TestService
Very appreciate if anyone can help me.
\\This is the main:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
import com.poc.spark.service.TestService;
object SparkIntTest {
def main(args:Array[String]){
sys.props.+=(("hadoop.home.dir","C:\\OpenSource\\Hadoop"))
val sparkConf = new SparkConf().setMaster("local").setAppName("employee").set("spark.testing.memory", "2147480000")
val sparkContext = new SparkContext(sparkConf)
val spark = SparkSession.builder().appName("employee").getOrCreate()
val df = spark.read.option("header", "true").csv(".\\src\\main\\resources\\employee.csv")
df.show();
println(df.schema);
val df_Applied = df.withColumn("award_rule",runAllRulesUDF(df("emp_id")))
df_Applied.show();
println(df_Applied.schema)
}
def runAllRulesUDF = udf(new TestService().runAllRulesForUDF(_:String))
}
Here is the Object TestService:
package com.poc.spark.service
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
object TestService {
def runAllRulesForUDF(empid: String): String = {
empid.toUpperCase();
}
}
TestService is an object, which means that it is a statically created singleton. So instead of
new TestService()
You can just say
TestService

spark streaming not able to use spark sql

I am facing an issue during spark streaming. I am getting empty records after it gets streamed and passes to the "parse" method.
My code:
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Encoders
import org.apache.spark.streaming._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType,
IntegerType}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType,
IntegerType}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import java.util.regex.Pattern
import java.util.regex.Matcher
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql._
val conf = new SparkConf().setAppName("streamHive").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(5))
val sc=ssc.sparkContext
val lines = ssc.textFileStream("file:///home/sadr/testHive")
case class Prices(name: String, age: String,sex: String, location: String)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
def parse (rdd : org.apache.spark.rdd.RDD[String] ) =
{
var l = rdd.map(_.split(","))
val prices = l.map(p => Prices(p(0),p(1),p(2),p(3)))
val pricesDf = sqlContext.createDataFrame(prices)
pricesDf.registerTempTable("prices")
pricesDf.show()
var x = sqlContext.sql("select count(*) from prices")
x.show()}
lines.foreachRDD { rdd => parse(rdd)}
lines.print()
ssc.start()
My input file:
cat test1.csv
Riaz,32,M,uk
tony,23,M,india
manu,33,M,china
imart,34,F,AUS
I am getting this output:
lines.foreachRDD { rdd => parse(rdd)}
lines.print()
ssc.start()
scala> +----+---+---+--------+
|name|age|sex|location|
+----+---+---+--------+
+----+---+---+--------+
I am using Spark version 2.3....I AM GETTING FOLLOWING ERROR AFTER ADDING X.SHOW()
Not sure if you are actually able to read the streams.
textFileStream reads only the new files added to the directory after the program starts and not the existing ones. Was the file already there?
If yes, remove it from the directory, start the program and copy the file again?

value toDF is not a member of Seq[(Int,String)]

I am trying to execute the following code but getting this error:
value toDF is not a member of Seq[(Int,String)].
I have the case class outside main and I have imported implicits too. But still I am getting this error. Can someone help me to resolve this ? I am using Spark 2.11-2.1.0 and Scala 2.11.8
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark._
final case class Email(id: Int, text: String)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val emails = Seq(
"This is an email from...",
"SPAM SPAM spam",
"Hello, We'd like to offer you")
.zipWithIndex.map(_.swap).toDF("id", "text").as[Email]
}
}
You already have a SparkSession you can just import the spark.implicits._ will work in your case
val spark = SparkSession.builder.appName("SampleKMeans")
.master("local[*]")
.getOrCreate()
import spark.implicits._
Now toDF method works as expected.
If the error still exists, You need to check the version of spark and scala libraries that you are using.
Hope this helps!