select query fails on large dataset i sqlcontext - scala

My code is reading data from sqlcontext. The table has 20 million records in it. I want to calculate totalCount in table.
val finalresult = sqlContext.sql(“SELECT movieid,
tagname, occurrence AS eachTagCount, count AS
totalCount FROM result ORDER BY movieid”)
I want calculate the total count of one column without using groupby and save it in a textfile.
.I change my saving file without additional ]
>val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import sqlContext._
case class DataClass(UserId: Int, MovieId:Int, Tag: String)
// Create an RDD of DataClass objects and register it as a table.
val Data = sc.textFile("file:///usr/local/spark/dataset/tagupdate").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim.toInt, p(2).trim)).toDF()
Data.registerTempTable("tag")
val orderedId = sqlContext.sql("SELECT MovieId AS Id,Tag FROM tag ORDER BY MovieId")
orderedId.rdd
.map(_.toSeq.map(_+"").reduce(_+";"+_))
.saveAsTextFile("/usr/local/spark/dataset/algorithm3/output")
// orderedId.write.parquet("ordered.parquet")
val eachTagCount =orderedId.groupBy("Tag").count()
//eachTagCount.show()
eachTagCount.rdd
.map(_.toSeq.map(_+"").reduce(_+";"+_))
.saveAsTextFile("/usr/local/spark/dataset/algorithm3/output2")
ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 604) java.lang.ArrayIndexOutOfBoundsException: 1 at tags$$anonfun$6.apply(tags.scala:46) at tags$$anonfun$6.apply(tags.scala:46) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)

Error NumberFormatException is probably thrown in this place:
p(1).trim.toInt
It is thrown because you're trying to parse 10] which is obviously not a valid number.
You could try to find that problematic place in your file and just remove additional ].
You could also try to catch an error and provide a default value in case there are any problems with parsing:
import scala.util.Try
Try(p(1).trim.toInt).getOrElse(0) //return 0 in case there is problem with parsing.
Another thing you could do is to remove characters, which are not digits from the string you're trying to parse:
//filter out everything which is not a digit
p(1).filter(_.isDigit).toInt)
It might also fail in case everything will be filtered out and an empty string will be left, so it might be a good idea to also wrap it in Try.

Related

Scala WithColumn only if both columns exists

I have seen some variations of this question asked but havent found exactly what Im looking for. Here is the question:
I have some report names that I have collected in a dataframe and pivoted. The trouble I am having is regarding the resilience of the report_name. I cant be assured that every 90 days data will be present and that Rpt1, Rpt2, and Rpt3 will be there. So how do I go about creating a calculation ONLY if the column is present. I have outlined how my code looks right now. It works if all columns are there, but Id like to future proof it to ensure that if the report is not present in the 90 day window the pipline will not error out, but instead just skip the .withColumn addition
df1=(reports.alias("r")
.groupBy(uniqueid)
.filter("current_date<=90")
.pivot(report_name)
**
Result would be the following columns uniqueid Rpt1, Rpt2, Rpt3
* +---+-----+------+----------+
* |id |Rpt1 |Rpt2 |Rpt3 |
* +---+-----+------+----------+
* |205|72 |36 | 12 |
**
df2=(df1.alias("d1")
.withColumn("new_calc",expr("Rpt2/Rpt3"))
You can catch the error with a Try monad and return the original dataframe if withColumn fails.
import scala.util.Try
val df2 = Try(df1.withColumn("new_calc", expr("Rpt2/Rpt3")))
.getOrElse(df1)
.alias("d1")
You can also define it as a method if you want to reuse:
import org.apache.spark.sql.Column
def withColumnIfExist(df: DataFrame, colName: String, col: Column) =
Try(df.withColumn("new_calc",expr("Rpt2/Rpt3"))).getOrElse(df)
val df3 = withColumnIfExist(df1, "new_calc", expr("Rpt2/Rpt3"))
.alias("d1")
And if you need to chain multiple transformation you can use it with transform:
val df4 = df1.alias("d1")
.transform(withColumnIfExist(_, "new_calc", expr("Rpt2/Rpt3")))
.transform(withColumnIfExist(_, "new_calc_2", expr("Rpt1/Rpt2")))
Or you can implement it as an extension method with implicit class:
implicit class RichDataFrame(df: DataFrame) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(df.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(df)
}
val df5 = df1.alias("d1")
.withColumnIfExist("new_calc", expr("Rpt2/Rpt3"))
.withColumnIfExist("new_calc_2", expr("Rpt1/Rpt2"))
Since withColumn works with all datasets, and if you want withColumnIfExist to work generically for all datasets including dataframe:
implicit class RichDataset[A](ds: Dataset[A]) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(ds.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(ds.toDF)
}

Spark-Scala: Parse Fixed width line into Dataframe Api with exception handling

I am beginner learning spark with scala .pardon for my broken english...I need to write a program to parse delimited and fixed width file into Dataframe using spark-scala Dataframe Api.Also if input data is corrupted then program must handle in below given way:
A:ignoring the input data
B:investigate the error in input
C:stop on error
To accomplish the above goal , i have succesfully done parsing with exception handling for delimited file using DataFrame Api options. But i dont have idea how to apply same technique for fixed width file. I am using Spark 2.4.3 version.
// predefined schema used in program
val schema = new StructType()
.add("empno",IntegerType,true)
.add("ename",StringType,true)
.add("designation",StringType,true)
.add("manager",StringType,true)
.add("hire_date",StringType,true)
.add("salary",DoubleType,true)
.add("deptno",IntegerType,true)
.add("_corrupt_record", StringType, true)
// parse csv file into DataFrame Api
// option("mode","PERMISSIVE") used to handle corrupt record
val textDF =sqlContext.read.format("csv").option("header", "true").schema(schema).option("mode", "PERMISSIVE").load("empdata.csv")
textDF.show
// program for fixed width line
// created lsplit method to split line into list of tokens based on width input / string
def lsplit(pos: List[Int], str: String): List[String] = {
val (rest, result) = pos.foldLeft((str, List[String]())) {
case ((s, res),curr) =>
if(s.length()<=curr)
{
val split=s.substring(0).trim()
val rest=""
(rest, split :: res)
}
else if(s.length()>curr)
{
val split=s.substring(0, curr).trim()
val rest=s.substring(curr)
(rest, split :: res)
}
else
{
val split=""
val rest=""
(rest, split :: res)
}
}
// list is reversed
result.reverse
}
// create case class to hold parsed data
case class EMP(empno:Int,ename:String,designation:String,manager:String,hire_dt:String,salary:Double,deptno:Int)
// create variable to hold width length
val sizeOfColumn=List(4,4,5,4,10,8,2);
// code to transform string to case class record
val ttRdd=textDF.map {
x =>
val row=lsplit(sizeOfColumn,x.mkString)
EMP(row(0).toInt,row(1),row(2),row(3),row(4).toDouble,row(5).toInt)
}
Code works fine for proper data but fails if incorrect data comes in file.
for e.g: "empno" column has some non-integer data..program throws exception NumberFormatException..
The program must handle if actual data in file does not match the specified schema as handled in delimited file.
Kindly help me here . I need to use same method for fixed width file as used for delimited file.
It's sort of obvious.
You are blending your own approach with the API "permissive" option.
The permissive will pick up errors such as wrong data type. Then your own process lsplit still executes and can get a null exception. E.g. If I put in empnum "YYY" this is clearly observable.
If the datatype is OK and the length wrong, you process in most cases correctly, but the fields are garbled.
Your lsplit needs to be more robust and you need to check for if an error exists in there or prior to invoking not invoking.
First case
+-----+-----+---------------+
|empno|ename|_corrupt_record|
+-----+-----+---------------+
| null| null| YYY,Gerry|
| 5555|Wayne| null|
+-----+-----+---------------+
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 30, localhost, executor driver): java.lang.NumberFormatException: For input string: "null"
Second case
+------+-----+---------------+
| empno|ename|_corrupt_record|
+------+-----+---------------+
|444444|Gerry| null|
| 5555|Wayne| null|
+------+-----+---------------+
res37: Array[EMP] = Array(EMP(4444,44Ger), EMP(5555,Wayne))
In short, some work to do and no need for a header in fact.

How to load data into Product case class using Dataframe in Spark

I have a text file and has data like below:
productId|price|saleEvent|rivalName|fetchTS
123|78.73|Special|VistaCart.com|2017-05-11 15:39:30
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43
123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29
678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06
678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01
I have to find minimum price of a product across websites, e.g. my output should be like this:
productId|price|saleEvent|rivalName|fetchTS
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01
I am trying like this:
case class Product(productId:String, price:Double, saleEvent:String, rivalName:String, fetchTS:String)
val cDF = spark.read.text("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
val (header,values) = cDF.collect.splitAt(1)
values.foreach(x => Product(x(0).toString, x(1).toString.toDouble,
x(2).toString, x(3).toString, x(4).toString))
Getting exception while running last line:
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.sql.catalyst.expressions.GenericRow
.get(rows.scala:174)
at org.apache.spark.sql.Row$class.apply(Row.scala:163)
at
org.apache.spark.sql.catalyst.expressions.GenericRow
.apply(rows.scala:166
)
at $anonfun$1.apply(<console>:28)
at $anonfun$1.apply(<console>:28)
at scala.collection.IndexedSeqOptimized$class.foreach
(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
... 49 elided
Priting value in values:
scala> values
res2: **Array[org.apache.spark.sql.Row]** = `
Array([123|78.73|Special|VistaCart.com|2017-05-11 15:39:30 ],
[123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 ],
[123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29 ],
[678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06 ],
[678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22 ],
[678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 ],
[777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01 ]`
scala>
I am able to understand that I need to split("|").
scala> val xy = values.foreach(x => x.toString.split("|").toSeq)
xy: Unit = ()
So after splitting its giving me Unit class, i.e. void, so unable to load values into the Product case class. How can I load this Dataframe to Product case class? I dont want to use Dataset for now, although Dataset is type safe.
I'm using Spark 2.3 and Scala 2.11.
The issue is due to split taking a regex, which means you need to use "\\|" instead of a single "|". Also, the foreach need to be changed to map to actually give a return value, i.e:
val xy = values.map(x => x.toString.split("\\|"))
However, a better approach would be to read the data as a csv file with | separators. In this way you do not need to treat the header in a special way and by inferring the column types there is no need to make any convertions (here I changed fetchTS to a timestamp):
case class Product(productId: String, price: Double, saleEvent: String, rivalName: String, fetchTS: Timestamp)
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("sep", "|")
.csv("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
.as[Product]
The final line will convert the dataframe to use the Product case class. If you want to use it as an RDD instead, simply add .rdd in the end.
After this is done, use groupBy and agg to get the final results.

NullPointerException applying a function to spark RDD that works on non-RDD

I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/

Share HDInsight SPARK SQL Table saveAsTable does not work

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.
currently my script itself is very simple as shown below:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer#mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")
// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table")
unfortunately I run in to the following error
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)
i have also tried to use the following code to overwrite the table if it exists
import org.apache.spark.sql.SaveMode
myData.saveAsTable("test_table", SaveMode.Overwrite)
but still it gives me same error.
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
Can someone please help me fix this issue?
I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark
I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")
also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:
def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))
Regards
Kiran