Apache Spark shortest job scala

Apache Spark shortest job scala - scala

I am new to Apache Spark and scala programming. I am writing a code in scala using apache spark api docs. My goal is to create a graph and deploy objects and compute shortest path.I have written a program to generate a csv file of object which I want to use. It consists of vehicleID,source,Destination.
It is as follows:
[My sample csv file][1]
[1]: https://i.stack.imgur.com/KtSVz.png
My code to generate CSV file
import java.io.BufferedWriter
import java.io.FileWriter
import scala.collection.JavaConverters._
import scala.collection.mutable.ListBuffer
import scala.util.Random
import au.com.bytecode.opencsv.CSVWriter
import scala.collection.mutable
class MakeCSV() {
def csvBuilder(dx:Int){
val outputfile= new BufferedWriter(new FileWriter("vehicles.csv"))
val csvWriter= new CSVWriter(outputfile)
val csvFields= Array("Vehicle-id","Source","Destination")
val vehicleID=(0 to dx).toList
val sourceList=mutable.MutableList[String]()
val destinationList=mutable.MutableList[String]()
var i,sx,sy,dsx,dsy=0
for(i<-0 to dx){
sx=Random.nextInt(dx)
sy=Random.nextInt(dx)
dsx=Random.nextInt(dx)
dsy=Random.nextInt(dx)
sourceList.+=((sx,sy).toString())
destinationList.+=((dsx,dsy).toString())
}
var listOfRecords = new ListBuffer[Array[String]]()
listOfRecords += csvFields
for (i<- 0 to dx){
listOfRecords+=Array(i.toString,sourceList(Random.nextInt(sourceList.length)),destinationList(Random.nextInt(destinationList.length)))
}
csvWriter.writeAll(listOfRecords.asJava)
csvWriter.close()
}
}
My main file:
import java.io.PrintWriter
import scala.io.StdIn
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.graphx.Graph
import org.apache.spark.graphx.util.GraphGenerators
object MainFile {
def main(args:Array[String]):Unit={
// Vehicle CSV file Generation
println("Enter the number of cars")
val input=StdIn.readInt()
val makecsv= new MakeCSV()
makecsv.csvBuilder(input)
// Spark Job Configuration
val conf = new SparkConf().setAppName("DjikstraShortestPath")
val sc= new SparkContext(conf)
// Graph Generation
println("Enter the number of rows for grid")
val row= StdIn.readInt()
println("Enter the number of columns for grid")
val column = StdIn.readInt()
val graph:Graph[(Int, Int), Double]=GraphGenerators.gridGraph(sc,row,column)
// Vehicle File opening
// For each Vehicle compute shortest path using source destination in csv file
}
}
Now I want to open that csv file and using its source and destination I want to compute shortest path for each vehicle using the graph generated above. Can anyone help me? How to open the csv file read it and find shortest path

Related

Spark badRecordsPath is not writing records to the Path as expected

I have a following sample csv data:
id
name
salary
1
"Raju"
1000
2
"Gautam"
15000
3
"Kishan"
30000
4
"Mike"
two hundread
The salary field in last record is corrupted.
I am trying to handle the corrupt record with badRecordsPath as shown in the code below. But it is not working. I am using Spark 3.0.3, Scala 12 and Windows 10.
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.ArrayType
object BadDataPathExample extends App{
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name", "BadDataPathExample")
sparkConf.set("spark.master", "local[2]")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val schema_string = "id int, name String, salary int"
Logger.getLogger(getClass.getName).info(">> Starting to read Data")
// read CSV
val badDF = spark.read
.format("csv")
.option("header", true)
.schema(schema_string)
.option("badRecordsPath", "D:/spark_practice/bad_dir")
.option("path", "D:/spark_practice/data/bad_emp.csv")
.load
badDF.show()
badDF.printSchema()
}
The Output from the above code is as below:
As we can see that record is present with corrupted column value set to Null., which is coming from default behavior of "PERMISSIVE" mode. Also, there is no record being written to the bad records path specified.
But same code works as expected in Databricks as shown below.
What am I doing wrong? Or is badRecordsPath a Databricks specific feature?

badRecordsPath is only a Databricks specific feature.
We can see the logic in source code FailureSafeParser.
class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null))
} catch {
case e: BadRecordException => mode match {
case PermissiveMode =>
Iterator(toResultRow(e.partialResult(), e.record))
case DropMalformedMode =>
Iterator.empty
case FailFastMode =>
throw QueryExecutionErrors.malformedRecordsDetectedInRecordParsingError(e)
}
}
}
}
emmm...
I have a idea to refactor this code...
When there have badRecordsPath option, the mode forced to be DropMalformedMode and ignore mode which user set.
DropMalformedMode parse rows with exception and write to badRecordsPath, then empty Iterator.

Not able to read pipe delimited csv

Input data:
Ord_value|other_data
12345|u1=876435;u5=4356|4357|4358;u15=Mr. Noodles,n/a,Great Value;u16=0.77,4.92,7.96;u17=4,1,7;
Details of U variables
U1= order I'd --single value
U5= pid --is a list
U15= name --is a list
U16= price -- is a list
U17= quantity -- is a list
Output:
Ord_value|orderid|pid|name|price|quantity
12345|876435|4356|Mr. Noodles|0.77|4
12345|876435|4357|n/a|4.92|1
i tried reading the file using semi colon
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
object example3 {
def main(args:Array[String]):Unit={
System.setProperty("hadoop.home.dir", "C:\\hadoop\\")
val conf=new SparkConf().setAppName("first_demo").setMaster("local[*]")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
spark.sparkContext.setLogLevel("Error")
import spark.implicits._
// val rdd1=sc.textFile("file:///C://Users//User//Desktop//example3.txt")
// rdd1.map(x=>x.split(";")).foreach(println)
spark.read.option("delimiter",";").option("header","true").load("file:///C://Users//User//Desktop//example3.txt").show()
}
}
Not able to read the file. am getting above error. it looks like complex file.
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=file:/C:/Users/User/Desktop/example3.txt; isDirectory=false; length=95; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
i tried again
val df1=spark.read.option("delimiter","|").csv("file:///C://Users//User//Desktop//example3.txt")
+-----+-----------------+----+------------------------------------------------------------------+
|_c0 |_c1 |_c2 |_c3 |
+-----+-----------------+----+------------------------------------------------------------------+
|12345|u1=876435;u5=4356|4357|4358;u15=Mr. Noodles,n/a,Great Value;u16=0.77,4.92,7.96;u17=4,1,7;|
+-----+-----------------+----+------------------------------------------------------------------+

how to fix Scala error with "Not found type"

I'm newbie in Scala, just trying to learn it in Spark. Now I'm writing a Scala app to load csv file from hadoop into dataframe, then I want to add a new column in that dataframe. There is a function to populate the content of that new column, for testing the function just uppercase the column from csv file, the csv file only contains one column: emp_id and it's string.. the function is defined in Object TestService. My IDE is Eclipse. Now I have error: not found: type TestService
Very appreciate if anyone can help me.
\\This is the main:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
import com.poc.spark.service.TestService;
object SparkIntTest {
def main(args:Array[String]){
sys.props.+=(("hadoop.home.dir","C:\\OpenSource\\Hadoop"))
val sparkConf = new SparkConf().setMaster("local").setAppName("employee").set("spark.testing.memory", "2147480000")
val sparkContext = new SparkContext(sparkConf)
val spark = SparkSession.builder().appName("employee").getOrCreate()
val df = spark.read.option("header", "true").csv(".\\src\\main\\resources\\employee.csv")
df.show();
println(df.schema);
val df_Applied = df.withColumn("award_rule",runAllRulesUDF(df("emp_id")))
df_Applied.show();
println(df_Applied.schema)
}
def runAllRulesUDF = udf(new TestService().runAllRulesForUDF(_:String))
}
Here is the Object TestService:
package com.poc.spark.service
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
object TestService {
def runAllRulesForUDF(empid: String): String = {
empid.toUpperCase();
}
}

TestService is an object, which means that it is a statically created singleton. So instead of
new TestService()
You can just say
TestService

Save MongoDB data to parquet file format using Apache Spark

I am a newbie with Apache spark as well with Scala programming language.
What I am trying to achieve is to extract the data from my local mongoDB database for then to save it in a parquet format using Apache Spark with the hadoop-connector
This is my code so far:
package com.examples
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.bson.BSONObject
import com.mongodb.hadoop.{MongoInputFormat, BSONFileInputFormat}
import org.apache.spark.sql
import org.apache.spark.sql.SQLContext
object DataMigrator {
def main(args: Array[String])
{
val conf = new SparkConf().setAppName("Migration App").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Import statement to implicitly convert an RDD to a DataFrame
import sqlContext.implicits._
val mongoConfig = new Configuration()
mongoConfig.set("mongo.input.uri", "mongodb://localhost:27017/mongosails4.case")
val mongoRDD = sc.newAPIHadoopRDD(mongoConfig, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject]);
val count = countsRDD.count()
// the count value is aprox 100,000
println("================ PRINTING =====================")
println(s"ROW COUNT IS $count")
println("================ PRINTING =====================")
}
}
The thing is that in order to save data to a parquet file format first its necessary to convert the mongoRDD variable to Spark DataFrame. I have tried something like this:
// convert RDD to DataFrame
val myDf = mongoRDD.toDF() // this lines throws an error
myDF.write.save("my/path/myData.parquet")
and the error I get is this:
Exception in thread "main" scala.MatchError: java.lang.Object (of class scala.reflect.internal.Types.$TypeRef$$anon$6)
do you guys have any other idea how could I convert the RDD to a DataFrame so that I can save data in parquet format?
Here's the structure of one Document in the mongoDB collection : https://gist.github.com/kingtrocko/83a94238304c2d654fe4

Create a Case class representing the data stored in your DBObject.
case class Data(x: Int, s: String)
Then, map the values of your rdd to instances of your case class.
val dataRDD = mongoRDD.values.map { obj => Data(obj.get("x"), obj.get("s")) }
Now with your RDD[Data], you can create a DataFrame with the sqlContext
val myDF = sqlContext.createDataFrame(dataRDD)
That should get you going. I can explain more later if needed.

display the content of clusters after clustering in streaming-k-means.scala code source in spark

i want to run the streaming k-means-example.scala code source (mllib) on spark , someone tell me how i can how I can display the content of clusters after clustering (for example i want to clustering data into 3 clusters , how i can display the cntent of the 3 clusters in 3 files and the content of centers in file.txt)
package org.apache.spark.examples.mllib
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingKMeansExample {
def main(args: Array[String]) {
if (args.length != 5) {
System.err.println( "Usage: StreamingKMeansExample " +
"<trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>")
System.exit(1)
}
val conf = new SparkConf().setMaster("localhost").setAppName
("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
val model = new StreamingKMeans().setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()

You would have to use the predict method on your RDD( look here for reference)
Then you could zip your Rdd containing values and your RDD of predicted clusters they fall in.