KMeansModel.clusterCenters returns NULL - scala

I am using AWS glue to execute Kmeans clustering on my dataset. I wish to find not only the cluster labels but also the cluster centers. I am failing to find the later.
In the code below model.clusterCenters returns NULL. KMeans clustering works fine, and it returns the cluster label i.e. clusterInstance variable.
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object Clustering {
case class ObjectDay(realnumber: Double, bnumber : Double, blockednumber: Double,
creationdate : String, fname : String, uniqueid : Long, registrationdate : String,
plusnumber : Double, cvalue : Double, hvalue : Double)
case class ClusterInfo( instance: Int, centers: String)
def main(args: Array[String]): Unit = {
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
val spark: SparkSession = glueContext.getSparkSession
import spark.implicits._
// write your code here - start
// Data Catalog: database and table name
val dbName = "dbname"
val tblName = "raw"
val sqlText = "SELECT <columns removed> FROM viewname WHERE `creation_date` ="
// S3 location for output
val outputDir = "s3://blucket/path/"
// Read data into a DynamicFrame using the Data Catalog metadata
val rawDyf: DynamicFrame = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
// get only single day data with only numbers
// Spark SQL on a Spark dataframe
val numberDf = rawDyf.toDF()
numberDf.createOrReplaceTempView("viewname")
def getDataViaSql(runDate : LocalDate): RDD[ObjectDay] ={
val data = spark.sql(s"${sqlText} '${runDate.toString}'")
data.as[ObjectDay].rdd
}
def getDenseVector(rddnumbers: RDD[ObjectDay]): RDD[linalg.Vector]={
rddnumbers.map(s => Vectors.dense(Array(s.realnumber, s.bnumber, s.blockednumber))).cache()
}
def getClusters( numbers: RDD[linalg.Vector] ): RDD[ClusterInfo] = {
// Trains a k-means model
val model: KMeansModel = KMeans.train(numbers, 2, 20)
val centers: Array[linalg.Vector] = model.clusterCenters
//put together unique_ids with cluster predictions
val clusters: RDD[Int] = model.predict(numbers)
clusters.map{ clusterInstance =>
ClusterInfo(clusterInstance.toInt, centers(clusterInstance).toJson)
}
}
def combineDataAndClusterInstances(rddnumbers : RDD[ObjectDay], clusterCenters: RDD[ClusterInfo]): DataFrame ={
val numbersWithCluster = rddnumbers.zip(clusterCenters)
numbersWithCluster.map(
x =>
(x._1.realnumber, x._1.bnumber, x._1.blockednumber, x._1.creationdate, x._1.fname,
x._1.uniqueid, x._1.registrationdate, x._1.plusnumber, x._1.cvalue, x._1.hvalue,
x._2.instance, x._2.centers)
)
.toDF("realnumber", "bnumber", "blockednumber", "creationdate",
"fname","uniqueid", "registrationdate", "plusnumber", "cvalue", "hvalue",
"clusterInstance", "clusterCenter")
}
def process(runDate : LocalDate): DataFrame = {
val rddnumbers = getDataViaSql( runDate)
val dense = getDenseVector(rddnumbers)
val clusterCenters = getClusters(dense)
combineDataAndClusterInstances(rddnumbers, clusterCenters)
}
val startdt = LocalDate.parse("2018-01-01", DateTimeFormatter.ofPattern("yyyy-MM-dd"))
val dfByDates = (0 to 240)
.map(days => startdt.plusDays(days))
.map(process(_))
val result = dfByDates.tail.fold(dfByDates.head)((accDF, newDF) => accDF.union(newDF))
val output = DynamicFrame(result, glueContext).withName(name="prediction")
// write your code here - end
glueContext.getSinkWithFormat(connectionType = "s3",
options = JsonOptions(Map("path" -> outputDir)), format = "csv").writeDynamicFrame(output)
}
}
I can successfully find the cluster centres using Python sklearn library on the same data.
UPDATED: Showing the complete Scala code which runs as Glue job. Also I am not getting any error while running the job. I just dont get any cluster centres.
What am I missing ?

Nevermind. It is generating cluster centres.
I didnt see the S3 output files until now.
I was running Glue Crawler and looking at the results in AWS Athena.
The crawler created a struct or array column datatype for clustercenter column and Athena failed to parse and read the JSON stored as string in the CSV output.
Sorry to bother.

Related

Receive a Kafka message through Spark Streaming and delete Phoenix/HBase data

In my project, I have the current workflow:
Kafka message => Spark Streaming/processing => Insert/Update to HBase and/or Phoenix
Both the Insert and Update operation works with HBase directly or through Phoenix (I tested both case).
Now I would like to delete data in HBase/Phoenix if I receive a specific message on Kafka. I did not find any clues/documentation on how I could do that while streaming.
I have found a way to delete data in "static"/"batch" mode with both HBase and Phoenix, but the same code does not work when on streaming (there is no error though, the data is simply not deleted).
Here is how we tried the delete part (we first create a parquet file on which we make a "readStream" to start a "fake" stream):
Main Object:
import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.functions.{col, concat_ws} import org.apache.spark.sql.streaming.DataStreamWriter
object AppMain{ def main(args: Array[String]): Unit = {
val spark : SparkSession = SparkSession.builder().getOrCreate()
import spark.implicits._
//algo_name, partner_code, site_code, indicator
val df = Seq(("FOO","FII"),
("FOO","FUU")
).toDF("col_1","col_2")
df.write.mode("overwrite").parquet("testParquetStream")
df.printSchema()
df.show(false)
val dfStreaming = spark.readStream.schema(df.schema).parquet("testParquetStream")
dfStreaming.printSchema()
val dfWithConcat = dfStreaming.withColumn("row", concat_ws("\u0000" , col("col_1"),col("col_2"))).select("row")
// using delete class
val withDeleteClass : WithDeleteClass = new WithDeleteClass(spark, dfWithConcat)
withDeleteClass.delete_data()
// using JDBC/Phoenix
//val withJDBCSink : WithJDBCSink = new WithJDBCSink(spark, dfStreaming)
//withJDBCSink.delete_data() }
WithJDBCSink class
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.streaming.DataStreamWriter
class WithJDBCSink (spark : SparkSession,dfToDelete : DataFrame){
val df_writer = dfToDelete.writeStream
def delete_data():Unit = {
val writer = new JDBCSink()
df_writer.foreach(writer).outputMode("append").start()
spark.streams.awaitAnyTermination()
}
}
JDBCSink class
import java.sql.{DriverManager, PreparedStatement, Statement}
class JDBCSink() extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val quorum = "phoenix_quorum"
var connection: java.sql.Connection = null
var statement: Statement = null
var ps : PreparedStatement= null
def open(partitionId: Long, version: Long): Boolean = {
connection = DriverManager.getConnection(s"jdbc:phoenix:$quorum")
statement = connection.createStatement()
true
}
def process(row: org.apache.spark.sql.Row): Unit = {
//-----------------------Method 1
connection = DriverManager.getConnection(s"jdbc:phoenix:$quorum")
val query = s"DELETE from TABLE_NAME WHERE key_1 = 'val1' and key_2 = 'val2'"
statement = connection.createStatement()
statement.executeUpdate(query)
connection.commit()
//-----------------------Method 2
//val query2 = s"DELETE from TABLE_NAME WHERE key_1 = ? and key_2 = ?"
//connection = DriverManager.getConnection(s"jdbc:phoenix:$quorum")
//ps = connection.prepareStatement(query2)
//ps.setString(1, "val1")
//ps.setString(2, "val2")
//ps.executeUpdate()
//connection.commit()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
WithDeleteClass
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.hadoop.hbase.client.{ConnectionFactory, Delete, HTable, RetriesExhaustedWithDetailsException, Row, Table}
import org.apache.hadoop.hbase.ipc.CallTimeoutException
import org.apache.hadoop.hbase.util.Bytes
import org.apache.log4j.Logger
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import java.util
import org.apache.spark.sql.streaming.DataStreamWriter
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.storage.StorageLevel
import java.sql.Connection
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.SQLException
import java.sql.Statement
class WithDeleteClass (spark : SparkSession, dfToDelete : DataFrame){
val finalData = dfToDelete
var df_writer = dfToDelete.writeStream
var dfWriter = finalData.writeStream.outputMode("append").format("console").start()
def delete_data(): Unit = {
deleteDataObj.open()
df_writer.foreachBatch((output : DataFrame, batchId : Long) =>
deleteDataObj.process(output)
)
df_writer.start()
}
object deleteDataObj{
var quorum = "hbase_quorum"
var zkPort = "portNumber"
var hbConf = HBaseConfiguration.create()
hbConf.set("hbase.zookeeper.quorum", quorum)
hbConf.set("hbase.zookeeper.property.clientPort", zkPort)
var tableName: TableName = TableName.valueOf("TABLE_NAME")
var conn = ConnectionFactory.createConnection(hbConf)
var table: Table = conn.getTable(tableName)
var hTable: HTable = table.asInstanceOf[HTable]
def open() : Unit = {
}
def process(df : DataFrame) : Unit = {
val rdd : RDD[Array[Byte]] = df.rdd.map(row => Bytes.toBytes(row(0).toString))
val deletions : util.ArrayList[Delete] = new util.ArrayList()
//List all rows to delete
rdd.foreach(row => {
val delete: Delete = new Delete(row)
delete.addColumns(Bytes.toBytes("0"), Bytes.toBytes("DATA"))
deletions.add(delete)
})
hTable.delete(deletions)
}
def close(): Unit = {}
}
}
Any help/pointers would be greatly appreciated

Update to the delta table in spark not working

package jobs
import io.delta.tables.DeltaTable
import model.RandomUtils
import org.apache.spark.sql.streaming.{ OutputMode, Trigger }
import org.apache.spark.sql.{ DataFrame, Dataset, Encoder, Encoders, SparkSession }
import jobs.SystemJob.Rate
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
case class Student(firstName: String, lastName: String, age: Long, percentage: Long)
case class Rate(timestamp: Timestamp, value: Long)
case class College(name: String, address: String, principal: String)
object RCConfigDSCCDeltaLake {
def getSpark(): SparkSession = {
SparkSession.builder
.appName("Delta table demo")
.master("local[*]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
}
def main(args: Array[String]): Unit = {
val spark = getSpark()
val rate = 1
val studentProfile = "student_profile"
if (!DeltaTable.isDeltaTable(s"spark-warehouse/$studentProfile")) {
val deltaTable: DataFrame = spark.sql(s"CREATE TABLE `$studentProfile` (firstName String, lastName String, age Long, percentage Long) USING delta")
deltaTable.show()
deltaTable.printSchema()
}
val studentProfileDT = DeltaTable.forPath(spark, s"spark-warehouse/$studentProfile")
def processStream(student: Dataset[Student], college: Dataset[College]) = {
val studentQuery = student.writeStream.outputMode(OutputMode.Update()).foreachBatch {
(st: Dataset[Student], y: Long) =>
val listOfStudents = st.collect().toList
println("list of students :::" + listOfStudents)
val (o, n) = ("oldData", "newData")
val colMap = Map(
"firstName" -> col(s"$n.firstName"),
"lastName" -> col(s"$n.lastName"),
"age" -> col(s"$n.age"),
"percentage" -> col(s"$n.percentage"))
studentProfileDT.as(s"$o").merge(st.toDF.as(s"$n"), s"$o.firstName = $n.firstName AND $o.lastName = $n.lastName")
.whenMatched.update(colMap)
.whenNotMatched.insert(colMap)
.execute()
}.start()
val os = spark.readStream.format("delta").load(s"spark-warehouse/$studentProfile").writeStream.format("console")
.outputMode(OutputMode.Append())
.option("truncate", value = false)
.option("checkpointLocation", "retrieved").start()
studentQuery.awaitTermination()
os.awaitTermination()
}
import spark.implicits._
implicit val encStudent: Encoder[Student] = Encoders.product[Student]
implicit val encCollege: Encoder[College] = Encoders.product[College]
def rateStream = spark
.readStream
.format("rate") // <-- use RateStreamSource
.option("rowsPerSecond", rate)
.load()
.as[Rate]
val studentStream: Dataset[Student] = rateStream.filter(_.value % 25 == 0).map {
stu =>
Student(...., ....., ....., .....) //fill with values
}
val collegeStream: Dataset[College] = rateStream.filter(_.value % 40 == 0).map {
stu =>
College(...., ....., ......) //fill with values
}
processStream(studentStream, collegeStream)
}
}
What I am trying to do is a simple UPSERT operation with streaming datasets. But it fails with error
22/04/13 19:50:33 ERROR MicroBatchExecution: Query [id = 8cf759fd-9bee-460f-
b0d9-91889c59c524, runId = 55723708-fd3c-4425-a2bc-83d737c37589] terminated with
error
java.lang.UnsupportedOperationException: Detected a data update (for example part-
00000-d026d92e-1798-4d21-a505-67ec72d334e2-c000.snappy.parquet) in the source table
at version 4. This is currently not supported. If you'd like to ignore updates, set
the option 'ignoreChanges' to 'true'. If you would like the data update to be
reflected, please restart this query with a fresh checkpoint directory.
Dependencies :
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
--packages io.delta:delta-core_2.12:0.7.0
The update query works when the datasets are not streamed and only hardcoded.
Am I doing something wrong here ?

Spark ML insert/fit custom OneHotEncoder into a Pipeline

Say I have a few features/columns in a dataframe on which I apply the regular OneHotEncoder, and one (let, n-th) column on which I need to apply my custom OneHotEncoder. Then I need to use VectorAssembler to assemble those features, and put into a Pipeline, finally fitting my trainData and getting predictions from my testData, such as:
val sIndexer1 = new StringIndexer().setInputCol("my_feature1").setOutputCol("indexed_feature1")
// ... let, n-1 such sIndexers for n-1 features
val featureEncoder = new OneHotEncoderEstimator().setInputCols(Array(sIndexer1.getOutputCol), ...).
setOutputCols(Array("encoded_feature1", ... ))
// **need to insert output from my custom OneHotEncoder function (please see below)**
// (which takes the n-th feature as input) in a way that matches the VectorAssembler below
val vectorAssembler = new VectorAssembler().setInputCols(featureEncoder.getOutputCols + ???).
setOutputCol("assembled_features")
...
val pipeline = new Pipeline().setStages(Array(sIndexer1, ...,featureEncoder, vectorAssembler, myClassifier))
val model = pipeline.fit(trainData)
val predictions = model.transform(testData)
How can I modify the building of the vectorAssembler so that it can ingest the output from the custom OneHotEncoder?
The problem is my desired oheEncodingTopN() cannot/should not refer to the "actual" dataframe, since it would be a part of the pipeline (to apply on trainData/testData).
Note:
I tested that the custom OneHotEncoder (see link) works just as expected separately on e.g. trainData. Basically, oheEncodingTopN applies OneHotEncoding on the input column, but for the top N frequent values only (e.g. N = 50), and put all the rest infrequent values in a dummy column (say, "default"), e.g.:
val oheEncoded = oheEncodingTopN(df, "my_featureN", 50)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def oheEncodingTopN(df: DataFrame, colName: String, n: Int): DataFrame = {
df.createOrReplaceTempView("data")
val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
I think the cleanest way would be to create your own class that extends spark ML Transformer so that you can play with as you would do with any other transformer (like OneHotEncoder). Your class would look like this :
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.param.Param
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Dataset, Column}
class OHEncodingTopN(n :Int, override val uid: String) extends Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input column")
final val outputCol = new Param[String](this, "outputCol", "The output column")
; def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
def this(n :Int) = this(n, Identifiable.randomUID("OHEncodingTopN"))
def copy(extra: ParamMap): OHEncodingTopN = {
defaultCopy(extra)
}
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is what you want if needed
// val idx = schema.fieldIndex($(inputCol))
// val field = schema.fields(idx)
// if (field.dataType != StringType) {
// throw new Exception(s"Input type ${field.dataType} did not match input type StringType")
// }
// Add the return field
schema.add(StructField($(outputCol), IntegerType, false))
}
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def transform(df: Dataset[_]): DataFrame = {
df.createOrReplaceTempView("data")
val colName = $(inputCol)
val topNDF = df.sparkSession.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
}
Now on a OHEncodingTopN object you should be able to call .getOuputCol to perform what you want. Good luck.
EDIT: your method that I just copy pasted in the transform method should be slightly modified in order to output a column of type Vector having the name given in the setOutputCol.

Best way to convert online csv to dataframe scala

I am trying to figure out the most efficient way to accomplish putting this online csv file into a data frame in Scala.
To save a download, the csv file in the code looks like this:
"Symbol","Name","LastSale","MarketCap","ADR
TSO","IPOyear","Sector","Industry","Summary Quote"
"DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"
"MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"
....
From my research, I start by downloading the csv, and placing it into a list buffer (since you can't do this with a list because it's immutable):
import scala.collection.mutable.ListBuffer
val sc = new SparkContext(conf)
var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
import scala.io.Source
val bufferedSource =
Source.fromURL("http://www.nasdaq.com/screening/companies-by-
industry.aspx?exchange=NYSE&render=download")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
}
bufferedSource.close
val stockInfoNYSE_List = stockInfoNYSE_ListBuffer.toList
So we have a list. You can basically get each value like this:
// SYMBOL : stockInfoNYSE_List(1).split(",")(0)
// COMPANY NAME : stockInfoNYSE_List(1).split(",")(1)
// IPOYear : stockInfoNYSE_List(1).split(",")(5)
// Sector : stockInfoNYSE_List(1).split(",")(6)
// Industry : stockInfoNYSE_List(1).split(",")(7)
Here is where I get stuck- how do I get this to a dataframe? The wrong approaches I have taken. I didn't put all the values in just yet- was a simple test.
case class StockMap(Symbol: String, Name: String)
val caseClassDS = Seq(StockMap(stockInfoNYSE_List(1).split(",")(0),
StockMap(stockInfoNYSE_List(1).split(",")(1))).toDS()
caseClassDS.show()
The problem with the approach above: I can only figure out how to add one sequence (row) by hard coding it. I want every Row in the list.
My second failed attempt:
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val test = stockInfoNYSE_List.toDF
This will just give you the array, and I want to divide up the values.
Array(["Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote"], ["DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"], ["MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"],.......
case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String
| )
defined class TestClass
var stockDF= stockInfoNYSE_ListBuffer.drop(1)
val demoDS = stockDF.map(line => {
val fields = line.replace("\"","").split(",")
TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})
scala> demoDS.toDS.show
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
|Symbol| Name|LastSale| MarketCap| ADR_TSO|IPOyear| Sector| Industry| Summary_Quote|
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
| DDD|3D Systems Corpor...| 18.09| 2058834640.41| n/a| n/a| Technology|Computer Software...|http://www.nasdaq...|
| MMM| 3M Company| 211.68|126423673447.68| n/a| n/a| Health Care|Medical/Dental In...|http://www.nasdaq...|
In case anyone is trying to get this example working, here is the code using the above solution:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import scala.collection.mutable.ListBuffer
import sqlContext.implicits._
var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
import scala.io.Source
val bufferedSource =
Source.fromURL("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE&render=download")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
}
bufferedSource.close
case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String )
var stockDF= stockInfoNYSE_ListBuffer.drop(1)
val demoDS = stockDF.map(line => {
val fields = line.replace("\"","").split(",")
TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})
demoDS.toDF().show

GraphX not working properly Spark / Scala

I am trying to create a GraphX object in apache Spark/Scala but it doesn't seem to be working for some reason. I have attached a file of the example input file, the actual program code is:
package SGraph
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._
`
object GooglePlusGraph {
/** Our main function where the action happens */
def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Create a SparkContext using every core of the local machine
val sc = new SparkContext("local[*]", "GooglePlusGraphX")
val lines = sc.textFile("../Example.txt")
val ratings = lines.map(x => x.toString().split(":")(0))
val verts = ratings.map(line => (line.toLong,line))
val edges = lines.flatMap(makeEdges)
val default = "Nobody"
val graph = Graph(verts, edges, default).cache()
graph.degrees.join(verts).take(10).foreach(println)
}
def makeEdges(line: String) : List[Edge[Int]] = {
import scala.collection.mutable.ListBuffer
var edges = new ListBuffer[Edge[Int]]()
val fields = line.split(",").flatMap(a => a.split(":"))
val origin = fields(0)
for (x <- 1 to (fields.length - 1)) {
// Our attribute field is unused, but in other graphs could
// be used to deep track of physical distances etc.
edges += Edge(origin.toLong, fields(x).toLong, 0)
}
return edges.toList
}
}
The first error i get is the following:
16/12/19 01:28:33 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 3)
java.lang.NumberFormatException: For input string: "935750800736168978117"
thanks for any help !
It's the same issue with the following your question.
Cannot convert string to a long in scala
The given number has 21 digits beyond the maximum number of digits of Long (19 digits).