Reading Nested data from ElasticSearch via Spark Scala

Reading Nested data from ElasticSearch via Spark Scala - scala

I am trying to read data from Elasticsearch via Spark Scala:
Scala 2.11.8, Spark 2.3.0, Elasticsearch 5.6.8
To Connect -- spark2-shell --jars elasticsearch-spark-20_2.11-5.6.8.jar
val df = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes", "xxxxxxx").option("es.port", "xxxx").option("es.net.http.auth.user","xxxxx").option("spark.serializer", "org.apache.spark.serializer.KryoSerializer").option("es.net.http.auth.pass", "xxxxxx").option("es.net.ssl", "true").option("es.nodes.wan.only", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("es.net.ssl.truststore.location", "xxxxx").option("es.net.ssl.truststore.pass", "xxxxx").option("es.read.field.as.array.include","true").option("pushdown", "true").option("es.read.field.as.array.include","a4,a4.a41,a4.a42,a4.a43,a4.a43.a431,a4.a43.a432,a4.a44,a4.a45").load("<index_name>")
Schema as below
|-- a1: string (nullable = true)
|-- a2: string (nullable = true)
|-- a3: struct (nullable = true)
| |-- a31: integer (nullable = true)
| |-- a32: struct (nullable = true)
|-- a4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a41: string (nullable = true)
| | |-- a42: string (nullable = true)
| | |-- a43: struct (nullable = true)
| | | |-- a431: string (nullable = true)
| | | |-- a432: string (nullable = true)
| | |-- a44: string (nullable = true)
| | |-- a45: string (nullable = true)
|-- a8: string (nullable = true)
|-- a9: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a91: string (nullable = true)
| | |-- a92: string (nullable = true)
|-- a10: string (nullable = true)
|-- a11: timestamp (nullable = true)
Though I am able to read data from direct columns and nested schema level 1 (i.e a9 or a3 columns) via command:
df.select(explode($"a9").as("exploded")).select("exploded.*").show
Problem is occuring when I am trying to read a4 elements as its throwing me below error:
[Stage 18:> (0 + 1) / 1]20/02/28 02:43:23 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 54, xxxxxxx, executor 12): scala.MatchError: Buffer() (of class scala.collection.convert.Wrappers$JListWrapper)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:241)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$2.apply(CatalystTypeConverters.scala:164)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:164)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:154)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:381)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/02/28 02:43:23 ERROR scheduler.TaskSetManager: Task 0 in stage 18.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18.0 (TID 57, xxxxxxx, executor 12): scala.MatchError: Buffer() (of class scala.collection.convert.Wrappers$JListWrapper)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:241)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)
Anything I am doing wrong or any steps I am missing? Please Help

Out of the top of my head, this error occurs when the schema guessed by the spark/ElasticSearch connector is not actually compatible with the data being read.
Keep in my that ES is schemaless, and SparkSQL has a "hard" schema. Bridging this gap is not always possible, so it's all just a best effort.
When connecting the two, the connector samples the documents and tries to guess a schema : "field A is a string, field B is an object structure with two subfield : B.1 being a date, and B.2 being an array of strings, ... whatever".
If it guessed wrong (typically : a given column / subcolumn is guessed as being a String, but in some documents it in fact is an array or a number), then the JSON to SparkSQL conversion emits those kind of errors.
In the words of the documentation, it states :
Elasticsearch treats fields with single or multi-values the same; in fact, the mapping provides no information about this. As a client, it means one cannot tell whether a field is single-valued or not until is actually being read. In most cases this is not an issue and elasticsearch-hadoop automatically creates the necessary list/array on the fly. However in environments with strict schema such as Spark SQL, changing a field actual value from its declared type is not allowed. Worse yet, this information needs to be available even before reading the data. Since the mapping is not conclusive enough, elasticsearch-hadoop allows the user to specify the extra information through field information, specifically es.read.field.as.array.include and es.read.field.as.array.exclude.
So I'd adivse you to check that the schema you reported in your question (the schema guessed by Spark) is actually valid agains all your documents, or not.
If it's not, you have a few options going forward :
Correct the mapping individually. If the problem is linked to an array type not being recognized as such, you can do so using configuration options. You can see the es.read.field.as.array.include (resp. .exclude) option (which is used to actively tell Spark which properties in the documents are array (resp. not array). If a field is unused, es.read.field.exclude is an option that will exclude a given field from Spark altogether, bypassing possible schema issus for it.
If there is no way to provide a valid schema for all cases to ElasticSearch (e.g. some field is sometimes a number, somtimes a string, and there is no way to tell), then basically, you're stuck to going back at the RDD level (and if need be, go back to Dataset / Dataframe once the schema is well defined).

Related

Spark VectorAssembler Error With Array of Column Input

I have the following code that constructs a VectorAssembler:
val allColsExceptOceanProximity: Array[String] = dfRaw.drop("ocean_proximity").columns
val assembler = new VectorAssembler()
//.setInputCols(Array("longitude", "latitude", "housing_median_age", "total_rooms", "population", "households", "population", "median_income", "median_house_value"))
.setInputCols(allColsExceptOceanProximity)
.setOutputCol("features")
If I use the Array with the columns statically types, it works. When I pass in the Array dynamically, it kind of fails with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1763.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1763.0 (TID 100292) (192.168.0.35 executor driver): org.apache.spark.SparkException: Failed to execute user defined function(VectorAssembler$$Lambda$4540/1415205968: (struct<longitude:double,latitude:double,housing_median_age:double,total_rooms:double,total_bedrooms:double,population:double,households:double,median_income:double,median_house_value:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeysOutput_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator.foreach(Iterator.scala:941)
Why is this? Any ideas. Is is expecting a Varargs type? I'm on my notebook and do not have the benefits of the IDE.
EDIT: Added the schema:
root
|-- longitude: double (nullable = true)
|-- latitude: double (nullable = true)
|-- housing_median_age: double (nullable = true)
|-- total_rooms: double (nullable = true)
|-- total_bedrooms: double (nullable = true)
|-- population: double (nullable = true)
|-- households: double (nullable = true)
|-- median_income: double (nullable = true)
|-- median_house_value: double (nullable = true)
|-- ocean_proximity: string (nullable = true)

Spark ElasticSearch - Struct not found error

Im trying to query ElasticSearch from spark. Here is my code.
val esURL = "ES.com"
val reader = spark.read
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.port","443")
.option("es.net.ssl","true")
.option("es.nodes", esURL)
.option("es.net.http.auth.user","admin")
.option("es.net.http.auth.pass","pass")
val df = reader.load("store_products")
// df.printSchema()
df.select($"about").show()
Here is the schema:
root
|-- about: struct (nullable = true)
|-- agg: struct (nullable = true)
|-- attributes: struct (nullable = true)
| |-- spec: string (nullable = true)
| |-- capacity: string (nullable = true)
I tried to extract the about field, but its giving the following error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 4 times, most recent failure: Lost task 0.3 in stage 15.0 (TID 54, 10.83.228.77, executor 1): org.elasticsearch.hadoop.rest.EsHadoopParsingException: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'about' not found; typically this occurs with arrays which are not mapped as single value
Not sure where is the exact problem.
Update:
Spark version: includes Apache Spark 3.1.2, Scala 2.12
ES jar file: elasticsearch-spark-20_2.11-7.9.0.jar
ES Version: 7.9

Error while inserting into partitioned hive table for spark scala

I am having hive table with following structure
CREATE TABLE gcganamrswp_work.historical_trend_result(
column_name string,
metric_name string,
current_percentage string,
lower_threshold double,
upper_threshold double,
calc_status string,
final_status string,
support_override string,
dataset_name string,
insert_timestamp string,
appid string,
currentdate string,
indicator map<string,string>)
PARTITIONED BY (
appname string,
year_month int)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY");
I am having spark dataframe with schema
root
|-- metric_name: string (nullable = true)
|-- column_name: string (nullable = true)
|-- Lower_Threshold: double (nullable = true)
|-- Upper_Threshold: double (nullable = true)
|-- Current_Percentage: double (nullable = true)
|-- Calc_Status: string (nullable = false)
|-- Final_Status: string (nullable = false)
|-- support_override: string (nullable = false)
|-- Dataset_Name: string (nullable = false)
|-- insert_timestamp: string (nullable = false)
|-- appId: string (nullable = false)
|-- currentDate: string (nullable = false)
|-- indicator: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = false)
|-- appname: string (nullable = false)
|-- year_month: string (nullable = false)
when i try to insert into hive table using below code it is failing
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
data_df.repartition(1)
.write.mode("append")
.format("hive")
.insertInto(Outputhive_table)
Spark Version : Spark 2.4.0
Error:
ERROR Hive:1987 - Exception when loading partition with parameters
partPath=hdfs://gcgprod/data/work/hive/historical_trend_result/.hive-staging_hive_2021-09-01_04-34-04_254_8783620706620422928-1/-ext-10000/_temporary/0,
table=historical_trend_result, partSpec={appname=, year_month=},
replace=false, listBucketingEnabled=false, isAcid=false,
hasFollowingStatsTask=false
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1662)
at
org.apache.hadoop.hive.ql.metadata.Hive.lambda$loadDynamicPartitions$4(Hive.java:1970)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.metastore.Warehouse.makePartName(Warehouse.java:329)
at
org.apache.hadoop.hive.metastore.Warehouse.makePartPath(Warehouse.java:312)
at
org.apache.hadoop.hive.ql.metadata.Hive.genPartPathFromTable(Hive.java:1751)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1607)
I have specified the partition columns in the last columns of dataframe, so i expect it consider last tow columns as partition columns. I wanted to used the same routine for inserting different tables so i don't want to mention the partition columns explicitly

Just to recap that you are using spark to write data to a hive table with dynamic partitions. So my answer below is based on same, if my understanding is incorrect, please feel free to correct me in comment.
While you have a table that is dynamically partitioned (by app_name and year_month), the spark job doesn't know the partitioning fields in the destination so you will still have to tell your spark job about the partitioning field of the destination table.
Something like this should work
data_df.repartition(1)
.write
.partitionBy("appname", "year_month")
.mode(SaveMode.Append)
.saveAsTable(Outputhive_table)
Make sure that you enable support for dynamic partitions by executing something like
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Check out this post by Itai Yaffe, this may be handy https://medium.com/nmc-techblog/spark-dynamic-partition-inserts-part-1-5b66a145974f

I think the problem is that some records have appname and year_month as strings. At least this is suggested by
Partition spec is incorrect. {appname=, year_month=}
Make sure partition colums are never empty or null! Also note that the type of year_month is not consistent between the DataFrame and your schema (string/int)

Schema error using hiveContext.createDataFrame from an RDD [scala spark 2.4]

Trying to run:
val outputDF = hiveContext.createDataFrame(myRDD, schema)
Getting this error:
Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of struct<col1name:string,col2name:string>
myRDD.take(5).foreach(println)
[string number,[Lscala.Tuple2;#163601a5]
[1234567890,[Lscala.Tuple2;#6fa7a81c]
data of the RDD:
RDD[Row]: [string number, [(string key, string value)]]
Row(string, Array(Tuple(String, String)))
where the tuple2 contains data like this:
(string key, string value)
schema:
schema:
root
|-- col1name: string (nullable = true)
|-- col2name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col3name: string (nullable = true)
| | |-- col4name: string (nullable = true)
StructType(
StructField(col1name,StringType,true),
StructField(col2name,ArrayType(
StructType(
StructField(col3name,StringType,true),
StructField(col4name,StringType,true)
),
true
),
true
)
)
This code was used to run in spark 1.6 before and didn't have problems. In spark 2.4, it appears that tuple2 doesn't count as a Struct Type? In that case, what should it be changed to?
I'm assuming the easiest solution would be to adjust the schema to suite the data.
Let me know if more details are needed

The answer to this is changing the tuple type that contained the 2 string types to a row containing the 2 string types instead.
So for the provided schema, the incoming data structure was
Row(string, Array(Tuple(String, String)))
This was changed to
Row(string, Array(Row(String, String)))
in order to continue using the same schema.

ClassCastException: java.lang.Double cannot be cast to org. apache.spark.mllib.linalg.Vector While using LabeledPoint

I am trying to use SVMWithSGD to train my model, but I encounter ClassCastException while trying to access my training.
My train_data dataframe schema looks like :
train_data.printSchema
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
|-- label_index: double (nullable = false)
I created an LabeledPoint RDD to use it on SVNWithSGD
val targetInd = train_data.columns.indexOf("label_index")`
val featInd = Array("features").map(train_data.columns.indexOf(_))
val train_lp = train_data.rdd.map(r => LabeledPoint( r.getDouble(targetInd),
Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
But When I call
SVMWithSGD.train(train_lp, numIterations)
it gives me :
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSched
uler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGSche
duler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGSche
duler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:
59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.appl
y(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.appl
y(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.sc
ala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGSche
duler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedu
ler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedu
ler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1364)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
51)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
12)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1378)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
51)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
12)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.generateInitia
lWeights(GeneralizedLinearAlgorithm.scala:204)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(Generalize
dLinearAlgorithm.scala:234)
at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:217)
at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:255)
... 55 elided
Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to org.
apache.spark.mllib.linalg.Vector
My train_data was created based on label (file_name) and features (json file representing images features).

Try using this -
Schema
train_data.printSchema
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
|-- label_index: double (nullable = false)
Modify your code as-
val train_lp = train_data.rdd.map(r => LabeledPoint(r.getAs("label_index"), r.getAs("features")))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse