I am trying to convert a column which contains Array[String] to String, but I consistently get this error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 78.0 failed 4 times, most recent failure: Lost task 0.3 in stage 78.0 (TID 1691, ip-******): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
Here's the piece of code
val mkString = udf((arrayCol:Array[String])=>arrayCol.mkString(","))
val dfWithString=df.select($"arrayCol").withColumn("arrayString",
mkString($"arrayCol"))
WrappedArray is not an Array (which is plain old Java Array not a natve Scala collection). You can either change signature to:
import scala.collection.mutable.WrappedArray
(arrayCol: WrappedArray[String]) => arrayCol.mkString(",")
or use one of the supertypes like Seq:
(arrayCol: Seq[String]) => arrayCol.mkString(",")
In the recent Spark versions you can use concat_ws instead:
import org.apache.spark.sql.functions.concat_ws
df.select(concat_ws(",", $"arrayCol"))
The code work for me:
df.select("wifi_ids").rdd.map(row =>row.get(0).asInstanceOf[WrappedArray[WrappedArray[String]]].toSeq.map(x=>x.toSeq.apply(0)))
In your case,I guess it is:
val mkString = udf(arrayCol=>arrayCol.asInstanceOf[WrappedArray[String]].toArray.mkString(","))
val dfWithString=df.select($"arrayCol").withColumn("arrayString",mkString($"arrayCol"))
Related
I am programmatically trying to convert datatypes of columns and running into some coding issues.
I modified the code used here for this.
Data >> any numbers being read as strings.
Code >>
import org.apache.spark.sql
raw_data.schema.fields
.collect({case x if x.dataType.typeName == "string" => x.name})
.foldLeft(raw_data)({case(dframe,field) => dframe(field).cast(sql.types.IntegerType)})
Error >>
<console>:75: error: type mismatch;
found : org.apache.spark.sql.Column
required: org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
.foldLeft(raw_data)({case(dframe,field) => dframe(field).cast(sql.types.IntegerType)})
The problem is that the result of dframe(field).cast(sql.types.IntegerType) in the foldLeft is a column, however, to continue the iteration a dataframe is expected. In the link where the code is originally from dframe.drop(field) is used which does return a dataframe and hence works.
To fix this, simply use withColumn which will adjust a specific column and then return the whole dataframe:
foldLeft(raw_data)({case(dframe, field) => dframe.withColumn(field, dframe(field).cast(sql.types.IntegerType))})
I've tried a lot of times to apply a function which applies some modification to a spark Dataframe which contains some text strings. Below is the corresponding code but it always give me this error:
An error occurred while calling o699.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 1 times, most recent failure: Lost task 0.0 in stage 27.0 (TID 29, localhost, executor driver):
import os
import sys
from pyspark.sql import SparkSession
#!hdfs dfs -rm -r nixon_token*
spark = SparkSession.builder \
.appName("spark-nltk") \
.getOrCreate()
data = spark.sparkContext.textFile('1970-Nixon.txt')
def word_tokenize(x):
import nltk
return str(nltk.word_tokenize(x))
test_tok = udf(lambda x: word_tokenize(x),StringType())
resultDF = df_test.select("spans", test_tok('spans').alias('text_tokens'))
resultDF.show()
I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards
At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.
I have a spark dataframe that has a vector in it:
org.apache.spark.sql.DataFrame = [sF: vector]
and I'm trying to convert it to a RDD of values:
org.apache.spark.rdd.RDD[(Double, Double)]
However, I haven't been able to convert it properly. I've tried:
val m2 = m1.select($"sF").rdd.map{case Row(v1, v2) => (v1.toString.toDouble, v2.toString.toDouble)}
and it compiles, but I get a runtime error:
scala.MatchError: [[-0.1111111111111111,-0.2222222222222222]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
when i do:
m2.take(10).foreach(println).
Is there something I'm doing wrong?
Assuming you want the first two values of the vectors present in the sF column, maybe this will work:
import org.apache.spark.mllib.linalg.Vector
val m2 = m1
.select($"sF")
.map { case Row(v: Vector) => (v(0), v(1)) }
You are getting an error because when you do case Row(v1, v2), it will not match the contents of the rows in your DataFrame, because you are expecting two values on each row (v1 and v2), but there is only one: a Vector.
Note: you don't need to call .rdd if you are going to do a .map operation.
I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].