I have a small dataset in csv format with two columns of integers over which I am computing summary statistics. There should be no missing or bad data:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
val raw = sc.textFile("skill_aggregate.csv")
val struct = StructType(StructField("personid", IntegerType, false)
:: StructField("numSkills", IntegerType, false) :: Nil)
val rows = raw.map(_.split(",")).map(x => Row(x(0), x(1)))
val df = sqlContext.createDataFrame(rows, struct)
df.describe().show()
The last line gives me:
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
which of course implies some bad data. The weird bit is that I can "collect" the entire data set without issue which implies each row correctly conforms to the IntegerType described in the schema. Also odd is that I can't find any NA values when I open the dataset up in R.
Why don't you use the databricks-csv reader (https://github.com/databricks/spark-csv) ? is easier and safer to create dataframes from a csv file and it allows you to define a schema of your fields (and avoid cast problems).
The code is very simple to achieve it :
myDataFrame = sqlContext.load(source="com.databricks.spark.csv", header="true", path = myFilePath)
Greetings,
JG
I found the error. It was necessary to add toInt to each row entry:
val rows = raw.map(_.split(",")).map(x => Row(x(0).toInt, x(1).toInt))
Related
I have a Scala variable "sizeFile" which contains the size in bytes of the file created for each execution.
That variable is defined as LongType in a corresponding schema to create a DataFrame.
The thing is "sizeFile" variable sometimes gets the value in bytes of an int, i.e. 500.
Then when trying to create DF with that value I get the error: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long.
I know I can make that 500 as long type adding suffix "L": 500L, but how can I add this suffix to the value recovered in variable "sizeFile" ?
In pseudo code something like:
val fileSize = args.fileBytes
val fileSizeLong = ${fileSize}L
val schema: StructType = new StructType()
.add("id", StringType, false)
.add("fileSize", Longtype, false))
spark.createDataFrame(Seq(Row("identifier",fileSizeLong)), schema)
It's not like there's a magic keyword like L which can convert anything to a Long number! This L can be placed after a literal number, and that's a hint to the compiler for type inference. Type inference is done by some rules, when you do this:
val someNumber = 5
The compiler (based on the mentioned rules) decides that the type of someNumber is is Int, unless you explicitly hint the compiler that I want a long number, that can be achieved using either of these:
val someNumberLong = 5L // translated by compiler to val someNumerLong: Long = 5
// or
val someNumberLongV2: Long = 5
Back to your problem, the val fileSize = args.fileBytes is an integer, and you need the value as a Long number, so this is what you need:
val fileSizeLong = fileSize.toLong
I have the following Scala value:
val values: List[Iterable[Any]] = Traces().evaluate(features).toList
and I want to convert it to a DataFrame.
When I try the following:
sqlContext.createDataFrame(values)
I got this error:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
sqlContext.createDataFrame(values)
Why?
Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD.
Here is an example with Spark 2.0 but it exists in older versions too
import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()
Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:
val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF
Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1
The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.
As zero323 mentioned, we need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame.
To convert List[Iterable[Any]] to List[Row], we can say
val rows = values.map{x => Row(x:_*)}
and then having schema like schema, we can make RDD
val rdd = sparkContext.makeRDD[RDD](rows)
and finally create a spark data frame
val df = sqlContext.createDataFrame(rdd, schema)
Simplest approach:
val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")
In Spark 2 we can use DataSet by just converting list to DS by toDS API
val ds = list.flatMap(_.split(",")).toDS() // Records split by comma
or
val ds = list.toDS()
This more convenient than rdd or df
The most concise way I've found:
val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))
I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards
At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.
I'm get into spark and I have problems with Vectors
import org.apache.spark.mllib.linalg.{Vectors, Vector}
The input of my program is a text file with contains the output of a RDD(Vector):
dataset.txt:
[-0.5069793074881704,-2.368342680619545,-3.401324690974588]
[-0.7346396928543871,-2.3407983487917448,-2.793949129209909]
[-0.9174226561793709,-0.8027635530022152,-1.701699021443242]
[0.510736518683609,-2.7304268743276174,-2.418865539558031]
So, what a try to do is:
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
I have the error because it read [0.510736518683609 as a number.
Exist any form to load directly the vector stored in the text-file without doing the second line? How I can delete "[" in the map stage ?
I'm really new in spark, sorry if it's a very obvious question.
Given the input the simplest thing you can do is to use Vectors.parse:
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> Vectors.parse("[-0.50,-2.36,-3.40]")
res14: org.apache.spark.mllib.linalg.Vector = [-0.5,-2.36,-3.4]
It also works with sparse representation:
scala> Vectors.parse("(10,[1,5],[0.5,-1.0])")
res15: org.apache.spark.mllib.linalg.Vector = (10,[1,5],[0.5,-1.0])
Combining it with your data all you need is:
rdd.map(Vectors.parse)
If you expect malformed / empty lines you can wrap it using Try:
import scala.util.Try
rdd.map(line => Try(Vectors.parse(line))).filter(_.isSuccess).map(_.get)
Here is one way to do it :
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map {
s =>
val vect = s.replaceAll("\\[", "").replaceAll("\\]","").split(',').map(_.toDouble)
Vectors.dense(vect)
}
I've just broke the map into line for readability purpose.
Note: Remember, it's simple a string processing on each line.
I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].