Scala/Spark - Create Dataset with one column from another Dataset - scala

I am trying to create a Dataset with only one column from Case Class.
Below is the code:
case class vectorData(value: Array[String], vectors: Vector)
def main(args: Array[String]) {
val spark = SparkSession.builder
.appName("Hello world!")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//blah blah and read data etc.
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset).as(encoder)
//val output: Dataset[Vector] = ???
}
As shown in last line of the code, I want the output to be only the 2nd column which has Vector type with vectors data.
I tried with:
val output = result.map(o => o.vectors)
But this line highlighted error No implicit arguments of type: Encoder[Vector]
How to resolve this?

I think line:
implicit val vectorEncoder: Encoder[Vector] = org.apache.spark.sql.Encoders.product[Vector]
should make
val output = result.map(o => o.vectors)
correct

Related

Output is not showing, spark scala

Output is showing the schema, but output of sql query is not visible. I dont understand where I am doing wrong.
object ex_1 {
def parseLine(line:String): (String, String, Int, Int) = {
val fields = line.split(" ")
val project_code = fields(0)
val project_title = fields(1)
val page_hits = fields(2).toInt
val page_size = fields(3).toInt
(project_code, project_title, page_hits, page_size)
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "Weblogs")
val lines = sc.textFile("F:/Downloads_F/pagecounts.out")
val parsedLines = lines.map(parseLine)
println("hello")
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val RDD1 = parsedLines.toDF("project","page","pagehits","pagesize")
RDD1.printSchema()
RDD1.createOrReplaceTempView("logs")
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
val results = min1.collect()
results.foreach(println)
println("bye")
spark.stop()
}
}
As confirmed in the comments, using the show method displays the result of spark.sql(..).
Since spark.sql returns a DataFrame, calling show is the ideal way to display the data. Where you where calling collect, previously, is not advised:
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
..
..
val min1 = spark.sql("SELECT * FROM logs WHERE pagesize >= 4733")
// where `false` prevents the output from being truncated.
min1.show(false)
println("bye")
spark.stop()
Even if your DataFrame is empty you will still see a table output including the column names (i.e: the schema); whereas .collect() and println would print nothing in this scenario.

type TimestampType of colum not supported

i use spark with scala a have problem in type TimestampType
object regressionLinear {
case class X(
time:String,nodeID: Int, posX: Double,posY: Double,
speed: Double,period: Int)
def main (args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
/**
* Read the input data
*/
var dataset = "C:\\spark\\A6-d07-h08.csv"
if (args.length > 0) {
dataset = args(0)
}
val spark = SparkSession
.builder
.appName("regressionsol")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile(dataset)
.map(line=>line.split(","))
.map(userRecord => (userRecord(0).trim.toString,
userRecord(1).trim.toInt, userRecord(2).trim.toDouble,userRecord(3).trim.toDouble,userRecord(4).trim.toDouble,userRecord(5).trim.toInt))
.toDF("time","nodeID","posX", "posY","speed","period").withColumn("time", $"time".cast("timestamp"))
val assembler = new VectorAssembler()
.setInputCols( Array(
"time","nodeID","posX", "posY","speed","period"))
.setOutputCol("features")
val lr = new LinearRegression()
.setLabelCol("period")
.setFeaturesCol("features")
.setRegParam(0.1)
.setMaxIter(100)
.setSolver("l-bfgs")
val steps =
Array(assembler, lr)
val pipeline = new Pipeline()
.setStages(steps)
val Array(training, test) = data.randomSplit(Array(0.75, 0.25), seed = 12345)
val model = pipeline.fit {
training
}
val holdout = model.transform(test)
holdout.show(20)
val prediction = holdout.select("prediction", "period","nodeID").orderBy(abs(col("prediction")-col("period")))
prediction.show(20)
val rm = new RegressionMetrics(prediction.rdd.map{
x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])
})
println(s"RMSE = ${rm.rootMeanSquaredError}")
println(s"R-squared = ${rm.r2}")
spark.stop()
}
}
it is error
Exception in thread "main" java.lang.IllegalArgumentException: Data type TimestampType of column time is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at regressionLinear$.main(regressionLinear.scala:100)
at regressionLinear.main(regressionLinear.scala)
VectorAssembler accepts only numeric columns. Other type of columns have to be encoded first. And considering that you apply LinearRegression data has to be encoded anyway.
Exact steps will depend on the domain specific knowledge:
If you expect linear trend based on time cast field to numeric first.
If you expect some type of seasonal effects you might have to extract individual components (day of week, hour / time of day, month and so on), and typically apply StringIndexer + `OneHotEncoder.

DoubleType to VectorUDT

I have this code written using Spark 2.1:
val mycolumns = originalFile.schema.fieldNames
mycolumns.map(cname => stddevPerColumnName(df.select(cname), cname))
def stddevPerColumnName(df: DataFrame, cname: String): DataFrame =
new StandardScaler()
.setInputCol(cname)
.setOutputCol("stddev")
.setWithStd(true)
.fit(df)
.transform(df)
Every single column has type DoubleType originally inferred from a CSV file.
When I run the code I get the Exception:
Column FirstColumn must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually DoubleType.
How can I convert the column type Double to VectorUDT?
you need to pass vector into ML model:use assembler to put double values into vector then do your ML then take values out of vector if required back to double
import org.apache.spark.ml.feature.{MinMaxScaler,VectorAssembler}
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions._
val assembler = new VectorAssembler().setInputCols(Array("yourDoubleValue")).setOutputCol("features")
def assembler (ds: Dataset[T]) = {mlib.assembler.transform(ds)}
val vectorToColumn = udf{ (x: DenseVector, index: Int) => x(index) }
val scaler = new StandardScaler().setInputCol("features").setOutputCol("featuresScaled")
*use DenseVector or SparseVector depending on your data
full example:
val data = spark.read....
val data_assembled = assembler.transform(data)
val assembled = scaler.fit(ds).transform(ds)
.withColumn("backToMyDouble",round(mlib.vectorToColumn(col("featuresScaled"),lit(0)),2))

overloaded method value corr with alternatives

I am trying compute the correlation between 2 features, that are being read from two separate text files as shown below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.stat.Statistics
import scala.io.Source
object Corr {
def main() {
val sparkSession = SparkSession.builder
.master("local")
.appName("Correlation")
.getOrCreate()
val sc = sparkSession.sparkContext
val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray
val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray
val feature_1_dist = sc.parallelize(feature_1)
val feature_2_dist = sc.parallelize(feature_2)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
println(s"Correlation is: $correlation")
}
}
Corr.main()
However, I get the following error:
overloaded method value corr with alternatives:
(x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double],method: String)scala.Double <and>
(x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double],method: String)scala.Double
cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[String], String)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
What I am trying to do, looks very similar to the example here but I can't figure it out.
As it is stated in error message, you need to have a RDD[Double], but you have RDD[String]. So, you could do something like this (if you have one number per row):
val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray.map(_.toDouble)
val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray.map(_.toDouble)

converting textFile to dataFrame dynamically

I am trying to convert input from a text file to dataframe using a schema file which is read at run time.
My input text file looks like this:
John,23
Charles,34
The schema file looks like this:
name:string
age:integer
This is what I tried:
object DynamicSchema {
def main(args: Array[String]) {
val inputFile = args(0)
val schemaFile = args(1)
val schemaLines = Source.fromFile(schemaFile, "UTF-8").getLines().map(_.split(":")).map(l => l(0) -> l(1)).toMap
val spark = SparkSession.builder()
.master("local[*]")
.appName("Dynamic Schema")
.getOrCreate()
import spark.implicits._
val input = spark.sparkContext.textFile(args(0))
val schema = spark.sparkContext.broadcast(schemaLines)
val nameToType = {
Seq(IntegerType,StringType)
.map(t => t.typeName -> t).toMap
}
println(nameToType)
val fields = schema.value
.map(field => StructField(field._1, nameToType(field._2), nullable = true)).toSeq
val schemaStruct = StructType(fields)
val rowRDD = input
.map(_.split(","))
.map(attributes => Row.fromSeq(attributes))
val peopleDF = spark.createDataFrame(rowRDD, schemaStruct)
peopleDF.printSchema()
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
results.show()
}
}
Though the printSchema gives the desired result, result.show errors out. I think the age field actually needs to be converted using toInt. Is there a way to achieve the same when the schema is only available at runtime?
Replace
val input = spark.sparkContext.textFile(args(0))
with
val input = spark.read.schema(schemaStruct).csv(args(0))
and move it after schema definition.