If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such:
val data = rdd.map(col => new LabeledPoint(
col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray))
Any suggestions or guidance will be much appreciated.
Maybe I messed up RDD with DataFrame, I can convert the RDD to DataFrame with .toDF() or it is easier to accomplish the goal with DataFrame than RDD.
I assume your data looks more or less like this:
import scala.util.Random.{setSeed, nextDouble}
setSeed(1)
case class Record(
foo: Double, target: Double, x1: Double, x2: Double, x3: Double)
val rows = sc.parallelize(
(1 to 10).map(_ => Record(
nextDouble, nextDouble, nextDouble, nextDouble, nextDouble
))
)
val df = sqlContext.createDataFrame(rows)
df.registerTempTable("df")
sqlContext.sql("""
SELECT ROUND(foo, 2) foo,
ROUND(target, 2) target,
ROUND(x1, 2) x1,
ROUND(x2, 2) x2,
ROUND(x2, 2) x3
FROM df""").show
So we have data as below:
+----+------+----+----+----+
| foo|target| x1| x2| x3|
+----+------+----+----+----+
|0.73| 0.41|0.21|0.33|0.33|
|0.01| 0.96|0.94|0.95|0.95|
| 0.4| 0.35|0.29|0.51|0.51|
|0.77| 0.66|0.16|0.38|0.38|
|0.69| 0.81|0.01|0.52|0.52|
|0.14| 0.48|0.54|0.58|0.58|
|0.62| 0.18|0.01|0.16|0.16|
|0.54| 0.97|0.25|0.39|0.39|
|0.43| 0.23|0.89|0.04|0.04|
|0.66| 0.12|0.65|0.98|0.98|
+----+------+----+----+----+
and we want to ignore foo and x2 and extract LabeledPoint(target, Array(x1, x3)):
// Map feature names to indices
val featInd = List("x1", "x3").map(df.columns.indexOf(_))
// Or if you want to exclude columns
val ignored = List("foo", "target", "x2")
val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_))
// Get index of target
val targetInd = df.columns.indexOf("target")
df.rdd.map(r => LabeledPoint(
r.getDouble(targetInd), // Get target value
// Map feature indices to values
Vectors.dense(featInd.map(r.getDouble(_)).toArray)
))
Related
I am a beginner for Scala and have been working on the following problem:
Example dataset named as given_dataset with player number and points scored
|player_no| |points|
1 25.0
1 20.0
1 21.0
2 15.0
2 18.0
3 24.0
3 25.0
3 29.0
Problem 1:
I have a dataset and need to calculate total points scored, average points per game, and number of games played. I am unable to explicitly set the data type to "double", "int", "float", when I apply the transformations. (Perhaps this is because they are untyped transformations?) Would anyone be able to help on this and correct my error?
No data type specified (but code is able to run)
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").orderBy("player_no")
Please note that I would like to retain the player number as I plan to merge total_points_dataset, games_played_dataset, and avg_points_dataset together.
Data type specified, but code crashes!
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[Double].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[Int].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[Double].orderBy("player_no")
Problem 2:
I would like to implement the above without using the library spark.sql.functions e.g. through functions such as map, groupByKey etc. If possible, could anyone provide an example for this and point me towards the right direction?
If you don't want to use import org.apache.spark.sql.types.{FloatType, IntegerType, StructType} then you have to cast it either at the time of reading or using as[(Int, Double)] in the dataset. Below is the example while reading from CSV file for your dataset:
/** A function that splits a line of input into (player_no, points) tuples. */
def parseLine(line: String): (Int, Float) = {
// Split by commas
val fields = line.split(",")
// Extract the player_no and points fields, and convert to integer & float
val player_no = fields(0).toInt
val points = fields(1).toFloat
// Create a tuple that is our result.
(player_no, points)
}
And then read as below:
val sc = new SparkContext("local[*]", "StackOverflow75354293")
val lines = sc.textFile("data/stackoverflowdata-noheader.csv")
val dataset = lines.map(parseLine)
val total_points_dataset2 = dataset.reduceByKey((x, y) => x + y)
val total_points_dataset2_sorted = total_points_dataset2.sortByKey(ascending = true)
total_points_dataset2_sorted.foreach(println)
val games_played_dataset2 = dataset.countByKey().toList.sorted
games_played_dataset2.foreach(println)
val avg_points_dataset2 =
dataset
.mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.sortByKey(ascending = true)
avg_points_dataset2.collect().foreach(println)
I locally tried running both ways and both are working fine, we can check the below output also:
(3,78.0)
(1,66.0)
(2,33.0)
(1,3)
(2,2)
(3,3)
(1,22.0)
(2,16.5)
(3,26.0)
For details you can see it on mysql page
Regarding "Problem 1" try
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[(Int, Double)].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[(Int, Long)].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[(Int, Double)].orderBy("player_no")
I have a parquet file which contain two columns (id,features).I want to subtract features from scalar , divide output by another scalar and save output as parquet file.
val df=sqlContext.read.parquet("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet").toDF("id","features")
val constant1 = 2.4848911616270923
val constant2 = 1.8305483113586494
val performComputation = (s: Double, val1: Double, val2: Double) => { Vectors.dense((s - val1) / val2)
df.withColumn("features", ((df("features")-val1)/val2)) } df.write.parquet("file:///usr/local/spark/dataset/output1")
parquet file stile the same.what's wrong?
You are saving the same dataframe you have read.
Try smth like:
val result = df.withColumn("features", ((df("features") - val1) / val2))
result.write.parquet("file:///usr/local/spark/dataset/output1")
I have created RDD with first column is Key and rest of columns are values against that key. Every row has a unique key. I want to find average of values against every key. I created Key value pair and tried following code but it is not producing desired results. My code is here.
val rows = 10
val cols = 6
val partitions = 4
lazy val li1 = List.fill(rows,cols)(math.random)
lazy val li2 = (1 to rows).toList
lazy val li = (li1, li2).zipped.map(_ :: _)
val conf = new SparkConf().setAppName("First spark").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(li,partitions)
val gr = rdd.map( x => (x(0) , x.drop(1)))
val gr1 = gr.values.reduce((x,y) => x.zip(y).map(x => x._1 +x._2 )).foldLeft(0)(_+_)
gr1.take(3).foreach(println)
I want result to be displayed like
1 => 1.1 ,
2 => 2.7
and so on for all keys
First I am not sure what this line is doing,
lazy val li = (li1, li2).zipped.map(_ :: _)
Instead, you could do this,
lazy val li = li2 zip li1
This will create the List of tuples of the type (Int, List[Double]).
And the solution to find the average values against keys could be as below,
rdd.map{ x => (x._1, x._2.fold(0.0)(_ + _)/x._2.length) }.collect.foreach(x => println(x._1+" => "+x._2))
How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)?
I know how to create a DataFrame manually, but I cannot automate it:
val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3")
Generating the data locally and then parallelizing it is totally fine, especially if you don't have to generate a lot of data.
However, should you ever need to generate a huge dataset, you can alway implement an RDD that does this for you in parallel, as in the following example.
import scala.reflect.ClassTag
import org.apache.spark.{Partition, TaskContext}
import org.apache.spark.rdd.RDD
// Each random partition will hold `numValues` items
final class RandomPartition[A: ClassTag](val index: Int, numValues: Int, random: => A) extends Partition {
def values: Iterator[A] = Iterator.fill(numValues)(random)
}
// The RDD will parallelize the workload across `numSlices`
final class RandomRDD[A: ClassTag](#transient private val sc: SparkContext, numSlices: Int, numValues: Int, random: => A) extends RDD[A](sc, deps = Seq.empty) {
// Based on the item and executor count, determine how many values are
// computed in each executor. Distribute the rest evenly (if any).
private val valuesPerSlice = numValues / numSlices
private val slicesWithExtraItem = numValues % numSlices
// Just ask the partition for the data
override def compute(split: Partition, context: TaskContext): Iterator[A] =
split.asInstanceOf[RandomPartition[A]].values
// Generate the partitions so that the load is as evenly spread as possible
// e.g. 10 partition and 22 items -> 2 slices with 3 items and 8 slices with 2
override protected def getPartitions: Array[Partition] =
((0 until slicesWithExtraItem).view.map(new RandomPartition[A](_, valuesPerSlice + 1, random)) ++
(slicesWithExtraItem until numSlices).view.map(new RandomPartition[A](_, valuesPerSlice, random))).toArray
}
Once you have this you can use it passing your own random data generator to get an RDD[Int]
val rdd = new RandomRDD(spark.sparkContext, 10, 22, scala.util.Random.nextInt(100) + 1)
rdd.foreach(println)
/*
* outputs:
* 30
* 86
* 75
* 20
* ...
*/
or an RDD[(Int, Int, Int)]
def rand = scala.util.Random.nextInt(100) + 1
val rdd = new RandomRDD(spark.sparkContext, 10, 22, (rand, rand, rand))
rdd.foreach(println)
/*
* outputs:
* (33,22,15)
* (65,24,64)
* (41,81,44)
* (58,7,18)
* ...
*/
and of course you can wrap it in a DataFrame very easily as well:
spark.createDataFrame(rdd).show()
/*
* outputs:
* +---+---+---+
* | _1| _2| _3|
* +---+---+---+
* |100| 48| 92|
* | 34| 40| 30|
* | 98| 63| 61|
* | 95| 17| 63|
* | 68| 31| 34|
* .............
*/
Notice how in this case the generated data is different every time the RDD/DataFrame is acted upon. By changing the implementation of RandomPartition to actually store the values instead of generating them on the fly, you can have a stable set of random items, while still retaining the flexibility and scalability of this approach.
One nice property of the stateless approach is that you can generate huge dataset even locally. The following ran in a few seconds on my laptop:
new RandomRDD(spark.sparkContext, 10, Int.MaxValue, 42).count
// returns: 2147483647
Here you go, Seq.fill is your friend:
def randomInt1to100 = scala.util.Random.nextInt(100)+1
val df = sc.parallelize(
Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
).toDF("col1", "col2", "col3")
You can simply use scala.util.Random to generate the random numbers within range and loop for 100 rows and finally use createDataFrame api
import scala.util.Random
val data = 1 to 100 map(x => (1+Random.nextInt(100), 1+Random.nextInt(100), 1+Random.nextInt(100)))
sqlContext.createDataFrame(data).toDF("col1", "col2", "col3").show(false)
You can use this below generic code
//no of rows required
val rows = 15
//no of columns required
val cols = 10
val spark = SparkSession.builder
.master("local[*]")
.appName("testApp")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.getOrCreate()
import spark.implicits._
val columns = 1 to cols map (i => "col" + i)
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, IntegerType)))
val lstrows = Seq.fill(rows * cols)(Random.nextInt(100) + 1).grouped(cols).toList.map { x => Row(x: _*) }
val rdd = spark.sparkContext.makeRDD(lstrows)
val df = spark.createDataFrame(rdd, schema)
If you need to create a large amount of random data, Spark provides an object called RandomRDDs that can generate datasets filled with random numbers following a uniform, normal, or various other distributions.
https://spark.apache.org/docs/latest/mllib-statistics.html#random-data-generation
From their example:
import org.apache.spark.mllib.random.RandomRDDs._
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
I have an RDD (long, vector). I want to do sum over all the vectors. How to achieve it in spark 1.6?
For example, input data is like
(1,[0.1,0.2,0.7])
(2,[0.2,0.4,0.4])
It then produces results like
[0.3,0.6,1.1]
regardless of the first value in long
If you have an RDD[Long, Vector] like this:
val myRdd = sc.parallelize(List((1l, Vectors.dense(0.1, 0.2, 0.7)),(2l, Vectors.dense(0.2, 0.4, 0.4))))
You can reduce the values (vectors) in order to get the sum:
val res = myRdd
.values
.reduce {case (a:(Vector), b:(Vector)) =>
Vectors.dense((a.toArray, b.toArray).zipped.map(_ + _))}
I get the following result with a floating point error:
[0.30000000000000004,0.6000000000000001,1.1]
source: this
you can refer spark example,about:
val model = pipeline.fit(df)
val documents = model.transform(df)
.select("features")
.rdd
.map { case Row(features: MLVector) => Vectors.fromML(features) }
.zipWithIndex()
.map(_.swap)
(documents,
model.stages(2).asInstanceOf[CountVectorizerModel].vocabulary,
//vocabulary
documents.map(_._2.numActives).sum().toLong)
//total token count