JAAD cannot read m4a - scala

What I wish to do is to generate a preview for every m4a file. I am trying to do this with Java Sound and JAAD.
Here is my attempt in Scala
import java.io.{File, FileOutputStream}
import javax.sound.sampled.AudioSystem
/**
* Created by khanguyen on 7/21/15.
*/
object Main extends App {
val filePath = "audio.m4a"
val file = new File(filePath)
val audio = AudioSystem.getAudioInputStream(file)
println(audio.getFrameLength) // return -1
println(audio.getFormat) // return PCM_SIGNED 0.0 Hz, 0 bit, 0 channels, 0 bytes/frame,
val output = new FileOutputStream("outputaudio.m4a")
var buffer = Array.fill[Byte](1024)(0)
for (i <- 0 to 1024) {
audio.read(buffer, i * 1024, 1024)
buffer.take(10).map(println)
output.write(buffer)
}
audio.close()
output.flush()
output.close()
}
I cannot read anything from the audio input stream. The frameLength is said to be -1. After a read pass, all the bytes in Array[Byte] are still 0. Am I missing anything?

if you are getting a -1 in audio.getFrameLength it is because the file format is not supported.
val audio = AudioSystem.getAudioInputStream(file)
//> javax.sound.sampled.UnsupportedAudioFileException: could not get audio input
//| stream from input file
//| at javax.sound.sampled.AudioSystem.getAudioInputStream(AudioSystem.java:
//| 1187)
//| at forcomp.wc$$anonfun$main$1.apply$mcV$sp(forcomp.wc.scala:15)
//| at org.scalaide.worksheet.runtime.library.WorksheetSupport$$anonfun$$exe
//| cute$1.apply$mcV$sp(WorksheetSupport.scala:76)
//| at org.scalaide.worksheet.runtime.library.WorksheetSupport$.redirected(W
//| orksheetSupport.scala:65)
//| at org.scalaide.worksheet.runtime.library.WorksheetSupport$.$execute(Wor
//| ksheetSupport.scala:75)
//| at forcomp.wc$.main(forcomp.wc.scala:5)
//| at forcomp.wc.main(forcomp.wc.scala)
println(audio.getFrameLength) // return -1
Anyways, I've tested it with a pet m4a example file but in the javax.sound.AudioSystem api (http://docs.oracle.com/javase/7/docs/api/javax/sound/sampled/AudioSystem.html) you will find more information about the supported file types, AAC (m4a extentions are part of AAC coding) should be supported.

Related

how to find the number of vertex that are reachable from a given vertex in Spark GraphX

I want to find out the number of reachable vertexes from a given vertex in a directed graph (see image below), e.g. for id=0L, since it connects to 1L and 2L, 1L connects to 3L, 2L connects to 4L, hence, the output should be 4. Following is the graph relationship data:
edgeid from to distance
0 0 1 10.0
1 0 2 5.0
2 1 2 2.0
3 1 3 1.0
4 2 1 3.0
5 2 3 9.0
6 2 4 2.0
7 3 4 4.0
8 4 0 7.0
9 4 3 5.0
I was able to set up the graph, but I am not sure how to use graph.edges.filter to get the ouput
val vertexRDD: RDD[(Long, (Double))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(Double), Int] = Graph(vertexRDD, edgeRDD)
In your example all vertices are connected with a directed path so each vertex should result in a value of 4.
But if you were to remove the 4->0 (id=8) connection there would be a different number of course.
Since your problem relies on (recursively) traversing the graph in parallel the Graphx Pregel API is probably the best approach.
The pregel call takes 3 functions
vprog to initialize each vertex with a message (in your case empty List[VertexId])
sendMsg an update step that is applied on each iteration (in your case accumulation the neighboring VertexIds and returning an Iterator with messages to send out to the next iteration
mergeMsg to merge two messages (2 List[VertexId]s into 1)
In code it would look like:
def vprog(id: VertexId, orig: List[VertexId], newly: List[VertexId]) : List[VertexId] = newly
def mergeMsg(a: List[VertexId], b: List[VertexId]) : List[VertexId] = (a ++ b).distinct
def sendMsg(trip: EdgeTriplet[List[VertexId],Double]) : Iterator[(VertexId, List[VertexId])] = {
val recursivelyConnectedNeighbors = (trip.dstId :: trip.dstAttr).filterNot(_ == trip.srcId)
if (trip.srcAttr.intersect(recursivelyConnectedNeighbors).length != recursivelyConnectedNeighbors.length)
Iterator((trip.srcId, recursivelyConnectedNeighbors))
else
Iterator.empty
}
val initList = List.empty[VertexId]
val result = graph
.mapVertices((_,_) => initList)
.pregel(
initialMsg = initList,
activeDirection = EdgeDirection.Out
)(vprog, sendMsg, mergeMsg)
.mapVertices((_, neighbors) => neighbors.length)
result.vertices.toDF("vertex", "value").show()
Output:
+------+-----+
|vertex|value|
+------+-----+
| 0| 4|
| 1| 3|
| 2| 3|
| 3| 1|
| 4| 1|
+------+-----+
Make sure to experiment with spark.graphx.pregel.checkpointInterval if you are getting OoM's traversing large graphs (or configuring the maxIterations in pregel init)

How to generate a DataFrame with random content and N rows?

How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)?
I know how to create a DataFrame manually, but I cannot automate it:
val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3")
Generating the data locally and then parallelizing it is totally fine, especially if you don't have to generate a lot of data.
However, should you ever need to generate a huge dataset, you can alway implement an RDD that does this for you in parallel, as in the following example.
import scala.reflect.ClassTag
import org.apache.spark.{Partition, TaskContext}
import org.apache.spark.rdd.RDD
// Each random partition will hold `numValues` items
final class RandomPartition[A: ClassTag](val index: Int, numValues: Int, random: => A) extends Partition {
def values: Iterator[A] = Iterator.fill(numValues)(random)
}
// The RDD will parallelize the workload across `numSlices`
final class RandomRDD[A: ClassTag](#transient private val sc: SparkContext, numSlices: Int, numValues: Int, random: => A) extends RDD[A](sc, deps = Seq.empty) {
// Based on the item and executor count, determine how many values are
// computed in each executor. Distribute the rest evenly (if any).
private val valuesPerSlice = numValues / numSlices
private val slicesWithExtraItem = numValues % numSlices
// Just ask the partition for the data
override def compute(split: Partition, context: TaskContext): Iterator[A] =
split.asInstanceOf[RandomPartition[A]].values
// Generate the partitions so that the load is as evenly spread as possible
// e.g. 10 partition and 22 items -> 2 slices with 3 items and 8 slices with 2
override protected def getPartitions: Array[Partition] =
((0 until slicesWithExtraItem).view.map(new RandomPartition[A](_, valuesPerSlice + 1, random)) ++
(slicesWithExtraItem until numSlices).view.map(new RandomPartition[A](_, valuesPerSlice, random))).toArray
}
Once you have this you can use it passing your own random data generator to get an RDD[Int]
val rdd = new RandomRDD(spark.sparkContext, 10, 22, scala.util.Random.nextInt(100) + 1)
rdd.foreach(println)
/*
* outputs:
* 30
* 86
* 75
* 20
* ...
*/
or an RDD[(Int, Int, Int)]
def rand = scala.util.Random.nextInt(100) + 1
val rdd = new RandomRDD(spark.sparkContext, 10, 22, (rand, rand, rand))
rdd.foreach(println)
/*
* outputs:
* (33,22,15)
* (65,24,64)
* (41,81,44)
* (58,7,18)
* ...
*/
and of course you can wrap it in a DataFrame very easily as well:
spark.createDataFrame(rdd).show()
/*
* outputs:
* +---+---+---+
* | _1| _2| _3|
* +---+---+---+
* |100| 48| 92|
* | 34| 40| 30|
* | 98| 63| 61|
* | 95| 17| 63|
* | 68| 31| 34|
* .............
*/
Notice how in this case the generated data is different every time the RDD/DataFrame is acted upon. By changing the implementation of RandomPartition to actually store the values instead of generating them on the fly, you can have a stable set of random items, while still retaining the flexibility and scalability of this approach.
One nice property of the stateless approach is that you can generate huge dataset even locally. The following ran in a few seconds on my laptop:
new RandomRDD(spark.sparkContext, 10, Int.MaxValue, 42).count
// returns: 2147483647
Here you go, Seq.fill is your friend:
def randomInt1to100 = scala.util.Random.nextInt(100)+1
val df = sc.parallelize(
Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
).toDF("col1", "col2", "col3")
You can simply use scala.util.Random to generate the random numbers within range and loop for 100 rows and finally use createDataFrame api
import scala.util.Random
val data = 1 to 100 map(x => (1+Random.nextInt(100), 1+Random.nextInt(100), 1+Random.nextInt(100)))
sqlContext.createDataFrame(data).toDF("col1", "col2", "col3").show(false)
You can use this below generic code
//no of rows required
val rows = 15
//no of columns required
val cols = 10
val spark = SparkSession.builder
.master("local[*]")
.appName("testApp")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.getOrCreate()
import spark.implicits._
val columns = 1 to cols map (i => "col" + i)
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, IntegerType)))
val lstrows = Seq.fill(rows * cols)(Random.nextInt(100) + 1).grouped(cols).toList.map { x => Row(x: _*) }
val rdd = spark.sparkContext.makeRDD(lstrows)
val df = spark.createDataFrame(rdd, schema)
If you need to create a large amount of random data, Spark provides an object called RandomRDDs that can generate datasets filled with random numbers following a uniform, normal, or various other distributions.
https://spark.apache.org/docs/latest/mllib-statistics.html#random-data-generation
From their example:
import org.apache.spark.mllib.random.RandomRDDs._
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)

Scala : bug with getLines?

I'm facing a problem on a very simple file usage in scala I don't understand if this is from a bug or a misunderstanding what I'm doing...
Even reproducible from a worksheet in scala/eclipse IDE. I'm using IDE4.6.1 and scala 2.12.2
Code is very simple :
//********************************
import scala.io.Source
import java.io.File
import java.io.PrintWriter
object Embed {
val filename = "proteins.csv"
val handler = Source.fromFile(filename)
val header:String = handler.getLines().next()
println (">"+header)
val header2:String = handler.getLines().next()
println (">"+header2)
val header3:String = handler.getLines().next()
println (">"+header3)
}
//**********************
first 3 lines of the file are a bit long and of non sense for non bio specialists :
Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
261,247,P0AFG4|ODO1_ECOL6,200.00,39,30,30,Carbamidomethylation; Deamidation (NQ); Oxidation (M),1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,105062,2-oxoglutarate dehydrogenase E1 component OS=Escherichia coli O6:H1 (strain CFT073 / ATCC 700928 / UPEC) GN=sucA PE=3 SV=1
287,657,B7NDL4|MDH_ECOLU,200.00,54,14,1,Carbamidomethylation; Deamidation (NQ); Oxidation (M),6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,32336,Malate dehydrogenase OS=Escherichia coli O17:K52:H18 (strain UMN026 / ExPEC) GN=mdh PE=3 SV=1
I won't go into this file details but it is a 3600 lines file, each containing 20 fields separated by commas and a '' end of line. First line is teh header.
I tried also with only and only with same result :
First line is read correctly but second line read is only the final part of the 8th line in the file, and so on then I cannot read/parse my file :
Following is the result I get
val filename = "proteins.csv"
//> filename : String = proteins.csv
val handler = Source.fromFile(filename) //> handler : scala.io.BufferedSource = non-empty iterator
val header:String = handler.getLines().next() //> header : String = Protein Group,Protein ID,Accession,Significance,Coverage
//| (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity
//| ,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity
//| ,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Descrip
//| tion
println (">"+header) //> >Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Uni
//| que,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,
//| Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity
//| ,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
val header2:String = handler.getLines().next() //> header2 : String = TCC 700928 / UPEC) GN=fumA PE=3 SV=2
println (">"+header2) //> >TCC 700928 / UPEC) GN=fumA PE=3 SV=2
val header3:String = handler.getLines().next() //> header3 : String = n SE11) GN=zapB PE=3 SV=1
println (">"+header3) //> >n SE11) GN=zapB PE=3 SV=1
An idea what I do wrong ?
Many thanks for helping
No hurry : this is part of an attempt to use scala and I'll now go back to Python for doing the job !
If I understand you correctly the problem is that every time you call handler.getLines() you receive a new Iterator[String] object that by default points to the first line of the CSV file. You should try something like this:
val lineIterator = Source.fromFile("proteins.csv").getLines() // Get the iterator object
val firstLine = lineIterator.next()
val secondLine = lineIterator.next()
val thirdLine = lineIterator.next()
Or this:
val lines = Source.fromFile("proteins.csv").getLines().toIndexedSeq // Convert iterator to the list of lines
val n = 2
val nLine = lines(n)
println(nLine)
Your mistake is that you have called three times handler.getLines() i.e. BufferedLineIterator is instantiated three times and each one is calling next meaning that each instances are trying to read from the same source. And thats the reason you are getting random outputs
The correct way is to create only one instance of handler.getLines() and call next on it
val linesIterator = handler.getLines()
val header:String = linesIterator.next()
println (">"+header)
val header2:String = linesIterator.next()
println (">"+header2)
val header3:String = linesIterator.next()
println (">"+header3)
More precisely, you don't even need to call next() by doing
for(lines <- handler.getLines()){
println(">"+lines)
}

Prepare data for MultilayerPerceptronClassifier in scala

Please keep in mind I'm new to scala.
This is the example I am trying to follow:
https://spark.apache.org/docs/1.5.1/ml-ann.html
It uses this dataset:
https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt
I have prepared my .csv using the code below to get a data frame for classification in Scala.
//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")
//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");
scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])
//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)
//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")
//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }
//Format for features which is gst_id_matched
val encodeLabel = udf[Double, String]( _ match
{ case "0.0" => 0.0 case "1.0" => 1.0} )
//Transformed dataset
val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")
val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)
The last line generates this error
15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0
My suspicions:
When I examine the dataset,it looks fine for classification
scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])
But the apache example dataset is different and my transformation does not give me what I need.Can some one please help me with the dataset transformation or understand the root cause of the problem.
This is what the apache dataset looks like:
scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])
The source of your problems is a wrong definition of layers. When you use
val layers = Array[Int](0, 0, 0, 0)
it means you want a network with zero nodes in each layer which simply doesn't make sense. Generally speaking number of neurons in the input layer should be equal to the number of features and each hidden layer should contain at least one neuron.
Lets recreate your data simpling your code on the way:
import org.apache.spark.sql.functions.col
val df = sc.parallelize(Seq(
("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")
Convert all columns to doubles:
val numeric = df
.select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
.withColumnRenamed("gst_id_matched", "label")
Assemble features:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("ip_crowding","lat_long_dist"))
.setOutputCol("features")
val data = assembler.transform(numeric)
data.show
// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist| features|
// +-----+-----------+-------------+-----------------+
// | 0.0| 0.0| 0.0| (2,[],[])|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+
Train and test network:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
val model = trainer.fit(data)
model.transform(data).show
// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist| features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// | 0.0| 0.0| 0.0| (2,[],[])| 0.0|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]| 0.0|
// +-----+-----------+-------------+-----------------+----------+

Stratified sampling in Spark

I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
One possible solution is in Holden's answer, and here is some other solutions :
Using RDDs :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class.
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed)
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractions for each key as following, since sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
To sample now :
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
Let's consider the same data (seq) from the previous section.
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDD to do that on which we creates tuples of the elements in this RDD by defining our key to be the first column :
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or df or data sample :
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0 you can use DataFrameStatFunctions.sampleBy method:
df.stat.sampleBy("keyColumn", fractions, seed)
Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
1) You want 80% of your users, and you want all of the data for them.
2) You want 80% of each users data
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFold in MLUtils or BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
For #2 you could look at BernoulliCellSampler and simply apply it directly.