Get range of Dataframe Row - scala

So I've loaded a dataframe from a parquet file. This dataframe now contains an unspecified number of columns. The first column is a Label, and the following are features.
I want to save each row in the dataframe as a LabeledPoint.
So far im thinking:
val labeledPoints: RDD[LabeledPoint] =df.map{row => LabeledPoint(row.getInt(0),Vectors.dense(row.getDouble(1),row.getDouble(2)))}
Its easy to get the column indexes, but when handling a lot of columns this won't hold. I'd like to be able to load the entire row starting from index 1 (since index 0 is the label) into a dense vector.
Any ideas?

This should do the trick
df.map {
row: Row =>
val data = for (index <- 1 until row.length) yield row.getDouble(index)
val vector = new DenseVector(data.toArray)
new LabeledPoint(row.getInt(0), vector)
}

Related

Spark find previous value on each iteration of RDD

I've following code :-
val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
{
val previousRow = (records - 1)th row
val currentRow = records
// Some calculation based on both rows
}
}
So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,
EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
should do the trick. Note however that this is using a DeveloperAPI.
alternatively you can
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null
OLD STUFF:
how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.
But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)
EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do
rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)
if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

Reduce size of Spark Dataframe by selecting only every n th element with Scala

I v got an org.apache.spark.sql.Dataframe = [t: double, S: long]
Now I want to reduce the Dataframe by every 2nd element, with val n=2
Result should be
How would u solve this problem?
I tried it by inserting a third column and using modulo, but I couldn’t solve it.
If i understand your question correctly, you want to keep every nth element from your dataframe and remove every other row. Assuming t is not your row index,add an index row and then filter it by:
import org.apache.spark.sql.expressions._
val n = 2
val filteredDF = df.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id))).filter($"index" % n === 0)

Using part of the first line of the text file as the key in RDD

I have a dataset that consists of several different folders named "01" to "15" and each folder include files named "00-00.txt" to "23-59.txt" in them (each folder depicting 1 day).
In the files I have the lines as below;
(each entry starting with !AIVDM is a line, except the first one, it starts with the numbers)
1443650400.010568 !AIVDM,1,1,,B,15NOHL0P00J#uq6>h8Jr6?vN2><,0*4B
!AIVDM,1,1,,A,4022051uvOFD>RG7kDCm1iW0088i,0*23
!AIVDM,1,1,,A,23aIhd#P1#PHRwPM<U#`OvN2><,0*4C
!AIVDM,1,1,,A,13n1mSgP00Pgq3TQpibh0?vL2><,0*74
!AIVDM,1,1,,B,177nPmw002:<Tn<gk1toGL60><,0*2B
!AIVDM,1,1,,B,139eu9gP00PugK:N2BOP0?vL2><,0*77
!AIVDM,1,1,,A,13bg8N0P000E2<BN15IKUOvN2><,0*34
!AIVDM,1,1,,B,14bL20003ReKodINRret28P0><,0*16
!AIVDM,1,1,,B,15SkVl001EPhf?VQ5SUTaCnH0><,0*00
!AIVDM,1,1,,A,14eG;ihP00G=4CvL=7qJmOvN0><,0*25
!AIVDM,1,1,,A,14eHMQ#000G<cKrL=6nJ9QfN2><,0*30
I want to have an RDD of key-value pairs, the long value 1443650400.010568 being the key and lines starting with !AIVDM... being the value. How can I achieve this?
Assuming each file is small enough to be contained in a single RDD record (does not exceed 2GB), you can use SparkContext.wholeTextFiles which reads each file into a single record, and then flatMap these records:
// assuming data/ folder contains folders 00, 01, ..., 15
val result: RDD[(String, String)] = sc.wholeTextFiles("data/*").values.flatMap(file => {
val lines = file.split("\n")
val id = lines.head.split(" ").head
lines.tail.map((id, _))
})
Alternatively, if that assumption isn't correct (each individual file might be large, i.e. hundreds of MB or more), you'll need to work a bit harder: load all data into a single RDD, add indices to the data, collect a map of "key" per index, and then find the right key for each data row using these indices:
// read files and zip with index to later match each data line to its key
val raw: RDD[(String, Long)] = sc.textFile("data/*").zipWithIndex().cache()
// separate data from ID rows
val dataRows: RDD[(String, Long)] = raw.filter(_._1.startsWith("!AIVDM"))
val idRows: RDD[(String, Long)] = raw.filter(!_._1.startsWith("!AIVDM"))
// collect a map if Index -> ID
val idForIndex = idRows.map { case (row, index) => (index, row.split(" ").head) }.collectAsMap()
// optimization: if idForIndex is very large - consider broadcasting it or not collecting it and using a join
// map each row to its key by looking up the MAXIMUM index which is < then row index
// in other words - find the LAST id record BEFORE the row
val result = dataRows.map { case (row, index) =>
val key = idForIndex.filterKeys(_ < index).maxBy(_._1)._2
(key, row)
}

Spark RDD: Sum one column without creating SQL DataFrame

Is there an efficient way to sum up the values in a column in spark RDD directly? I do not want to create a SQL DataFrame just for this.
I have an RDD of LabeledPoint in which each LabeledPoint uses a sparse vector representation. Suppose I am interested in sum of the values of first feature.
The following code does not work for me:
//lp_RDD is RDD[LabeledPoint]
var total = 0.0
for(x <- lp_RDD){
total += x.features(0)
}
The value of totalAmt after this loop is still 0.
What you want is to extract the first element from the feature vector using RDD.map and then sum them all up using DoubleRDDFunctions.sum:
val sum: Double = rdd.map(_.features(0)).sum()

Efficient way of row/column sum of a IndexedRowmatrix in Apache Spark

I have a matrix in a CoordinateMatrix format in Scala. The Matrix is sparse and the entires look like (upon coo_matrix.entries.collect),
Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(
MatrixEntry(0,0,-1.0), MatrixEntry(0,1,-1.0), MatrixEntry(1,0,-1.0),
MatrixEntry(1,1,-1.0), MatrixEntry(1,2,-1.0), MatrixEntry(2,1,-1.0),
MatrixEntry(2,2,-1.0), MatrixEntry(0,3,-1.0), MatrixEntry(0,4,-1.0),
MatrixEntry(0,5,-1.0), MatrixEntry(3,0,-1.0), MatrixEntry(4,0,-1.0),
MatrixEntry(3,3,-1.0), MatrixEntry(3,4,-1.0), MatrixEntry(4,3,-1.0),
MatrixEntry(4,4,-1.0))
This is only a small sample size. The Matrix is of size a N x N (where N = 1 million) though a majority of it is sparse. What is one of the efficient way of getting row sums of this matrix in Spark Scala? The goal is to create a new RDD composed of row sums i.e. of size N where 1st element is row sum of row1 and so on ..
I can always convert this coordinateMatrix to IndexedRowMatrix and run a for loop to compute rowsums one iteration at a time, but it is not the most efficient approach.
any idea is greatly appreciated.
It will be quite expensive due to shuffling (this is the part you cannot really avoid here) but you can convert entries to PairRDD and reduce by key:
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix}
import org.apache.spark.rdd.RDD
val mat: CoordinateMatrix = ???
val rowSums: RDD[Long, Double)] = mat.entries
.map{case MatrixEntry(row, _, value) => (row, value)}
.reduceByKey(_ + _)
Unlike solution based on indexedRowMatrix:
import org.apache.spark.mllib.linalg.distributed.IndexedRow
mat.toIndexedRowMatrix.rows.map{
case IndexedRow(i, values) => (i, values.toArray.sum)
}
it requires no groupBy transformation or intermediate SparseVectors.