PySpark: Creating a column with number of timesteps to an event - scala

I have a dataframe that looks as follows:
|id |val1|val2|
+---+----+----+
|1 |1 |0 |
|1 |2 |0 |
|1 |3 |0 |
|1 |4 |0 |
|1 |5 |5 |
|1 |6 |0 |
|1 |7 |0 |
|1 |8 |0 |
|1 |9 |9 |
|1 |10 |0 |
|1 |11 |0 |
|2 |1 |0 |
|2 |2 |0 |
|2 |3 |0 |
|2 |4 |0 |
|2 |5 |0 |
|2 |6 |6 |
|2 |7 |0 |
|2 |8 |8 |
|2 |9 |0 |
+---+----+----+
only showing top 20 rows
I want to create a new column with the number of rows until a non-zero value appears in val2, this should be done groupby/partitionby 'id'... if the event never happens, I need to put a -1 in the steps field.
|id |val1|val2|steps|
+---+----+----+----+
|1 |1 |0 |4 |
|1 |2 |0 |3 |
|1 |3 |0 |2 |
|1 |4 |0 |1 |
|1 |5 |5 |0 | event
|1 |6 |0 |3 |
|1 |7 |0 |2 |
|1 |8 |0 |1 |
|1 |9 |9 |0 | event
|1 |10 |0 |-1 | no further events for this id
|1 |11 |0 |-1 | no further events for this id
|2 |1 |0 |5 |
|2 |2 |0 |4 |
|2 |3 |0 |3 |
|2 |4 |0 |2 |
|2 |5 |0 |1 |
|2 |6 |6 |0 | event
|2 |7 |0 |1 |
|2 |8 |8 |0 | event
|2 |9 |0 |-1 | no further events for this id
+---+----+----+----+
only showing top 20 rows

Your requirement seems easy but implementing in spark and preserving immutability is a difficult task. I am suggesting you would need a recursive function to generate the steps column. Below I have tried to suggest you a recursive way using a udf function.
import org.apache.spark.sql.functions._
//udf function to populate step column
def stepsUdf = udf((values: Seq[Row]) => {
//sorting the collected struct in reverse order according to val1 column in reverse order
val val12 = values.sortWith(_.getAs[Int]("val1") > _.getAs[Int]("val1"))
//selecting the first of sorted list
val val12Head = val12.head
//generating the first step column in the collected list
val prevStep = if(val12Head.getAs("val2") != 0) 0 else -1
//generating the first output struct
val listSteps = List(steps(val12Head.getAs("val1"), val12Head.getAs("val2"), prevStep))
//recursive function for generating the step column
def recursiveSteps(vals : List[Row], previousStep: Int, listStep : List[steps]): List[steps] = vals match {
case x :: y =>
//event changed so step column should be 0
if(x.getAs("val2") != 0) {
recursiveSteps(y, 0, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), 0))
}
//event doesn't change after the last event change
else if(x.getAs("val2") == 0 && previousStep == -1) {
recursiveSteps(y, previousStep, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), previousStep))
}
//val2 is 0 after the event change so increment the step column
else {
recursiveSteps(y, previousStep+1, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), previousStep+1))
}
case Nil => listStep
}
//calling the recursive function
recursiveSteps(val12.tail.toList, prevStep, listSteps)
})
df
.groupBy("id") // grouping by id column
.agg(stepsUdf(collect_list(struct("val1", "val2"))).as("stepped")) //calling udf function after the collection of struct of val1 and val2
.withColumn("stepped", explode(col("stepped"))) // generating rows from the list returned from udf function
.select(col("id"), col("stepped.*")) // final desired output
.sort("id", "val1") //optional step just for viewing
.show(false)
where steps is a case class
case class steps(val1: Int, val2: Int, steps: Int)
which should give you
+---+----+----+-----+
|id |val1|val2|steps|
+---+----+----+-----+
|1 |1 |0 |4 |
|1 |2 |0 |3 |
|1 |3 |0 |2 |
|1 |4 |0 |1 |
|1 |5 |5 |0 |
|1 |6 |0 |3 |
|1 |7 |0 |2 |
|1 |8 |0 |1 |
|1 |9 |9 |0 |
|1 |10 |0 |-1 |
|1 |11 |0 |-1 |
|2 |1 |0 |5 |
|2 |2 |0 |4 |
|2 |3 |0 |3 |
|2 |4 |0 |2 |
|2 |5 |0 |1 |
|2 |6 |6 |0 |
|2 |7 |0 |1 |
|2 |8 |8 |0 |
|2 |9 |0 |-1 |
+---+----+----+-----+
I hope the answer is helpful

Related

How can I make a unique match with join with two spark dataframes and different columns?

I have two dataframes spark(scala):
First:
+-------------------+------------------+-----------------+----------+-----------------+
|id |zone |zone_father |father_id |country |
+-------------------+------------------+-----------------+----------+-----------------+
|2 |1 |123 |1 |0 |
|2 |2 |123 |1 |0 |
|3 |3 |1 |2 |0 |
|2 |4 |123 |1 |0 |
|3 |5 |2 |2 |0 |
|3 |6 |4 |2 |0 |
|3 |7 |19 |2 |0 |
+-------------------+------------------+-----------------+----------+-----------------+
Second:
+-------------------+------------------+-----------------+-----------------+
|country |id |zone |zone_value |
+-------------------+------------------+-----------------+-----------------+
|0 |2 |1 |7 |
|0 |2 |2 |7 |
|0 |2 |4 |8 |
|0 |0 |0 |2 |
+-------------------+------------------+-----------------+-----------------+
Then I need following logic:
1 -> If => first.id = second.id && first.zone = second.zone
2 -> Else if => first.father_id = second.id && first.zone_father = second.zone
3 -> If neither the first nor the second is true, follow the latter => first.country = second.zone
And the expected result would be:
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
|id |zone |zone_father |father_id |country |zone_value |
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
|2 |1 |123 |1 |0 |7 |
|2 |2 |123 |1 |0 |7 |
|3 |3 |1 |2 |0 |7 |
|2 |4 |123 |1 |0 |8 |
|3 |5 |2 |2 |0 |7 |
|3 |6 |4 |2 |0 |8 |
|3 |7 |19 |2 |0 |2 |
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
I tried to join both dataframes, but due "or" operation, two results for each row is returned, because the last premise returns true regardless of the result of the other two.

getting duplicate count but retaining duplicate rows in pyspark

I am trying to find the duplicate count of rows in a pyspark dataframe. I found a similar answer here
but it only outputs a binary flag. I would like to have the actual count for each row.
To use the orignal post's example, if I have a dataframe like so:
+--+--+--+--+
|a |b |c |d |
+--+--+--+--+
|1 |0 |1 |2 |
|0 |2 |0 |1 |
|1 |0 |1 |2 |
|0 |4 |3 |1 |
|1 |0 |1 |2 |
+--+--+--+--+
I would like to result in something like:
+--+--+--+--+--+--+--+--+
|a |b |c |d |row_count |
+--+--+--+--+--+--+--+--+
|1 |0 |1 |2 |3 |
|0 |2 |0 |1 |0 |
|1 |0 |1 |2 |3 |
|0 |4 |3 |1 |0 |
|1 |0 |1 |2 |3 |
+--+--+--+--+--+--+--+--+
Is this possible?
Thank You
Assuming df is your input dataframe:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import *
w = (Window.partitionBy([F.col("a"), F.col("b"), F.col("c"), F.col("D")]))
df=df.select(F.col("a"), F.col("b"), F.col("c"), F.col("D"), F.count(F.col("a")).over(w).alias("row_count"))
If, as per your example, you want to replace every count 1 with 0 do:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import *
w = (Window.partitionBy([F.col("a"), F.col("b"), F.col("c"), F.col("D")]))
df=df.select(F.col("a"), F.col("b"), F.col("c"), F.col("D"), F.count(F.col("a")).over(w).alias("row_count")).select("a", "b", "c", "d", F.when(F.col("row_count")==F.lit(1), F.lit(0)). otherwise(F.col("row_count")).alias("row_count"))

How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?
Given that you have dataframe as
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |31 |
|2 |1 |32 |
|2 |2 |56 |
+----+----+----+
You can Window functions by doing the following
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
Result:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
Similar is the result for collect_set as well. But the order of elements in the final set will not be in order as with collect_list
df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[63, 23] |
|1 |3 |31 |[63, 31, 23]|
|2 |1 |32 |[32] |
|2 |2 |56 |[56, 32] |
+----+----+----+------------+
If you remove orderBy as below
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)
result would be
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23, 63, 31]|
|1 |2 |63 |[23, 63, 31]|
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32, 56] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
I hope the answer is helpful
Existing answer is valid, just adding here a different style of writting window functions:
import org.apache.spark.sql.expressions.Window
val wind_user = Window.partitionBy("colA", "colA2").orderBy("colB", "colB2".desc)
df.withColumn("colD_distinct", collect_set($"colC") over wind_user)
.withColumn("colD_historical", collect_list($"colC") over wind_user).show(false)

Group by with average function in scala

Hi I am totally new to spark scala.I need an idea or any sample solution.I have a data like this
tagid,timestamp,listner,orgid,suborgid,rssi
[4,1496745915,718,4,3,0.30]
[2,1496745915,3878,4,3,0.20]
[4,1496745918,362,4,3,0.60]
[4,1496745913,362,4,3,0.60]
[2,1496745918,362,4,3,0.10]
[3,1496745912,718,4,3,0.05]
[2,1496745918,718,4,3,0.30]
[4,1496745911,1901,4,3,0.60]
[4,1496745912,718,4,3,0.60]
[2,1496745915,362,4,3,0.30]
[2,1496745912,3878,4,3,0.20]
[2,1496745915,1901,4,3,0.30]
[2,1496745910,1901,4,3,0.30]
I want to find for each tag and for each listner last 10 seconds timestamp data. Then For the 10 seconds data I need to find average for rssi values.Like this.
2,1496745918,718,4,3,0.60
2,1496745917,718,4,3,1.30
2,1496745916,718,4,1,2.20
2,1496745914,718,1,2,3.10
2,1496745911,718,1,2,6.10
4,1496745910,1901,1,2,0.30
4,1496745908,1901,1,2,1.30
..........................
..........................
Like this I need to find it. Any solution or suggestions is appreciated.
NOTE: I am doing with spark scala.
I tried through spark sql query .But not works properly.
val filteravg = avg.registerTempTable("avg")
val avgfinal = sqlContext.sql("SELECT tagid,timestamp,listner FROM (SELECT tagid,timestamp,listner,dense_rank() OVER (PARTITION BY _c6 ORDER BY _c5 ASC) as rank FROM avg) tmp WHERE rank <= 10")
avgfinal.collect.foreach(println)
I am trying through array also.Any help will be appreciated.
If you already have a dataframe as
+-----+----------+-------+-----+--------+----+
|tagid|timestamp |listner|orgid|suborgid|rssi|
+-----+----------+-------+-----+--------+----+
|4 |1496745915|718 |4 |3 |0.30|
|2 |1496745915|3878 |4 |3 |0.20|
|4 |1496745918|362 |4 |3 |0.60|
|4 |1496745913|362 |4 |3 |0.60|
|2 |1496745918|362 |4 |3 |0.10|
|3 |1496745912|718 |4 |3 |0.05|
|2 |1496745918|718 |4 |3 |0.30|
|4 |1496745911|1901 |4 |3 |0.60|
|4 |1496745912|718 |4 |3 |0.60|
|2 |1496745915|362 |4 |3 |0.30|
|2 |1496745912|3878 |4 |3 |0.20|
|2 |1496745915|1901 |4 |3 |0.30|
|2 |1496745910|1901 |4 |3 |0.30|
+-----+----------+-------+-----+--------+----+
Doing the following should work for you
df.withColumn("firstValue", first("timestamp") over Window.orderBy($"timestamp".desc).partitionBy("tagid"))
.filter($"firstValue".cast("long")-$"timestamp".cast("long") < 10)
.withColumn("average", avg("rssi") over Window.partitionBy("tagid"))
.drop("firstValue")
.show(false)
you should have output as
+-----+----------+-------+-----+--------+----+-------------------+
|tagid|timestamp |listner|orgid|suborgid|rssi|average |
+-----+----------+-------+-----+--------+----+-------------------+
|3 |1496745912|718 |4 |3 |0.05|0.05 |
|4 |1496745918|362 |4 |3 |0.60|0.54 |
|4 |1496745915|718 |4 |3 |0.30|0.54 |
|4 |1496745913|362 |4 |3 |0.60|0.54 |
|4 |1496745912|718 |4 |3 |0.60|0.54 |
|4 |1496745911|1901 |4 |3 |0.60|0.54 |
|2 |1496745918|362 |4 |3 |0.10|0.24285714285714288|
|2 |1496745918|718 |4 |3 |0.30|0.24285714285714288|
|2 |1496745915|3878 |4 |3 |0.20|0.24285714285714288|
|2 |1496745915|362 |4 |3 |0.30|0.24285714285714288|
|2 |1496745915|1901 |4 |3 |0.30|0.24285714285714288|
|2 |1496745912|3878 |4 |3 |0.20|0.24285714285714288|
|2 |1496745910|1901 |4 |3 |0.30|0.24285714285714288|
+-----+----------+-------+-----+--------+----+-------------------+

Spark 1.6 VectorAssembler unexpected results

I try to create a label-feature DataFrame using Spark's VectorAssembler.
According to the Spark docs, it should be as simple as this:
val incidentDF = sqlContext.sql("select `is_similar`, `cosine_similarity`,..... from some.table")
//vectorassembler: compact all relevant columns into a vector
val assembler = new VectorAssembler()
assembler.setInputCols(Array("cosine_similarity", ....."))
assembler.setOutputCol("features")
val output = assembler.transform(incidentDF).select("is_similar", "features").withColumnRenamed("is_similar", "label")
However, I get unexpected results.
This:
+----------+---------------------+----------------------------+----------------------+-----------------------------+-----------------------+------------------------------+--------------------+-------------+----------------+-------------+-------------+-------------------+--------+-------------------+---------------------------+----------------------------------+----------------------------+-----------------------------------+-----------------------------+------------------------------------+--------------------+------------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
|0 |0.21437323142813602 |0.08703882797784893 |0.23570226039551587 |0.10050378152592121 |0.10206207261596577 |0.0 |1 |1 |1 |1 |1 |1 |1 |0.26373626373626374|0.012967453461681464 |0.007624195465949381 |0.014425347541872306 |0.008896738386617248 |0.022695267556861232 |0.0 |1 |0.16838138468917166 |0.15434287415564008 |0.3922322702763681 |0.34874291623145787 |
|1 |0.5303300858899107 |0.5017452060042545 |0.5303300858899107 |0.5017452060042545 |0.5303300858899107 |0.5017452060042545 |1 |1 |1 |1 |1 |1 |1 |0.6870229007633588 |0.3534850108895589 |0.5857224407945156 |0.36079979664267925 |0.5853463384675868 |0.36971703925333405 |0.5814734067275937 |0 |1.0 |0.9999999999999998 |1.0 |0.9999999999999998 |
|0 |0.31754264805429416 |0.30151134457776363 |0.33541019662496846 |0.3344968040028363 |0.2867696673382022 |0.26111648393354675 |1 |1 |0 |1 |1 |1 |1 |0.41600000000000004|0.10867521883199269 |0.1920005048084368 |0.1322792942407786 |0.2477844869237889 |0.11802058757911914 |0.16554971608261862 |1 |0.0 |0.01605611773109364 |0.0 |0.16666666666666666 |
|0 |0.16169041669088866 |0.0 |0.1666666666666667 |0.0 |0.09622504486493764 |0.0 |1 |1 |1 |1 |1 |1 |1 |0.26666666666666666|0.012517205514308224 |0.0 |0.012752837227090714 |0.0 |0.021516657911501622 |0.0 |1 |0.16838138468917166 |0.15434287415564008 |0.3922322702763681 |0.34874291623145787 |
|0 |0.2750456656690116 |0.1860521018838127 |0.2858309752375147 |0.19611613513818402 |0.223606797749979 |0.1386750490563073 |1 |1 |1 |1 |1 |1 |1 |0.34862385321100914|0.06278282792172384 |0.09178430436891666 |0.06694373400084344 |0.08253907697526759 |0.07508140721703477 |0.10856631569349082 |1 |0.3014783135305502 |0.25688979598845174 |0.5590169943749475 |0.47628967220784013 |
|0 |0.2449489742783178 |0.19810721293758182 |0.26352313834736496 |0.2307692307692308 |0.21629522817435007 |0.16012815380508716 |1 |1 |0 |1 |1 |1 |1 |0.4838709677419355 |0.12209521675839743 |0.19126420671254496 |0.1475066405521753 |0.2459312750965279 |0.1242978535834829 |0.1886519686826469 |1 |0.0 |0.01605611773109364 |0.0 |0.16666666666666666 |
|0 |0.08320502943378437 |0.09642365197998375 |0.11952286093343938 |0.13912166872805048 |0.0 |0.0 |0 |0 |0 |1 |0 |0 |1 |0.12 |0.04035362208133099 |0.04456121367953338 |0.04819698770773715 |0.0538656145326838 |0.0 |0.0 |8 |0.05825659037076343 |0.05246835256923818 |0.112089707663561 |0.11278230910134424 |
|0 |0.20784609690826525 |0.1846372364689991 |0.26111648393354675 |0.24806946917841688 |0.0 |0.0 |0 |0 |0 |1 |0 |1 |1 |0.0 |0.07233915683015167 |0.0716540790026919 |0.08229370516713722 |0.08299754342027771 |0.0 |0.0 |6 |0.04977054860197747 |0.06558734556106822 |0.09607689228305229 |0.21759706994462227 |
|1 |0.8926577981869824 |0.9066143160193102 |0.914335372996105 |0.9226517385233938 |0.5477225575051661 |0.6324555320336759 |0 |0 |0 |0 |0 |0 |1 |0.5309734513274337 |0.8734996606615234 |0.8946928809168011 |0.8791317315987442 |0.8973856295754765 |0.3496004425218079 |0.48223175160299564 |0 |0.0 |0.0 |0.0 |0.0 |
|1 |0.5185629788417315 |0.8432740427115678 |0.5118906968889915 |0.8819171036881969 |0.24253562503633297 |0.3333333333333333 |1 |1 |0 |1 |1 |1 |1 |0.09375 |0.18908955158360016 |0.8022196858263557 |0.17544355300115252 |0.8474955187144462 |0.13927839835275616 |0.2838123484309787 |6 |0.0 |0.0 |0.0 |0.0 |
|1 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0 |0 |1 |1 |0 |0 |1 |0.14814814814814814|0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |1 |0.02170244443925667 |0.020410228072244255 |0.15062893357603016 |0.28922903686544305 |
|0 |0.26860765467512676 |0.06271815075053182 |0.29515063885057 |0.07485976927589244 |0.0 |0.0 |0 |0 |1 |1 |0 |0 |1 |0.08 |0.04804110216570731 |0.03027143543580809 |0.05341183077151175 |0.03431607006581793 |0.0 |0.0 |1 |0.0 |0.022192268824097448 |0.0 |0.24019223070763074 |
|1 |0.33333333333333337 |0.40824829046386296 |0.33333333333333337 |0.40824829046386296 |0.33333333333333337 |0.40824829046386296 |0 |0 |0 |1 |0 |1 |1 |0.4516129032258064 |0.3310013083604027 |0.3537516145932176 |0.3444032278588375 |0.3667764454925114 |0.3042153384207993 |0.3408010155297054 |6 |0.28297384452448776 |0.23615630148525626 |0.2182178902359924 |0.19245008972987526 |
|0 |0.0519174131651165 |0.0 |0.0917662935482247 |0.0 |0.0 |0.0 |0 |0 |1 |1 |0 |0 |1 |0.0967741935483871 |0.03050544547960052 |0.0 |0.0490339271669166 |0.0 |0.0 |0.0 |5 |0.0 |0.0 |0.0 |0.0 |
|0 |0.049160514400834666 |0.0 |0.02627034687463669 |0.0 |0.0 |0.0 |0 |0 |0 |0 |0 |0 |1 |0.1282051282051282 |0.006316709944109247 |0.0 |0.003132143258557757 |0.0 |0.0 |0.0 |3 |0.0 |0.019794166951004794 |0.0 |0.15638581054280606 |
|0 |0.07082882469748285 |0.0 |0.08494119857293758 |0.0 |0.0 |0.0 |0 |0 |0 |1 |0 |1 |1 |0.06060606060606061|0.004924318378089263 |0.0 |0.005845759285912874 |0.0 |0.0 |0.0 |4 |0.023119472246583003 |0.010659666129102227 |0.03210289415620512 |0.04420122177473814 |
|0 |0.1924976258772545 |0.038014296063485276 |0.19149207069693872 |0.02521364528296496 |0.0 |0.0 |0 |0 |0 |1 |0 |1 |1 |0.125 |0.020931167922971575 |0.00448818821863432 |0.02118543184402528 |0.0026553570889578286 |0.0 |0.0 |5 |0.02336541089352552 |0.02401310014140845 |0.11919975664202526 |0.10760330515353056 |
|1 |0.17095921484405754 |0.08434614994311695 |0.20073126386549828 |0.10085458113185984 |0.0 |0.0 |0 |0 |1 |0 |0 |1 |1 |0.07407407407407407|0.09182827200781651 |0.05443489342945772 |0.10010815165693956 |0.05842165588249673 |0.0 |0.0 |8 |0.2973721930047951 |0.168690765981807 |0.5637584095764486 |0.48478000681923245 |
|0 |0.1405456737852613 |0.049147318718299055 |0.11846977555181847 |0.08333333333333333 |0.22360679774997896 |0.0 |1 |1 |1 |1 |1 |1 |1 |0.08333333333333331|0.01937969263670974 |0.003427781939920998 |0.022922840542318093 |0.006443992956721386 |0.03572605281706383 |0.0 |5 |0.26345546669165004 |0.2557786050767472 |0.405007416909787 |0.45121260440202404 |
|1 |0.6793662204867575 |0.753778361444409 |0.5773502691896258 |0.6396021490668313 |0.5773502691896258 |0.8164965809277259 |0 |0 |1 |1 |0 |0 |1 |0.6875 |0.7466360531069871 |0.8217912018147824 |0.7034677645212848 |0.6620051533994062 |0.469853400225108 |0.9321213932723664 |6 |0.0 |0.011793139853629018 |0.0 |0.14433756729740643 |
+----------+---------------------+----------------------------+----------------------+-----------------------------+-----------------------+------------------------------+--------------------+-------------+----------------+-------------+-------------+-------------------+--------+-------------------+---------------------------+----------------------------------+----------------------------+-----------------------------------+-----------------------------+------------------------------------+--------------------+------------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
Becomes this:
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0 |[0.21437323142813602,0.08703882797784893,0.23570226039551587,0.10050378152592121,0.10206207261596577,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.26373626373626374,0.012967453461681464,0.007624195465949381,0.014425347541872306,0.008896738386617248,0.022695267556861232,0.0,1.0,0.16838138468917166,0.15434287415564008,0.3922322702763681,0.34874291623145787] |
|1 |[0.5303300858899107,0.5017452060042545,0.5303300858899107,0.5017452060042545,0.5303300858899107,0.5017452060042545,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.6870229007633588,0.3534850108895589,0.5857224407945156,0.36079979664267925,0.5853463384675868,0.36971703925333405,0.5814734067275937,0.0,1.0,0.9999999999999998,1.0,0.9999999999999998] |
|0 |[0.31754264805429416,0.30151134457776363,0.33541019662496846,0.3344968040028363,0.2867696673382022,0.26111648393354675,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.41600000000000004,0.10867521883199269,0.1920005048084368,0.1322792942407786,0.2477844869237889,0.11802058757911914,0.16554971608261862,1.0,0.0,0.01605611773109364,0.0,0.16666666666666666] |
|0 |[0.16169041669088866,0.0,0.1666666666666667,0.0,0.09622504486493764,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.26666666666666666,0.012517205514308224,0.0,0.012752837227090714,0.0,0.021516657911501622,0.0,1.0,0.16838138468917166,0.15434287415564008,0.3922322702763681,0.34874291623145787] |
|0 |[0.2750456656690116,0.1860521018838127,0.2858309752375147,0.19611613513818402,0.223606797749979,0.1386750490563073,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.34862385321100914,0.06278282792172384,0.09178430436891666,0.06694373400084344,0.08253907697526759,0.07508140721703477,0.10856631569349082,1.0,0.3014783135305502,0.25688979598845174,0.5590169943749475,0.47628967220784013]|
|0 |[0.2449489742783178,0.19810721293758182,0.26352313834736496,0.2307692307692308,0.21629522817435007,0.16012815380508716,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.4838709677419355,0.12209521675839743,0.19126420671254496,0.1475066405521753,0.2459312750965279,0.1242978535834829,0.1886519686826469,1.0,0.0,0.01605611773109364,0.0,0.16666666666666666] |
|0 |[0.08320502943378437,0.09642365197998375,0.11952286093343938,0.13912166872805048,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.12,0.04035362208133099,0.04456121367953338,0.04819698770773715,0.0538656145326838,0.0,0.0,8.0,0.05825659037076343,0.05246835256923818,0.112089707663561,0.11278230910134424] |
|0 |[0.20784609690826525,0.1846372364689991,0.26111648393354675,0.24806946917841688,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.07233915683015167,0.0716540790026919,0.08229370516713722,0.08299754342027771,0.0,0.0,6.0,0.04977054860197747,0.06558734556106822,0.09607689228305229,0.21759706994462227] |
|1 |(25,[0,1,2,3,4,5,12,13,14,15,16,17,18,19],[0.8926577981869824,0.9066143160193102,0.914335372996105,0.9226517385233938,0.5477225575051661,0.6324555320336759,1.0,0.5309734513274337,0.8734996606615234,0.8946928809168011,0.8791317315987442,0.8973856295754765,0.3496004425218079,0.48223175160299564]) |
|1 |[0.5185629788417315,0.8432740427115678,0.5118906968889915,0.8819171036881969,0.24253562503633297,0.3333333333333333,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.09375,0.18908955158360016,0.8022196858263557,0.17544355300115252,0.8474955187144462,0.13927839835275616,0.2838123484309787,6.0,0.0,0.0,0.0,0.0] |
|1 |(25,[8,9,12,13,20,21,22,23,24],[1.0,1.0,1.0,0.14814814814814814,1.0,0.02170244443925667,0.020410228072244255,0.15062893357603016,0.28922903686544305]) |
|0 |(25,[0,1,2,3,8,9,12,13,14,15,16,17,20,22,24],[0.26860765467512676,0.06271815075053182,0.29515063885057,0.07485976927589244,1.0,1.0,1.0,0.08,0.04804110216570731,0.03027143543580809,0.05341183077151175,0.03431607006581793,1.0,0.022192268824097448,0.24019223070763074]) |
|1 |[0.33333333333333337,0.40824829046386296,0.33333333333333337,0.40824829046386296,0.33333333333333337,0.40824829046386296,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.4516129032258064,0.3310013083604027,0.3537516145932176,0.3444032278588375,0.3667764454925114,0.3042153384207993,0.3408010155297054,6.0,0.28297384452448776,0.23615630148525626,0.2182178902359924,0.19245008972987526]|
|0 |(25,[0,2,8,9,12,13,14,16,20],[0.0519174131651165,0.0917662935482247,1.0,1.0,1.0,0.0967741935483871,0.03050544547960052,0.0490339271669166,5.0]) |
|0 |(25,[0,2,12,13,14,16,20,22,24],[0.049160514400834666,0.02627034687463669,1.0,0.1282051282051282,0.006316709944109247,0.003132143258557757,3.0,0.019794166951004794,0.15638581054280606]) |
|0 |(25,[0,2,9,11,12,13,14,16,20,21,22,23,24],[0.07082882469748285,0.08494119857293758,1.0,1.0,1.0,0.06060606060606061,0.004924318378089263,0.005845759285912874,4.0,0.023119472246583003,0.010659666129102227,0.03210289415620512,0.04420122177473814]) |
|0 |[0.1924976258772545,0.038014296063485276,0.19149207069693872,0.02521364528296496,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.125,0.020931167922971575,0.00448818821863432,0.02118543184402528,0.0026553570889578286,0.0,0.0,5.0,0.02336541089352552,0.02401310014140845,0.11919975664202526,0.10760330515353056] |
|1 |[0.17095921484405754,0.08434614994311695,0.20073126386549828,0.10085458113185984,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.07407407407407407,0.09182827200781651,0.05443489342945772,0.10010815165693956,0.05842165588249673,0.0,0.0,8.0,0.2973721930047951,0.168690765981807,0.5637584095764486,0.48478000681923245] |
|0 |[0.1405456737852613,0.049147318718299055,0.11846977555181847,0.08333333333333333,0.22360679774997896,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.08333333333333331,0.01937969263670974,0.003427781939920998,0.022922840542318093,0.006443992956721386,0.03572605281706383,0.0,5.0,0.26345546669165004,0.2557786050767472,0.405007416909787,0.45121260440202404] |
|1 |[0.6793662204867575,0.753778361444409,0.5773502691896258,0.6396021490668313,0.5773502691896258,0.8164965809277259,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.6875,0.7466360531069871,0.8217912018147824,0.7034677645212848,0.6620051533994062,0.469853400225108,0.9321213932723664,6.0,0.0,0.011793139853629018,0.0,0.14433756729740643] |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
And as you can see here, the result has 2 different results, not just one unified vector.
Is this a bug in CDH's spark (1.6) or am I missing something?
TL;DR This is a normal behavior.
Your data contains a number of sparse rows. When assembled these are converted to SparseVector and represented in the output as
(size, [idx1, idx2, ..., idxm], [val1, val2, ..., valm])
where idx1..indm are positions of non-zero values, and val1..valm corresponding value. So following
(25,[8,9,12,13, ...],[1.0,1.0,1.0,0.14814814814814814, ...])
is a SparseVector of size 25, where 9-th position is equal to 1.0, and 13-th to 0.148.
If data is dense (less than half of the values is equal to zero) you get DenseVectors which in your input are represented as:
[val0, val1, ..., valn]
Both representations are perfectly valid and majority of tools will accept both just fine.