I try to create a label-feature DataFrame using Spark's VectorAssembler.
According to the Spark docs, it should be as simple as this:
val incidentDF = sqlContext.sql("select `is_similar`, `cosine_similarity`,..... from some.table")
//vectorassembler: compact all relevant columns into a vector
val assembler = new VectorAssembler()
assembler.setInputCols(Array("cosine_similarity", ....."))
assembler.setOutputCol("features")
val output = assembler.transform(incidentDF).select("is_similar", "features").withColumnRenamed("is_similar", "label")
However, I get unexpected results.
This:
+----------+---------------------+----------------------------+----------------------+-----------------------------+-----------------------+------------------------------+--------------------+-------------+----------------+-------------+-------------+-------------------+--------+-------------------+---------------------------+----------------------------------+----------------------------+-----------------------------------+-----------------------------+------------------------------------+--------------------+------------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
|0 |0.21437323142813602 |0.08703882797784893 |0.23570226039551587 |0.10050378152592121 |0.10206207261596577 |0.0 |1 |1 |1 |1 |1 |1 |1 |0.26373626373626374|0.012967453461681464 |0.007624195465949381 |0.014425347541872306 |0.008896738386617248 |0.022695267556861232 |0.0 |1 |0.16838138468917166 |0.15434287415564008 |0.3922322702763681 |0.34874291623145787 |
|1 |0.5303300858899107 |0.5017452060042545 |0.5303300858899107 |0.5017452060042545 |0.5303300858899107 |0.5017452060042545 |1 |1 |1 |1 |1 |1 |1 |0.6870229007633588 |0.3534850108895589 |0.5857224407945156 |0.36079979664267925 |0.5853463384675868 |0.36971703925333405 |0.5814734067275937 |0 |1.0 |0.9999999999999998 |1.0 |0.9999999999999998 |
|0 |0.31754264805429416 |0.30151134457776363 |0.33541019662496846 |0.3344968040028363 |0.2867696673382022 |0.26111648393354675 |1 |1 |0 |1 |1 |1 |1 |0.41600000000000004|0.10867521883199269 |0.1920005048084368 |0.1322792942407786 |0.2477844869237889 |0.11802058757911914 |0.16554971608261862 |1 |0.0 |0.01605611773109364 |0.0 |0.16666666666666666 |
|0 |0.16169041669088866 |0.0 |0.1666666666666667 |0.0 |0.09622504486493764 |0.0 |1 |1 |1 |1 |1 |1 |1 |0.26666666666666666|0.012517205514308224 |0.0 |0.012752837227090714 |0.0 |0.021516657911501622 |0.0 |1 |0.16838138468917166 |0.15434287415564008 |0.3922322702763681 |0.34874291623145787 |
|0 |0.2750456656690116 |0.1860521018838127 |0.2858309752375147 |0.19611613513818402 |0.223606797749979 |0.1386750490563073 |1 |1 |1 |1 |1 |1 |1 |0.34862385321100914|0.06278282792172384 |0.09178430436891666 |0.06694373400084344 |0.08253907697526759 |0.07508140721703477 |0.10856631569349082 |1 |0.3014783135305502 |0.25688979598845174 |0.5590169943749475 |0.47628967220784013 |
|0 |0.2449489742783178 |0.19810721293758182 |0.26352313834736496 |0.2307692307692308 |0.21629522817435007 |0.16012815380508716 |1 |1 |0 |1 |1 |1 |1 |0.4838709677419355 |0.12209521675839743 |0.19126420671254496 |0.1475066405521753 |0.2459312750965279 |0.1242978535834829 |0.1886519686826469 |1 |0.0 |0.01605611773109364 |0.0 |0.16666666666666666 |
|0 |0.08320502943378437 |0.09642365197998375 |0.11952286093343938 |0.13912166872805048 |0.0 |0.0 |0 |0 |0 |1 |0 |0 |1 |0.12 |0.04035362208133099 |0.04456121367953338 |0.04819698770773715 |0.0538656145326838 |0.0 |0.0 |8 |0.05825659037076343 |0.05246835256923818 |0.112089707663561 |0.11278230910134424 |
|0 |0.20784609690826525 |0.1846372364689991 |0.26111648393354675 |0.24806946917841688 |0.0 |0.0 |0 |0 |0 |1 |0 |1 |1 |0.0 |0.07233915683015167 |0.0716540790026919 |0.08229370516713722 |0.08299754342027771 |0.0 |0.0 |6 |0.04977054860197747 |0.06558734556106822 |0.09607689228305229 |0.21759706994462227 |
|1 |0.8926577981869824 |0.9066143160193102 |0.914335372996105 |0.9226517385233938 |0.5477225575051661 |0.6324555320336759 |0 |0 |0 |0 |0 |0 |1 |0.5309734513274337 |0.8734996606615234 |0.8946928809168011 |0.8791317315987442 |0.8973856295754765 |0.3496004425218079 |0.48223175160299564 |0 |0.0 |0.0 |0.0 |0.0 |
|1 |0.5185629788417315 |0.8432740427115678 |0.5118906968889915 |0.8819171036881969 |0.24253562503633297 |0.3333333333333333 |1 |1 |0 |1 |1 |1 |1 |0.09375 |0.18908955158360016 |0.8022196858263557 |0.17544355300115252 |0.8474955187144462 |0.13927839835275616 |0.2838123484309787 |6 |0.0 |0.0 |0.0 |0.0 |
|1 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0 |0 |1 |1 |0 |0 |1 |0.14814814814814814|0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |1 |0.02170244443925667 |0.020410228072244255 |0.15062893357603016 |0.28922903686544305 |
|0 |0.26860765467512676 |0.06271815075053182 |0.29515063885057 |0.07485976927589244 |0.0 |0.0 |0 |0 |1 |1 |0 |0 |1 |0.08 |0.04804110216570731 |0.03027143543580809 |0.05341183077151175 |0.03431607006581793 |0.0 |0.0 |1 |0.0 |0.022192268824097448 |0.0 |0.24019223070763074 |
|1 |0.33333333333333337 |0.40824829046386296 |0.33333333333333337 |0.40824829046386296 |0.33333333333333337 |0.40824829046386296 |0 |0 |0 |1 |0 |1 |1 |0.4516129032258064 |0.3310013083604027 |0.3537516145932176 |0.3444032278588375 |0.3667764454925114 |0.3042153384207993 |0.3408010155297054 |6 |0.28297384452448776 |0.23615630148525626 |0.2182178902359924 |0.19245008972987526 |
|0 |0.0519174131651165 |0.0 |0.0917662935482247 |0.0 |0.0 |0.0 |0 |0 |1 |1 |0 |0 |1 |0.0967741935483871 |0.03050544547960052 |0.0 |0.0490339271669166 |0.0 |0.0 |0.0 |5 |0.0 |0.0 |0.0 |0.0 |
|0 |0.049160514400834666 |0.0 |0.02627034687463669 |0.0 |0.0 |0.0 |0 |0 |0 |0 |0 |0 |1 |0.1282051282051282 |0.006316709944109247 |0.0 |0.003132143258557757 |0.0 |0.0 |0.0 |3 |0.0 |0.019794166951004794 |0.0 |0.15638581054280606 |
|0 |0.07082882469748285 |0.0 |0.08494119857293758 |0.0 |0.0 |0.0 |0 |0 |0 |1 |0 |1 |1 |0.06060606060606061|0.004924318378089263 |0.0 |0.005845759285912874 |0.0 |0.0 |0.0 |4 |0.023119472246583003 |0.010659666129102227 |0.03210289415620512 |0.04420122177473814 |
|0 |0.1924976258772545 |0.038014296063485276 |0.19149207069693872 |0.02521364528296496 |0.0 |0.0 |0 |0 |0 |1 |0 |1 |1 |0.125 |0.020931167922971575 |0.00448818821863432 |0.02118543184402528 |0.0026553570889578286 |0.0 |0.0 |5 |0.02336541089352552 |0.02401310014140845 |0.11919975664202526 |0.10760330515353056 |
|1 |0.17095921484405754 |0.08434614994311695 |0.20073126386549828 |0.10085458113185984 |0.0 |0.0 |0 |0 |1 |0 |0 |1 |1 |0.07407407407407407|0.09182827200781651 |0.05443489342945772 |0.10010815165693956 |0.05842165588249673 |0.0 |0.0 |8 |0.2973721930047951 |0.168690765981807 |0.5637584095764486 |0.48478000681923245 |
|0 |0.1405456737852613 |0.049147318718299055 |0.11846977555181847 |0.08333333333333333 |0.22360679774997896 |0.0 |1 |1 |1 |1 |1 |1 |1 |0.08333333333333331|0.01937969263670974 |0.003427781939920998 |0.022922840542318093 |0.006443992956721386 |0.03572605281706383 |0.0 |5 |0.26345546669165004 |0.2557786050767472 |0.405007416909787 |0.45121260440202404 |
|1 |0.6793662204867575 |0.753778361444409 |0.5773502691896258 |0.6396021490668313 |0.5773502691896258 |0.8164965809277259 |0 |0 |1 |1 |0 |0 |1 |0.6875 |0.7466360531069871 |0.8217912018147824 |0.7034677645212848 |0.6620051533994062 |0.469853400225108 |0.9321213932723664 |6 |0.0 |0.011793139853629018 |0.0 |0.14433756729740643 |
+----------+---------------------+----------------------------+----------------------+-----------------------------+-----------------------+------------------------------+--------------------+-------------+----------------+-------------+-------------+-------------------+--------+-------------------+---------------------------+----------------------------------+----------------------------+-----------------------------------+-----------------------------+------------------------------------+--------------------+------------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
Becomes this:
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0 |[0.21437323142813602,0.08703882797784893,0.23570226039551587,0.10050378152592121,0.10206207261596577,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.26373626373626374,0.012967453461681464,0.007624195465949381,0.014425347541872306,0.008896738386617248,0.022695267556861232,0.0,1.0,0.16838138468917166,0.15434287415564008,0.3922322702763681,0.34874291623145787] |
|1 |[0.5303300858899107,0.5017452060042545,0.5303300858899107,0.5017452060042545,0.5303300858899107,0.5017452060042545,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.6870229007633588,0.3534850108895589,0.5857224407945156,0.36079979664267925,0.5853463384675868,0.36971703925333405,0.5814734067275937,0.0,1.0,0.9999999999999998,1.0,0.9999999999999998] |
|0 |[0.31754264805429416,0.30151134457776363,0.33541019662496846,0.3344968040028363,0.2867696673382022,0.26111648393354675,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.41600000000000004,0.10867521883199269,0.1920005048084368,0.1322792942407786,0.2477844869237889,0.11802058757911914,0.16554971608261862,1.0,0.0,0.01605611773109364,0.0,0.16666666666666666] |
|0 |[0.16169041669088866,0.0,0.1666666666666667,0.0,0.09622504486493764,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.26666666666666666,0.012517205514308224,0.0,0.012752837227090714,0.0,0.021516657911501622,0.0,1.0,0.16838138468917166,0.15434287415564008,0.3922322702763681,0.34874291623145787] |
|0 |[0.2750456656690116,0.1860521018838127,0.2858309752375147,0.19611613513818402,0.223606797749979,0.1386750490563073,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.34862385321100914,0.06278282792172384,0.09178430436891666,0.06694373400084344,0.08253907697526759,0.07508140721703477,0.10856631569349082,1.0,0.3014783135305502,0.25688979598845174,0.5590169943749475,0.47628967220784013]|
|0 |[0.2449489742783178,0.19810721293758182,0.26352313834736496,0.2307692307692308,0.21629522817435007,0.16012815380508716,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.4838709677419355,0.12209521675839743,0.19126420671254496,0.1475066405521753,0.2459312750965279,0.1242978535834829,0.1886519686826469,1.0,0.0,0.01605611773109364,0.0,0.16666666666666666] |
|0 |[0.08320502943378437,0.09642365197998375,0.11952286093343938,0.13912166872805048,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.12,0.04035362208133099,0.04456121367953338,0.04819698770773715,0.0538656145326838,0.0,0.0,8.0,0.05825659037076343,0.05246835256923818,0.112089707663561,0.11278230910134424] |
|0 |[0.20784609690826525,0.1846372364689991,0.26111648393354675,0.24806946917841688,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.07233915683015167,0.0716540790026919,0.08229370516713722,0.08299754342027771,0.0,0.0,6.0,0.04977054860197747,0.06558734556106822,0.09607689228305229,0.21759706994462227] |
|1 |(25,[0,1,2,3,4,5,12,13,14,15,16,17,18,19],[0.8926577981869824,0.9066143160193102,0.914335372996105,0.9226517385233938,0.5477225575051661,0.6324555320336759,1.0,0.5309734513274337,0.8734996606615234,0.8946928809168011,0.8791317315987442,0.8973856295754765,0.3496004425218079,0.48223175160299564]) |
|1 |[0.5185629788417315,0.8432740427115678,0.5118906968889915,0.8819171036881969,0.24253562503633297,0.3333333333333333,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.09375,0.18908955158360016,0.8022196858263557,0.17544355300115252,0.8474955187144462,0.13927839835275616,0.2838123484309787,6.0,0.0,0.0,0.0,0.0] |
|1 |(25,[8,9,12,13,20,21,22,23,24],[1.0,1.0,1.0,0.14814814814814814,1.0,0.02170244443925667,0.020410228072244255,0.15062893357603016,0.28922903686544305]) |
|0 |(25,[0,1,2,3,8,9,12,13,14,15,16,17,20,22,24],[0.26860765467512676,0.06271815075053182,0.29515063885057,0.07485976927589244,1.0,1.0,1.0,0.08,0.04804110216570731,0.03027143543580809,0.05341183077151175,0.03431607006581793,1.0,0.022192268824097448,0.24019223070763074]) |
|1 |[0.33333333333333337,0.40824829046386296,0.33333333333333337,0.40824829046386296,0.33333333333333337,0.40824829046386296,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.4516129032258064,0.3310013083604027,0.3537516145932176,0.3444032278588375,0.3667764454925114,0.3042153384207993,0.3408010155297054,6.0,0.28297384452448776,0.23615630148525626,0.2182178902359924,0.19245008972987526]|
|0 |(25,[0,2,8,9,12,13,14,16,20],[0.0519174131651165,0.0917662935482247,1.0,1.0,1.0,0.0967741935483871,0.03050544547960052,0.0490339271669166,5.0]) |
|0 |(25,[0,2,12,13,14,16,20,22,24],[0.049160514400834666,0.02627034687463669,1.0,0.1282051282051282,0.006316709944109247,0.003132143258557757,3.0,0.019794166951004794,0.15638581054280606]) |
|0 |(25,[0,2,9,11,12,13,14,16,20,21,22,23,24],[0.07082882469748285,0.08494119857293758,1.0,1.0,1.0,0.06060606060606061,0.004924318378089263,0.005845759285912874,4.0,0.023119472246583003,0.010659666129102227,0.03210289415620512,0.04420122177473814]) |
|0 |[0.1924976258772545,0.038014296063485276,0.19149207069693872,0.02521364528296496,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.125,0.020931167922971575,0.00448818821863432,0.02118543184402528,0.0026553570889578286,0.0,0.0,5.0,0.02336541089352552,0.02401310014140845,0.11919975664202526,0.10760330515353056] |
|1 |[0.17095921484405754,0.08434614994311695,0.20073126386549828,0.10085458113185984,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.07407407407407407,0.09182827200781651,0.05443489342945772,0.10010815165693956,0.05842165588249673,0.0,0.0,8.0,0.2973721930047951,0.168690765981807,0.5637584095764486,0.48478000681923245] |
|0 |[0.1405456737852613,0.049147318718299055,0.11846977555181847,0.08333333333333333,0.22360679774997896,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.08333333333333331,0.01937969263670974,0.003427781939920998,0.022922840542318093,0.006443992956721386,0.03572605281706383,0.0,5.0,0.26345546669165004,0.2557786050767472,0.405007416909787,0.45121260440202404] |
|1 |[0.6793662204867575,0.753778361444409,0.5773502691896258,0.6396021490668313,0.5773502691896258,0.8164965809277259,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.6875,0.7466360531069871,0.8217912018147824,0.7034677645212848,0.6620051533994062,0.469853400225108,0.9321213932723664,6.0,0.0,0.011793139853629018,0.0,0.14433756729740643] |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
And as you can see here, the result has 2 different results, not just one unified vector.
Is this a bug in CDH's spark (1.6) or am I missing something?
TL;DR This is a normal behavior.
Your data contains a number of sparse rows. When assembled these are converted to SparseVector and represented in the output as
(size, [idx1, idx2, ..., idxm], [val1, val2, ..., valm])
where idx1..indm are positions of non-zero values, and val1..valm corresponding value. So following
(25,[8,9,12,13, ...],[1.0,1.0,1.0,0.14814814814814814, ...])
is a SparseVector of size 25, where 9-th position is equal to 1.0, and 13-th to 0.148.
If data is dense (less than half of the values is equal to zero) you get DenseVectors which in your input are represented as:
[val0, val1, ..., valn]
Both representations are perfectly valid and majority of tools will accept both just fine.