Spark - Divide a dataframe into n number of records - scala

I have a dataframe with 2 or more columns and 1000 records. I want to split the data into 100 records chunks randomly without any conditions.
So expected output in records count should be something like this,
[(1,2....100),(101,102,103...200),.....,(900,901...1000)]
Here's the solution that worked for my use case after trying different approaches:
https://stackoverflow.com/a/61276734/12322995

As #Shaido said randomsplit is ther for splitting dataframe is popular approach..
Thought differently about repartitionByRange with => spark 2.3
repartitionByRange public Dataset repartitionByRange(int
numPartitions,
scala.collection.Seq partitionExprs) Returns a new Dataset partitioned by the given
partitioning expressions into numPartitions. The resulting Dataset is
range partitioned. At least one partition-by expression must be
specified. When no explicit sort order is specified, "ascending nulls
first" is assumed. Parameters: numPartitions - (undocumented)
partitionExprs - (undocumented) Returns: (undocumented) Since:
2.3.0
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object RepartitionByRange extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
val t1 = sc.parallelize(0 until 1000).toDF("id")
val repartitionedOrders: Dataset[String] = t1.repartitionByRange(10, $"id")
.mapPartitions(rows => {
val idsInPartition = rows.map(row => row.getAs[Int]("id")).toSeq.sorted.mkString(",")
Iterator(idsInPartition)
})
repartitionedOrders.show(false)
println("number of chunks or partitions :" + repartitionedOrders.rdd.getNumPartitions)
}
Result :
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99 |
|100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199|
|200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299|
|300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399|
|400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499|
|500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599|
|600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699|
|700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799|
|800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899|
|900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
number of chunks or partitions : 10
UPDATE : randomsplit example :
import spark.implicits._
val t1 = sc.parallelize(0 until 1000).toDF("id")
println("With Random Split ")
val dfarray = t1.randomSplit(Array(1, 1, 1, 1, 1, 1, 1, 1, 1, 1));
println("number of dataframes " + dfarray.length + "element order is not guaranteed ")
dfarray.foreach {
df => df.show
}
Result : Will be split in to 10 dataframes and order is not guaranteed.
With Random Split
number of dataframes 10element order is not guaranteed
+---+
| id|
+---+
| 2|
| 10|
| 16|
| 30|
| 36|
| 46|
| 51|
| 91|
|100|
|121|
|136|
|138|
|149|
|152|
|159|
|169|
|198|
|199|
|220|
|248|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 26|
| 40|
| 45|
| 54|
| 63|
| 72|
| 76|
|107|
|129|
|137|
|142|
|145|
|153|
|162|
|173|
|179|
|196|
|208|
|214|
|232|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 7|
| 12|
| 31|
| 32|
| 38|
| 42|
| 53|
| 61|
| 68|
| 73|
| 80|
| 89|
| 96|
|115|
|117|
|118|
|131|
|132|
|139|
|146|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 0|
| 24|
| 35|
| 57|
| 58|
| 65|
| 77|
| 78|
| 84|
| 86|
| 90|
| 97|
|156|
|158|
|168|
|174|
|182|
|197|
|218|
|242|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 1|
| 3|
| 17|
| 18|
| 19|
| 33|
| 70|
| 71|
| 74|
| 83|
|102|
|104|
|108|
|109|
|122|
|128|
|143|
|150|
|154|
|157|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 14|
| 15|
| 29|
| 44|
| 64|
| 75|
| 88|
|103|
|110|
|113|
|116|
|120|
|124|
|135|
|155|
|213|
|221|
|238|
|241|
|251|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 5|
| 9|
| 21|
| 22|
| 23|
| 25|
| 27|
| 47|
| 52|
| 55|
| 60|
| 62|
| 69|
| 93|
|111|
|114|
|141|
|144|
|161|
|164|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 13|
| 20|
| 39|
| 41|
| 49|
| 56|
| 67|
| 85|
| 87|
| 92|
|105|
|106|
|126|
|127|
|160|
|165|
|166|
|171|
|175|
|184|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 4|
| 34|
| 50|
| 79|
| 81|
|101|
|119|
|123|
|133|
|147|
|163|
|170|
|180|
|181|
|193|
|202|
|207|
|222|
|226|
|233|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 6|
| 8|
| 11|
| 28|
| 37|
| 43|
| 48|
| 59|
| 66|
| 82|
| 94|
| 95|
| 98|
| 99|
|112|
|125|
|130|
|134|
|140|
|183|
+---+
only showing top 20 rows

Since I want the data to be evenly distributed and to be able to use the chunks separately or in iterative manner using randomSplit doesn't work as it may leave empty dataframes or unequal distribution.
So using grouped can be one of the most feasible solutions here if you don't mind calling collect on your dataframe.
Eg: val newdf = df.collect.grouped(10)
That gives an Iterator[List[org.apache.spark.sql.Row]] = non-empty iterator. Can also convert it into list by adding .toList at the end
Another possible solution if we don't want Array chunks of data from the dataframe but still want to partition the data with equal counts of records we can try to use countApprox by adjusting timeout and confidence as required. Then divide that with number of records we need in a partition, which can be later used as number of partitions when using repartition or Coalesce.
countApprox instead of count because it is less expensive operation and you can feel the difference when the data size is huge
val approxCount = df.rdd.countApprox(timeout = 1000L,confidence = 0.95).getFinalValue().high
val numOfPartitions = Math.max(Math.round(approxCount / 100), 1).toInt
df.repartition(numOfPartitions)

Related

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

join 2 DF with diferent dimension scala

Hi I have 2 Differente DF
scala> d1.show() scala> d2.show()
+--------+-------+ +--------+----------+
| fecha|eventos| | fecha|TotalEvent|
+--------+-------+ +--------+----------+
|20180404| 3| | 0| 23534|
|20180405| 7| |20180322| 10|
|20180406| 10| |20180326| 50|
|20180409| 4| |20180402| 6|
.... |20180403| 118|
scala> d1.count() |20180404| 1110|
res3: Long = 60 ...
scala> d2.count()
res7: Long = 74
But I like to join them by fecha without loose data, and then, create a new column with a math operation (TotalEvent - eventos)*100/TotalEvent
Something like this:
+---------+-------+----------+--------+
|fecha |eventos|TotalEvent| KPI |
+---------+-------+----------+--------+
| 0| | 23534 | 100.00|
| 20180322| | 10 | 100.00|
| 20180326| | 50 | 100.00|
| 20180402| | 6 | 100.00|
| 20180403| | 118 | 100.00|
| 20180404| 3 | 1110 | 99.73|
| 20180405| 7 | 1204 | 99.42|
| 20180406| 10 | 1526 | 99.34|
| 20180407| | 14 | 100.00|
| 20180409| 4 | 1230 | 99.67|
| 20180410| 11 | 1456 | 99.24|
| 20180411| 6 | 1572 | 99.62|
| 20180412| 5 | 1450 | 99.66|
| 20180413| 7 | 1214 | 99.42|
.....
The problems is that I can't find the way to do it.
When I use:
scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()
I loose the data that isn't in both table.
+--------+-------+--------+----------+
| fecha|eventos| fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404| 3|20180404| 1110|
|20180405| 7|20180405| 1204|
|20180406| 10|20180406| 1526|
|20180409| 4|20180409| 1230|
|20180410| 11|20180410| 1456|
....
Additional, How can I add a new column with the math operation?
Thank you
I would recommend left-joining df2 with df1 and calculating KPI based on whether eventos is null or not in the joined dataset (using when/otherwise):
import org.apache.spark.sql.functions._
val df1 = Seq(
("20180404", 3),
("20180405", 7),
("20180406", 10),
("20180409", 4)
).toDF("fecha", "eventos")
val df2 = Seq(
("0", 23534),
("20180322", 10),
("20180326", 50),
("20180402", 6),
("20180403", 118),
("20180404", 1110),
("20180405", 100),
("20180406", 100)
).toDF("fecha", "TotalEvent")
df2.
join(df1, Seq("fecha"), "left_outer").
withColumn( "KPI",
round( when($"eventos".isNull, 100.0).
otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),
2
)
).show
// +--------+----------+-------+-----+
// | fecha|TotalEvent|eventos| KPI|
// +--------+----------+-------+-----+
// | 0| 23534| null|100.0|
// |20180322| 10| null|100.0|
// |20180326| 50| null|100.0|
// |20180402| 6| null|100.0|
// |20180403| 118| null|100.0|
// |20180404| 1110| 3|99.73|
// |20180405| 100| 7| 93.0|
// |20180406| 100| 10| 90.0|
// +--------+----------+-------+-----+
Note that if the more precise raw KPI is wanted instead, just remove the wrapping round( , 2).
I would do this in several of steps. First join, then select the calculated column, then fill in the na:
# val df2a = df2.withColumnRenamed("fecha", "fecha2") # to avoid ambiguous column names after the join
# val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")
# val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))
# kpi.show()
+--------+-------+--------+----------+-----------------+
| fecha|eventos| fecha2|TotalEvent| KPI|
+--------+-------+--------+----------+-----------------+
| null| null|20180402| 6| 100.0|
| null| null| 0| 23534| 100.0|
| null| null|20180322| 10| 100.0|
|20180404| 3|20180404| 1110|99.72972972972973|
|20180406| 10| null| null| 100.0|
| null| null|20180403| 118| 100.0|
| null| null|20180326| 50| 100.0|
|20180409| 4| null| null| 100.0|
|20180405| 7| null| null| 100.0|
+--------+-------+--------+----------+-----------------+
I solved the problems with mixed both suggestion recived.
val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))

How to generate sequence on a file with millions records (daily incremental load) in Spark2

I have a business scenario to generate surrogate key on daily incremental table or file in spark 2.0 with scala 2.11.8. I know about "zipwithindex", "row_num" and "monotonically_increasing_id()" but none of them works for daily incremental load, as for today's load my sequence would be 1 + yesterday's sequence.
Accumulator's also won't work as it is write only.
Ex. Scenario: Till yesterday's load I have last customer_sk as 1001, now in today's load I want to set customer_sk that will starts from 1002 till end of file.
Note: I will have millions of rows , the program will be running on multiple nodes in parallel.
Thanks in advance
1) Get max customer_sk from the table.
2) then when using row_num add this max customer_sk number so that your sequence continue from that.
If using rdd also, add the previous max number to the (zipwithindex +1).
For all those who are still looking for answer with sample code.
hdfs dfs -cat /user/shahabhi/test_file_2.csv
abhishek,shah,123,pune,2018-12-31,2018-11-30
abhishek,shah,123,pune,2018-12-31,2018-11-30
ravi,sharma,464,mumbai,20181231,20181130
Mitesh,shah,987,satara,2018-12-31,2018-11-30
shalabh,nagar,981,satara,2018-12-31,2018-11-30
Gaurav,mehta,235,ujjain,2018/12/31,2018/11/30
Gaurav,mehta,235,ujjain,2018-12-31,2018-11-30
vikas,khanna,123,ujjain,2018-12-31,2018-11-30
vinayak,kale,789,pune,2018-12-31,2018-11-30
Spark code--
import org.apache.spark.sql.functions.monotonically_increasing_id
val df =spark.read.csv("/user/shahabhi/test_file_2.csv").toDF("name","lname","d_code","city","esd","eed")
df.show()
+--------+------+------+------+----------+----------+
| name| lname|d_code| city| esd| eed|
+--------+------+------+------+----------+----------+
|abhishek| shah| 123| pune|2018-12-31|2018-11-30|
|abhishek| shah| 123| pune|2018-12-31|2018-11-30|
| ravi|sharma| 464|mumbai| 20181231| 20181130|
| Mitesh| shah| 987|satara|2018-12-31|2018-11-30|
| shalabh| nagar| 981|satara|2018-12-31|2018-11-30|
| Gaurav| mehta| 235|ujjain|2018/12/31|2018/11/30|
| Gaurav| mehta| 235|ujjain|2018-12-31|2018-11-30|
| vikas|khanna| 123|ujjain|2018-12-31|2018-11-30|
| vinayak| kale| 789| pune|2018-12-31|2018-11-30|
+--------+------+------+------+----------+----------+
val df_2=df.withColumn("surrogate_key", monotonically_increasing_id())
df_2.show()
+--------+------+------+------+----------+----------+-------------+
| name| lname|d_code| city| esd| eed|surrogate_key|
+--------+------+------+------+----------+----------+-------------+
|abhishek| shah| 123| pune|2018-12-31|2018-11-30| 0|
|abhishek| shah| 123| pune|2018-12-31|2018-11-30| 1|
| ravi|sharma| 464|mumbai| 20181231| 20181130| 2|
| Mitesh| shah| 987|satara|2018-12-31|2018-11-30| 3|
| shalabh| nagar| 981|satara|2018-12-31|2018-11-30| 4|
| Gaurav| mehta| 235|ujjain|2018/12/31|2018/11/30| 5|
| Gaurav| mehta| 235|ujjain|2018-12-31|2018-11-30| 6|
| vikas|khanna| 123|ujjain|2018-12-31|2018-11-30| 7|
| vinayak| kale| 789| pune|2018-12-31|2018-11-30| 8|
+--------+------+------+------+----------+----------+-------------+
val df_3=df.withColumn("surrogate_key", monotonically_increasing_id()+1000)
df_3.show()
+--------+------+------+------+----------+----------+-------------+
| name| lname|d_code| city| esd| eed|surrogate_key|
+--------+------+------+------+----------+----------+-------------+
|abhishek| shah| 123| pune|2018-12-31|2018-11-30| 1000|
|abhishek| shah| 123| pune|2018-12-31|2018-11-30| 1001|
| ravi|sharma| 464|mumbai| 20181231| 20181130| 1002|
| Mitesh| shah| 987|satara|2018-12-31|2018-11-30| 1003|
| shalabh| nagar| 981|satara|2018-12-31|2018-11-30| 1004|
| Gaurav| mehta| 235|ujjain|2018/12/31|2018/11/30| 1005|
| Gaurav| mehta| 235|ujjain|2018-12-31|2018-11-30| 1006|
| vikas|khanna| 123|ujjain|2018-12-31|2018-11-30| 1007|
| vinayak| kale| 789| pune|2018-12-31|2018-11-30| 1008|
+--------+------+------+------+----------+----------+-------------+

Group rows that match sub string in a column using scala

I have a fol df:
Zip | Name | id |
abc | xyz | 1 |
def | wxz | 2 |
abc | wex | 3 |
bcl | rea | 4 |
abc | txc | 5 |
def | rfx | 6 |
abc | abc | 7 |
I need to group all the names that contain 'x' based on same Zip using scala
Desired Output:
Zip | Count |
abc | 3 |
def | 2 |
Any help is highly appreciated
As #Shaido mentioned in the comment above, all you need is filter, groupBy and aggregation as
import org.apache.spark.sql.functions._
fol.filter(col("Name").contains("x")) //filtering the rows that has x in the Name column
.groupBy("Zip") //grouping by Zip column
.agg(count("Zip").as("Count")) //counting the rows in each groups
.show(false)
and you should have the desired output
+---+-----+
|Zip|Count|
+---+-----+
|abc|3 |
|def|2 |
+---+-----+
You want to groupBy bellow data frame.
+---+----+---+
|zip|name| id|
+---+----+---+
|abc| xyz| 1|
|def| wxz| 2|
|abc| wex| 3|
|bcl| rea| 4|
|abc| txc| 5|
|def| rfx| 6|
|abc| abc| 7|
+---+----+---+
then you can simply use groupBy function with passing column parameter and followed by count will give you the result.
val groupedDf: DataFrame = df.groupBy("zip").count()
groupedDf.show()
// +---+-----+
// |zip|count|
// +---+-----+
// |bcl| 1|
// |abc| 4|
// |def| 2|
// +---+-----+

How to create feature vector in Scala? [duplicate]

This question already has an answer here:
How to transform the dataframe into label feature vector?
(1 answer)
Closed 5 years ago.
I am reading a csv as a data frame in scala as below:
+-----------+------------+
|x |y |
+-----------+------------+
| 0| 0|
| 0| 33|
| 0| 58|
| 0| 96|
| 0| 1|
| 1| 21|
| 0| 10|
| 0| 65|
| 1| 7|
| 1| 28|
+-----------+------------+
Then I create the label and feature vector as below:
val assembler = new VectorAssembler()
.setInputCols(Array("y"))
.setOutputCol("features")
val output = assembler.transform(daf).select($"x".as("label"), $"features")
println(output.show)
The output is as:
+-----------+------------+
|label | features |
+-----------+------------+
| 0.0| 0.0|
| 0.0| 33.0|
| 0.0| 58.0|
| 0.0| 96.0|
| 0.0| 1.0|
| 0.0| 21.0|
| 0.0| 10.0|
| 1.0| 65.0|
| 1.0| 7.0|
| 1.0| 28.0|
+-----------+------------+
But instead of this I want the output to be like in the below format
+-----+------------------+
|label| features |
+-----+------------------+
| 0.0|(1,[1],[0]) |
| 0.0|(1,[1],[33]) |
| 0.0|(1,[1],[58]) |
| 0.0|(1,[1],[96]) |
| 0.0|(1,[1],[1]) |
| 1.0|(1,[1],[21]) |
| 0.0|(1,[1],[10]) |
| 0.0|(1,[1],[65]) |
| 1.0|(1,[1],[7]) |
| 1.0|(1,[1],[28]) |
+-----------+------------+
I tried
val assembler = new VectorAssembler()
.setInputCols(Array("y").map{x => "(1,[1],"+x+")"})
.setOutputCol("features")
But did not work.
Any help is appreciated.
This is not how you use VectorAssembler.
You need to give the names of your input columns. i.e
new VectorAssembler().setInputCols(Array("features"))
You'll face eventually another issue considering the data that you have shared. It's not much a vector if it's one point. (your features columns)
It should be used with 2 or more columns. i.e :
new VectorAssembler().setInputCols(Array("f1","f2","f3"))