Spark-Monotonically increasing id not working as expected in dataframe? - scala

I have a dataframe df in Spark which looks something like this:
scala> df.show()
+--------+--------+
|columna1|columna2|
+--------+--------+
| 0.1| 0.4|
| 0.2| 0.5|
| 0.1| 0.3|
| 0.3| 0.6|
| 0.2| 0.7|
| 0.2| 0.8|
| 0.1| 0.7|
| 0.5| 0.5|
| 0.6| 0.98|
| 1.2| 1.1|
| 1.2| 1.2|
| 0.4| 0.7|
+--------+--------+
I tried to include an id column with the following code
val df_id = df.withColumn("id",monotonicallyIncreasingId)
but the id column is not what I expect:
scala> df_id.show()
+--------+--------+----------+
|columna1|columna2| id|
+--------+--------+----------+
| 0.1| 0.4| 0|
| 0.2| 0.5| 1|
| 0.1| 0.3| 2|
| 0.3| 0.6| 3|
| 0.2| 0.7| 4|
| 0.2| 0.8| 5|
| 0.1| 0.7|8589934592|
| 0.5| 0.5|8589934593|
| 0.6| 0.98|8589934594|
| 1.2| 1.1|8589934595|
| 1.2| 1.2|8589934596|
| 0.4| 0.7|8589934597|
+--------+--------+----------+
As you can see, it goes well from 0 to 5 but then the next id is 8589934592 instead of 6 and so on.
So what is wrong here? Why is the id column not properly indexed here?

It works as expected. This function is not intended for generating consecutive values. Instead it encodes partition number and index by partition
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
If you want consecutive numbers, use RDD.zipWithIndex.

Related

How does Spark DataFrame find out some lines that only appear once?

I want to eliminate some rows that only appear once in the ‘county’ column, which is not conducive to my statistics.
I used groupBy+count to find:
fault_data.groupBy("county").count().show()
The data looks like this:
+----------+-----+
| county|count|
+----------+-----+
| A| 117|
| B| 31|
| C| 1|
| D| 272|
| E| 1|
| F| 1|
| G| 280|
| H| 1|
| I| 1|
| J| 1|
| K| 112|
| L| 63|
| M| 18|
| N| 71|
| O| 1|
| P| 1|
| Q| 82|
| R| 2|
| S| 31|
| T| 2|
+----------+-----+
Next, I want to eliminate the data whose count is 1.
But when I wrote it like this, it was wrong:
fault_data.filter("count(county)=1").show()
The result is:
Aggregate/Window/Generate expressions are not valid in where clause of the query.
Expression in where clause: [(count(county) = CAST(1 AS BIGINT))]
Invalid expressions: [count(county)];
Filter (count(county#7) = cast(1 as bigint))
+- Relation [fault_id#0,fault_type#1,acs_way#2,fault_1#3,fault_2#4,province#5,city#6,county#7,town#8,detail#9,num#10,insert_time#11] JDBCRelation(fault_data) [numPartitions=1]
So I want to know the right way, thank you.
fault_data.groupBy("county").count().where(col("count")===1).show()

How to save one of the values of a column eliminated by groupBy in Spark

Here is a sample of the dataFrame I have:
+----+------+-------+-----+
| id | pool | start | end |
+----+------+-------+-----+
| 1| cat| 100| 105|
| 2| cat| 104| 110|
| 3| cat| 130| 135|
| 4| dog| 100| 110|
| 5| dog| 101| 103|
| 6| rat| 150| 151|
| 7| rat| 180| 187|
| 8| rat| 183| 184|
| 9| cat| 143| 150|
| 10| dog| 107| 120|
| ...| ...| ...| ...|
+----+------+-------+-----+
And I want to group by the pool while preserving the min and max values seen in the start and end for each pool. Great, I run
sourceDataframe.groupBy("pool").agg(min("start"), max("end"))
which gives me:
+------+-----------+---------+
| pool | min(start)| max(end)|
+------+-----------+---------+
| cat| 100| 150|
| dog| 100| 120|
| rat| 150| 187|
| ...| ...| ...|
+------+-----------+---------+
But I want one more thing, which is any id belonging to each pool. If possible, preferably the one with the maximum end (or any other arbitrary requirement I suppose). And if possible, without doing another join after the fact.
Example of success:
// If two have the same maximum end, it should just pick an arbitrary one
+------+-----------+---------+-----------+
| pool | min(start)| max(end)| idOfMaxEnd|
+------+-----------+---------+-----------+
| cat| 100| 150| 9|
| dog| 100| 120| 10|
| rat| 150| 187| 7|
| ...| ...| ...| ...|
+------+-----------+---------+-----------+
Thanks for your time and effort!
Edit: Okay so doing the following almost gives me what I want:
sourceDataframe.groupBy("pool").agg(min("start"), max("end"), first("id"))
It saves the first ID of each group that it comes across. I would like to save the ID with the maximum end per group, if possible. I know I could solve this by sorting the DataFrame but that would require too much time.
You can use window function with 2 different windows although I am not sure that it will be more efficient than using join (you mentioned that you want to avoid it):
val df: DataFrame = Seq(
(1, "cat", 100, 1335),
(3, "cat", 130, 135),
(7, "rat", 180, 187),
(8, "rat", 183, 184),
).toDF("id", "pool", "start", "end")
val windowSpecEnd = Window.partitionBy("pool").orderBy(col("end").desc)
val windowSpecStart = Window.partitionBy("pool").orderBy(col("start"))
val aggDf = df
.withColumn("min_start", functions.first("start").over(windowSpecStart))
.withColumn("max_end", functions.first("end").over(windowSpecEnd))
.withColumn("idOfMaxEnd", functions.first("id").over(windowSpecEnd))
.select("pool", "min_start", "max_end", "idOfMaxEnd")
.distinct
aggDf.show
// output:
+----+---------+-------+----------+
|pool|min_start|max_end|idOfMaxEnd|
+----+---------+-------+----------+
| rat| 180| 187| 7|
| cat| 100| 1335| 1|
+----+---------+-------+----------+

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

Spark - Divide a dataframe into n number of records

I have a dataframe with 2 or more columns and 1000 records. I want to split the data into 100 records chunks randomly without any conditions.
So expected output in records count should be something like this,
[(1,2....100),(101,102,103...200),.....,(900,901...1000)]
Here's the solution that worked for my use case after trying different approaches:
https://stackoverflow.com/a/61276734/12322995
As #Shaido said randomsplit is ther for splitting dataframe is popular approach..
Thought differently about repartitionByRange with => spark 2.3
repartitionByRange public Dataset repartitionByRange(int
numPartitions,
scala.collection.Seq partitionExprs) Returns a new Dataset partitioned by the given
partitioning expressions into numPartitions. The resulting Dataset is
range partitioned. At least one partition-by expression must be
specified. When no explicit sort order is specified, "ascending nulls
first" is assumed. Parameters: numPartitions - (undocumented)
partitionExprs - (undocumented) Returns: (undocumented) Since:
2.3.0
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object RepartitionByRange extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
val t1 = sc.parallelize(0 until 1000).toDF("id")
val repartitionedOrders: Dataset[String] = t1.repartitionByRange(10, $"id")
.mapPartitions(rows => {
val idsInPartition = rows.map(row => row.getAs[Int]("id")).toSeq.sorted.mkString(",")
Iterator(idsInPartition)
})
repartitionedOrders.show(false)
println("number of chunks or partitions :" + repartitionedOrders.rdd.getNumPartitions)
}
Result :
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99 |
|100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199|
|200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299|
|300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399|
|400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499|
|500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599|
|600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699|
|700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799|
|800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899|
|900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
number of chunks or partitions : 10
UPDATE : randomsplit example :
import spark.implicits._
val t1 = sc.parallelize(0 until 1000).toDF("id")
println("With Random Split ")
val dfarray = t1.randomSplit(Array(1, 1, 1, 1, 1, 1, 1, 1, 1, 1));
println("number of dataframes " + dfarray.length + "element order is not guaranteed ")
dfarray.foreach {
df => df.show
}
Result : Will be split in to 10 dataframes and order is not guaranteed.
With Random Split
number of dataframes 10element order is not guaranteed
+---+
| id|
+---+
| 2|
| 10|
| 16|
| 30|
| 36|
| 46|
| 51|
| 91|
|100|
|121|
|136|
|138|
|149|
|152|
|159|
|169|
|198|
|199|
|220|
|248|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 26|
| 40|
| 45|
| 54|
| 63|
| 72|
| 76|
|107|
|129|
|137|
|142|
|145|
|153|
|162|
|173|
|179|
|196|
|208|
|214|
|232|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 7|
| 12|
| 31|
| 32|
| 38|
| 42|
| 53|
| 61|
| 68|
| 73|
| 80|
| 89|
| 96|
|115|
|117|
|118|
|131|
|132|
|139|
|146|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 0|
| 24|
| 35|
| 57|
| 58|
| 65|
| 77|
| 78|
| 84|
| 86|
| 90|
| 97|
|156|
|158|
|168|
|174|
|182|
|197|
|218|
|242|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 1|
| 3|
| 17|
| 18|
| 19|
| 33|
| 70|
| 71|
| 74|
| 83|
|102|
|104|
|108|
|109|
|122|
|128|
|143|
|150|
|154|
|157|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 14|
| 15|
| 29|
| 44|
| 64|
| 75|
| 88|
|103|
|110|
|113|
|116|
|120|
|124|
|135|
|155|
|213|
|221|
|238|
|241|
|251|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 5|
| 9|
| 21|
| 22|
| 23|
| 25|
| 27|
| 47|
| 52|
| 55|
| 60|
| 62|
| 69|
| 93|
|111|
|114|
|141|
|144|
|161|
|164|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 13|
| 20|
| 39|
| 41|
| 49|
| 56|
| 67|
| 85|
| 87|
| 92|
|105|
|106|
|126|
|127|
|160|
|165|
|166|
|171|
|175|
|184|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 4|
| 34|
| 50|
| 79|
| 81|
|101|
|119|
|123|
|133|
|147|
|163|
|170|
|180|
|181|
|193|
|202|
|207|
|222|
|226|
|233|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 6|
| 8|
| 11|
| 28|
| 37|
| 43|
| 48|
| 59|
| 66|
| 82|
| 94|
| 95|
| 98|
| 99|
|112|
|125|
|130|
|134|
|140|
|183|
+---+
only showing top 20 rows
Since I want the data to be evenly distributed and to be able to use the chunks separately or in iterative manner using randomSplit doesn't work as it may leave empty dataframes or unequal distribution.
So using grouped can be one of the most feasible solutions here if you don't mind calling collect on your dataframe.
Eg: val newdf = df.collect.grouped(10)
That gives an Iterator[List[org.apache.spark.sql.Row]] = non-empty iterator. Can also convert it into list by adding .toList at the end
Another possible solution if we don't want Array chunks of data from the dataframe but still want to partition the data with equal counts of records we can try to use countApprox by adjusting timeout and confidence as required. Then divide that with number of records we need in a partition, which can be later used as number of partitions when using repartition or Coalesce.
countApprox instead of count because it is less expensive operation and you can feel the difference when the data size is huge
val approxCount = df.rdd.countApprox(timeout = 1000L,confidence = 0.95).getFinalValue().high
val numOfPartitions = Math.max(Math.round(approxCount / 100), 1).toInt
df.repartition(numOfPartitions)

Create new column from existing Dataframe

I am having a Dataframe and trying to create a new column from existing columns based on following condition.
Group data by column named event_type
Only filter those rows where column source has value train and call it X.
Values for new column are X.sum / X.length
Here is input Dataframe
+-----+-------------+----------+--------------+------+
| id| event_type| location|fault_severity|source|
+-----+-------------+----------+--------------+------+
| 6597|event_type 11|location 1| -1| test|
| 8011|event_type 15|location 1| 0| train|
| 2597|event_type 15|location 1| -1| test|
| 5022|event_type 15|location 1| -1| test|
| 5022|event_type 11|location 1| -1| test|
| 6852|event_type 11|location 1| -1| test|
| 6852|event_type 15|location 1| -1| test|
| 5611|event_type 15|location 1| -1| test|
|14838|event_type 15|location 1| -1| test|
|14838|event_type 11|location 1| -1| test|
| 2588|event_type 15|location 1| 0| train|
| 2588|event_type 11|location 1| 0| train|
+-----+-------------+----------+--------------+------+
and i want following output.
+--------------+------------+-----------+
| | event_type | PercTrain |
+--------------+------------+-----------+
|event_type 11 | 7888 | 0.388945 |
|event_type 35 | 6615 | 0.407105 |
|event_type 34 | 5927 | 0.406783 |
|event_type 15 | 4395 | 0.392264 |
|event_type 20 | 1458 | 0.382030 |
+--------------+------------+-----------+
I have tried this code but this throws error
EventSet.withColumn("z" , when($"source" === "train" , sum($"source") / length($"source"))).groupBy("fault_severity").count().show()
Here EventSet is input dataframe
Python code that gives desired output is
event_type_unq['PercTrain'] = event_type.pivot_table(values='source',index='event_type',aggfunc=lambda x: sum(x=='train')/float(len(x)))
I guess you want to obtain the percentage of train values. So, here is my code,
val df2 = df.select($"event_type", $"source").groupBy($"event_type").pivot($"source").agg(count($"source")).withColumn("PercTrain", $"train" / ($"train" + $"test")).show
and gives the result as follows:
+-------------+----+-----+------------------+
| event_type|test|train| PercTrain|
+-------------+----+-----+------------------+
|event_type 11| 4| 1| 0.2|
|event_type 15| 5| 2|0.2857142857142857|
+-------------+----+-----+------------------+
Hope to be helpful.