remove first character of a spark string column - scala

I wonder as I said in title how to remove first character of a spark string column, for the two following cases:
val myDF1 = Seq(("£14326"),("£1258634"),("£15626"),("£163262")).toDF("A")
val myDF2 = Seq(("a14326"),("c1258634"),("t15626"),("f163262")).toDF("A")
myDF1.show
myDF2.show
+--------+
| A|
+--------+
|£14326 |
|£1258634|
|£15626 |
|£163262 |
+--------+
+--------+
| A |
+--------+
|a14326 |
|c1258634|
|t15626 |
|f163262 |
+--------+
I would like to obtain:
+--------+-------+
| A| B|
+--------+-------+
|£14326 | 14326|
|£1258634|1258634|
|£15626 | 15626|
|£163262 | 163262|
+--------+-------+
+--------+-------+
| A| B|
+--------+-------+
|a14326 |14326 |
|c1258634|1258634|
|t15626 |15626 |
|f163262 |163262 |
+--------+-------+
Do you have any idea?

You can do something like this.
myDF1.show
+------+
| A|
+------+
|£14326|
|£12586|
|£15626|
|£16326|
+------+
myDF1.withColumn("B", expr("substring(A, 2, length(A))")).show
+------+-----+
| A| B|
+------+-----+
|£14326|14326|
|£12586|12586|
|£15626|15626|
|£16326|16326|
+------+-----+

Related

Spark - Divide a dataframe into n number of records

I have a dataframe with 2 or more columns and 1000 records. I want to split the data into 100 records chunks randomly without any conditions.
So expected output in records count should be something like this,
[(1,2....100),(101,102,103...200),.....,(900,901...1000)]
Here's the solution that worked for my use case after trying different approaches:
https://stackoverflow.com/a/61276734/12322995
As #Shaido said randomsplit is ther for splitting dataframe is popular approach..
Thought differently about repartitionByRange with => spark 2.3
repartitionByRange public Dataset repartitionByRange(int
numPartitions,
scala.collection.Seq partitionExprs) Returns a new Dataset partitioned by the given
partitioning expressions into numPartitions. The resulting Dataset is
range partitioned. At least one partition-by expression must be
specified. When no explicit sort order is specified, "ascending nulls
first" is assumed. Parameters: numPartitions - (undocumented)
partitionExprs - (undocumented) Returns: (undocumented) Since:
2.3.0
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object RepartitionByRange extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
val t1 = sc.parallelize(0 until 1000).toDF("id")
val repartitionedOrders: Dataset[String] = t1.repartitionByRange(10, $"id")
.mapPartitions(rows => {
val idsInPartition = rows.map(row => row.getAs[Int]("id")).toSeq.sorted.mkString(",")
Iterator(idsInPartition)
})
repartitionedOrders.show(false)
println("number of chunks or partitions :" + repartitionedOrders.rdd.getNumPartitions)
}
Result :
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99 |
|100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199|
|200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299|
|300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399|
|400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499|
|500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599|
|600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699|
|700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799|
|800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899|
|900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
number of chunks or partitions : 10
UPDATE : randomsplit example :
import spark.implicits._
val t1 = sc.parallelize(0 until 1000).toDF("id")
println("With Random Split ")
val dfarray = t1.randomSplit(Array(1, 1, 1, 1, 1, 1, 1, 1, 1, 1));
println("number of dataframes " + dfarray.length + "element order is not guaranteed ")
dfarray.foreach {
df => df.show
}
Result : Will be split in to 10 dataframes and order is not guaranteed.
With Random Split
number of dataframes 10element order is not guaranteed
+---+
| id|
+---+
| 2|
| 10|
| 16|
| 30|
| 36|
| 46|
| 51|
| 91|
|100|
|121|
|136|
|138|
|149|
|152|
|159|
|169|
|198|
|199|
|220|
|248|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 26|
| 40|
| 45|
| 54|
| 63|
| 72|
| 76|
|107|
|129|
|137|
|142|
|145|
|153|
|162|
|173|
|179|
|196|
|208|
|214|
|232|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 7|
| 12|
| 31|
| 32|
| 38|
| 42|
| 53|
| 61|
| 68|
| 73|
| 80|
| 89|
| 96|
|115|
|117|
|118|
|131|
|132|
|139|
|146|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 0|
| 24|
| 35|
| 57|
| 58|
| 65|
| 77|
| 78|
| 84|
| 86|
| 90|
| 97|
|156|
|158|
|168|
|174|
|182|
|197|
|218|
|242|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 1|
| 3|
| 17|
| 18|
| 19|
| 33|
| 70|
| 71|
| 74|
| 83|
|102|
|104|
|108|
|109|
|122|
|128|
|143|
|150|
|154|
|157|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 14|
| 15|
| 29|
| 44|
| 64|
| 75|
| 88|
|103|
|110|
|113|
|116|
|120|
|124|
|135|
|155|
|213|
|221|
|238|
|241|
|251|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 5|
| 9|
| 21|
| 22|
| 23|
| 25|
| 27|
| 47|
| 52|
| 55|
| 60|
| 62|
| 69|
| 93|
|111|
|114|
|141|
|144|
|161|
|164|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 13|
| 20|
| 39|
| 41|
| 49|
| 56|
| 67|
| 85|
| 87|
| 92|
|105|
|106|
|126|
|127|
|160|
|165|
|166|
|171|
|175|
|184|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 4|
| 34|
| 50|
| 79|
| 81|
|101|
|119|
|123|
|133|
|147|
|163|
|170|
|180|
|181|
|193|
|202|
|207|
|222|
|226|
|233|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 6|
| 8|
| 11|
| 28|
| 37|
| 43|
| 48|
| 59|
| 66|
| 82|
| 94|
| 95|
| 98|
| 99|
|112|
|125|
|130|
|134|
|140|
|183|
+---+
only showing top 20 rows
Since I want the data to be evenly distributed and to be able to use the chunks separately or in iterative manner using randomSplit doesn't work as it may leave empty dataframes or unequal distribution.
So using grouped can be one of the most feasible solutions here if you don't mind calling collect on your dataframe.
Eg: val newdf = df.collect.grouped(10)
That gives an Iterator[List[org.apache.spark.sql.Row]] = non-empty iterator. Can also convert it into list by adding .toList at the end
Another possible solution if we don't want Array chunks of data from the dataframe but still want to partition the data with equal counts of records we can try to use countApprox by adjusting timeout and confidence as required. Then divide that with number of records we need in a partition, which can be later used as number of partitions when using repartition or Coalesce.
countApprox instead of count because it is less expensive operation and you can feel the difference when the data size is huge
val approxCount = df.rdd.countApprox(timeout = 1000L,confidence = 0.95).getFinalValue().high
val numOfPartitions = Math.max(Math.round(approxCount / 100), 1).toInt
df.repartition(numOfPartitions)

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

Create a Vertical Table in Spark 2 [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Closed 4 years ago.
How to create a vertical table in Spark 2 SQL.
I am building a ETL using Spark 2 / SQL / Scala. I have data in normal table structure like.
Input Table:
| ID | A | B | C | D |
| 1 | A1 | B1 | C1 | D1 |
| 2 | A2 | B2 | C2 | D2 |
Output Table:
| ID | Key | Val |
| 1 | A | A1 |
| 1 | B | B1 |
| 1 | C | C1 |
| 1 | D | D1 |
| 2 | A | A2 |
| 2 | B | B2 |
| 2 | C | C2 |
| 2 | D | D2 |
This could do the trick as well:
Input Data:
+---+---+---+---+---+
|ID |A |B |C |D |
+---+---+---+---+---+
|1 |A1 |B1 |C1 |D1 |
|2 |A2 |B2 |C2 |D2 |
|3 |A3 |B3 |C3 |D3 |
+---+---+---+---+---+
Zip the column header and no of columns to be included:
val cols = Seq("A","B","C","D") zip Range(0,4,1)
df.flatMap(r => cols.map(i => (r.getString(0),i._1,r.getString(i._2 + 1)))).toDF("ID","KEY","VALUE").show()
Result should look like this:
+---+---+-----+
| ID|KEY|VALUE|
+---+---+-----+
| 1| A| A1|
| 1| B| B1|
| 1| C| C1|
| 1| D| D1|
| 2| A| A2|
| 2| B| B2|
| 2| C| C2|
| 2| D| D2|
| 3| A| A3|
| 3| B| B3|
| 3| C| C3|
| 3| D| D3|
+---+---+-----+
Good Luck!!

Group rows that match sub string in a column using scala

I have a fol df:
Zip | Name | id |
abc | xyz | 1 |
def | wxz | 2 |
abc | wex | 3 |
bcl | rea | 4 |
abc | txc | 5 |
def | rfx | 6 |
abc | abc | 7 |
I need to group all the names that contain 'x' based on same Zip using scala
Desired Output:
Zip | Count |
abc | 3 |
def | 2 |
Any help is highly appreciated
As #Shaido mentioned in the comment above, all you need is filter, groupBy and aggregation as
import org.apache.spark.sql.functions._
fol.filter(col("Name").contains("x")) //filtering the rows that has x in the Name column
.groupBy("Zip") //grouping by Zip column
.agg(count("Zip").as("Count")) //counting the rows in each groups
.show(false)
and you should have the desired output
+---+-----+
|Zip|Count|
+---+-----+
|abc|3 |
|def|2 |
+---+-----+
You want to groupBy bellow data frame.
+---+----+---+
|zip|name| id|
+---+----+---+
|abc| xyz| 1|
|def| wxz| 2|
|abc| wex| 3|
|bcl| rea| 4|
|abc| txc| 5|
|def| rfx| 6|
|abc| abc| 7|
+---+----+---+
then you can simply use groupBy function with passing column parameter and followed by count will give you the result.
val groupedDf: DataFrame = df.groupBy("zip").count()
groupedDf.show()
// +---+-----+
// |zip|count|
// +---+-----+
// |bcl| 1|
// |abc| 4|
// |def| 2|
// +---+-----+

Extracting array index in Spark Dataframe

I have a Dataframe with a Column of Array Type
For example :
val df = List(("a", Array(1d,2d,3d)), ("b", Array(4d,5d,6d))).toDF("ID", "DATA")
df: org.apache.spark.sql.DataFrame = [ID: string, DATA: array<double>]
scala> df.show
+---+---------------+
| ID| DATA|
+---+---------------+
| a|[1.0, 2.0, 3.0]|
| b|[4.0, 5.0, 6.0]|
+---+---------------+
I wish to explode the array and have index like
+---+------------------+
| ID| DATA_INDEX| DATA|
+---+------------------+
| a|1 | 1.0 |
| a|2 | 2.0 |
| a|3 | 3.0 |
| b|1 | 4.0 |
| b|2 | 5.0 |
| b|3 | 6.0 |
+---+------------+-----+
I wish be able to do that with scala, and Sparlyr or SparkR
I'm using spark 1.6
There is a posexplode function available in spark functions
import org.apache.spark.sql.functions._
df.select("ID", posexplode($"DATA))
PS: This is only available after 2.1.0 versions
With Spark 1.6, you can register you dataframe as a temporary table and then run Hive QL over it to get the desired result.
df.registerTempTable("tab")
sqlContext.sql("""
select
ID, exploded.DATA_INDEX + 1 as DATA_INDEX, exploded.DATA
from
tab
lateral view posexplode(tab.DATA) exploded as DATA_INDEX, DATA
""").show
+---+----------+----+
| ID|DATA_INDEX|DATA|
+---+----------+----+
| a| 1| 1.0|
| a| 2| 2.0|
| a| 3| 3.0|
| b| 1| 4.0|
| b| 2| 5.0|
| b| 3| 6.0|
+---+----------+----+