Scala (Spark) .txt to .csv - scala

I've got both .txt an d .dat files with structure:
Number Date Time Nns Ans Nwe Awe
1 22.07.17 08:00:23 12444 427 8183 252
2 22.07.17 08:00:24 13 312 9 278
3 22.07.17 08:00:25 162 1877 63 273
4 22.07.17 08:00:26 87 400 29 574
5 22.07.17 08:00:27 72 349 82 2047
6 22.07.17 08:00:28 79 294 63 251
7 22.07.17 08:00:29 35 318 25 248
I can't translate it to .csv by using Spark/Scala.
val data = spark
.read
.option("header", "true")
.option("inferSchema","true")
.csv() /.text /.textfile
doesn't work!
Please help.
Here s file - https://github.com/CvitoyBamp/overflow

You could try
val text = spark.read.textFile(pathToFile)
val cleaned = text.map(_.replaceAll(" +", " ").trim)
val data = spark
.read
.option("header", true)
.option("sep", " ")
.option("inferSchema", true)
.csv(cleaned)
It will first read the file as simple strings, line by line. Then it replaces every sequence of 1 or more spaces with exactly 1 space, and then tries to parse the result as csv with a single space as separator. One thing you have to be aware of is that when one of your fields contains a sequence of multiple spaces they will also be replaced by a single space.

hope this helps, it's wotking fine for me with your A.txt test file
First, read the File as usual:
val df = spark.read.csv("A.txt")
get the headers from the first row and zip them with index
val headers = df.first.toSeq.asInstanceOf[Seq[String]].flatMap(_.split("\\s+")).zipWithIndex
RESULT
headers: Seq[(String, Int)] = ArrayBuffer((Number,0), (Date,1), (Time,2), (Nns,3), (Ans,4), (Nwe,5), (Awe,6))
Then foldLeft the headers retrieving the item indicated by the index (second item in each header element) and assigning it the name of the column (first item in each header element)
Also drop the undesired columns and filter the row containing the headers value
headers.foldLeft(df.withColumn("tmp", split($"_c0", "\\s+")))
((acc, elem) => acc.withColumn(elem._1, $"tmp".getItem(elem._2)))
.drop("_c0", "tmp")
.filter("Number <> 'Number'")
RESULT
+------+--------+--------+-----+----+----+----+
|Number| Date| Time| Nns| Ans| Nwe| Awe|
+------+--------+--------+-----+----+----+----+
| 1|22.07.17|08:00:23|12444| 427|8183| 252|
| 2|22.07.17|08:00:24| 13| 312| 9| 278|
| 3|22.07.17|08:00:25| 162|1877| 63| 273|
| 4|22.07.17|08:00:26| 87| 400| 29| 574|
| 5|22.07.17|08:00:27| 72| 349| 82|2047|
| 6|22.07.17|08:00:28| 79| 294| 63| 251|
| 7|22.07.17|08:00:29| 35| 318| 25| 248|
| 8|22.07.17|08:00:30| 10| 629| 12| 391|
| 9|22.07.17|08:00:31| 58| 511| 67| 525|
| 10|22.07.17|08:00:32| 72| 234| 29| 345|
| 11|22.07.17|08:00:33| 277|1181| 38| 250|
| 12|22.07.17|08:00:34| 40| 268| 31| 292|
| 13|22.07.17|08:00:35| 16| 523| 10| 368|
| 14|22.07.17|08:00:36| 319|1329| 143| 703|
| 15|22.07.17|08:00:37| 164| 311| 124| 352|
| 16|22.07.17|08:00:38| 62| 320| 116| 272|
| 17|22.07.17|08:00:39| 223| 356| 217|1630|
| 18|22.07.17|08:00:40| 50|1659| 94|1611|
| 19|22.07.17|08:00:41| 34| 431| 26| 587|
| 20|22.07.17|08:00:42| 0| 0| 5| 277|
+------+--------+--------+-----+----+----+----+
only showing top 20 rows
Also, a solution close to the one from the other answer
You can load your data as a String Dataset
val stringDF = spark.read.textFile("Downloads/A.txt").map(_.replaceAll("\\s+", " "))
And then
val data = spark
.read
.option("header", true)
.option("sep", " ")
.option("inferSchema", true)
.csv(cleaned)
.drop("_c7")
RESULT
+------+--------+--------+-----+----+----+----+
|Number| Date| Time| Nns| Ans| Nwe| Awe|
+------+--------+--------+-----+----+----+----+
| 1|22.07.17|08:00:23|12444| 427|8183| 252|
| 2|22.07.17|08:00:24| 13| 312| 9| 278|
| 3|22.07.17|08:00:25| 162|1877| 63| 273|
| 4|22.07.17|08:00:26| 87| 400| 29| 574|
| 5|22.07.17|08:00:27| 72| 349| 82|2047|
| 6|22.07.17|08:00:28| 79| 294| 63| 251|
| 7|22.07.17|08:00:29| 35| 318| 25| 248|
| 8|22.07.17|08:00:30| 10| 629| 12| 391|
| 9|22.07.17|08:00:31| 58| 511| 67| 525|
| 10|22.07.17|08:00:32| 72| 234| 29| 345|
| 11|22.07.17|08:00:33| 277|1181| 38| 250|
| 12|22.07.17|08:00:34| 40| 268| 31| 292|
| 13|22.07.17|08:00:35| 16| 523| 10| 368|
| 14|22.07.17|08:00:36| 319|1329| 143| 703|
| 15|22.07.17|08:00:37| 164| 311| 124| 352|
| 16|22.07.17|08:00:38| 62| 320| 116| 272|
| 17|22.07.17|08:00:39| 223| 356| 217|1630|
| 18|22.07.17|08:00:40| 50|1659| 94|1611|
| 19|22.07.17|08:00:41| 34| 431| 26| 587|
| 20|22.07.17|08:00:42| 0| 0| 5| 277|
+------+--------+--------+-----+----+----+----+
only showing top 20 rows

Related

Spark - Divide a dataframe into n number of records

I have a dataframe with 2 or more columns and 1000 records. I want to split the data into 100 records chunks randomly without any conditions.
So expected output in records count should be something like this,
[(1,2....100),(101,102,103...200),.....,(900,901...1000)]
Here's the solution that worked for my use case after trying different approaches:
https://stackoverflow.com/a/61276734/12322995
As #Shaido said randomsplit is ther for splitting dataframe is popular approach..
Thought differently about repartitionByRange with => spark 2.3
repartitionByRange public Dataset repartitionByRange(int
numPartitions,
scala.collection.Seq partitionExprs) Returns a new Dataset partitioned by the given
partitioning expressions into numPartitions. The resulting Dataset is
range partitioned. At least one partition-by expression must be
specified. When no explicit sort order is specified, "ascending nulls
first" is assumed. Parameters: numPartitions - (undocumented)
partitionExprs - (undocumented) Returns: (undocumented) Since:
2.3.0
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object RepartitionByRange extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
val t1 = sc.parallelize(0 until 1000).toDF("id")
val repartitionedOrders: Dataset[String] = t1.repartitionByRange(10, $"id")
.mapPartitions(rows => {
val idsInPartition = rows.map(row => row.getAs[Int]("id")).toSeq.sorted.mkString(",")
Iterator(idsInPartition)
})
repartitionedOrders.show(false)
println("number of chunks or partitions :" + repartitionedOrders.rdd.getNumPartitions)
}
Result :
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99 |
|100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199|
|200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299|
|300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399|
|400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499|
|500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599|
|600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699|
|700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799|
|800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899|
|900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
number of chunks or partitions : 10
UPDATE : randomsplit example :
import spark.implicits._
val t1 = sc.parallelize(0 until 1000).toDF("id")
println("With Random Split ")
val dfarray = t1.randomSplit(Array(1, 1, 1, 1, 1, 1, 1, 1, 1, 1));
println("number of dataframes " + dfarray.length + "element order is not guaranteed ")
dfarray.foreach {
df => df.show
}
Result : Will be split in to 10 dataframes and order is not guaranteed.
With Random Split
number of dataframes 10element order is not guaranteed
+---+
| id|
+---+
| 2|
| 10|
| 16|
| 30|
| 36|
| 46|
| 51|
| 91|
|100|
|121|
|136|
|138|
|149|
|152|
|159|
|169|
|198|
|199|
|220|
|248|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 26|
| 40|
| 45|
| 54|
| 63|
| 72|
| 76|
|107|
|129|
|137|
|142|
|145|
|153|
|162|
|173|
|179|
|196|
|208|
|214|
|232|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 7|
| 12|
| 31|
| 32|
| 38|
| 42|
| 53|
| 61|
| 68|
| 73|
| 80|
| 89|
| 96|
|115|
|117|
|118|
|131|
|132|
|139|
|146|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 0|
| 24|
| 35|
| 57|
| 58|
| 65|
| 77|
| 78|
| 84|
| 86|
| 90|
| 97|
|156|
|158|
|168|
|174|
|182|
|197|
|218|
|242|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 1|
| 3|
| 17|
| 18|
| 19|
| 33|
| 70|
| 71|
| 74|
| 83|
|102|
|104|
|108|
|109|
|122|
|128|
|143|
|150|
|154|
|157|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 14|
| 15|
| 29|
| 44|
| 64|
| 75|
| 88|
|103|
|110|
|113|
|116|
|120|
|124|
|135|
|155|
|213|
|221|
|238|
|241|
|251|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 5|
| 9|
| 21|
| 22|
| 23|
| 25|
| 27|
| 47|
| 52|
| 55|
| 60|
| 62|
| 69|
| 93|
|111|
|114|
|141|
|144|
|161|
|164|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 13|
| 20|
| 39|
| 41|
| 49|
| 56|
| 67|
| 85|
| 87|
| 92|
|105|
|106|
|126|
|127|
|160|
|165|
|166|
|171|
|175|
|184|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 4|
| 34|
| 50|
| 79|
| 81|
|101|
|119|
|123|
|133|
|147|
|163|
|170|
|180|
|181|
|193|
|202|
|207|
|222|
|226|
|233|
+---+
only showing top 20 rows
+---+
| id|
+---+
| 6|
| 8|
| 11|
| 28|
| 37|
| 43|
| 48|
| 59|
| 66|
| 82|
| 94|
| 95|
| 98|
| 99|
|112|
|125|
|130|
|134|
|140|
|183|
+---+
only showing top 20 rows
Since I want the data to be evenly distributed and to be able to use the chunks separately or in iterative manner using randomSplit doesn't work as it may leave empty dataframes or unequal distribution.
So using grouped can be one of the most feasible solutions here if you don't mind calling collect on your dataframe.
Eg: val newdf = df.collect.grouped(10)
That gives an Iterator[List[org.apache.spark.sql.Row]] = non-empty iterator. Can also convert it into list by adding .toList at the end
Another possible solution if we don't want Array chunks of data from the dataframe but still want to partition the data with equal counts of records we can try to use countApprox by adjusting timeout and confidence as required. Then divide that with number of records we need in a partition, which can be later used as number of partitions when using repartition or Coalesce.
countApprox instead of count because it is less expensive operation and you can feel the difference when the data size is huge
val approxCount = df.rdd.countApprox(timeout = 1000L,confidence = 0.95).getFinalValue().high
val numOfPartitions = Math.max(Math.round(approxCount / 100), 1).toInt
df.repartition(numOfPartitions)

How to calculate 5 day-mean, 10-day mean & 15-day mean for given data?

Scenario :
I have following dataframe as below
``` -- -----------------------------------
companyId | calc_date | mean |
----------------------------------
1111 | 01-08-2002 | 15 |
----------------------------------
1111 | 02-08-2002 | 16.5 |
----------------------------------
1111 | 03-08-2002 | 17 |
----------------------------------
1111 | 04-08-2002 | 15 |
----------------------------------
1111 | 05-08-2002 | 23 |
----------------------------------
1111 | 06-08-2002 | 22.6 |
----------------------------------
1111 | 07-08-2002 | 25 |
----------------------------------
1111 | 08-08-2002 | 15 |
----------------------------------
1111 | 09-08-2002 | 15 |
----------------------------------
1111 | 10-08-2002 | 16.5 |
----------------------------------
1111 | 11-08-2002 | 22.6 |
----------------------------------
1111 | 12-08-2002 | 15 |
----------------------------------
1111 | 13-08-2002 | 16.5 |
----------------------------------
1111 | 14-08-2002 | 25 |
----------------------------------
1111 | 15-08-2002 | 16.5 |
----------------------------------
```
Required :
Need to calculate for given data 5 day-mean , 10-day mean , 15-day mean for every record for every company.
5 day-mean --> Previous 5 days available mean sum
10 day-mean --> Previous 10 days available mean sum
15 day-mean --> Previous 15 days available mean sum
Resultant dataframe should have caluluated columns as below
----------------------------------------------------------------------------
companyId | calc_date | mean | 5 day-mean | 10-day mean | 15-day mean |
----------------------------------------------------------------------------
Question :
How to achieve this ?
What is the best way to do this in spark ?
Here's one approach using Window partitions by company to compute the n-day mean between the current row and previous rows within the specified timestamp range via rangeBetween, as shown below (using a dummy dataset):
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 3).flatMap(i => Seq.tabulate(15)(j => (i, s"${j+1}-2-2019", j+1))).
toDF("company_id", "calc_date", "mean")
df.show
// +----------+---------+----+
// |company_id|calc_date|mean|
// +----------+---------+----+
// | 1| 1-2-2019| 1|
// | 1| 2-2-2019| 2|
// | 1| 3-2-2019| 3|
// | 1| 4-2-2019| 4|
// | 1| 5-2-2019| 5|
// | ... |
// | 1|14-2-2019| 14|
// | 1|15-2-2019| 15|
// | 2| 1-2-2019| 1|
// | 2| 2-2-2019| 2|
// | 2| 3-2-2019| 3|
// | ... |
// +----------+---------+----+
def winSpec = Window.partitionBy("company_id").orderBy("ts")
def dayRange(days: Int) = winSpec.rangeBetween(-(days * 24 * 60 * 60), 0)
df.
withColumn("ts", unix_timestamp(to_date($"calc_date", "d-M-yyyy"))).
withColumn("mean-5", mean($"mean").over(dayRange(5))).
withColumn("mean-10", mean($"mean").over(dayRange(10))).
withColumn("mean-15", mean($"mean").over(dayRange(15))).
show
// +----------+---------+----+----------+------+-------+-------+
// |company_id|calc_date|mean| ts|mean-5|mean-10|mean-15|
// +----------+---------+----+----------+------+-------+-------+
// | 1| 1-2-2019| 1|1549008000| 1.0| 1.0| 1.0|
// | 1| 2-2-2019| 2|1549094400| 1.5| 1.5| 1.5|
// | 1| 3-2-2019| 3|1549180800| 2.0| 2.0| 2.0|
// | 1| 4-2-2019| 4|1549267200| 2.5| 2.5| 2.5|
// | 1| 5-2-2019| 5|1549353600| 3.0| 3.0| 3.0|
// | 1| 6-2-2019| 6|1549440000| 3.5| 3.5| 3.5|
// | 1| 7-2-2019| 7|1549526400| 4.5| 4.0| 4.0|
// | 1| 8-2-2019| 8|1549612800| 5.5| 4.5| 4.5|
// | 1| 9-2-2019| 9|1549699200| 6.5| 5.0| 5.0|
// | 1|10-2-2019| 10|1549785600| 7.5| 5.5| 5.5|
// | 1|11-2-2019| 11|1549872000| 8.5| 6.0| 6.0|
// | 1|12-2-2019| 12|1549958400| 9.5| 7.0| 6.5|
// | 1|13-2-2019| 13|1550044800| 10.5| 8.0| 7.0|
// | 1|14-2-2019| 14|1550131200| 11.5| 9.0| 7.5|
// | 1|15-2-2019| 15|1550217600| 12.5| 10.0| 8.0|
// | 3| 1-2-2019| 1|1549008000| 1.0| 1.0| 1.0|
// | 3| 2-2-2019| 2|1549094400| 1.5| 1.5| 1.5|
// | 3| 3-2-2019| 3|1549180800| 2.0| 2.0| 2.0|
// | 3| 4-2-2019| 4|1549267200| 2.5| 2.5| 2.5|
// | 3| 5-2-2019| 5|1549353600| 3.0| 3.0| 3.0|
// +----------+---------+----+----------+------+-------+-------+
// only showing top 20 rows
Note that one could use rowsBetween (as opposed to rangeBetween) directly on calc_date if the dates are guaranteed to be contiguous per-day time series.

join 2 DF with diferent dimension scala

Hi I have 2 Differente DF
scala> d1.show() scala> d2.show()
+--------+-------+ +--------+----------+
| fecha|eventos| | fecha|TotalEvent|
+--------+-------+ +--------+----------+
|20180404| 3| | 0| 23534|
|20180405| 7| |20180322| 10|
|20180406| 10| |20180326| 50|
|20180409| 4| |20180402| 6|
.... |20180403| 118|
scala> d1.count() |20180404| 1110|
res3: Long = 60 ...
scala> d2.count()
res7: Long = 74
But I like to join them by fecha without loose data, and then, create a new column with a math operation (TotalEvent - eventos)*100/TotalEvent
Something like this:
+---------+-------+----------+--------+
|fecha |eventos|TotalEvent| KPI |
+---------+-------+----------+--------+
| 0| | 23534 | 100.00|
| 20180322| | 10 | 100.00|
| 20180326| | 50 | 100.00|
| 20180402| | 6 | 100.00|
| 20180403| | 118 | 100.00|
| 20180404| 3 | 1110 | 99.73|
| 20180405| 7 | 1204 | 99.42|
| 20180406| 10 | 1526 | 99.34|
| 20180407| | 14 | 100.00|
| 20180409| 4 | 1230 | 99.67|
| 20180410| 11 | 1456 | 99.24|
| 20180411| 6 | 1572 | 99.62|
| 20180412| 5 | 1450 | 99.66|
| 20180413| 7 | 1214 | 99.42|
.....
The problems is that I can't find the way to do it.
When I use:
scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()
I loose the data that isn't in both table.
+--------+-------+--------+----------+
| fecha|eventos| fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404| 3|20180404| 1110|
|20180405| 7|20180405| 1204|
|20180406| 10|20180406| 1526|
|20180409| 4|20180409| 1230|
|20180410| 11|20180410| 1456|
....
Additional, How can I add a new column with the math operation?
Thank you
I would recommend left-joining df2 with df1 and calculating KPI based on whether eventos is null or not in the joined dataset (using when/otherwise):
import org.apache.spark.sql.functions._
val df1 = Seq(
("20180404", 3),
("20180405", 7),
("20180406", 10),
("20180409", 4)
).toDF("fecha", "eventos")
val df2 = Seq(
("0", 23534),
("20180322", 10),
("20180326", 50),
("20180402", 6),
("20180403", 118),
("20180404", 1110),
("20180405", 100),
("20180406", 100)
).toDF("fecha", "TotalEvent")
df2.
join(df1, Seq("fecha"), "left_outer").
withColumn( "KPI",
round( when($"eventos".isNull, 100.0).
otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),
2
)
).show
// +--------+----------+-------+-----+
// | fecha|TotalEvent|eventos| KPI|
// +--------+----------+-------+-----+
// | 0| 23534| null|100.0|
// |20180322| 10| null|100.0|
// |20180326| 50| null|100.0|
// |20180402| 6| null|100.0|
// |20180403| 118| null|100.0|
// |20180404| 1110| 3|99.73|
// |20180405| 100| 7| 93.0|
// |20180406| 100| 10| 90.0|
// +--------+----------+-------+-----+
Note that if the more precise raw KPI is wanted instead, just remove the wrapping round( , 2).
I would do this in several of steps. First join, then select the calculated column, then fill in the na:
# val df2a = df2.withColumnRenamed("fecha", "fecha2") # to avoid ambiguous column names after the join
# val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")
# val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))
# kpi.show()
+--------+-------+--------+----------+-----------------+
| fecha|eventos| fecha2|TotalEvent| KPI|
+--------+-------+--------+----------+-----------------+
| null| null|20180402| 6| 100.0|
| null| null| 0| 23534| 100.0|
| null| null|20180322| 10| 100.0|
|20180404| 3|20180404| 1110|99.72972972972973|
|20180406| 10| null| null| 100.0|
| null| null|20180403| 118| 100.0|
| null| null|20180326| 50| 100.0|
|20180409| 4| null| null| 100.0|
|20180405| 7| null| null| 100.0|
+--------+-------+--------+----------+-----------------+
I solved the problems with mixed both suggestion recived.
val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))

Group rows that match sub string in a column using scala

I have a fol df:
Zip | Name | id |
abc | xyz | 1 |
def | wxz | 2 |
abc | wex | 3 |
bcl | rea | 4 |
abc | txc | 5 |
def | rfx | 6 |
abc | abc | 7 |
I need to group all the names that contain 'x' based on same Zip using scala
Desired Output:
Zip | Count |
abc | 3 |
def | 2 |
Any help is highly appreciated
As #Shaido mentioned in the comment above, all you need is filter, groupBy and aggregation as
import org.apache.spark.sql.functions._
fol.filter(col("Name").contains("x")) //filtering the rows that has x in the Name column
.groupBy("Zip") //grouping by Zip column
.agg(count("Zip").as("Count")) //counting the rows in each groups
.show(false)
and you should have the desired output
+---+-----+
|Zip|Count|
+---+-----+
|abc|3 |
|def|2 |
+---+-----+
You want to groupBy bellow data frame.
+---+----+---+
|zip|name| id|
+---+----+---+
|abc| xyz| 1|
|def| wxz| 2|
|abc| wex| 3|
|bcl| rea| 4|
|abc| txc| 5|
|def| rfx| 6|
|abc| abc| 7|
+---+----+---+
then you can simply use groupBy function with passing column parameter and followed by count will give you the result.
val groupedDf: DataFrame = df.groupBy("zip").count()
groupedDf.show()
// +---+-----+
// |zip|count|
// +---+-----+
// |bcl| 1|
// |abc| 4|
// |def| 2|
// +---+-----+

How to create feature vector in Scala? [duplicate]

This question already has an answer here:
How to transform the dataframe into label feature vector?
(1 answer)
Closed 5 years ago.
I am reading a csv as a data frame in scala as below:
+-----------+------------+
|x |y |
+-----------+------------+
| 0| 0|
| 0| 33|
| 0| 58|
| 0| 96|
| 0| 1|
| 1| 21|
| 0| 10|
| 0| 65|
| 1| 7|
| 1| 28|
+-----------+------------+
Then I create the label and feature vector as below:
val assembler = new VectorAssembler()
.setInputCols(Array("y"))
.setOutputCol("features")
val output = assembler.transform(daf).select($"x".as("label"), $"features")
println(output.show)
The output is as:
+-----------+------------+
|label | features |
+-----------+------------+
| 0.0| 0.0|
| 0.0| 33.0|
| 0.0| 58.0|
| 0.0| 96.0|
| 0.0| 1.0|
| 0.0| 21.0|
| 0.0| 10.0|
| 1.0| 65.0|
| 1.0| 7.0|
| 1.0| 28.0|
+-----------+------------+
But instead of this I want the output to be like in the below format
+-----+------------------+
|label| features |
+-----+------------------+
| 0.0|(1,[1],[0]) |
| 0.0|(1,[1],[33]) |
| 0.0|(1,[1],[58]) |
| 0.0|(1,[1],[96]) |
| 0.0|(1,[1],[1]) |
| 1.0|(1,[1],[21]) |
| 0.0|(1,[1],[10]) |
| 0.0|(1,[1],[65]) |
| 1.0|(1,[1],[7]) |
| 1.0|(1,[1],[28]) |
+-----------+------------+
I tried
val assembler = new VectorAssembler()
.setInputCols(Array("y").map{x => "(1,[1],"+x+")"})
.setOutputCol("features")
But did not work.
Any help is appreciated.
This is not how you use VectorAssembler.
You need to give the names of your input columns. i.e
new VectorAssembler().setInputCols(Array("features"))
You'll face eventually another issue considering the data that you have shared. It's not much a vector if it's one point. (your features columns)
It should be used with 2 or more columns. i.e :
new VectorAssembler().setInputCols(Array("f1","f2","f3"))