How to create DataFrame from fixed-length text file given field lengths? - scala

I am reading fixed positional file. Final result of file is stored in string. I would like to convert string into a DataFrame to process further. Kindly help me on this. Below is my code:
Input data:
+---------+----------------------+
|PRGREFNBR|value |
+---------+----------------------+
|01 |11 apple TRUE 0.56|
|02 |12 pear FALSE1.34|
|03 |13 raspberry TRUE 2.43|
|04 |14 plum TRUE .31|
|05 |15 cherry TRUE 1.4 |
+---------+----------------------+
data position: "3,10,5,4"
expected result with default header in data frame:
+-----+-----+----------+-----+-----+
|SeqNo|col_0| col_1|col_2|col_3|
+-----+-----+----------+-----+-----+
| 01 | 11 |apple |TRUE | 0.56|
| 02 | 12 |pear |FALSE| 1.34|
| 03 | 13 |raspberry |TRUE | 2.43|
| 04 | 14 |plum |TRUE | 1.31|
| 05 | 15 |cherry |TRUE | 1.4 |
+-----+-----+----------+-----+-----+

Given the fixed-position file (say input.txt):
11 apple TRUE 0.56
12 pear FALSE1.34
13 raspberry TRUE 2.43
14 plum TRUE 1.31
15 cherry TRUE 1.4
and the length of every field in the input file as (say lengths):
3,10,5,4
you could create a DataFrame as follows:
// Read the text file as is
// and filter out empty lines
val lines = spark.read.textFile("input.txt").filter(!_.isEmpty)
// define a helper function to do the split per fixed lengths
// Home exercise: should be part of a case class that describes the schema
def parseLinePerFixedLengths(line: String, lengths: Seq[Int]): Seq[String] = {
lengths.indices.foldLeft((line, Array.empty[String])) { case ((rem, fields), idx) =>
val len = lengths(idx)
val fld = rem.take(len)
(rem.drop(len), fields :+ fld)
}._2
}
// Split the lines using parseLinePerFixedLengths method
val lengths = Seq(3,10,5,4)
val fields = lines.
map(parseLinePerFixedLengths(_, lengths)).
withColumnRenamed("value", "fields") // <-- it'd be unnecessary if a case class were used
scala> fields.show(truncate = false)
+------------------------------+
|fields |
+------------------------------+
|[11 , apple , TRUE , 0.56]|
|[12 , pear , FALSE, 1.34]|
|[13 , raspberry , TRUE , 2.43]|
|[14 , plum , TRUE , 1.31]|
|[15 , cherry , TRUE , 1.4 ]|
+------------------------------+
That's what you may have had already so let's unroll/destructure the nested sequence of fields into columns
val answer = lengths.indices.foldLeft(fields) { case (result, idx) =>
result.withColumn(s"col_$idx", $"fields".getItem(idx))
}
// drop the unnecessary/interim column
scala> answer.drop("fields").show
+-----+----------+-----+-----+
|col_0| col_1|col_2|col_3|
+-----+----------+-----+-----+
| 11 |apple |TRUE | 0.56|
| 12 |pear |FALSE| 1.34|
| 13 |raspberry |TRUE | 2.43|
| 14 |plum |TRUE | 1.31|
| 15 |cherry |TRUE | 1.4 |
+-----+----------+-----+-----+
Done!

Related

Scala/Spark; Add column to DataFrame that increments by 1 when a value is repeated in another column

I have a dataframe called rawEDI that looks something like this;
Line_number
Segment
1
ST
2
BPT
3
SE
4
ST
5
BPT
6
N1
7
SE
8
ST
9
PTD
10
SE
Each row represents a line in a file. Each line is called a segment and is denoted by something called a segment identifier; a short string. Segments are grouped together in chunks that start with an ST segment identifier and end with an SE segment segment identifier. There can be any number of ST chunks in a given file and the size of each any ST chunk is not fixed.
I want to create a new column on the dataframe that represents numerically what ST group a given segment belongs to. This will allow me to use groupBy to perform aggregate operations across all ST segments without having to loop over each individual ST segment, which is too slow.
The final DataFrame would look like this;
Line_number
Segment
ST_Group
1
ST
1
2
BPT
1
3
SE
1
4
ST
2
5
BPT
2
6
N1
2
7
SE
2
8
ST
3
9
PTD
3
10
SE
3
In short, I want to create and populate a DataFrame column with a number that increments by one whenever the value "ST" appears in the Segment column.
I am using spark 2.3.2 and scala 2.11.8
My initial thought was to use iteration. I collected another DataFrame, df, that contained the starting and ending line_number for each segment, looking like this;
Start
End
1
3
4
7
8
10
Then iterate over the rows of the dataframe and use them to populate the new column like this;
var st = 1
for (row <- df.collect()) {
val start = row(0)
val end = row(1)
var labelSTs = rawEDI.filter("line_number > = ${start}").filter("line_number <= ${end}").withColumn("ST_Group", lit(st))
st = st + 1
However, this yields an empty DataFrame. Additionally, the use of a for loop is time-prohibitive, taking over 20s on my machine for this. Achieving this result without the use of a loop would be huge, but a solution with a loop may also be acceptable if performant.
I have a hunch this can be accomplished using a udf or a Window, but I'm not certain how to attack that.
This
val func = udf((s:String) => if(s == "ST") 1 else 0)
var labelSTs = rawEDI.withColumn("ST_Group", func((col("segment")))
Only populates the column with 1 at each ST segment start.
And this
val w = Window.partitionBy("Segment").orderBy("line_number")
val labelSTs = rawEDI.withColumn("ST_Group", row_number().over(w)
Returns a nonsense dataframe.
One way is to create an intermediate dataframe of "groups" that would tell you on which line each group starts and ends (sort of what you've already done), and then join it to the original table using greater-than/less-than conditions.
Sample data
scala> val input = Seq((1,"ST"),(2,"BPT"),(3,"SE"),(4,"ST"),(5,"BPT"),
(6,"N1"),(7,"SE"),(8,"ST"),(9,"PTD"),(10,"SE"))
.toDF("linenumber","segment")
scala> input.show(false)
+----------+-------+
|linenumber|segment|
+----------+-------+
|1 |ST |
|2 |BPT |
|3 |SE |
|4 |ST |
|5 |BPT |
|6 |N1 |
|7 |SE |
|8 |ST |
|9 |PTD |
|10 |SE |
+----------+-------+
Create a dataframe for groups, using Window just as your hunch was telling you:
scala> val groups = input.where("segment='ST'")
.withColumn("endline",lead("linenumber",1) over Window.orderBy("linenumber"))
.withColumn("groupnumber",row_number() over Window.orderBy("linenumber"))
.withColumnRenamed("linenumber","startline")
.drop("segment")
scala> groups.show(false)
+---------+-----------+-------+
|startline|groupnumber|endline|
+---------+-----------+-------+
|1 |1 |4 |
|4 |2 |8 |
|8 |3 |null |
+---------+-----------+-------+
Join both to get the result
scala> input.join(groups,
input("linenumber") >= groups("startline") &&
(input("linenumber") < groups("endline") || groups("endline").isNull))
.select("linenumber","segment","groupnumber")
.show(false)
+----------+-------+-----------+
|linenumber|segment|groupnumber|
+----------+-------+-----------+
|1 |ST |1 |
|2 |BPT |1 |
|3 |SE |1 |
|4 |ST |2 |
|5 |BPT |2 |
|6 |N1 |2 |
|7 |SE |2 |
|8 |ST |3 |
|9 |PTD |3 |
|10 |SE |3 |
+----------+-------+-----------+
The only problem with this is Window.orderBy() on an unpartitioned dataframe, which would collect all data to a single partition and thus could be a killer.
if you want just to add column with a number that increments by one whenever the value "ST" appears in the Segment column, you can filter lines with the ST segment in a separate dataframe,
var labelSTs = rawEDI.filter("segement == 'ST'");
// then group by ST and collect to list the linenumbers
var groupedDf = labelSTs.groupBy("Segment").agg(collect_list("Line_number").alias("Line_numbers"))
// now you need to flat back the data frame and log the line number index
var flattedDf = groupedDf.select($"Segment", explode($"Line_numbers").as("Line_number"))
// log the line_number index in your target column ST_Group
val withIndexDF = flattenedDF.withColumn("ST_Group", row_number().over(Window.partitionBy($"Segment").orderBy($"Line_number")))
and you have this as result:
+-------+----------+----------------+
|Segment|Line_number|ST_Group |
+-------+----------+----------------+
| ST| 1| 1|
| ST| 4| 2|
| ST| 8| 3|
+-------|----------|----------------|
then you concat this with other Segement in the initial dataframe.
Found a more simpler way, add a column which will have 1 when the segment column value is ST, otherwise it will have 0. Then using Window function find the cummulative sum of that new column. This will give you the desired results.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val rawEDI=Seq((1,"ST"),(2,"BPT"),(3,"SE"),(4,"ST"),(5,"BPT"),(6,"N1"),(7,"SE"),(8,"ST"),(9,"PTD"),(10,"SE")).toDF("line_number","segment")
val newDf=rawEDI.withColumn("ST_Group", ($"segment" === "ST").cast("bigint"))
val windowSpec = Window.orderBy("line_number")
newDf.withColumn("ST_Group", sum("ST_Group").over(windowSpec))
.show
+-----------+-------+--------+
|line_number|segment|ST_Group|
+-----------+-------+--------+
| 1| ST| 1|
| 2| BPT| 1|
| 3| SE| 1|
| 4| ST| 2|
| 5| BPT| 2|
| 6| N1| 2|
| 7| SE| 2|
| 8| ST| 3|
| 9| PTD| 3|
| 10| SE| 3|
+-----------+-------+--------+

Merge multiple spark rows inside dataframe by ID into one row based on update_time

we need to merge multiple rows based on ID into a single record using Pyspark. If there are multiple updates to the column, then we have to select the one with the last update made to it.
Please note, NULL would mean there was no update made to the column in that instance.
So, basically we have to create a single row with the consolidated updates made to the records.
So,for example, if this is the dataframe ...
Looking for similar answer, but in Pyspark .. Merge rows in a spark scala Dataframe
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update1 | <*no-update*> | 1634228709 |
| 123 | <*no-update*> | 80 | 1634228724 |
| 123 | update2 | <*no-update*> | 1634229000 |
expected output is -
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update2 | 80 | 1634229000 |
Let's say that our input dataframe is:
+---+-------+----+----------+
|id |col1 |col2|updated_at|
+---+-------+----+----------+
|123|null |null|1634228709|
|123|null |80 |1634228724|
|123|update2|90 |1634229000|
|12 |update1|null|1634221233|
|12 |null |80 |1634228333|
|12 |update2|null|1634221220|
+---+-------+----+----------+
What we want is to covert updated_at to TimestampType then order by id and updated_at in desc order:
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
that gives us:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|12 |null |80 |2021-10-14 18:18:53|
|12 |update1|null|2021-10-14 16:20:33|
|12 |update2|null|2021-10-14 16:20:20|
|123|update2|90 |2021-10-14 18:30:00|
|123|null |80 |2021-10-14 18:25:24|
|123|null |null|2021-10-14 18:25:09|
+---+-------+----+-------------------+
Now get first non None value in each column or return None and group by id:
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)
And the result is:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|123|update2|90 |2021-10-14 18:30:00|
|12 |update1|80 |2021-10-14 18:18:53|
+---+-------+----+-------------------+
Here's the full example code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(123, None, None, 1634228709),
(123, None, 80, 1634228724),
(123, "update2", 90, 1634229000),
(12, "update1", None, 1634221233),
(12, None, 80, 1634228333),
(12, "update2", None, 1634221220),
]
columns = ["id", "col1", "col2", "updated_at"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)

How to assign a category to each row based on the cumulative sum of values in spark dataframe?

I have a spark dataframe consist of two columns [Employee and Salary] where salary is in ascending order.
Sample Dataframe
Expected Output:
| Employee |salary |
| -------- | ------|
| Emp1 | 10 |
| Emp2 | 20 |
| Emp3 | 30 |
| EMp4 | 35 |
| Emp5 | 36 |
| Emp6 | 50 |
| Emp7 | 70 |
I want to group the rows such that each group has less than 80 as the aggregated value and assign a category to each group something like this. I will keep adding the salary in rows until the sum becomes more than 80. As soon as it becomes more than 80, I will asssign a new category.
Expected Output:
| Employee |salary | Category|
| -------- | ------|----------
| Emp1 | 10 |A |
| Emp2 | 20 |A |
| Emp3 | 30 |A |
| EMp4 | 35 |B |
| Emp5 | 36 |B |
| Emp6 | 50 |C |
| Emp7 | 70 |D |
Is there a simple way we can do this in spark scala?
To solve your problem, you can use a custom aggregate function over a window
First, you need to create your custom aggregate function. An aggregate function is defined by an accumulator (a buffer), that will be initialized (zero value) and updated when treating a new row (reduce function) or encountering another accumulator (merge function). And at the end, the accumulator is returned (finish function)
In your case, accumulator should keep two pieces of information:
Current category of employees
Sum of salaries of previous employees belonging to the current category
To store those information, you can use a Tuple (Int, Int), with first element is current category and second element the sum of salaries of previous employees of current category:
You initialize this tuple with (0, 0).
When you encounter a new row, if the sum of previous salaries and salary of current row is over 80, you increment category and reinitialize previous salaries' sum with salary of current row, else you add salary of current row to previous salaries' sum.
As you will be using a window function, you will sequentially treat rows so you don't need to implement merge with another accumulator.
And at the end, as you only want the category, you return only the first element of the accumulator.
So we get the following aggregator implementation:
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator
object Labeler extends Aggregator[Int, (Int, Int), Int] {
override def zero: (Int, Int) = (0, 0)
override def reduce(catAndSum: (Int, Int), salary: Int): (Int, Int) = {
if (catAndSum._2 + salary > 80)
(catAndSum._1 + 1, salary)
else
(catAndSum._1, catAndSum._2 + salary)
}
override def merge(catAndSum1: (Int, Int), catAndSum2: (Int, Int)): (Int, Int) = {
throw new NotImplementedError("should be used only over a windows function")
}
override def finish(catAndSum: (Int, Int)): Int = catAndSum._1
override def bufferEncoder: Encoder[(Int, Int)] = Encoders.tuple(Encoders.scalaInt, Encoders.scalaInt)
override def outputEncoder: Encoder[Int] = Encoders.scalaInt
}
Once you have your aggregator, you transform it to a spark aggregate function using udaf function.
You then create your window over all dataframe and ordered by salary and apply your spark aggregate function over this window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val labeler = udaf(Labeler)
val window = Window.orderBy("salary")
val result = dataframe.withColumn("category", labeler(col("salary")).over(window))
Using your example as input dataframe, you get the following result dataframe:
+--------+------+--------+
|employee|salary|category|
+--------+------+--------+
|Emp1 |10 |0 |
|Emp2 |20 |0 |
|Emp3 |30 |0 |
|Emp4 |35 |1 |
|Emp5 |36 |1 |
|Emp6 |50 |2 |
|Emp7 |70 |3 |
+--------+------+--------+

Spark Dataframe - Write a new record for a change in VALUE for a particular KEY group

Need to write a row when there is change in "AMT" column for a particular "KEY" group.
Eg :
Scenarios-1: For KEY=2, first change is 90 to 20, So need to write a record with value (20-90).
Similarly the next change for the same key group is 20 to 30.5, So again need to write another record with value (30.5 - 20)
Scenarios-2: For KEY=1, only one record for this KEY group so write as is
Scenarios-3: For KEY=3, Since the same AMT value exists twice, so write once
How can this be implemented ? Using window functions or by groupBy agg functions?
Sample Input Data :
val DF1 = List((1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)).toDF("KEY", "AMT")
DF1.show(false)
+-----+-------------------+
|KEY |AMT |
+-----+-------------------+
|1 |34.6 |
|2 |90.0 |
|2 |90.0 |
|2 |20.0 |----->[ 20.0 - 90.0 = -70.0 ]
|2 |30.5 |----->[ 30.5 - 20.0 = 10.5 ]
|3 |89.0 |
|3 |89.0 |
+-----+-------------------+
Expected Values :
scala> df2.show()
+----+--------------------+
|KEY | AMT |
+----+--------------------+
| 1 | 34.6 |-----> As Is
| 2 | -70.0 |----->[ 20.0 - 90.0 = -70.0 ]
| 2 | 10.5 |----->[ 30.5 - 20.0 = 10.5 ]
| 3 | 89.0 |-----> As Is, with one record only
+----+--------------------+
i have tried to solve it in pyspark not in scala.
from pyspark.sql.functions import lead
from pyspark.sql.window import Window
w1=Window().partitionBy("key").orderBy("key")
DF4 =spark.createDataFrame([(1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)],["KEY", "AMT"])
DF4.createOrReplaceTempView('keyamt')
DF7=spark.sql('select distinct key,amt from keyamt where key in ( select key from (select key,count(distinct(amt))dist from keyamt group by key) where dist=1)')
DF8=DF4.join(DF7,DF4['KEY']==DF7['KEY'],'leftanti').withColumn('new_col',((lag('AMT',1).over(w1)).cast('double') ))
DF9=DF8.withColumn('new_col1', ((DF8['AMT']-DF8['new_col'].cast('double'))))
DF9.withColumn('new_col1', ((DF9['AMT']-DF9['new_col'].cast('double')))).na.fill(0)
DF9.filter(DF9['new_col1'] !=0).select(DF9['KEY'],DF9['new_col1']).union(DF7).orderBy(DF9['KEY'])
Output:
+---+--------+
|KEY|new_col1|
+---+--------+
| 1| 34.6|
| 2| -70.0|
| 2| 10.5|
| 3| 89.0|
+---+--------+
You can implement your logic using window function with combination of when, lead, monotically_increasing_id() for ordering and withColumn api as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("KEY").orderBy("rowNo")
val tempdf = DF1.withColumn("rowNo", monotonically_increasing_id())
tempdf.select($"KEY", when(lead("AMT", 1).over(windowSpec).isNull || (lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")===lit(0.0), $"AMT").otherwise(lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")).show(false)

Scala/ Spark- Multiply an Integer with each value in a Dataframe Column

I have a sample dataframe
df_that_I_have
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 50 | 1 |
+---------+---------+-------+
| Japan | 20 | 3 |
+---------+---------+-------+
| India | 20 | 1 |
+---------+---------+-------+
| Japan | 10 | 3 |
+---------+---------+-------+
and I want a dataframe that looks like this
df_that_I_want
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 70 | 10 | // 5 * Sum of "some" for India, i.e. (1 + 1)
+---------+---------+-------+
| Japan | 30 | 30 | // 5 * Sum of "some" for Japan, i.e. (3 + 3)
+---------+---------+-------+
The second dataframe has the sum of members and the sum of some multiplied 5.
This is what I'm doing to achieve this
val df_that_I_want = df_that_I_have
.select(df_that_I_have("country"),
df_that_I_have.groupBy("country").sum("members"),
5 * df_that_I_have.groupBy("country").sum("some")) //Problem here
But the compiler does not allow me to do this because apparently I can't multiply 5 with a column.
How can I multiply an Integer value with the sum of some for each country?
You can try lit function.
scala> val df_that_I_have = Seq(("India",50,1),("India",20,1),("Japan",20,3),("Japan",10,3)).toDF("Country","Members","Some")
df_that_I_have: org.apache.spark.sql.DataFrame = [Country: string, Members: int, Some: int]
scala> val df1 = df_that_I_have.groupBy("country").agg(sum("members"), sum("some") * lit(5))
df1: org.apache.spark.sql.DataFrame = [country: string, sum(members): bigint, ((sum(some),mode=Complete,isDistinct=false) * 5): bigint]
scala> val df_that_I_want= df1.select($"Country",$"sum(Members)".alias("Members"), $"((sum(Some),mode=Complete,isDistinct=false) * 5)".alias("Some"))
df_that_I_want: org.apache.spark.sql.DataFrame = [Country: string, Members: bigint, Some: bigint]
scala> df_that_I_want.show
+-------+-------+----+
|Country|Members|Some|
+-------+-------+----+
| India| 70| 10|
| Japan| 30| 30|
+-------+-------+----+
Please try this
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
lit function is used for creating the column of literal value which is 5 here.
when you are not able to multiply 5 directly, it is creating a column containing 5 and multiplying with it.