create data frame in spark from unparsed text string

create data frame in spark from unparsed text string - scala

I am using Scala and Spark to analyze some data.
Sorry I am absolute novice in this area.
I have data in the following format (below)
I want to create RDD to filter data, group and transform data.
Currently I have rdd with list of unparsed strings
I have created it from rawData: list of strings
val rawData ( this is ListBuffer[String] )
val rdd = sc.parallelize(rawData)
How can I create data set to manipulate data?
I want to have objects in Rdd with named fields line ob.name, obj.year and so on
What is the right approach?
Should I create data frame for this?
Raw data strings looks like this : this is list of strings, with space separated values
Column meaning: "name", year" , "month", "tmax", "tmin", "afdays", "rainmm", "sunhours"
aberporth 1941 10 --- --- --- 106.2 ---
aberporth 1941 11 --- --- --- 92.3 ---
aberporth 1941 12 --- --- --- 86.5 ---
aberporth 1942 1 5.8 2.1 --- 114.0 58.0
aberporth 1942 2 4.2 -0.6 --- 13.8 80.3
aberporth 1942 3 9.7 3.7 --- 58.0 117.9
aberporth 1942 4 13.1 5.3 --- 42.5 200.1
aberporth 1942 5 14.0 6.9 --- 101.1 215.1
aberporth 1942 6 16.2 9.9 --- 2.3 269.3
aberporth 1942 7 17.4 11.3 12 70.2* 185.0
aberporth 1942 8 18.7 12.3 5- 78.5 141.9
aberporth 1942 9 16.4 10.7 123 146.8 129.1#
aberporth 1942 10 13.1 8.2 125 131.1 82.1l
--- - means no data, i think I can put 0 to this colument.
70.2* , 129.1# , 82.l - * , # and l here should be filtred
Please point me to right direction.
I have found one of the possible solution here:
https://medium.com/#mrpowers/manually-creating-spark-dataframes-b14dae906393
This example looks good:
val someData = Seq(
Row(8, "bat"),
Row(64, "mouse"),
Row(-27, "horse")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
How can I transform list of strings to Seq of Row ?

You can read the data as a text file and replace --- with 0 and remove special characters or filter out. ( I have replaced in below example)
Create a case class to represent the data
case class Data(
name: String, year: String, month: Int, tmax: Double,
tmin: Double, afdays: Int, rainmm: Double, sunhours: Double
)
Read a file
val data = spark.read.textFile("file path") //read as a text file
.map(_.replace("---", "0").replaceAll("-|#|\\*", "")) //replace special charactes
.map(_.split("\\s+"))
.map(x => // create Data object for each record
Data(x(0), x(1), x(2).toInt, x(3).toDouble, x(4).toDouble, x(5).toInt, x(6).toDouble, x(7).replace("l", "").toDouble)
)
Now you get a Dataset[Data] which is a dataset parsed from the text.
Output:
+---------+----+-----+----+----+------+------+--------+
|name |year|month|tmax|tmin|afdays|rainmm|sunhours|
+---------+----+-----+----+----+------+------+--------+
|aberporth|1941|10 |0.0 |0.0 |0 |106.2 |0.0 |
|aberporth|1941|11 |0.0 |0.0 |0 |92.3 |0.0 |
|aberporth|1941|12 |0.0 |0.0 |0 |86.5 |0.0 |
|aberporth|1942|1 |5.8 |2.1 |0 |114.0 |58.0 |
|aberporth|1942|2 |4.2 |0.6 |0 |13.8 |80.3 |
|aberporth|1942|3 |9.7 |3.7 |0 |58.0 |117.9 |
|aberporth|1942|4 |13.1|5.3 |0 |42.5 |200.1 |
|aberporth|1942|5 |14.0|6.9 |0 |101.1 |215.1 |
|aberporth|1942|6 |16.2|9.9 |0 |2.3 |269.3 |
|aberporth|1942|7 |17.4|11.3|12 |70.2 |185.0 |
|aberporth|1942|8 |18.7|12.3|5 |78.5 |141.9 |
|aberporth|1942|9 |16.4|10.7|123 |146.8 |129.1 |
|aberporth|1942|10 |13.1|8.2 |125 |131.1 |82.1 |
+---------+----+-----+----+----+------+------+--------+
I hope this helps!

Related

Pivot dataframe in pyspark using column for suffix

This question is similar to one I've asked before (Pandas pivot ussing column as suffix) but this time I need to do it using Pyspark instead of Pandas. The problem is as follows.
I have a dataframe like the following example:
Id
Type
Value_1
Value_2
1234
A
1
2
1234
B
1
2
789
A
1
2
789
B
1
2
567
A
1
2
And I want to transform to get the following:
Id
Value_1_A
Value_1_B
Value_2_A
Value_2_B
1234
1
1
2
2
789
1
1
2
2
567
1
1
In summary: replicating the value columns using the 'Type' column as a suffix and convert the dataframe to a wide format.
One solution I can think of is creating the columns with the suffix manually and then aggregating.
Other solutions I've tried are using pyspark GroupedData pivot function as follows:
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({'Id': {0: 1234, 1: 1234, 2: 789, 3: 789, 4: 567},
'Type': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A'},
'Value_1': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Value_2': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2}}))
df.groupBy("Id").pivot("Type").avg().show()
The issue of this solution is that the resulting dataframe would include the Id column repeated 3 times, and the inability to name the columns adding Type as the suffix, since they would be named liked this:
['Id',
'A_avg(Id)',
'A_avg(Value_1)',
'A_avg(Value_2)',
'B_avg(Id)',
'B_avg(Value_1)',
'B_avg(Value_2)']
I also tried specifying the value columns to the pivot functions as follows
df.groupBy("Id").pivot("Type", values=["Value_1", "Value_2"]).avg().show()
This removes the extra Id columns, but the rest of the columns only have null values.
Is there any elegant way to do the transformation I'm attempting on pyspark?

Option 1:
If you don't mind having your Type values as column prefixes rather than suffixes, you can use a combination of agg, avg, and alias:
import pyspark.sql.functions as F
df_pivot = df \
.groupBy("Id") \
.pivot("Type") \
.agg(F.avg("Value_1").alias("Value_1"), F.avg("Value_2").alias("Value_2"))
+----+---------+---------+---------+---------+
|Id |A_Value_1|A_Value_2|B_Value_1|B_Value_2|
+----+---------+---------+---------+---------+
|789 |1.0 |2.0 |1.0 |2.0 |
|567 |1.0 |2.0 |null |null |
|1234|1.0 |2.0 |1.0 |2.0 |
+----+---------+---------+---------+---------+
Separately, it's worth noting here that the values argument in the pivot method is used to limit which values you want to retain from your pivot (i.e., Type) column. For example, if you only wanted A and not B in your output, you would specify pivot("Type", values=["A"]).
Option 2:
If you do still want them as suffixes, you'll likely have to use some regex and withColumnRenamed, which could look something like this:
import pyspark.sql.functions as F
import re
df_pivot = df \
.groupBy("Id") \
.pivot("Type") \
.agg(F.avg("Value_1"), F.avg("Value_2"))
for col in df_pivot.columns:
if "avg(" in col:
suffix = re.findall("^.*(?=_avg\()|$", col)[0]
base_name = re.findall("(?<=\().*(?=\)$)|$", col)[0]
df_pivot = df_pivot.withColumnRenamed(col, "_".join([base_name, suffix]))
+----+---------+---------+---------+---------+
|Id |Value_1_A|Value_2_A|Value_1_B|Value_2_B|
+----+---------+---------+---------+---------+
|789 |1.0 |2.0 |1.0 |2.0 |
|567 |1.0 |2.0 |null |null |
|1234|1.0 |2.0 |1.0 |2.0 |
+----+---------+---------+---------+---------+

PySpark Timeseries indicator event column, extract data before and after event occurs

I am working with a spark dataframe containing a timeseries data and one of the columns is an indicator for an event. looking something like the dummy table below.
id
time
timeseries_data
event_indicator
a
2022-08-12 08:00
1
0
a
2022-08-12 08:01
2
0
a
2022-08-12 08:02
3
0
a
2022-08-12 08:03
4
1
a
2022-08-12 08:04
5
0
a
2022-08-12 08:05
6
0
b
2022-08-12 08:00
1
0
b
2022-08-12 08:01
2
0
b
2022-08-12 08:02
3
1
b
2022-08-12 08:03
4
0
b
2022-08-12 08:04
5
0
b
2022-08-12 08:05
6
0
I now want to select samples before and after (including the sample where the event occurs). to start off one sample before and after, but also by time so everything within 4 minutes of the event for each id.
I've tried to use the window function but I don't know how to sort it out.
The result for id a is shown below. the event occurs 2022-08-12 08:03 at sample 4 and I now want to extract the following to a new dataframe.
id
time
timeseries_data
event_indicator
a
2022-08-12 08:02
3
0
a
2022-08-12 08:03
4
1
a
2022-08-12 08:04
5
0
Edit:
I thought it would be a very simple solution to this problem, and I am just new to PySpark that's why I don't really get it to work.
What I've tried is to use a window function per id.
windowPartition = Window.partitionBy([F.col("id")]).orderBy("time").rangeBetween(-1, 1)
test_df = df_dummy.where(F.col('event_indicator') == 1).over(windowPartition)
however, the error is that df_dummy does not have object 'over'. So I need to figure out a way to apply this window to the entire dataframe and not just a function.
The lag/lead from my understanding is only to take the lagged/lead value and I want a consecutive dataframe of the time around the event_indicator.
The timestamp is only dummy data, for me currently it does not matter if the window over is per minute or per second so I've changed the question to per minute.
Currently the goal is to get an understanding how I can extract a subset of the entire timeseries dataframe. This to see how the data changes when something happens. An example could be a normal car driving, one tyre explodes and we want to see what happened with the pressure the x timeseries samples before and after the explosion. And the next step might not be to use samples but instead what happened with the data the previous minute and the following minute of data.

#Emil
Solution :
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [("a","2022-08-12 08:00","1","0"),
("a","2022-08-12 08:01","2","0"),
("a","2022-08-12 08:02","3","0"),
("a","2022-08-12 08:03","4","1"),
("a","2022-08-12 08:04","5","0"),
("a","2022-08-12 08:05","6","0"),
("a","2022-08-12 08:10","7","0"),
("a","2022-08-12 08:12","8","1"),
("a","2022-08-12 08:14","9","0"),
("b","2022-08-12 08:00","1","0"),
("b","2022-08-12 08:01","2","0"),
("b","2022-08-12 08:02","3","1"),
("b","2022-08-12 08:03","4","0"),
("b","2022-08-12 08:04","5","0"),
("b","2022-08-12 08:05","6","0"),
("b","2022-08-12 08:10","7","0"), # thesearemy testcase,shouldn't in output
("b","2022-08-12 08:12","8","1"), # theseare my testcase,shouldn't in output
("b","2022-08-12 08:17","9","0")] # thesearemy testcase,shouldn't in output
schema=["id","time","timeseries_data","event_indicator"]
df = spark.createDataFrame(data,schema)
df= df.withColumn("time",F.col("time").cast("timestamp"))\
.withColumn("event_indicator",F.col("event_indicator").cast("int"))
window_spec = Window.partitionBy(["id"]).orderBy("time")
window_spec_groups_all = Window.partitionBy(["id","continous_groups"])
flag_cond_ = ((F.lag(F.col("event_indicator")).over(window_spec)==1) | (F.col("event_indicator")==1)
|(F.lead(F.col("event_indicator")).over(window_spec)==1))
four_minutes_cond_ = (F.unix_timestamp(F.last(F.col("time")).over(window_spec_groups_all))-F.unix_timestamp(F.first(F.col("time")).over(window_spec_groups_all)))
df_flt= df.withColumn("flag",F.when(flag_cond_,F.lit(True)).otherwise(F.lit(False)))\
.withColumn("org_rnk",F.row_number().over(window_spec))\
.filter(F.col("flag"))\
.withColumn("flt_rnk",F.row_number().over(window_spec))\
.withColumn("continous_groups",(F.col("org_rnk")-F.col("flt_rnk")).cast("int"))\
.withColumn("four_minutes_cond_",four_minutes_cond_)\
.filter(F.col("four_minutes_cond_") <=240)\
.select(schema)
df_flt.show(20,0)
output:-
+---+-------------------+---------------+---------------+
|id |time |timeseries_data|event_indicator|
+---+-------------------+---------------+---------------+
|a |2022-08-12 08:02:00|3 |0 |
|a |2022-08-12 08:03:00|4 |1 |
|a |2022-08-12 08:04:00|5 |0 |
|a |2022-08-12 08:10:00|7 |0 |
|a |2022-08-12 08:12:00|8 |1 |
|a |2022-08-12 08:14:00|9 |0 |
|b |2022-08-12 08:01:00|2 |0 |
|b |2022-08-12 08:02:00|3 |1 |
|b |2022-08-12 08:03:00|4 |0 |
+---+-------------------+---------------+---------------+

How do I add a new column to a Spark dataframe for every row that exists?

I'm trying to create a comparison matrix using a Spark dataframe, and am starting by creating a single column dataframe with one row per value:
val df = List(1, 2, 3, 4, 5).toDF
From here, what I need to do is create a new column for each row, and insert (for now), a random number in each space, like this:
Item 1 2 3 4 5
------ --- --- --- --- ---
1 0 7 3 6 2
2 1 0 4 3 1
3 8 6 0 4 4
4 8 8 1 0 9
5 9 5 3 6 0
Any assistance would be appreciated!

Considering to transpose the input DataFrame called df using .pivot() function like the following:
val output = df.groupBy("item").pivot("item").agg((rand()*100).cast(DataTypes.IntegerType))
This will generate a new DataFrame with a random Integer value corrisponding to the row value (null otherwise).
+----+----+----+----+----+----+
|item|1 |2 |3 |4 |5 |
+----+----+----+----+----+----+
|1 |9 |null|null|null|null|
|3 |null|null|2 |null|null|
|5 |null|null|null|null|6 |
|4 |null|null|null|26 |null|
|2 |null|33 |null|null|null|
+----+----+----+----+----+----+
If you don't want the null values you can consider to apply an UDF later.

Row manipulation for Dataframe in spark [duplicate]

This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have a dataframe in spark which is like :
column_A | column_B
--------- --------
1 1,12,21
2 6,9
both column_A and column_B is of String type.
how can I convert the above dataframe to a new dataframe which is like :
colum_new_A | column_new_B
----------- ------------
1 1
1 12
1 21
2 6
2 9
both column_new_A and column_new_B should be of String type.

You need to split the Column_B with comma and use the explode function as
val df = Seq(
("1", "1,12,21"),
("2", "6,9")
).toDF("column_A", "column_B")
You can use withColumn or select to create new column.
df.withColumn("column_B", explode(split( $"column_B", ","))).show(false)
df.select($"column_A".as("column_new_A"), explode(split( $"column_B", ",")).as("column_new_B"))
Output:
+------------+------------+
|column_new_A|column_new_B|
+------------+------------+
|1 |1 |
|1 |12 |
|1 |21 |
|2 |6 |
|2 |9 |
+------------+------------+

How to create DataFrame from fixed-length text file given field lengths?

I am reading fixed positional file. Final result of file is stored in string. I would like to convert string into a DataFrame to process further. Kindly help me on this. Below is my code:
Input data:
+---------+----------------------+
|PRGREFNBR|value |
+---------+----------------------+
|01 |11 apple TRUE 0.56|
|02 |12 pear FALSE1.34|
|03 |13 raspberry TRUE 2.43|
|04 |14 plum TRUE .31|
|05 |15 cherry TRUE 1.4 |
+---------+----------------------+
data position: "3,10,5,4"
expected result with default header in data frame:
+-----+-----+----------+-----+-----+
|SeqNo|col_0| col_1|col_2|col_3|
+-----+-----+----------+-----+-----+
| 01 | 11 |apple |TRUE | 0.56|
| 02 | 12 |pear |FALSE| 1.34|
| 03 | 13 |raspberry |TRUE | 2.43|
| 04 | 14 |plum |TRUE | 1.31|
| 05 | 15 |cherry |TRUE | 1.4 |
+-----+-----+----------+-----+-----+

Given the fixed-position file (say input.txt):
11 apple TRUE 0.56
12 pear FALSE1.34
13 raspberry TRUE 2.43
14 plum TRUE 1.31
15 cherry TRUE 1.4
and the length of every field in the input file as (say lengths):
3,10,5,4
you could create a DataFrame as follows:
// Read the text file as is
// and filter out empty lines
val lines = spark.read.textFile("input.txt").filter(!_.isEmpty)
// define a helper function to do the split per fixed lengths
// Home exercise: should be part of a case class that describes the schema
def parseLinePerFixedLengths(line: String, lengths: Seq[Int]): Seq[String] = {
lengths.indices.foldLeft((line, Array.empty[String])) { case ((rem, fields), idx) =>
val len = lengths(idx)
val fld = rem.take(len)
(rem.drop(len), fields :+ fld)
}._2
}
// Split the lines using parseLinePerFixedLengths method
val lengths = Seq(3,10,5,4)
val fields = lines.
map(parseLinePerFixedLengths(_, lengths)).
withColumnRenamed("value", "fields") // <-- it'd be unnecessary if a case class were used
scala> fields.show(truncate = false)
+------------------------------+
|fields |
+------------------------------+
|[11 , apple , TRUE , 0.56]|
|[12 , pear , FALSE, 1.34]|
|[13 , raspberry , TRUE , 2.43]|
|[14 , plum , TRUE , 1.31]|
|[15 , cherry , TRUE , 1.4 ]|
+------------------------------+
That's what you may have had already so let's unroll/destructure the nested sequence of fields into columns
val answer = lengths.indices.foldLeft(fields) { case (result, idx) =>
result.withColumn(s"col_$idx", $"fields".getItem(idx))
}
// drop the unnecessary/interim column
scala> answer.drop("fields").show
+-----+----------+-----+-----+
|col_0| col_1|col_2|col_3|
+-----+----------+-----+-----+
| 11 |apple |TRUE | 0.56|
| 12 |pear |FALSE| 1.34|
| 13 |raspberry |TRUE | 2.43|
| 14 |plum |TRUE | 1.31|
| 15 |cherry |TRUE | 1.4 |
+-----+----------+-----+-----+
Done!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse