I have below data which is stored in a csv file
1|Roy|NA|2|Marry|4.6|3|Richard|NA|4|Joy|NA|5|Joe|NA|6|Jos|9|
Now I want to read the file and store it in the spark dataframe, before storing it into dataframe I want to split at every 3rd | and store it as a row.
Output Expected :
1|Roy|NA|
2|Marry|4.6|
3|Richard|NA|
4|Joy|NA|
5|Joe|NA|
6|Jos|9|
Could you anyone help me out to get the output like above.
Start by reading your csv file
val df = spark.read.option("delimiter", "|").csv(file)
This will give you this dataframe
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
|_c1|_c2|_c3|_c4 |_c5|_c6|_c7 |_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
Last column is created because of the last delimiter in your csv file so we get rid of it
val dataframe = df.drop(df.schema.last.name)
dataframe.show(false)
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
|_c0|_c1|_c2|_c3|_c4 |_c5|_c6|_c7 |_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
Then, you need to create an array that contains list of columns name you need to have in your final dataframe
val names : Array[String] = Array("colOne", "colTwo", "colThree")
Last, you need a function that reads by 3
def splitCSV(dataFrame: DataFrame, columnNames : Array[String], sparkSession: SparkSession) : DataFrame = {
import sparkSession.implicits._
val columns = dataFrame.columns
var finalDF : DataFrame = Seq.empty[(String,String,String)].toDF(columnNames:_*)
for(order <- 0 until(columns.length) -3 by(3) ){
finalDF = finalDF.union(dataFrame.select(col(columns(order)).as(columnNames(0)), col(columns(order+1)).as(columnNames(1)), col(columns(order+2)).as(columnNames(2))))
}
finalDF
}
After we apply this function on dataframe
val finalDF = splitCSV(dataframe, names, sparkSession)
finalDF.show(false)
+------+-------+--------+
|colOne|colTwo |colThree|
+------+-------+--------+
|1 |Roy |NA |
|1 |Roy |NA |
|1 |Roy |NA |
|2 |Marry |4.6 |
|2 |Marry |4.6 |
|2 |Marry |4.6 |
|3 |Richard|NA |
|3 |Richard|NA |
|3 |Richard|NA |
|4 |Joy |NA |
|4 |Joy |NA |
|4 |Joy |NA |
|5 |Joe |NA |
|5 |Joe |NA |
|5 |Joe |NA |
+------+-------+--------+
You can use regex for most of it. There's no straightforward regex for "split at nth matching occurence", so we work around it by using a match to pick out the pattern, then insert a custom splitter that we can then use.
ds
.withColumn("value",
regexp_replace('value, "([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|", "$1|$2|$3||")) // 1
.withColumn("value", explode(split('value, "\\|\\|"))) // 2
.where(length('value) > 0) // 3
Explanation
Replace every group of 3 |'s with the components, then terminate with ||
Split on each || and use explode to move each to a separate row
Unfortunately, the split picks up the empty group at the end, so we filter it out
Output for your given input:
+------------+
|value |
+------------+
|1|Roy|NA |
|2|Marry|4.6 |
|3|Richard|NA|
|4|Joy|NA |
|5|Joe|NA |
|6|Jos|9 |
+------------+
Related
I have a pyspark dataframe-
df1 = spark.createDataFrame([
("s1", "i1", 0),
("s1", "i2", 1),
("s1", "i3", 2),
("s1", None, 3),
("s1", "i5", 4),
],
["session_id", "item_id", "pos"])
df1.show(truncate=False)
pos is the position or rank of the item in the session.
Now I want to create new sessions without any null values in them. I want to do this by starting a new session after every null item. Basically I want to break existing sessions into multiple sessions, removing the null item_id in the process.
The expected output would like something like-
+----------+-------+---+--------------+
|session_id|item_id|pos|new_session_id|
+----------+-------+---+--------------+
|s1 |i1 |0 | s1_0|
|s1 |i2 |1 | s1_0|
|s1 |i3 |2 | s1_0|
|s1 |null |3 | None|
|s1 |i5 |4 | s1_4|
+----------+-------+---+--------------+
How do I achieve this?
Not sure about the configs of your spark job, but to prevent to use
collect action to build the reference of your "new" session in Python built-in data structure, I would use built-in spark sql function to build the new session reference. Based on your example, assuming you have already sorted the data frame:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
df = spark.createDataFrame(
[("s1", "i1", 0), ("s1", "i2", 1), ("s1", "i3", 2), ("s1", None, 3), ("s1", None, 4), ("s1", "i6", 5), ("s2", "i7", 6), ("s2", None, 7), ("s2", "i9", 8), ("s2", "i10", 9), ("s2", "i11", 10)],
["session_id", "item_id", "pos"]
)
df.show(20, False)
+----------+-------+---+
|session_id|item_id|pos|
+----------+-------+---+
|s1 |i1 |0 |
|s1 |i2 |1 |
|s1 |i3 |2 |
|s1 |null |3 |
|s1 |null |4 |
|s1 |i6 |5 |
|s2 |i7 |6 |
|s2 |null |7 |
|s2 |i9 |8 |
|s2 |i10 |9 |
|s2 |i11 |10 |
+----------+-------+---+
Step 1: As the data is already sorted, we can use a lag function to shift the data to the next record:
df2 = df\
.withColumn('lag_item', func.lag('item_id', 1).over(Window.partitionBy('session_id').orderBy('pos')))
df2.show(20, False)
+----------+-------+---+--------+
|session_id|item_id|pos|lag_item|
+----------+-------+---+--------+
|s1 |i1 |0 |null |
|s1 |i2 |1 |i1 |
|s1 |i3 |2 |i2 |
|s1 |null |3 |i3 |
|s1 |null |4 |null |
|s1 |i6 |5 |null |
|s2 |i7 |6 |null |
|s2 |null |7 |i7 |
|s2 |i9 |8 |null |
|s2 |i10 |9 |i9 |
|s2 |i11 |10 |i10 |
+----------+-------+---+--------+
Step 2: After using the lag function we can see if the item_id in previous record is NULL or not. Therefore , we can know the boundaries of each new session by doing the filtering and build the reference:
reference = df2\
.filter((func.col('item_id').isNotNull())&(func.col('lag_item').isNull()))\
.groupby('session_id')\
.agg(func.collect_set('pos').alias('session_id_set'))
reference.show(100, False)
+----------+--------------+
|session_id|session_id_set|
+----------+--------------+
|s1 |[0, 5] |
|s2 |[6, 8] |
+----------+--------------+
Step 3: Join the reference back to the data and write a simple UDF to find which new session should be in:
#func.udf(returnType=IntegerType())
def udf_find_session(item_id, pos, session_id_set):
r_val = None
if item_id != None:
for item in session_id_set:
if pos >= item:
r_val = item
else:
break
return r_val
df3 = df2.select('session_id', 'item_id', 'pos')\
.join(reference, on='session_id', how='inner')
df4 = df3.withColumn('new_session_id', udf_find_session(func.col('item_id'), func.col('pos'), func.col('session_id_set')))
df4.show(20, False)
+----------+-------+---+--------------+
|session_id|item_id|pos|new_session_id|
+----------+-------+---+--------------+
|s1 |i1 |0 |0 |
|s1 |i2 |1 |0 |
|s1 |i3 |2 |0 |
|s1 |null |3 |null |
|s1 |null |4 |null |
|s1 |i6 |5 |5 |
|s2 |i7 |6 |6 |
|s2 |null |7 |null |
|s2 |i9 |8 |8 |
|s2 |i10 |9 |8 |
|s2 |i11 |10 |8 |
+----------+-------+---+--------------+
The last step just concat the string you want to show in new session id.
I'm trying to find the max of a column grouped by spark partition id. I'm getting the wrong value when applying the max function though. Here is the code:
val partitionCol = uuid()
val localRankCol = "test"
df = df.withColumn(partitionCol, spark_partition_id)
val windowSpec = WindowSpec.partitionBy(partitionCol).orderBy(sortExprs:_*)
val rankDF = df.withColumn(localRankCol, dense_rank().over(windowSpec))
val rankRangeDF = rankDF.agg(max(localRankCol))
rankRangeDF.show(false)
sortExprs is applying an ascending sort on sales.
And the result with some dummy data is (partitionCol is 5th column):
+--------------+------+-----+---------------------------------+--------------------------------+----+
|title |region|sales|r6bea781150fa46e3a0ed761758a50dea|5683151561af407282380e6cf25f87b5|test|
+--------------+------+-----+---------------------------------+--------------------------------+----+
|Die Hard |US |100.0|1 |0 |1 |
|Rambo |US |100.0|1 |0 |1 |
|Die Hard |AU |200.0|1 |0 |2 |
|House of Cards|EU |400.0|1 |0 |3 |
|Summer Break |US |400.0|1 |0 |3 |
|Rambo |EU |100.0|1 |1 |1 |
|Summer Break |APAC |200.0|1 |1 |2 |
|Rambo |APAC |300.0|1 |1 |3 |
|House of Cards|US |500.0|1 |1 |4 |
+--------------+------+-----+---------------------------------+--------------------------------+----+
+---------+
|max(test)|
+---------+
|5 |
+---------+
"test" column has a max value of 4 but 5 is being returned.
I need to filter a dataframe with the below criteria.
I have 2 columns 4Wheel(Subaru, Toyota, GM, null/empty) and 2Wheel(Yamaha, Harley, Indian, null/empty).
I have to filter on 4Wheel with values (Subaru, Toyota), if 4Wheel contain empty/null then filter on 2Wheel with values (Yamaha, Harley)
I couldn't find this type of filtering in different examples. I am new to spark/scala, so could not get enough idea to implement this.
Thanks,
Barun.
You can use spark SQL built-in function when to check if a column is null or empty, and filter accordingly:
import org.apache.spark.sql.functions.{col, when}
dataframe.filter(when(col("4Wheel").isNull || col("4Wheel").equalTo(""),
col("2Wheel").isin("Yamaha", "Harley")
).otherwise(
col("4Wheel").isin("Subaru", "Toyota")
))
So if you have the following input:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|3 |GM |null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
|8 |null |Indian|
|9 | |Indian|
|10 |null |null |
+---+------+------+
You get the following filtered ouput:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
+---+------+------+
I have a dataframe that looks as follows:
|id |val1|val2|
+---+----+----+
|1 |1 |0 |
|1 |2 |0 |
|1 |3 |0 |
|1 |4 |0 |
|1 |5 |5 |
|1 |6 |0 |
|1 |7 |0 |
|1 |8 |0 |
|1 |9 |9 |
|1 |10 |0 |
|1 |11 |0 |
|2 |1 |0 |
|2 |2 |0 |
|2 |3 |0 |
|2 |4 |0 |
|2 |5 |0 |
|2 |6 |6 |
|2 |7 |0 |
|2 |8 |8 |
|2 |9 |0 |
+---+----+----+
only showing top 20 rows
I want to create a new column with the number of rows until a non-zero value appears in val2, this should be done groupby/partitionby 'id'... if the event never happens, I need to put a -1 in the steps field.
|id |val1|val2|steps|
+---+----+----+----+
|1 |1 |0 |4 |
|1 |2 |0 |3 |
|1 |3 |0 |2 |
|1 |4 |0 |1 |
|1 |5 |5 |0 | event
|1 |6 |0 |3 |
|1 |7 |0 |2 |
|1 |8 |0 |1 |
|1 |9 |9 |0 | event
|1 |10 |0 |-1 | no further events for this id
|1 |11 |0 |-1 | no further events for this id
|2 |1 |0 |5 |
|2 |2 |0 |4 |
|2 |3 |0 |3 |
|2 |4 |0 |2 |
|2 |5 |0 |1 |
|2 |6 |6 |0 | event
|2 |7 |0 |1 |
|2 |8 |8 |0 | event
|2 |9 |0 |-1 | no further events for this id
+---+----+----+----+
only showing top 20 rows
Your requirement seems easy but implementing in spark and preserving immutability is a difficult task. I am suggesting you would need a recursive function to generate the steps column. Below I have tried to suggest you a recursive way using a udf function.
import org.apache.spark.sql.functions._
//udf function to populate step column
def stepsUdf = udf((values: Seq[Row]) => {
//sorting the collected struct in reverse order according to val1 column in reverse order
val val12 = values.sortWith(_.getAs[Int]("val1") > _.getAs[Int]("val1"))
//selecting the first of sorted list
val val12Head = val12.head
//generating the first step column in the collected list
val prevStep = if(val12Head.getAs("val2") != 0) 0 else -1
//generating the first output struct
val listSteps = List(steps(val12Head.getAs("val1"), val12Head.getAs("val2"), prevStep))
//recursive function for generating the step column
def recursiveSteps(vals : List[Row], previousStep: Int, listStep : List[steps]): List[steps] = vals match {
case x :: y =>
//event changed so step column should be 0
if(x.getAs("val2") != 0) {
recursiveSteps(y, 0, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), 0))
}
//event doesn't change after the last event change
else if(x.getAs("val2") == 0 && previousStep == -1) {
recursiveSteps(y, previousStep, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), previousStep))
}
//val2 is 0 after the event change so increment the step column
else {
recursiveSteps(y, previousStep+1, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), previousStep+1))
}
case Nil => listStep
}
//calling the recursive function
recursiveSteps(val12.tail.toList, prevStep, listSteps)
})
df
.groupBy("id") // grouping by id column
.agg(stepsUdf(collect_list(struct("val1", "val2"))).as("stepped")) //calling udf function after the collection of struct of val1 and val2
.withColumn("stepped", explode(col("stepped"))) // generating rows from the list returned from udf function
.select(col("id"), col("stepped.*")) // final desired output
.sort("id", "val1") //optional step just for viewing
.show(false)
where steps is a case class
case class steps(val1: Int, val2: Int, steps: Int)
which should give you
+---+----+----+-----+
|id |val1|val2|steps|
+---+----+----+-----+
|1 |1 |0 |4 |
|1 |2 |0 |3 |
|1 |3 |0 |2 |
|1 |4 |0 |1 |
|1 |5 |5 |0 |
|1 |6 |0 |3 |
|1 |7 |0 |2 |
|1 |8 |0 |1 |
|1 |9 |9 |0 |
|1 |10 |0 |-1 |
|1 |11 |0 |-1 |
|2 |1 |0 |5 |
|2 |2 |0 |4 |
|2 |3 |0 |3 |
|2 |4 |0 |2 |
|2 |5 |0 |1 |
|2 |6 |6 |0 |
|2 |7 |0 |1 |
|2 |8 |8 |0 |
|2 |9 |0 |-1 |
+---+----+----+-----+
I hope the answer is helpful
I am learning Spark and Scala, and was experimenting in the spark REPL.
When I try to convert a List to a DataFrame, it works as follows:
val convertedDf = Seq(1,2,3,4).toDF("Field1")
However, when I try to convert a list of lists to a DataFrame with two columns (field1, field2), it fails with
java.lang.IllegalArgumentException: requirement failed: The number of
columns doesn't match
error message:
val twoColumnDf =Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).toDF("Field1", (Field2))
How to convert such a List of Lists to a DataFrame in Scala?
If you are seeking ways to have each elements of each sequence in each row of respective columns then following are the options for you
zip
zip both sequences and then apply toDF as
val twoColumnDf =Seq(1,2,3,4,5).zip(Seq(5,4,3,2,3)).toDF("Field1", "Field2")
which should give you twoColumnDf as
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
zipped
Another better way is to use zipped as
val threeColumnDf = (Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).zipped.toList.toDF("Field1", "Field2", "field3")
which should give you
+------+------+------+
|Field1|Field2|field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
But zipped works only for maximum three sequeces Thanks for pointing that out #Shaido
Note: the number of rows is determined by the shortest sequence present
transpose
Tanspose combines all sequences as zip and zipped does but returns list instead of tuples so a little hacking is needed as
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).transpose.map{case List(a,b) => (a, b)}.toDF("Field1", "Field2")
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
and
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).transpose.map{case List(a,b,c) => (a, b, c)}.toDF("Field1", "Field2", "Field3")
+------+------+------+
|Field1|Field2|Field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
and so on ...
Note: transpose requires all sequences to be of same length
I hope the answer is helpful
By default, each element is considered to be a Row of the dataFrame.
If you want each of the Seqs to be a different column you need to group them inside a Tuple:
val twoColumnDf =Seq((Seq(1,2,3,4,5), Seq(5,4,3,2,3))).toDF("Field1", "Field2")
twoColumnDf.show
+---------------+---------------+
| Field1| Field2|
+---------------+---------------+
|[1, 2, 3, 4, 5]|[5, 4, 3, 2, 3]|
+---------------+---------------+