I was hoping somebody might know a simple solution to this problem using spark and scala.
I have some network data of animal movements in the following format (currently in a dataframe in spark):
id start end date
12 0 10 20091017
12 10 20 20091201
12 20 0 20091215
12 0 15 20100220
12 15 0 20100320
the id is the id of the animal, the start and end are locations of movements (i.e. the second row is movement from location id 10 to location id 20). If the start or end is a 0 that means the animal is born or has died (i.e. first row animal 12 is born and row 3 the animal has died).
The problem I have is that the data was collected so that animal ID's were re-used in the database so after an animal has died its id may re-occur.
What I want to do is apply a unique tag to all movements which are re-used. So you would get a database something like
id start end date
12a 0 10 20091017
12a 10 20 20091201
12a 20 0 20091215
12b 0 15 20100220
12b 15 0 20100320
I've been trying a few different approaches but can't seem to get anything that works. The database is very large (several gigabytes) so need something that works quite efficiently.
Any help is much appreciated.
The only solution that may work relatively well directly on DataFrames is to use window functions but I still wouldn't expect particularly high performance here:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(12, 0, 10, 20091017), (12, 10, 20, 20091201),
(12, 20, 0, 20091215), (12, 0, 15, 20100220),
(12, 15, 0, 20100320)
).toDF("id", "start", "end", "date")
val w = Window.partitionBy($"id").orderBy($"date")
val uniqueId = struct(
$"id", sum(when($"start" === 0, 1).otherwise(0)).over(w))
df.withColumn("unique_id", uniqueId).show
// +---+-----+---+--------+---------+
// | id|start|end| date|unique_id|
// +---+-----+---+--------+---------+
// | 12| 0| 10|20091017| [12,1]|
// | 12| 10| 20|20091201| [12,1]|
// | 12| 20| 0|20091215| [12,1]|
// | 12| 0| 15|20100220| [12,2]|
// | 12| 15| 0|20100320| [12,2]|
// +---+-----+---+--------+---------+
Related
Using Spark and Scala, I have two DataFrames with data values.
I'm trying to accomplish something that, when processing serially would be trival, but when processing in a cluster seems daunting.
Let's say I have to sets of values. One of them is very regular:
Relative Time
Value1
10
1
20
2
30
3
And I want to combine it with another value that is very irregular:
Relative Time
Value2
1
100
22
200
And get this (driven by Value1):
Relative Time
Value1
Value2
10
1
100
20
2
100
30
3
200
Note: There are a few scenarios here. One of them is that Value1 is a massive DataFrame and Value2 only has a few hundred values. The other scenario is that they're both massive.
Also note: I depict Value2 as being very slow, and it might be, but also could may be much faster than Value1, so I may have 10 or 100 values of Value2 before my next value of Value1, and I'd want the latest. Because of this doing a union of them and windowing it doesn't seem practical.
How would I accomplish this in Spark?
I think you can do:
Full outer join between the two tables
Use the last function to look back the closest value of value2
import spark.implicits._
import org.apache.spark.sql.expressions.Window
val df1 = spark.sparkContext.parallelize(Seq(
(10, 1),
(20, 2),
(30, 3)
)).toDF("Relative Time", "value1")
val df2 = spark.sparkContext.parallelize(Seq(
(1, 100),
(22, 200)
)).toDF("Relative Time", "value2_temp")
val df = df1.join(df2, Seq("Relative Time"), "outer")
val window = Window.orderBy("Relative Time")
val result = df.withColumn("value2", last($"value2_temp", ignoreNulls = true).over(window)).filter($"value1".isNotNull).drop("value2_temp")
result.show()
+-------------+------+------+
|Relative Time|value1|value2|
+-------------+------+------+
| 10| 1| 100|
| 20| 2| 100|
| 30| 3| 200|
+-------------+------+------+
I have some data (invoice data). Assuming id ~ date and id is what I'm sorting by:
fid, id, due, overdue
0, 1, 5, 0
0, 3, 5, 5
0, 13, 5, 10
0, 14, 5, 0
1, 5, 5, 0
1, 26, 5, 5
1, 27, 5, 10
1, 38, 5, 0
remove all rows under some arbitrary date-id id = 20
group_by fid and sort by id within the group
(major) aggregate a new column overdue_id that is the id of the row before the first row in the group that has a nonzero value for overdue
(minor) fill a row for every fid even if all rows are filtered out by #0
so the output would be (given default value null)
fid, overdue_id
0, 1
1, null
because for fid = 0, the first id with nonzero overdue is id = 3, and I'd like to output the id for the row that before that in id-date time which is id = 1.
I have group_by('fid').withColumn('overdue_id', ...), and want to use functions like agg, min, when, but am not sure after that as I am very new to the docs.
You can use the following steps to solve :
import pyspark.sql.functions as F
from pyspark.sql import *
#added fid=2 for overdue = 0 condition
fid = [0,1,2]*4
fid.sort()
dateId = [1,3,13,14,5,26,27,28]
dateId.extend(range(90,95))
due = [5]*12
overdue = [0,5,10,0]*2
overdue.extend([0,0,0,0])
data = zip(fid, dateId, due, overdue)
df = spark.createDataFrame(data, schema =["fid", "dateId", "due", "overdue"])
win = Window.partitionBy(df['fid']).orderBy(df['dateId'])
res = df\
.filter(F.col("dateId")!= 20)\
.withColumn("lag_id", F.lag(F.col("dateId"), 1).over(win))\
.withColumn("overdue_id", F.when(F.col("overdue")!=0, F.col("lag_id")).otherwise(None))\
.groupBy("fid")\
.agg(F.min("overdue_id").alias("min_overdue_id"))
>>> res.show()
+---+--------------+
|fid|min_overdue_id|
+---+--------------+
| 0| 1|
| 1| 5|
| 2| null|
+---+--------------+
You need to use the lag and window function. Before we begin, why is your example output showing null for fid 1. The first non zero value is for id 26, so the id before that is 5. so shouldn't be 5? Unless you need something else, you can try this.
tst=sqlContext.createDataFrame([(0, 1,5,0),(0,20,5,0),(0,30,5,5),(0,13,5,10),(0,14,5,0),(1,5,5,0),(1,26,5,5),(1,27,5,10),(1,38,5,0)],schema=["fid","id","due","overdue"])
# To filter data
tst_f = tst.where('id!=20')
# Define window function
w=Window.partitionBy('fid').orderBy('id')
tst_lag = tst_f.withColumn('overdue_id',F.lag('id').over(w))
# Remove rows with 0 overdue
tst_od = tst_lag.where('overdue!=0')
# Find the row before first non zero overdue
tst_res = tst_od.groupby('fid').agg(F.first('overdue_id').alias('overdue_id'))
tst_res.show()
+---+----------+
|fid|overdue_id|
+---+----------+
| 0| 1|
| 1| 5|
+---+----------+
If you are weary about using the first function , or just to be confident about avoiding ghost issues, you can try the below performance expensive option
# Create a copy to avoid ambiguous join and select the minimum from non zero overdue rows
tst_min= tst_od.withColumn("dummy",F.lit('dummy')).groupby('fid').agg(F.min('id').alias('id_min'))
# Join this with the dataframe to get results
tst_join = tst_od.join(tst_min,on=tst_od.id==tst_min.id_min,how='right')
tst_join.show()
+---+---+---+-------+----------+---+------+
|fid| id|due|overdue|overdue_id|fid|id_min|
+---+---+---+-------+----------+---+------+
| 1| 26| 5| 5| 5| 1| 26|
| 0| 13| 5| 10| 1| 0| 13|
+---+---+---+-------+----------+---+------+
# This way you can see all the information
You can filter the relevant information from this dataframe using filter() or where() method
I have multiple database rows per personId with columns that may or may not have values - I'm using colors here as the data is text not numeric so doesn't lend itself to built-in aggregation functions. A simplified example is
PersonId ColA ColB ColB
100 red
100 green
100 gold
100 green
110 yellow
110 white
110
120
etc...
I want to be able to decide in a function which column data to use per unique PersonId. A three-way join on the table against itself would be a good solution if the data didn't have multiple values(colors) per column. E.g. that join merges 3 of the rows into one but still produces multiple rows.
PersonId ColA ColB ColB
100 red green gold
100 green
110 white yellow
110
120
So the solution I'm looking for is something that will allow me to address all the values (colors) for a person in one place (function) so the decision can be made across all their data.
The real data of course has more columns but the primary ones for this decision are the three columns. The data is being read in Scala Spark as a Dataframe and I'd prefer using the API to sql. I don't know if any of the exotic windows or groupby functions will help or if it's gonna be down to plain old iterate and accumulate.
The technique used in [How to aggregate values into collection after groupBy? might be applicable but it's a bit of a leap.
Think of using customUDF for doing this.
import org.apache.spark.sql.functions._
val df = Seq((100, "red", null, null), (100, null, "white", null), (100, null, null, "green"), (200, null, "red", null)).toDF("PID", "A", "B", "C")
df.show()
+---+----+-----+-----+
|PID| A| B| C|
+---+----+-----+-----+
|100| red| null| null|
|100|null|white| null|
|100|null| null|green|
|200|null| red| null|
+---+----+-----+-----+
val customUDF = udf((array: Seq[String]) => {
val newts = array.filter(_.nonEmpty)
if (newts.size == 0) null
else newts.head
})
df.groupBy($"PID").agg(customUDF(collect_set($"A")).as("colA"), customUDF(collect_set($"B")).as("colB"), customUDF(collect_set($"C")).as("colC")).show
+---+----+-----+-----+
|PID|colA| colB| colC|
+---+----+-----+-----+
|100| red|white|green|
|200|null| red| null|
+---+----+-----+-----+
I have a dateframe which have unique as well as repeated records on the basis of number. Now i want to split the dataframe into two dataframe. In first dataframe i need to copy only unique rows and in second dataframe i want all repeated rows. For example
id name number
1 Shan 101
2 Shan 101
3 John 102
4 Michel 103
The two splitted dataframe should be like
Unique
id name number
3 John 102
4 Michel 103
Repeated
id name number
1 Shan 101
2 Shan 101
The solution you tried could probably get you there.
Your data looks like this
val df = sc.parallelize(Array(
(1, "Shan", 101),
(2, "Shan", 101),
(3, "John", 102),
(4, "Michel", 103)
)).toDF("id","name","number")
Then you yourself suggest grouping and counting. If you do it like this
val repeatedNames = df.groupBy("name").count.where(col("count")>1).withColumnRenamed("name","repeated").drop("count")
then you could actually get all the way by doing something like this afterwards:
val repeated = df.join(repeatedNames, repeatedNames("repeated")===df("name")).drop("repeated")
val distinct = df.except(repeated)
repeated show
+---+----+------+
| id|name|number|
+---+----+------+
| 1|Shan| 101|
| 2|Shan| 101|
+---+----+------+
distinct show
+---+------+------+
| id| name|number|
+---+------+------+
| 4|Michel| 103|
| 3| John| 102|
+---+------+------+
Hope it helps.
I want to transpose following table using spark scala without Pivot function
I am using Spark 1.5.1 and Pivot function does not support in 1.5.1. Please suggest suitable method to transpose following table:
Customer Day Sales
1 Mon 12
1 Tue 10
1 Thu 15
1 Fri 2
2 Sun 10
2 Wed 5
2 Thu 4
2 Fri 3
Output table :
Customer Sun Mon Tue Wed Thu Fri
1 0 12 10 0 15 2
2 10 0 0 5 4 3
Following code is not working as I am using Spark 1.5.1 and pivot function is available from Spark 1.6:
var Trans = Cust_Sales.groupBy("Customer").Pivot("Day").sum("Sales")
Not sure how efficient that is, but you can use collect to get all the distinct days, and then add these columns, then use groupBy and sum:
// get distinct days from data (this assumes there are not too many of them):
val days: Array[String] = df.select("Day")
.distinct()
.collect()
.map(_.getAs[String]("Day"))
// add column for each day with the Sale value if days match:
val withDayColumns = days.foldLeft(df) {
case (data, day) => data.selectExpr("*", s"IF(Day = '$day', Sales, 0) AS $day")
}
// wrap it up
val result = withDayColumns
.drop("Day")
.drop("Sales")
.groupBy("Customer")
.sum(days: _*)
result.show()
Which prints (almost) what you wanted:
+--------+--------+--------+--------+--------+--------+--------+
|Customer|sum(Tue)|sum(Thu)|sum(Sun)|sum(Fri)|sum(Mon)|sum(Wed)|
+--------+--------+--------+--------+--------+--------+--------+
| 1| 10| 15| 0| 2| 12| 0|
| 2| 0| 4| 10| 3| 0| 5|
+--------+--------+--------+--------+--------+--------+--------+
I'll leave it to you to rename / reorder the columns if needed.
If you are working with python below code might help. Let's say you want to transpose spark DataFrame df:
pandas_df = df.toPandas().transpose().reset_index()
transposed_df = sqlContext.createDataFrame(pandas_df)
transposed_df.show()
Consider a data frame which has 6 columns and we want to group by first 4 columns and pivot on col5 while aggregating on col6 (say sum on it).
So lets say you cannot use the spark 1.6 version then the below code can be written (in spark 1.5) as:
val pivotedDf = df_to_pivot
.groupBy(col1,col2,col3,col4)
.pivot(col5)
.agg(sum(col6))
Here is the code with same output but without using in-built pivot function:
import scala.collection.SortedMap
//Extracting the col5 distinct values to create the new columns
val distinctCol5Values = df_to_pivot
.select(col(col5))
.distinct
.sort(col5) // ensure that the output columns are in a consistent logical order
.map(_.getString(0))
.toArray
.toSeq
//Grouping by the data frame to be pivoted on col1-col4
val pivotedAgg = df_to_pivot.rdd
.groupBy{row=>(row.getString(col1Index),
row.getDate(col2Index),
row.getDate(col3Index),
row.getString(col4Index))}
//Initializing a List of tuple of (String, double values) to be filled in the columns that will be created
val pivotColListTuple = distinctCol5Values.map(ft=> (ft,0.0))
// Using Sorted Map to ensure the order is maintained
var distinctCol5ValuesListMap = SortedMap(pivotColListTuple : _*)
//Pivoting the data on col5 by opening the grouped data
val pivotedRDD = pivotedAgg.map{groupedRow=>
distinctCol5ValuesListMap = distinctCol5ValuesListMap.map(ft=> (ft._1,0.0))
groupedRow._2.foreach{row=>
//Updating the distinctCol5ValuesListMap values to reflect the changes
//Change this part accordingly to what you want
distinctCol5ValuesListMap = distinctCol5ValuesListMap.updated(row.getString(col5Index),
distinctCol5ValuesListMap.getOrElse(row.getString(col5Index),0.0)+row.getDouble(col6Index))
}
Row.fromSeq(Seq(groupedRow._1._1,groupedRow._1._2,groupedRow._1._3,groupedRow._1._4) ++ distinctCol5ValuesListMap.values.toSeq)
}
//Consructing the structFields for new columns
val colTypesStruct = distinctCol5ValuesListMap.map(colName=>StructField(colName._1,DoubleType))
//Adding the first four column structFields with the new columns struct
val opStructType = StructType(Seq(StructField(col1Name,StringType),
StructField(col2Name,DateType),
StructField(col3Name,DateType),
StructField(col4Name,StringType)) ++ colTypesStruct )
//Creating the final data frame
val pivotedDF = sqlContext.createDataFrame(pivotedRDD,opStructType)