I have a Spark RDD whose entries I want to sort in an organized manner. Let's say the entry is a tuple with 3 elements (name,phonenumber,timestamp). I want to sort the entries first depending on the value of phonenumber and then depending on the value of timestamp while respecting and not changing the sort that was done based on phonenumber. (so timestamp only re-arranges based on the phonenumber sort). Is there a Spark function to do this?
(I am using Spark 2.x with Scala)
In order to do the sorting based on Multiple elements in RDD, you can use sortBy function. Please find below some sample code in Python. you can similarly implement in other languages as well.
tmp = [('a', 1), ('a', 2), ('1', 3), ('1', 4), ('2', 5)]
sc.parallelize(tmp).sortBy(lambda x: (x[0], x[1]), False).collect()
Regards,
Neeraj
You can use sortBy function on RDD as below
val df = spark.sparkContext.parallelize(Seq(
("a","1", "2017-03-10"),
("b","12", "2017-03-9"),
("b","123", "2015-03-12"),
("c","1234", "2015-03-15"),
("c","12345", "2015-03-12")
))//.toDF("name", "phonenumber", "timestamp")
df.sortBy(x => (x._1, x._3)).foreach(println)
Output:
(c,1234,2015-03-15)
(c,12345,2015-03-12)
(b,12,2017-03-9)
(b,123,2015-03-12)
(a,1,2017-03-10)
If you have a dataframe with toDF("name", "phonenumber", "timestamp")
Then you could simply do
df.sort("name", "timestamp")
Hope this helps!
Related
I would like to filter out from allData Dataframe those records which types appears in the wrongTypes DataFrame (it has like 2000 records). I convert the wrongTypes DF to list of Strings then in the filter check if record is in that list.
Here is my code:
import org.apache.spark.sql.functions._
import spark.implicits._
val allData = Seq(
("id1", "X"),
("id2", "X"),
("id3", "Y"),
("id4", "A")
).toDF("id", "type")
val wrongTypes = Seq(
("X"),
("Y"),
("Z")
).toDF("type").select("type").map(r => r.getString(0)).collect.toList
allData.filter(col("type").isin(wrongTypes)).show()
and I get this error:
SparkRuntimeException: The feature is not supported: literal for 'List(X, Y, Z)' of class scala.collection.immutable.$colon$colon.
allData.filter(col("type").isInCollection(wrongTypes)).show()
isin() is function with variable number of arguments and if you want to pass collection to it you must use splat operator:
col("type").isin(wrongTypes:_*)
Other option is to use isInCollection() which has Iterable as argument. I suggest using it as you mentioned that you expect about 2K entries.
If the wrongTypes dataframe is large, you may want to not collect the results, and just use the dataframes.
I am not sure how to express it on the Dataframe level, but in sql terms you want to use the IN clause, like this:
allData.createOrReplaceTempView("all_data")
wrongTypes.createOrReplaceTempView("wrong_types")
spark.sql("""
SELECT
*
FROM all_data
WHERE type in (SELECT type FROM wrong_types)
""").show()
I created a dataframe in spark scala shell for SFPD incidents. I queried the data for Category count and the result is a datafame. I want to plot this data into a graph using Wisp. Here is my dataframe,
+--------------+--------+
| Category|catcount|
+--------------+--------+
| LARCENY/THEFT| 362266|
|OTHER OFFENSES| 257197|
| NON-CRIMINAL| 189857|
| ASSAULT| 157529|
| VEHICLE THEFT| 109733|
| DRUG/NARCOTIC| 108712|
| VANDALISM| 91782|
| WARRANTS| 85837|
| BURGLARY| 75398|
|SUSPICIOUS OCC| 64452|
+--------------+--------+
I want to convert this dataframe into an arraylist of key value pairs. So I want result like this with (String,Int) type,
(LARCENY/THEFT,362266)
(OTHER OFFENSES,257197)
(NON-CRIMINAL,189857)
(ASSAULT,157529)
(VEHICLE THEFT,109733)
(DRUG/NARCOTIC,108712)
(VANDALISM,91782)
(WARRANTS,85837)
(BURGLARY,75398)
(SUSPICIOUS OCC,64452)
I tried converting this dataframe (t) into an RDD as val rddt = t.rdd. And then used flatMapValues,
rddt.flatMapValues(x=>x).collect()
but still couldn't get the required result.
Or is there a way to directly give the dataframe output into Wisp?
In pyspark it'd be as below. Scala will be quite similar.
Creating test data
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,1), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
Mapping the test data, reformatting from a RDD of Rows to an RDD of tuples. Then, using collect to extract all the tuples as a list.
df.rdd.map(lambda x: (x[0], x[1])).collect()
[(0, 1), (0, 1), (0, 2), (1, 2), (1, 1), (1, 20), (3, 18), (3, 18), (3, 18)]
Here's the Scala Spark Row documentation that should help you convert this to Scala Spark code
Is it possible to divide DF in two parts using single filter operation.For example
let say df has below records
UID Col
1 a
2 b
3 c
if I do
df1 = df.filter(UID <=> 2)
can I save filtered and non-filtered records in different RDD in single operation
?
df1 can have records where uid = 2
df2 can have records with uid 1 and 3
If you're interested only in saving data you can add an indicator column to the DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
and use it as a partition column for the DataFrameWriter with one of the supported formats (as for 1.6 it is Parquet, text, and JSON):
dfWithInd.write.partitionBy("ind").parquet(...)
It will create two separate directories (ind=false, ind=true) on write.
In general though, it is not possible to yield multiple RDDs or DataFrames from a single transformation. See How to split a RDD into two or more RDDs?
This question is about the duality between DataFrame and RDD when it comes to aggregation operations. In Spark SQL one can use table generating UDFs for custom aggregations but creating one of those is typically noticeably less user-friendly than using the aggregation functions available for RDDs, especially if table output is not required.
Is there an efficient way to apply pair RDD operations such as aggregateByKey to a DataFrame which has been grouped using GROUP BY or ordered using ORDERED BY?
Normally, one would need an explicit map step to create key-value tuples, e.g., dataFrame.rdd.map(row => (row.getString(row.fieldIndex("category")), row).aggregateByKey(...). Can this be avoided?
Not really. While DataFrames can be converted to RDDs and vice versa this is relatively complex operation and methods like DataFrame.groupBy don't have the same semantics as their counterparts on RDD.
The closest thing you can get is a new DataSet API introduced in Spark 1.6.0. It provides a much closer integration with DataFrames and GroupedDataset class with its own set of methods including reduce, cogroup or mapGroups:
case class Record(id: Long, key: String, value: Double)
val df = sc.parallelize(Seq(
(1L, "foo", 3.0), (2L, "bar", 5.6),
(3L, "foo", -1.0), (4L, "bar", 10.0)
)).toDF("id", "key", "value")
val ds = df.as[Record]
ds.groupBy($"key").reduce((x, y) => if (x.id < y.id) x else y).show
// +-----+-----------+
// | _1| _2|
// +-----+-----------+
// |[bar]|[2,bar,5.6]|
// |[foo]|[1,foo,3.0]|
// +-----+-----------+
In some specific cases it is possible to leverage Orderable semantics to group and process data using structs or arrays. You'll find an example in SPARK DataFrame: select the first row of each group
I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.