I have a Dataframe with rows that look like this:
[WrappedArray(1, 5DC7F285-052B-4739-8DC3-62827014A4CD, 1, 1425450997, 714909, 1425450997, 714909, {}, 2013, GAVIN, ST LAWRENCE, M, 9)]
[WrappedArray(2, 17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF, 2, 1425450997, 714909, 1425450997, 714909, {}, 2013, LEVI, ST LAWRENCE, M, 9)]
[WrappedArray(3, 53E20DA8-8384-4EC1-A9C4-071EC2ADA701, 3, 1425450997, 714909, 1425450997, 714909, {}, 2013, LOGAN, NEW YORK, M, 44)]
...
Everything before the year (2013 in this example) is nonsense that should be dropped. I would like to map the data to a Name class that I have created and put it into a new dataframe.
How do I get to the data and do that mapping?
Here is my Name class:
case class Name(year: Int, first_name: String, county: String, sex: String, count: Int)
Basically, I would like to fill my dataframe with rows and columns according to the schema of the Name class. I know how to do this part, but I just don't know how to get to the data in the dataframe.
Assuming the data is an array of strings like this:
val df = Seq(Seq("1", "5DC7F285-052B-4739-8DC3-62827014A4CD", "1", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "GAVIN", "STLAWRENCE", "M", "9"),
Seq("2", "17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF", "2", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LEVI", "ST LAWRENCE", "M", "9"),
Seq("3", "53E20DA8-8384-4EC1-A9C4-071EC2ADA701", "3", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LOGAN", "NEW YORK", "M", "44"))
.toDF("array")
You could either use an UDF that returns a case class or you can use withColumn multiple times. The latter should be more efficient and can be done like this:
val df2 = df.withColumn("year", $"array"(8).cast(IntegerType))
.withColumn("first_name", $"array"(9))
.withColumn("county", $"array"(10))
.withColumn("sex", $"array"(11))
.withColumn("count", $"array"(12).cast(IntegerType))
.drop($"array")
.as[Name]
This will give you a DataSet[Name]:
+----+----------+-----------+---+-----+
|year|first_name|county |sex|count|
+----+----------+-----------+---+-----+
|2013|GAVIN |STLAWRENCE |M |9 |
|2013|LEVI |ST LAWRENCE|M |9 |
|2013|LOGAN |NEW YORK |M |44 |
+----+----------+-----------+---+-----+
Hope it helped!
Related
Given a spark DataFrame with columns "id", "first", "last", "year"
val df=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "Ive", "Fish", 1990),
(4, "John", "Wayne", 1995)
)).toDF("id", "first", "last", "year")
and case class
case class IdAndLastName(
id: Int,
last:String )
I would like to only select columns in case class which are id and last. In other words, I would like to have this output df.select("id","last") by using case class. I am avoiding hardcoding the attributes. Could you please help me how can I achieve this in a compact way.
You can create explictly an encoder for the case class (usually this happens implicitly here). Then you can get the field names from the encoder and use them in the select statement:
val fieldnames = Encoders.product[IdAndLastName].schema.fieldNames
df.select(fieldnames.head, fieldnames.tail:_*).show()
Output:
+---+-----+
| id| last|
+---+-----+
| 1| Doe|
| 2| Fish|
| 4|Wayne|
+---+-----+
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.functions.col
val cols = Encoders.product[IdAndLastName].schema.fieldNames.map(col)
df.select(cols: _*).show()
I am trying to capitalize some words in a column in my spark dataframe. The words are all in a list.
val wrds = ["usa","gb"]
val dF = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
I would like to have an output of
val dF = List(
(1, "z",3, "Bob lives in the USA"),
(4, "t", 2, "GB is where Beth lives")
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
It seems I have to do a string split on the column and then capitalize based on if that part of a string is present in the value. I am mostly struggling with row 3 where I do not want to capitalize ogb even though it does contain gb. Could anyone point me in the right direction?
import org.apache.spark.sql.functions._
val words = Array("usa","gb")
val df = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
val replaced = words.foldLeft(df){
case (adf, word) =>
adf.withColumn("country", regexp_replace($"country", "(\\b" + word + "\\b)", word.toUpperCase))
}
replaced.show
Output:
+---+----+-----+--------------------+
| id|name|thing| country|
+---+----+-----+--------------------+
| 1| z| 3|Bob lives in the USA|
| 4| t| 2|GB is where Beth ...|
| 5| t| 2| ogb|
+---+----+-----+--------------------+
I'm looking for a way to do this without a UDF, I am wondering if its possible. Lets say I have a DF as follows:
Buyer_name Buyer_state CoBuyer_name CoBuyers_state Price Date
Bob CA Joe CA 20 010119
Stacy IL Jamie IL 50 020419
... about 3 millions more rows...
And I want to turn it to:
Buyer_name Buyer_state Price Date
Bob CA 20 010119
Joe CA 20 010119
Stacy IL 50 020419
Jamie IL 50 020419
...
Edit: I could also,
Create two dataframes, removing "Buyer" columns from one, and "Cobuyer" columns from the other.
Rename dataframe with "Cobuyer" columns as "Buyer" columns.
Concatenate both dataframes.
You can group struct(Buyer_name, Buyer_state) and struct(CoBuyer_name, CoBuyer_state) into an Array which is then expanded using explode, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
df.
withColumn("Buyers", array(
struct($"Buyer_name".as("_1"), $"Buyer_state".as("_2")),
struct($"CoBuyer_name".as("_1"), $"CoBuyer_state".as("_2"))
)).
withColumn("Buyer", explode($"Buyers")).
select(
$"Buyer._1".as("Buyer_name"), $"Buyer._2".as("Buyer_state"), $"Price", $"Date"
).show
// +----------+-----------+-----+------+
// |Buyer_name|Buyer_state|Price| Date|
// +----------+-----------+-----+------+
// | Bob| CA| 20|010119|
// | Joe| CA| 20|010119|
// | Stacy| IL| 50|020419|
// | Jamie| IL| 50|020419|
// +----------+-----------+-----+------+
This sounds like an unpivot operation to me which can be accomplished with the union function in Scala:
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
val df_new = df.select("Buyer_name", "Buyer_state", "Price", "Date").union(df.select("CoBuyer_name", "CoBuyer_state", "Price", "Date"))
df_new.show
Thanks to Leo for providing the dataframe definition which I've re-used.
I have a dataframe to aggregate one column based on the rest of the other columns. I do not want to give all those rest of the columns in groupBy with comma separated as I have about 30 columns. Could somebody tell me how can I do it in a way that looks more readable.
right now, am doing - df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","c9","c10",....).agg(c11)
I want to know if there is any better way..
Thanks,
John
Specifying the columns is the clean way to do it but I believe you have quite a few options.
One of them is to go to Spark SQL and compose the query programmatically composing the string.
Another option could be to use the varargs : _* on a list of columns names, like this:
val cols = ...
df.groupBy( cols : _*).agg(...)
Use below steps:
get the columns as list
remove the columns needs to be aggregated from the columns list.
apply groupBy & agg.
**Ex**:
val seq = Seq((101, "abc", 24), (102, "cde", 24), (103, "efg", 22), (104, "ghi", 21), (105, "ijk", 20), (106, "klm", 19), (107, "mno", 18), (108, "pqr", 18), (109, "rst", 26), (110, "tuv", 27), (111, "pqr", 18), (112, "rst", 28), (113, "tuv", 29))
val df = sc.parallelize(seq).toDF("id", "name", "age")
val colsList = df.columns.toList
(colsList: List[String] = List(id, name, age))
val groupByColumns = colsList.slice(0, colsList.size-1)
(groupByColumns: List[String] = List(id, name))
val aggColumn = colsList.last
(aggColumn: String = age)
df.groupBy(groupByColumns.head, groupByColumns.tail:_*).agg(avg(aggColumn)).show
+---+----+--------+
| id|name|avg(age)|
+---+----+--------+
|105| ijk| 20.0|
|108| pqr| 18.0|
|112| rst| 28.0|
|104| ghi| 21.0|
|111| pqr| 18.0|
|113| tuv| 29.0|
|106| klm| 19.0|
|102| cde| 24.0|
|107| mno| 18.0|
|101| abc| 24.0|
|103| efg| 22.0|
|110| tuv| 27.0|
|109| rst| 26.0|
+---+----+--------+
The Following code gives a dataframe having three values in each column as shown below.
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
val df=paths
df.select(df.columns.filter(_.startsWith("e")).map(df(_)) : _*).show
OutPut of Above Code is given below::
+-------+-------+-------+
| e0| e1| e2|
+-------+-------+-------+
|[1,2,8]|[2,4,7]|[4,5,8]|
+-------+-------+-------+
In the above output, we can see that each column has three values and they can be interpreted as follows.
e0 :
source 1, Destination 2 and distance 8
e1:
source 2, Destination 4 and distance 7
e2:
source 4, Destination 5 and distance 8
basically e0,e1, and e3 are the edges. I want to sum the third element of each column, i.e add the distance of each edge to get the total distance. How can I achieve this?
It can be done like this:
val total = df.columns.filter(_.startsWith("e"))
.map(c => col(s"$c.property")) // or col(c).getItem("property")
.reduce(_ + _)
df.withColumn("total", total)
I would make a collection of the columns to sum and then use a foldLeft on a UDF:
scala> val df = Seq((Array(1,2,8),Array(2,4,7),Array(4,5,8))).toDF("e0", "e1", "e2")
df: org.apache.spark.sql.DataFrame = [e0: array<int>, e1: array<int>, e2: array<int>]
scala> df.show
+---------+---------+---------+
| e0| e1| e2|
+---------+---------+---------+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]|
+---------+---------+---------+
scala> val colsToSum = df.columns
colsToSum: Array[String] = Array(e0, e1, e2)
scala> val accLastUDF = udf((acc: Int, col: Seq[Int]) => acc + col.last)
accLastUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List(IntegerType, ArrayType(IntegerType,false)))
scala> df.withColumn("dist", colsToSum.foldLeft(lit(0))((acc, colName) => accLastUDF(acc, col(colName)))).show
+---------+---------+---------+----+
| e0| e1| e2|dist|
+---------+---------+---------+----+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]| 23|
+---------+---------+---------+----+