I have a simple example
val arrayStructureData = Seq(
Row("test1|value1"),
Row("test2|value2"),
Row("test3|value3")
)
val arrayStructureSchema = new StructType()
.add("name", StringType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData), arrayStructureSchema)
import spark.implicits._
val distPhens = df.flatMap(row => row.getString(0).split("\\|"))
.filter(x => x.like("test[0-9]+"))
.toDF("distinct_phens")
where I'm trying to run filter after running flatMap. The desired output is :
value1
value2
value3
If I understand correctly, like expects a column but I am not sure how to "refer" to the column after flatMap has been executed.
I need this filter operation to run after flatMap.
Thanks in advance.
You can refer to the column object using col, and do an rlike filter:
val result = df.flatMap(row => row.getString(0).split("\\|")).filter(col("value").rlike("test[0-9]+"))
result.show
+-----+
|value|
+-----+
|test1|
|test2|
|test3|
+-----+
I have to map a list of columns to another column in a Spark dataset: think something like this
val translationMap: Map[Column, Column] = Map(
lit("foo") -> lit("bar"),
lit("baz") -> lit("bab")
)
And I have a dataframe like this one:
val df = Seq("foo", "baz").toDF("mov")
So I intend to perform the translation like this:
df.select(
col("mov"),
translationMap(col("mov"))
)
but this piece of code spits the following error
key not found: movs
java.util.NoSuchElementException: key not found: movs
Is there a way to perform such translation without concatenating hundreds of whens? think that translationMap could have lots of pairs key-value.
Instead of Map[Column, Column] you should use a Column containing a map literal:
import org.apache.spark.sql.functions.typedLit
val translationMap: Column = typedLit(Map(
"foo" -> "bar",
"baz" -> "bab"
))
The rest of your code can stay as-is:
df.select(
col("mov"),
translationMap(col("mov"))
).show
+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo| bar|
|baz| bab|
+---+---------------------------------------+
You can not refer a Scala collection declared on the driver like this inside a distributed dataframe. An alternative would be to use a UDF which will not be performance efficient if you have a large dataset since UDFs are not optimized by Spark.
val translationMap = Map( "foo" -> "bar" , "baz" -> "bab" )
val getTranslationValue = udf ((x: String)=>translationMap.getOrElse(x,null.asInstanceOf[String]) )
df.select(col("mov"), getTranslationValue($"mov").as("value") ).show
//+---+-----+
//|mov|value|
//+---+-----+
//|foo| bar|
//|baz| bab|
//+---+-----+
Another solution would be to load the Map as a DataSet[(String, String)] and the join the two datasets taking mov as the key.
As said in:
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
The describe() function works for each numerical column, It is possible to do it against rows? My DF size is 53 cols and 346,143 rows, so transpose is not an option. How can I do it?
I'm using Spark 2.11
You can do your own UDF. Either you make an separate UDF for each quantity or put everything in 1 UDF returning a complex result:
val df = Seq(
(1.0,2.0,3.0,4.0,5.0)
).toDF("x1","x2","x3","x4","x5")
val describe = udf(
{ xs : Seq[Double] =>
val xmin = xs.min
val xmax = xs.max
val mean = xs.sum/xs.size.toDouble
(xmin,xmax,mean)
}
)
df
.withColumn("describe",describe(array("*")))
.withColumn("min",$"describe._1")
.withColumn("max",$"describe._2")
.withColumn("mean",$"describe._3")
.drop($"describe")
.show
gives:
+---+---+---+---+---+---+---+----+
| x1| x2| x3| x4| x5|min|max|mean|
+---+---+---+---+---+---+---+----+
|1.0|2.0|3.0|4.0|5.0|1.0|5.0| 3.0|
+---+---+---+---+---+---+---+----+
I have a DataFrame that is created as follows:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF()
This is the content of data.txt:
123,2016-11-09,1
124,2016-11-09,2
123,2016-11-10,1
123,2016-11-11,1
123,2016-11-12,1
124,2016-11-13,1
124,2016-11-14,1
Is it possible to filter df in order to get the sum of 3rd column values for 123 for the last N days starting from now? I am interested in a flexible solution so that N could be defined as a parameter.
For example, if today would be 2016-11-16 and N would be equal to 5, then the sum of 3rd column values for 124 would be equal to 2.
This is my current solution:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF(["key","date","qty"])
val starting_date = LocalDate.now().minusDays(x_last_days)
df.filter(col("key") === "124")
.filter(to_date(df("date")).gt(starting_date))
.agg(sum(col("qty")))
but it does not seem to work properly. 1. The line where I define column names ["key","date","qty"] does not compile for Scala 2.10.6 and Spark 1.6.2. 2. Also it returns a dataframe, while I need Int. Should I just do toString.toInt?
Both of the following won't compile :
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
// <console>:1: error: illegal start of simple expression
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
^
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
// <console>:27: error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[String]]
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
^
The first one won't because it's a incorrect syntax and as for the second, it is because, like the error says, it's not a member, in other terms, the action is not supported.
The later one will compile with Spark 2.x but the following solution would also apply or you'll have a DataFrame with one column of type ArrayType.
Now let's solve the issue :
scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._ // you don't need to import this in the shell.
val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1"))
.map{ _.split(",") match { case Array(a,b,c) => (a,b,c) }}.toDF("key","date","qty")
// Exiting paste mode, now interpreting.
// df: org.apache.spark.sql.DataFrame = [key: string, date: string, qty: string]
You can apply any filter you want and compute the aggregation needed, e.g :
scala> val df2 = df.filter(col("key") === "124").agg(sum(col("qty")))
// df2: org.apache.spark.sql.DataFrame = [sum(qty): double]
scala> df2.show
// +--------+
// |sum(qty)|
// +--------+
// | 4.0|
// +--------+
PS: The above code has been tested in Spark 1.6.2 and 2.0.0
I am new to spark and using spark 1.6.1. I am using the pivot function to create a new column based on a integer value. Say I have a csv file like this:
year,winds
1990,50
1990,55
1990,58
1991,45
1991,42
1991,58
I am loading the csv file like this:
var df =sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data/sample.csv")
I want to aggregate the winds colmnn filtering those winds greater than 55 so that I get an output file like this:
year, majorwinds
1990,2
1991,1
I am using the code below:
val df2=df.groupBy("major").pivot("winds").agg(>55)->"count")
But I get this error
error: expected but integer literal found
What is the correct syntax here? Thanks in advance
In your case, if you just want output like:
+----+----------+
|year|majorwinds|
+----+----------+
|1990| 2|
|1991| 1|
+----+----------+
It's not necessary to use pivot.
You could reach this by using filter, groupBy and count:
df.filter($"winds" >= 55)
.groupBy($"year")
.count()
.withColumnRenamed("count", "majorwinds")
.show()
use this generic funtion to do pivot
def transpose(sqlCxt: SQLContext, df: DataFrame, compositeId: Vector[String], pair: (String, String), distinctCols: Array[Any]): DataFrame = {
val rdd = df.map { row => (compositeId.collect { case id => row.getAs(id).asInstanceOf[Any] }, scala.collection.mutable.Map(row.getAs(pair._1).asInstanceOf[Any] -> row.getAs(pair._2).asInstanceOf[Any])) }
val pairRdd = rdd.reduceByKey(_ ++ _)
val rowRdd = pairRdd.map(r => dynamicRow(r, distinctCols))
sqlCxt.createDataFrame(rowRdd, getSchema(compositeId ++ distinctCols))
}