With R's dplyr I would calculate variance between groups like so:
df %>% group_by(group) %>% summarise(total = sum(value)) %>% summarise(variance_between_groups = var(total))
Trying to perform the same action with Sparks DataFrame API:
df.groupBy(group).agg(sum(value).alias("total")).agg(var_samp(total).alias("variance_between_groups"))
I receive an error in the second agg saying that it can't find total. I am clearly misunderstanding something so any help would be appreciated.
var_samp() takes a String-type column name, hence you need to provide a String as follows:
import org.apache.spark.sql.functions._
val df = Seq(
("a", 1.0),
("a", 2.5),
("a", 1.5),
("b", 2.0),
("b", 1.6)
).toDF("group", "value")
df.groupBy("group").
agg(sum("value").alias("total")).
agg(var_samp("total").alias("variance_between_groups")).
show
// +-----------------------+
// |variance_between_groups|
// +-----------------------+
// | 0.9799999999999999|
// +-----------------------+
It can also take a column (of Column type), e.g. var_samp($"total"). See Spark's API doc for more details.
Related
For one of the data cleaning steps, I would like to gather insight into how the unique values are existing as a percentage of the total row count so that I can apply a threshold and decide if I should completely remove this column / feature. For this I came up with this function as below:
def uniqueValuesAsPercentage(data: DataFrame) = {
val (rowCount, columnCount) = shape(data)
data.selectExpr(data.head().getValuesMap[Long](data.columns).map(elem => {
val (columnName, uniqueCount) = elem
val percentage = uniqueCount / rowCount * 100
(columnName, uniqueCount, percentage)
}))
}
But it fails with the following error:
<console>:90: error: type mismatch;
found : scala.collection.immutable.Iterable[(String, Long, Long)]
required: String
data.selectExpr(data.head().getValuesMap[Long](data.columns).map(elem => {
Unfortunately since this is in the Apache Zeppelin notebook, I'm also missing the capabilities of an IDE. I have the IntelliJ untilate, but the Big Data Tools support seem not to be available for my version of the IDE. Very annoying!
Any ideas as to what the problem is here? I guess I'm messing with the DataFrame in the selectExpr(....). As it can be seen, that I'm returning a tuple with the information I calculate.
You can calculate it in a much simpler way:
import org.apache.spark.sql.functions.{col, countDistinct, count}
import spark.implicits._
// define a dataframe for example
val data = Seq(("1", "1"), ("1", "2"), ("1", "3"), ("1", "4")).toDF("col_a", "col_b")
data.select(data.columns.map(c => (lit(100) * countDistinct(col(c)) / count(col(c))).alias(c)): _*).show()
// output:
+-----+-----+
|col_a|col_b|
+-----+-----+
| 25.0|100.0|
+-----+-----+
I am new to Apache Spark, I have a use case to find the date gap identification between multiple dates.
e.g
In the above example, the member had a gap between 2018-02-01 to 2018-02-14. How to find this Apache Spark 2.3.4 using Scala.
Excepted output for the above scenario is,
You could use datediff along with Window function lag to check for day-gaps between current and previous rows, and compute the missing date ranges with some date functions:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
import java.sql.Date
val df = Seq(
(1, Date.valueOf("2018-01-01"), Date.valueOf("2018-01-31")),
(1, Date.valueOf("2018-02-16"), Date.valueOf("2018-02-28")),
(1, Date.valueOf("2018-03-01"), Date.valueOf("2018-03-31")),
(2, Date.valueOf("2018-07-01"), Date.valueOf("2018-07-31")),
(2, Date.valueOf("2018-08-16"), Date.valueOf("2018-08-31"))
).toDF("MemberId", "StartDate", "EndDate")
val win = Window.partitionBy("MemberId").orderBy("StartDate", "EndDate")
df.
withColumn("PrevEndDate", coalesce(lag($"EndDate", 1).over(win), date_sub($"StartDate", 1))).
withColumn("DayGap", datediff($"StartDate", $"PrevEndDate")).
where($"DayGap" > 1).
select($"MemberId", date_add($"PrevEndDate", 1).as("StartDateGap"), date_sub($"StartDate", 1).as("EndDateGap")).
show
// +--------+------------+----------+
// |MemberId|StartDateGap|EndDateGap|
// +--------+------------+----------+
// | 1| 2018-02-01|2018-02-15|
// | 2| 2018-08-01|2018-08-15|
// +--------+------------+----------+
I have a dataframe like this:
val df = Seq(
("a", Seq(2.0)),
("a", Seq(1.0)),
("a", Seq(0.5)),
("b", Seq(24.0)),
("b", Seq(12.5)),
("b", Seq(6.4)),
("b", Seq(3.2)),
("c", Seq(104.0)),
("c", Seq(107.4))
).toDF("key", "value")
I need to use an algorithm that takes in input a DataFrame object on distinct groups.
To make this clearer, assume that I have to use StandardScaler scaling by groups.
In pandas I would do something like this (many type changes in the process):
from sklearn.preprocessing import StandardScaler
df.groupby(key) \
.value \
.transform(lambda x: StandardScaler \
.fit_transform(x \
.values \
.reshape(-1,1)) \
.reshape(-1))
I need to do this in scala because the algorithm I need to use is not the Scaler but another thing built in scala.
So far I've tried to do something like this:
import org.apache.spark.ml.feature.StandardScaler
def f(X : org.apache.spark.sql.Column) : org.apache.spark.sql.Column = {
val scaler = new StandardScaler()
.setInputCol("value")
.setOutputCol("scaled")
val output = scaler.fit(X)("scaled")
(output)
}
df.withColumn("scaled_values", f(col("features")).over(Window.partitionBy("key")))
but of course it gives me an error:
command-144174313464261:21: error: type mismatch;
found : org.apache.spark.sql.Column
required: org.apache.spark.sql.Dataset[_]
val output = scaler.fit(X)("scaled")
So I'm trying to transform a single Column object into a DataFrame object, without success. How do I do it?
If it's not possible, is there any workaround to solve this?
UPDATE 1
It seems I made some mistakes in the code, I tried to fix it (I think I did right):
val df = Seq(
("a", 2.0),
("a", 1.0),
("a", 0.5),
("b", 24.0),
("b", 12.5),
("b", 6.4),
("b", 3.2),
("c", 104.0),
("c", 107.4)
).toDF("key", "value")
def f(X : org.apache.spark.sql.DataFrame) : org.apache.spark.sql.Column = {
val assembler = new VectorAssembler()
.setInputCols(Array("value"))
.setOutputCol("feature")
val scaler = new StandardScaler()
.setInputCol("feature")
.setOutputCol("scaled")
val pipeline = new Pipeline()
.setStages(Array(assembler, scaler))
val output = pipeline.fit(X).transform(X)("scaled")
(output)
}
someDF.withColumn("scaled_values", f(someDF).over(Window.partitionBy("key")))
I still get an error:
org.apache.spark.sql.AnalysisException: Expression 'scaled#1294' not
supported within a window function.;;
I am not sure about the reason for this error, I tried aliasing the column but it doesn't seem to work.
So I'm trying to transform a single Column object into a DataFrame object, without success. How do I do it?
You can't, a column just references a column of a DataFrame, it does not contain any data, it's not a data structure like a dataframe.
Your f function will also not work like this. If you want to create a custom function to be used with Window, then you need an UDAF (User-Defined-Aggregation-Function), which is pretty hard...
In your case, I would to a groupBy key, collect_list of your values, then apply an UDF to do the scaling. Note that this only works of the data per key is not too large (larger than what fits into 1 executor), otherwise you need UDAF
Here an example:
// example scala method, scale to 0-1
def myScaler(data:Seq[Double]) = {
val mi = data.min
val ma = data.max
data.map(x => (x-mi)/(ma-mi))
}
val udf_myScaler = udf(myScaler _)
df
.groupBy($"key")
.agg(
collect_list($"value").as("values")
)
.select($"key",explode(arrays_zip($"values",udf_myScaler($"values"))))
.select($"key",$"col.values",$"col.1".as("values_scaled"))
.show()
gives:
+---+------+-------------------+
|key|values| values_scaled|
+---+------+-------------------+
| c| 104.0| 0.0|
| c| 107.4| 1.0|
| b| 24.0| 1.0|
| b| 12.5|0.44711538461538464|
| b| 6.4|0.15384615384615385|
| b| 3.2| 0.0|
| a| 2.0| 1.0|
| a| 1.0| 0.3333333333333333|
| a| 0.5| 0.0|
+---+------+-------------------+
I have a dataframe that looks like this
+--------------------
| unparsed_data|
+--------------------
|02020sometext5002...
|02020sometext6682...
I need to get it split it up into something like this
+--------------------
|fips | Name | Id ...
+--------------------
|02020 | sometext | 5002...
|02020 | sometext | 6682...
I have a list like this
val fields = List(
("fips", 5),
(“Name”, 8),
(“Id”, 27)
....more fields
)
I need the spit to take the first 5 characters in unparsed_data and map it to fips, take the next 8 characters in unparsed_data and map it to Name, then the next 27 characters and map them to Id and so on. I need the split to use/reference the filed lengths supplied in the list to do the splitting/slicing as there are allot of fields and the unparsed_data field is very long.
My scala is still pretty week and I assume the answer would look something like this
df.withColumn("temp_field", split("unparsed_data", //some regex created from the list values?)).map(i => //some mapping to the field names in the list)
any suggestions/ideas much appreciated
You can use foldLeft to traverse your fields list to iteratively create columns from the original DataFrame using
substring. It applies regardless of the size of the fields list:
import org.apache.spark.sql.functions._
val df = Seq(
("02020sometext5002"),
("03030othrtext6003"),
("04040moretext7004")
).toDF("unparsed_data")
val fields = List(
("fips", 5),
("name", 8),
("id", 4)
)
val resultDF = fields.foldLeft( (df, 1) ){ (acc, field) =>
val newDF = acc._1.withColumn(
field._1, substring($"unparsed_data", acc._2, field._2)
)
(newDF, acc._2 + field._2)
}._1.
drop("unparsed_data")
resultDF.show
// +-----+--------+----+
// | fips| name| id|
// +-----+--------+----+
// |02020|sometext|5002|
// |03030|othrtext|6003|
// |04040|moretext|7004|
// +-----+--------+----+
Note that a Tuple2[DataFrame, Int] is used as the accumulator for foldLeft to carry both the iteratively transformed DataFrame and next offset position for substring.
This can get you going. Depending on your needs it can get more and more complicated with variable lengths etc. which you do not state. But you can I think use column list.
import org.apache.spark.sql.functions._
val df = Seq(
("12334sometext999")
).toDF("X")
val df2 = df.selectExpr("substring(X, 0, 5)", "substring(X, 6,8)", "substring(X, 14,3)")
df2.show
Gives in this case (you can rename cols again):
+------------------+------------------+-------------------+
|substring(X, 0, 5)|substring(X, 6, 8)|substring(X, 14, 3)|
+------------------+------------------+-------------------+
| 12334| sometext| 999|
+------------------+------------------+-------------------+
I created a dataframe in spark scala shell for SFPD incidents. I queried the data for Category count and the result is a datafame. I want to plot this data into a graph using Wisp. Here is my dataframe,
+--------------+--------+
| Category|catcount|
+--------------+--------+
| LARCENY/THEFT| 362266|
|OTHER OFFENSES| 257197|
| NON-CRIMINAL| 189857|
| ASSAULT| 157529|
| VEHICLE THEFT| 109733|
| DRUG/NARCOTIC| 108712|
| VANDALISM| 91782|
| WARRANTS| 85837|
| BURGLARY| 75398|
|SUSPICIOUS OCC| 64452|
+--------------+--------+
I want to convert this dataframe into an arraylist of key value pairs. So I want result like this with (String,Int) type,
(LARCENY/THEFT,362266)
(OTHER OFFENSES,257197)
(NON-CRIMINAL,189857)
(ASSAULT,157529)
(VEHICLE THEFT,109733)
(DRUG/NARCOTIC,108712)
(VANDALISM,91782)
(WARRANTS,85837)
(BURGLARY,75398)
(SUSPICIOUS OCC,64452)
I tried converting this dataframe (t) into an RDD as val rddt = t.rdd. And then used flatMapValues,
rddt.flatMapValues(x=>x).collect()
but still couldn't get the required result.
Or is there a way to directly give the dataframe output into Wisp?
In pyspark it'd be as below. Scala will be quite similar.
Creating test data
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,1), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
Mapping the test data, reformatting from a RDD of Rows to an RDD of tuples. Then, using collect to extract all the tuples as a list.
df.rdd.map(lambda x: (x[0], x[1])).collect()
[(0, 1), (0, 1), (0, 2), (1, 2), (1, 1), (1, 20), (3, 18), (3, 18), (3, 18)]
Here's the Scala Spark Row documentation that should help you convert this to Scala Spark code