In Spark-Sql version 1.6, using DataFrames, is there a way to calculate, for a specific column, the fraction of dividing current row and the next one, for every row?
For example, if I have a table with one column, like so
Age
100
50
20
4
I'd like the following output
Franction
2
2.5
5
The last row is dropped because it has no "next row" to be added to.
Right now I am doing it by ranking the table and joining it with itself, where the rank is equals to rank+1.
Is there a better way to do this?
Can this be done with a Window function?
Window function should do only partial tricks. Other partial trick can be done by defining a udf function
def div = udf((age: Double, lag: Double) => lag/age)
First we need to find the lag using Window function and then pass that lag and age in udf function to find the div
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val dataframe = Seq(
("A",100),
("A",50),
("A",20),
("A",4)
).toDF("person", "Age")
val windowSpec = Window.partitionBy("person").orderBy(col("Age").desc)
val newDF = dataframe.withColumn("lag", lag(dataframe("Age"), 1) over(windowSpec))
And finally cal the udf function
newDF.filter(newDF("lag").isNotNull).withColumn("div", div(newDF("Age"), newDF("lag"))).drop("Age", "lag").show
Final output would be
+------+---+
|person|div|
+------+---+
| A|2.0|
| A|2.5|
| A|5.0|
+------+---+
Edited
As #Jacek has suggested a better solution to use .na.drop instead of .filter(newDF("lag").isNotNull) and use / operator , so we don't even need to call the udf function
newDF.na.drop.withColumn("div", newDF("lag")/newDF("Age")).drop("Age", "lag").show
Related
i have some individual values with data and i have to convert it into dataframe. and i tried the below . Only one row output will come.
val matchingcount= 3
val notmatchingcount=5
val filename=h:/filename1
import spark.implicits._
val data=Seq("+filename+","+matchingcount+","+notmatchingcount+").toDF("ezfilename","match_count","non_matchcount")
data.show()
throwing error :
Exception in thread "main" java.lang.IllegalArguementException : requirement failed : the number of columns doesn't match.
Old column names (1): value
New column names (8) : ezfilename,match_count,non_matchcount
Any help please
You were almost there! The code that does what you want is the following:
val matchingcount= 3
val notmatchingcount=5
val filename="h:/filename1"
import spark.implicits._
val data=Seq((filename,matchingcount,notmatchingcount)).toDF("ezfilename","match_count","non_matchcount")
data.show()
+------------+-----------+--------------+
| ezfilename|match_count|non_matchcount|
+------------+-----------+--------------+
|h:/filename1| 3| 5|
+------------+-----------+--------------+
There are 3 key differences between your code and the code above here:
In scala, a string has to be surrounded by " characters. So I've added these characters to val filename=
You were correct in the fact that you could use a Seq to use the toDF method after imports spark.implicits._, but each element of the string would represent one row of the dataframe. So instead of creating a dataframe with 3 columns you were creating one with 1 element. The way you can create 3 columns is by adding tuples inside of your Seq. So notice the difference between Seq(bla,bla,bla) and Seq((bla, bla, bla)) where the latter is the correct one. You can also create multiple rows like this by doing: Seq((bla, bli, blu), (blo, ble, bly)).
In Scala, the way you access a variable's value is by simply writing the variable's name. So writing filename instead of "+filename+" is the correct way of doing that.
Hope this helps!
Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.
I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.
I'm looking for a way to calculate some statistic e.g. mean over several selected columns in Spark using Scala. Given that data object is my Spark DataFrame, it's easy to calculate a mean for one column only e.g.
data.agg(avg("var1") as "mean var1").show
Also, we can easily calculate a mean cross-tabulated by values of some other columns e.g.:
data.groupBy("category").agg(avg("var1") as "mean_var1").show
But how can we calculate a mean for a List of columns in a DataFrame? I tried running something like this, but it didn't work:
scala> data.select("var1", "var2").mean().show
<console>:44: error: value mean is not a member of org.apache.spark.sql.DataFrame
data.select("var1", "var2").mean().show
^
This is what you need to do
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1,2,3), (3,4,5), (1,2,4)).toDF("A", "B", "C")
data.select(data.columns.map(mean(_)): _*).show()
Output:
+------------------+------------------+------+
| avg(A)| avg(B)|avg(C)|
+------------------+------------------+------+
|1.6666666666666667|2.6666666666666665| 4.0|
+------------------+------------------+------+
This works for selected columns
data.select(Seq("A", "B").map(mean(_)): _*).show()
Output:
+------------------+------------------+
| avg(A)| avg(B)|
+------------------+------------------+
|1.6666666666666667|2.6666666666666665|
+------------------+------------------+
Hope this helps!
If you already have the dataset you can do this:
ds.describe(s"age")
Which will return this:
summary age
count 10.0
mean 53.3
stddev 11.6
min 18.0
max 92.0
I just started how to use dataframe and column in Spark/Scala. I know if I want to show something on the screen, I can just do like df.show() for that. But how can I do this to a column. For example,
scala> val dfcol = df.apply("sgan")
dfcol: org.apache.spark.sql.Column = sgan
this can find a column called "sgan" from the dataframe df then give it to dfcol, so dfcol is a column. Then, if I do
scala> abs(dfcol)
res29: org.apache.spark.sql.Column = abs(sgan)
I just got the result shown on the screen like above. How can I show the result of this function on the screen like df.show() does? Or, in other words, how can I know the results of the functions like abs, min and so forth?
You should always use a dataframe, Column objects are not meant to be investigated this way. You can use select to create a dataframe with the column you're interested in, and then use show():
df.select(functions.abs(df("sgan"))).show()