Assume I have a data frame df in spark, with the structure like so.
Input:
amount city
10000 la
12145 ng
14000 wy
18000 la
How can subset the data frame for amount > 10000
Expected Output:
amount city
12145 ng
14000 wy
18000 la
In R i can do something like this:
df1 <- df[df$amount > 10000 ,]
I know I can use SQL of spark to do the same, but what is the step which is similar to above
From the docs:
http://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
val df1 = df.filter($"amount" > 10000)
Related
I am trying to Convert a DataFrame into RDD and Splitting them into Specific number of Columns based on Number of Columns in DataFrame Dynamically and Elegantly
i.e
This is a sample data from a table in hive employee
Id Name Age State City
123 Bob 34 Texas Dallas
456 Stan 26 Florida Tampa
val temp_df = spark.sql("Select * from employee")
val temp2_rdd = temp_df.rdd.map(x => (x(0),x(1),x(2),x(3))
I am looking to generate the tem2_rdd dynamically based on the number of columns from the table.
It should not be hard coded as i did.
As the maximum size of tuple is 22 in scala, any other collection that can hold the rdd efficiently.
Coding Language : Spark Scala
Please advise.
Instead of extracting and transforming each element using index you can use toSeq method of Row object.
val temp_df = spark.sql("Select * from employee")
// RDD[List[Any]]
val temp2_rdd = temp_df.rdd.map(_.toSeq.toList)
// RDD[List[String]]
val temp3_rdd = temp_df.rdd.map(_.toSeq.map(_.toString).toList)
I want to Perform Group by on each column of the data frame using Spark Sql. The Dataframe will have approx. 1000 columns.
I have tried Iterating over all the columns in the data frame and performed groupBy on each column. But the program is executing more than 1.5 hour
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
val groupedData= channelDf.columns.map(c => channelDf.groupBy(c).count().take(10).toList)
println("Printing Dataset :"+ dataset)
If I have columns in the Dataframe For Example Name and Amount then the output should be like
GroupBy on column Name:
Name Count
Jon 2
Ram 5
David 3
GroupBy on column Amount:
Amount Count
1000 4
2525 3
3000 3
I want the group by result for each column.
The only way I can see a speed up here is to cache the df straight after reading it.
Unfortunately, each computation is independant, and you have to do them, there is no "work around".
Something like this can speed up a little bit, but not that much :
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
.cache()
I have a dataframe like below:
group value
B 2
B 3
A 5
A 6
now i need to subtract rows based on group. i.e 2-3 and 5-6. after transformation it should look like this.
group value
B -1
A -1
i tried below code but couldnt solve my case.
val df2 = df1.groupBy("Group").agg(first("Value")-second(col("Value")))
import org.apache.spark.sql.expressions.Window
val df2 = df1.select("group", "value", $"value" - lead("value").over(Window.partitionBy("group").orderBy("value")))
I guess you're trying to subtract two neighbored values with order.
This is working for me.
val df2 = df1.groupBy("Group").agg(first("Value").minus(last(col("Value"))))
I'm looking for a way to calculate some statistic e.g. mean over several selected columns in Spark using Scala. Given that data object is my Spark DataFrame, it's easy to calculate a mean for one column only e.g.
data.agg(avg("var1") as "mean var1").show
Also, we can easily calculate a mean cross-tabulated by values of some other columns e.g.:
data.groupBy("category").agg(avg("var1") as "mean_var1").show
But how can we calculate a mean for a List of columns in a DataFrame? I tried running something like this, but it didn't work:
scala> data.select("var1", "var2").mean().show
<console>:44: error: value mean is not a member of org.apache.spark.sql.DataFrame
data.select("var1", "var2").mean().show
^
This is what you need to do
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1,2,3), (3,4,5), (1,2,4)).toDF("A", "B", "C")
data.select(data.columns.map(mean(_)): _*).show()
Output:
+------------------+------------------+------+
| avg(A)| avg(B)|avg(C)|
+------------------+------------------+------+
|1.6666666666666667|2.6666666666666665| 4.0|
+------------------+------------------+------+
This works for selected columns
data.select(Seq("A", "B").map(mean(_)): _*).show()
Output:
+------------------+------------------+
| avg(A)| avg(B)|
+------------------+------------------+
|1.6666666666666667|2.6666666666666665|
+------------------+------------------+
Hope this helps!
If you already have the dataset you can do this:
ds.describe(s"age")
Which will return this:
summary age
count 10.0
mean 53.3
stddev 11.6
min 18.0
max 92.0
I am using Spark with Scala. Spark version 1.5 and I am trying to transform input dataframe which has name value combination to a new dataframe in which all name to be transposed to columns and values as rows.
I/P DataFrame:
ID Name Value
1 Country US
2 Country US
2 State NY
3 Country UK
4 Country India
4 State MH
5 Country US
5 State NJ
5 County Hudson
Link here for image
Transposed DataFrame
ID Country State County
1 US NULL NULL
2 US NY NULL
3 UK NULL NULL
4 India MH NULL
5 US NJ Hudson
Link to transposed image
Seems like pivot would help in this use case, but its not supported in spark 1.5.x version.
Any pointers/help?
This is a really ugly data but you can always filter and join:
val names = Seq("Country", "State", "County")
names.map(name =>
df.where($"Name" === name).select($"ID", $"Value".alias("name"))
).reduce((df1, df2) => df1.join(df2, Seq("ID"), "leftouter"))
map creates a list of three DataFrames where each table contains records containing only a single name. Next we simply reduce this list using left outer join. So putting it all together you get something like this:
(left-outer-join
(left-outer-join
(where df (=== name "Country"))
(where df (=== name "State")))
(where df (=== name "County")))
Note: If you use Spark >= 1.6 with Python or Scala, or Spark >= 2.0 with R, just use pivot with first:
Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames
How to pivot DataFrame?