how to create dataframe from one column in pyspark? - pyspark

I have sliced out one column of type Column in pyspark.
x =game_reviews.groupBy("product_id_index").agg((F.count('star_rating').alias('num') )
x.num
gives
Column<b'num'>
But this
new_df = spark.createDataFrame(x.num)
new_df.show()
gives error.

What you want to achieve is a simple one-liner. Good luck!
new_df = game_reviews.groupBy("product_id_index").agg((F.count('star_rating').alias('num')).select("num")
new_df.show()

Related

How to extract a single (column/row) value from a dataframe using PySpark?

Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks!
df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data")
df.createOrReplaceTempView("MyTable")
df = spark.sql("SELECT COUNT (DISTINCT AP) FROM MyTable")
display(df)
here is the alternative:
df.first()['column name']
it will give you the desired output. you can store it in a variable.
I think you're looking for collect. Something like this should get you the value:
df.collect()[0]['count(DISTINCT AP)']
assuming the column name is 'count(DISTINCT AP)'
If you want to extract value in specific row and column:
df.select('column name').collect()[row number][0]
for example df.select('eye color').collect()[20][0]

How can I pretty print a data frame in Hue/Notebook/Scala/Spark?

I am using Spark 2.1 and Scala 2.11 in a HUE 3.12 notebook. I have a dataframe that I can print like this:
df.select("account_id", "auto_pilot").show(2, false)
And the output looks like this:
+--------------------+----------+
|account_id |auto_pilot|
+--------------------+----------+
|00000000000000000000|null |
|00000000000000000002|null |
+--------------------+----------+
only showing top 2 rows
Is there a way of getting the data frame to show as pretty tables (like when I query from Impala or pyspark)?
Impala example of same query:
you can use a magic function %table , however this function only works for datasets not dataframe. One option is to convert dataframe to datasets before printing.
import spark.implicits._
case class Account(account_id: String, auto_pilot: String)
val accountDF = df.select("account_id", "auto_pilot").collect()
val accountDS: Dataset[Account] = accountDF.as[Account]
%table accountDS
Right now this is the solution that I can think of. Other better solutions are always welcome. I will modify this as soon I find any other elegant solution.
From http://gethue.com/bay-area-bike-share-data-analysis-with-spark-notebook-part-2/
This is what I did
df = sqlContext.sql("select * from my_table")
result = df.limit(5).collect()
%table result

Calculate mean for several columns in Spark scala

I'm looking for a way to calculate some statistic e.g. mean over several selected columns in Spark using Scala. Given that data object is my Spark DataFrame, it's easy to calculate a mean for one column only e.g.
data.agg(avg("var1") as "mean var1").show
Also, we can easily calculate a mean cross-tabulated by values of some other columns e.g.:
data.groupBy("category").agg(avg("var1") as "mean_var1").show
But how can we calculate a mean for a List of columns in a DataFrame? I tried running something like this, but it didn't work:
scala> data.select("var1", "var2").mean().show
<console>:44: error: value mean is not a member of org.apache.spark.sql.DataFrame
data.select("var1", "var2").mean().show
^
This is what you need to do
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1,2,3), (3,4,5), (1,2,4)).toDF("A", "B", "C")
data.select(data.columns.map(mean(_)): _*).show()
Output:
+------------------+------------------+------+
| avg(A)| avg(B)|avg(C)|
+------------------+------------------+------+
|1.6666666666666667|2.6666666666666665| 4.0|
+------------------+------------------+------+
This works for selected columns
data.select(Seq("A", "B").map(mean(_)): _*).show()
Output:
+------------------+------------------+
| avg(A)| avg(B)|
+------------------+------------------+
|1.6666666666666667|2.6666666666666665|
+------------------+------------------+
Hope this helps!
If you already have the dataset you can do this:
ds.describe(s"age")
Which will return this:
summary age
count 10.0
mean 53.3
stddev 11.6
min 18.0
max 92.0

Creating a new column by applying a function in an existing column in PySpark?

Say I have a dataframe
product_id customers
1 [1,2,4]
2 [1,2]
I want to create a new column, say nb_customer by applying the function len on the column customers.
I tried
df = df.select('*', (map(len, df.customers)).alias('nb_customer'))
but it does not work.
What is the correct way to do that?
import pyspark.sql.functions as f
df = sc.parallelize([
[1,[1,2,4]],
[2,[1,2]]
]).toDF(('product_id', 'customers'))
df.withColumn('nb_customer',f.size(df.customers)).show()

How to display the results brough from column functions using spark/scala like what show() does to dataframe

I just started how to use dataframe and column in Spark/Scala. I know if I want to show something on the screen, I can just do like df.show() for that. But how can I do this to a column. For example,
scala> val dfcol = df.apply("sgan")
dfcol: org.apache.spark.sql.Column = sgan
this can find a column called "sgan" from the dataframe df then give it to dfcol, so dfcol is a column. Then, if I do
scala> abs(dfcol)
res29: org.apache.spark.sql.Column = abs(sgan)
I just got the result shown on the screen like above. How can I show the result of this function on the screen like df.show() does? Or, in other words, how can I know the results of the functions like abs, min and so forth?
You should always use a dataframe, Column objects are not meant to be investigated this way. You can use select to create a dataframe with the column you're interested in, and then use show():
df.select(functions.abs(df("sgan"))).show()