Spark DataFrame groupBy - scala

I have Spark Java that looked like this. Code pulls data from oracle table using JDBC and displays the groupby output.
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
jdbcDF.show();
jdbcDF.groupBy("VA_HOSTNAME").count().show();
Long ll = jdbcDF.count();
System.out.println("ll="+ll);
When I ran the code, jdbcDF.show(); is working, whereas the groupBy and count are not printing anything and no errors were thrown.
My column name is correct. I tried by printing that column and it worked, but when groupBy it's not working.
Can someone help me with DataFrame output? I am using spark 1.6.3.

You can try
import org.apache.spark.sql.functions.count
jdbcDF.groupBy("VA_HOSTNAME").agg(count("*")).show()

Related

Iterating through a DataFrame using Pandas UDF and outputting a dataframe

I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements.
def is_pass_in(df):
x = list(df["string"])
result = []
for i in x:
if "pass" in i:
result.append("YES")
else:
result.append("NO")
df["result"] = result
return df
The code is super simple all I'm trying to do is iterate through a column and in each row contains a sentence. I want to check if the word pass is in that sentence and if so append that to a list that will later become a column right next to the df["string"] column. Ive tried to do this using Pandas UDF but the error messages I'm getting are something that I don't understand because I'm new to spark. Could someone point me in the correct direction?
There is no need to use a UDF. This can be done in pyspark as follows. Even in pandas, I would advice you dont do what you have done. use np.where()
df.withColumn('result', when(col('store')=='target','YES').otherwise('NO')).show()

show single row for multiple records with total number records as count in a new column of dataframe spark scala

I have data as follows.
I want to summarize this as follows:
I want to take first timestamp of name and add total count for name column.
I am not getting any Idea on how to do this in Spark scala code.
Could you please let me know how to handle this situation in spark scala dataframe.
Thanks,Bab
Spark SQL has functions that you can use to achieve this.
import org.apache.spark.sql.functions.{first, col}
In Scala you can do something like this:
df.groupBy(col("Name"))
.agg(first("ID").alias("ID"),
first(col("Timestamp")).alias("Timestamp"),
count(col("Name")).alias("Count")
)
If you want to group on both ID and Name you can also write it as
df.groupBy(col("ID"), col("Name"))
.agg(first(col("Timestamp")).alias("Timestamp"),
count(col("Name")).alias("Count")
)

Scala, Spark-shell, Groupby failing

I have Spark version 2.4.0 and scala version 2.11.12. I can sucessfully load a dataframe with the following code.
val df = spark.read.format("csv").option("header","true").option("delimiter","|").option("mode","DROPMALFORMED").option("maxColumns",60000).load("MAR18.csv")
However, when I attempt to do a groupby the following I get an error.
df.groupby("S0102_gender").agg(sum("Respondent.Serial")).show()
The error message is:
error: value groupby is not a member of org.apache.spark.sql.DataFrame
What am I missing. A complete Scala and Spark Newb.
You have a typo
Change
groupby
To
groupBy
Instead of groupby it should be groupBy like below... clearly typo error.
df.groupBy("S0102_gender").agg(sum("Respondent.Serial")).show()

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

How to convert dataframe to spark rows

I am looking to get a Spark Row(s) from Apache Dataframe. I have tried searching online but could not find anything
You can do
Dataframe.map(x => x.getAsROW)