I have Spark version 2.4.0 and scala version 2.11.12. I can sucessfully load a dataframe with the following code.
val df = spark.read.format("csv").option("header","true").option("delimiter","|").option("mode","DROPMALFORMED").option("maxColumns",60000).load("MAR18.csv")
However, when I attempt to do a groupby the following I get an error.
df.groupby("S0102_gender").agg(sum("Respondent.Serial")).show()
The error message is:
error: value groupby is not a member of org.apache.spark.sql.DataFrame
What am I missing. A complete Scala and Spark Newb.
You have a typo
Change
groupby
To
groupBy
Instead of groupby it should be groupBy like below... clearly typo error.
df.groupBy("S0102_gender").agg(sum("Respondent.Serial")).show()
Related
If I use hive UDF in spark SQL it works. as mentioned below.
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.registerTempTable("test")
spark.sql("select default.encrypt(selling_price,'sales','','employee','id') from test").show
However following is not working.
//following is not working. not sure if you need to register a function for this
val encDF = df.withColumn("encrypted", default.encrypt($"selling_price","sales","","employee","id"))
encDF.show
Error
error: not found: value default
Hive UDF is only available if you access it through Spark SQL. It is not available in the Scala environment, because it was not defined there. But you can still access the Hive UDF using expr:
df.withColumn("encrypted", expr("default.encrypt(selling_price,'sales','','employee','id')"))
I am new to/still learning Apache Spark/Scala. I am trying to analyze a dataset and have loaded the dataset into Scala. However, when I try to perform a basic analysis such as max, min or average, I get an error -
error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]
Could anyone please shed some light on this please? I am running Spark on the cloudlab of an organization.
Code:
// Reading in the csv file
val df = sc.textFile("/user/Spark/PortbankRTD.csv").map(x => x.split(","))
// Select Max of Age
df.select(max($"age")).show()
Error:
<console>:40: error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]
df.select(max($"age")).show()
Please let me know if you need any more information.
Thanks
Following up on my comment, the textFile method returns an RDD[String]. select is a method on DataFrame. You will need to convert your RDD[String] into a DataFrame. You can do this in a number of ways. One example is
import spark.implicits._
val rdd = sc.textFile("/user/Spark/PortbankRTD.csv")
val df = rdd.toDF()
There are also built-in readers for many types of input files:
spark.read.csv("/user/Spark/PortbankRTD.csv")
returns a DataFrame immediately.
Im trying to find DataFrame class definition in scala source code not in pyspark.
There is some files like DataFrameReader, DataFrameWriter, Dataset But not DataFrame.
I have found some directories such as spark/sql, spark/core.
A DataFrame is just a Dataset[Row] and is a type alias:
type DataFrame = Dataset[Row]
https://github.com/apache/spark/blob/50538600ec972469338370f7e2d3674ca8b3c389/sql/core/src/main/scala/org/apache/spark/sql/package.scala#L46
I am trying to save my dataframe aa parquet file with one partition per day. So trying to use the date column. However, I want to write one file per partition, so using repartition($"date"), but keep getting errors:
This error "cannot resolve symbol repartition" and "value $ is not a member of stringContext" when I use,
DF.repartition($"date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
This error Type mismatch, expected column, actual string, when I use:
DF.repartition("date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
However, this works fine without any error.
DF.write.mode("append").partitionBy("date").parquet("s3://file-path/")
Cant we use date type in repartition? Whats wrong here?
To use the $ symbol inplace of col(), you need to first import spark.implicits. spark here is an instance of a SparkSession, hence the import must be done after the creation of a SparkSession. A simple example:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
This import will also enable other functionallity such as converting RDDs to Dataframe of Datasets with toDF() and toDS() respectively.
I have Spark Java that looked like this. Code pulls data from oracle table using JDBC and displays the groupby output.
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
jdbcDF.show();
jdbcDF.groupBy("VA_HOSTNAME").count().show();
Long ll = jdbcDF.count();
System.out.println("ll="+ll);
When I ran the code, jdbcDF.show(); is working, whereas the groupBy and count are not printing anything and no errors were thrown.
My column name is correct. I tried by printing that column and it worked, but when groupBy it's not working.
Can someone help me with DataFrame output? I am using spark 1.6.3.
You can try
import org.apache.spark.sql.functions.count
jdbcDF.groupBy("VA_HOSTNAME").agg(count("*")).show()