I am trying to save my dataframe aa parquet file with one partition per day. So trying to use the date column. However, I want to write one file per partition, so using repartition($"date"), but keep getting errors:
This error "cannot resolve symbol repartition" and "value $ is not a member of stringContext" when I use,
DF.repartition($"date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
This error Type mismatch, expected column, actual string, when I use:
DF.repartition("date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
However, this works fine without any error.
DF.write.mode("append").partitionBy("date").parquet("s3://file-path/")
Cant we use date type in repartition? Whats wrong here?
To use the $ symbol inplace of col(), you need to first import spark.implicits. spark here is an instance of a SparkSession, hence the import must be done after the creation of a SparkSession. A simple example:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
This import will also enable other functionallity such as converting RDDs to Dataframe of Datasets with toDF() and toDS() respectively.
Related
If I use hive UDF in spark SQL it works. as mentioned below.
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.registerTempTable("test")
spark.sql("select default.encrypt(selling_price,'sales','','employee','id') from test").show
However following is not working.
//following is not working. not sure if you need to register a function for this
val encDF = df.withColumn("encrypted", default.encrypt($"selling_price","sales","","employee","id"))
encDF.show
Error
error: not found: value default
Hive UDF is only available if you access it through Spark SQL. It is not available in the Scala environment, because it was not defined there. But you can still access the Hive UDF using expr:
df.withColumn("encrypted", expr("default.encrypt(selling_price,'sales','','employee','id')"))
I am trying to enter some data in postgresql database using pyspark. There is one field in postresql table which defined as data type GEOGRAPHY(Point). I have written below pyspark code to creat this field using longitude and latitude
from pyspark.sql.functions import st_makePoint
df = (Load input file into pyspark dataframe)
df = df.withColumn("Location", st_makePoint(col("Longitude"), col("Latitude")))
Next step is load the data into postgresql
But I am getting the error
"ImportError: cannot import name 'st_makePoint'
I think st_makePoint is part of pyspark.sql.function. Not sure why it is giving error. Please help.
Also if there is better way of entering the Geography(Point) field in postgresql from pyspark please let me know
check this documentation of geo-mesa
Registering user-defined types and functions can be done manually by invoking geomesa_pyspark.init_sql() on the Spark session object:
import geomesa_pyspark
geomesa_pyspark.init_sql(spark)
Then you can use st_mkPoint
I am new to/still learning Apache Spark/Scala. I am trying to analyze a dataset and have loaded the dataset into Scala. However, when I try to perform a basic analysis such as max, min or average, I get an error -
error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]
Could anyone please shed some light on this please? I am running Spark on the cloudlab of an organization.
Code:
// Reading in the csv file
val df = sc.textFile("/user/Spark/PortbankRTD.csv").map(x => x.split(","))
// Select Max of Age
df.select(max($"age")).show()
Error:
<console>:40: error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]
df.select(max($"age")).show()
Please let me know if you need any more information.
Thanks
Following up on my comment, the textFile method returns an RDD[String]. select is a method on DataFrame. You will need to convert your RDD[String] into a DataFrame. You can do this in a number of ways. One example is
import spark.implicits._
val rdd = sc.textFile("/user/Spark/PortbankRTD.csv")
val df = rdd.toDF()
There are also built-in readers for many types of input files:
spark.read.csv("/user/Spark/PortbankRTD.csv")
returns a DataFrame immediately.
I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type
Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])
I am running the following scala code:
val hiveContext=new org.apache.spark.sql.hive.HiveContex(sc)
val df=hiveContext.sql("SELECT * FROM hl7.all_index")
val rows=df.rdd
val firstStruct=rows.first.get(4)
//I know the column with index 4 IS a StructType
val fs=firstStruct.asInstanceOf[StructType]
//now it fails
//what I'm trying to achieve is
log.println(fs.apply("name"))
I know that firstStruct is of structType and that one of the StructFields' name is "name" but it seems to fail when trying to cast
I've been told that spark/hive structs differ from scala, but, in order to use StructType I needed to
import org.apache.spark.sql.types._
so I assume they actually should be the same type
I looked here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
in order to see how to get to the structField.
Thanks!
Schema types are logical types. They don't map one-to-one to the type of objects from the column with with that schema type.
For example, Hive/SQL use BIGINT for 64 bit integers while SparkSQL uses LongType. The actual type of the data in Scala is Long. This is the issue you are having.
A struct in Hive (StructType in SparkSQL) is represented by Row in a dataframe. So, what you want to do is one of the following:
row.getStruct(4)
import org.apache.spark.sql.Row
row.getAs[Row](4)