I have a dataframe (scala)
I am using both pyspark and scala in a notebook
#pyspark
spark.read.csv(output_path + '/dealer', header = True).createOrReplaceTempView('dealer_dl')
%scala
import org.apache.spark.sql.functions._
val df = spark.sql("select * from dealer_dl")
How to convert a string column (amount) into decimal in scala dataframe.
I tried as below.
%scala
df = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))
But I am getting an error as below:
error: reassignment to val
I am used to pyspark and quite new to scala. I need to do by scala to proceed further. Please let me know. Thanks.
in scala you can't reasign references defined as val but val is immutable reference. if you want to use reasigning some ref you can use var but better solution is not reasign something to the same reference name and use another val.
For example:
val dfWithDecimalAmount = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))
Related
I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:
val toDouble = udf[Double, String]( _.toDouble)
val analysisData = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))
val assembler = new VectorAssembler()
.setInputCols(Array("execution", "event", "info"))
.setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())
when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type
which means that my values are still strings! how can I convert the values and not only the schema type?
thanks
Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.
To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:
val analysisData = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))
It should work!
I am reading 2 different .csv files which has only column as below:
val dF1 = sqlContext.read.csv("some.csv").select($"ID")
val dF2 = sqlContext.read.csv("other.csv").select($"PID")
trying to search if dF2("PID") exists in dF1("ID"):
val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
This gives me null pointer exception.
but if I convert dF1 outside and use list in udf it works:
val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
val getIdUdf = udf((x:String)=>{dF1.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
I know I can use join to get this done but want to know what is the reason of null pointer exception here.
Thanks.
Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.
This article claims that a DataFrame in Spark is equivalent to a Dataset[Row], but this blog post shows that a DataFrame has a schema.
Take the example in the blog post of converting an RDD to a DataFrame: if DataFrame were the same thing as Dataset[Row], then converting an RDD to a DataFrameshould be as simple
val rddToDF = rdd.map(value => Row(value))
But instead it shows that it's this
val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]
Clearly a dataframe is actually a dataset of rows and a schema.
In Spark 2.0, in code there is:
type DataFrame = Dataset[Row]
It is Dataset[Row], just because of definition.
Dataset has also schema, you can print it using printSchema() function. Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)
You can also do createTempView(name) and use it in SQL queries, just like DataFrames.
In other words, Dataset = DataFrame from Spark 1.5 + encoder, that converts rows to your classes. After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row], so without specified encoder.
About conversions: rdd.map() also returns RDD, it never returns DataFrame. You can do:
// Dataset[Row]=DataFrame, without encoder
val rddToDF = sparkSession.createDataFrame(rdd)
// And now it has information, that encoder for String should be used - so it becomes Dataset[String]
val rDDToDataSet = rddToDF.as[String]
// however, it can be shortened to:
val dataset = sparkSession.createDataset(rdd)
Note (in addition to the answer of T Gaweda) that there is a schema associated to each Row (Row.schema). However, this schema is not set until it is integrated in a DataFrame (or Dataset[Row])
scala> Row(1).schema
res12: org.apache.spark.sql.types.StructType = null
scala> val rdd = sc.parallelize(List(Row(1)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[5] at parallelize at <console>:28
scala> spark.createDataFrame(rdd,schema).first
res15: org.apache.spark.sql.Row = [1]
scala> spark.createDataFrame(rdd,schema).first.schema
res16: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))
I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:
val toDouble = udf[Double, String]( _.toDouble)
val analysisData = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))
val assembler = new VectorAssembler()
.setInputCols(Array("execution", "event", "info"))
.setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())
when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type
which means that my values are still strings! how can I convert the values and not only the schema type?
thanks
Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.
To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:
val analysisData = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))
It should work!
I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28