how to pivot columns and rows in databricks apache using scala? - scala

im trying to pivot row to column. Im using pivot function but when i use it it just gives me the same exact database without any changes. Code runs fine without any errors , but i would like to reformat the data and add column attribute and value as shown below. Any help is greatly appreciated!
// current database table
Census_block_group B08007e1 B08007m1 B08007e2 B08007m2
010010201001 291 95 291 95
010010201002 678 143 663 139
// what i need
Census_block_group attribute value
010010201001 B08007e1 678
//code
import org.apache.spark.sql.SQLContext
spark.conf.set("spark.sql.pivotMaxValues", 999999)
val df = censusBlocks.toDF
df.groupBy("B08007e1").pivot("census_block_group")
display(df)

What you are actually trying to do is in fact an "unpivot" rather than a pivot. Spark does not have an unpivot function. Instead you can use the stack function.
import org.apache.spark.sql.functions._
val unPivotDF = df.select($"Census_block_group",
expr("stack(4, 'B08007e1', B08007e1, 'B08007m1', B08007m1, 'B08007e2', B08007e2, 'B08007m2', B08007e2) as (attribute,value)"))
unPivotDF.show()
You can find more details about the using the stack function here - https://sparkbyexamples.com/how-to-pivot-table-and-unpivot-a-spark-dataframe/

You actually want to 'Transpose' not the 'Pivot'. Here is another solution(Sorry a bit lengthy) :-)
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
object Stackoverflow3 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Test").master("local").getOrCreate()
val df = <YOUR ORIGINAL DATAFRAME>
val transposed = transform(df, Array("Census_block_group"))
transposed
.withColumn("Attribute", col("ColVal.col1"))
.withColumn("Value", col("ColVal.col2"))
.drop("ColVal")
.show()
}
def transform(df: DataFrame, fixedColumns: Array[String]): DataFrame = {
val colsToTranspose = df.columns.diff(fixedColumns)
val createCols = {
colsToTranspose.foldLeft(Array.empty[Column]) {
case (acc, name) => acc.:+(struct(lit(name).cast(StringType), col(name).cast(StringType)))
}
}
df
.withColumn("colVal", explode(array(createCols: _*)))
.select(Array("Census_block_group", "colVal").map(col): _*)
}
}

Related

Spark Scala getting null pointer exception

I'm trying to get mass elevation data from tiff image, I have a csv file. csv file contents latitude, longitude and other attributes also. Looping through csv file and getting latitude and longitude and calling elevation method, Code given below. Reference RasterFrames extracting location information problem
package main.scala.sample
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
import org.apache.spark.sql.functions.col
object SparkSQLExample {
def main(args: Array[String]) {
implicit val spark = SparkSession.builder()
.master("local[*]").appName("RasterFrames")
.withKryoSerialization.getOrCreate().withRasterFrames
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val example = "https://raw.githubusercontent.com/locationtech/rasterframes/develop/core/src/test/resources/LC08_B7_Memphis_COG.tiff"
val rf = spark.read.raster.from(example).load()
val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
val extent = extentEnc.to[Extent]
Raster(tile, extent).getDoubleValueAtPoint(point)
})
val spark_file:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples")
.getOrCreate()
spark_file.sparkContext.setLogLevel("ERROR")
println("spark read csv files from a directory into RDD")
val rddFromFile = spark_file.sparkContext.textFile("point.csv")
println(rddFromFile.getClass)
def customF(str: String): String = {
val lat = str.split('|')(2).toDouble;
val long = str.split('|')(3).toDouble;
val point = st_makePoint(long, lat)
val test = rf.where(st_intersects(rf_geometry(col("proj_raster")), point))
.select(rf_value_at_point(rf_extent(col("proj_raster")), rf_tile(col("proj_raster")), point) as "value")
return test.toString()
}
val rdd2=rddFromFile.map(f=> customF(f))
rdd2.foreach(t=>println(t))
spark.stop()
}
}
when I'm running getting null pointer exception, any help appreciated
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:64)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3416)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1490)
at org.apache.spark.sql.Dataset.where(Dataset.scala:1518)
at main.scala.sample.SparkSQLExample$.main$scala$sample$SparkSQLExample$$customF$1(SparkSQLExample.scala:49)
The function which is being mapped over the RDD (customF) is not null safe. Try calling customF(null) and see what happens. If it throws an exception, then you will have to make sure that rddFromFile doesn't contain any null/missing values.
It is a little hard to tell if that is exactly where issue is. I think the stack trace of the exception is less helpful than usual because the function is being run in a spark tasks on the workers.
If that is the issue, you could rewrite customF to handle the case where str is null or change the parameter type to Option[String] (and tweak the logic accordingly).
By the way, the same thing allies for UDFs. They need to either
Accept Option types as input
Handle the case where each arg is null or
Only be applied to data with no missing values.

Error in returning spark rdd from a function called inside the map function

I have a collection of rowkeys (plants as shown below) from hbase table and I want to make a fetchData function which returns rdd data for rowkeys from the collection. Goal is to get union of RDDs from fetchData method for each element from plants collection. I have given the relevant part of code below. My issue is, the code is giving compilation error for return type of fetchData:
println("PartB: "+ hBaseRDD.getNumPartitions)
error: value getNumPartitions is not a member of Option[org.apache.spark.rdd.RDD[it.nerdammer.spark.test.sys.Record]]
I am using scala 2.11.8 spark 2.2.0 and maven compilation
import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
import spark.implicits._
type Record = (String, Option[String], Option[String])
def fetchData(plant: String): RDD[Record] = {
val start_index = plant
val end_index = plant + "z"
//The below command works fine if I run it in main function, but to get multiple rows from hbase, I am using it in a separate function
spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
}
def main(args: Array[String]) {
//the below elements in the collection are prefix of relevant rowkeys in hbase table ("test_table")
val plants = Vector("a8","cu","aw","fx")
val hBaseRDD = plants.map( pp => fetchData(pp))
println("Part: "+ hBaseRDD.getNumPartitions)
/*
rest of the code
*/
}
}
Here is the working version of code. The problem here is that I am using for loop and I have to request data corresponding to rowkey (plants) vector from HBase in each loop instead of getting all the data first and then executing rest of the codes
import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
import spark.implicits._
type Record = (String, Option[String], Option[String])
val plants = Vector("a8","cu","aw","fx")
for (plant <- plants){
val start_index = plant
val end_index = plant + "z"
val hBaseRDD = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
println("Part: "+ hBaseRDD.getNumPartitions)
/*
rest of the code
*/
}
}
}
After trying, this is where I am stuck now. So how can I cast the type to required.
scala> def fetchData(plant: String) = {
| val start_index = plant
| val end_index = plant + "~"
| val x1 = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
| x1
| }
Define the function in REPL and running it
scala> val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
<console>:39: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Option[String], Option[String])]
required: it.nerdammer.spark.hbase.HBaseReaderBuilder[(String, Option[String], Option[String])]
val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
Thanks in Advance!
The type of hBaseRDD is Vector[_] and not RDD[_], so you can't execute method getNumPartitions on it. If I understand correctly you want to union fetched RDDs. You can do it by plants.map( pp => fetchData(pp)).reduceOption(_ union _) (I recommend to use reduceOption because it won't fail on empty list, but you can use reduce if you confident that list is not empty)
Also the returned type of fetchData is RDD[U], but I didn't find any definition of U. Probably this is the reason why compiler infer Vector[Nothing] instead of Vector[RDD[Record]]. In order to avoid subsequent errors you should also change RDD[U] to RDD[Record].

Spark Scala | create Dataframe Dyanmically

I would like to create dataframe names dynamically from a collection.
Please see below:
val set1 = Set("category1","category2","category3")
The following is a UDF which takes a string x from the set as input and generate the dataframe accordingly:
def catDfgen(x: String): DataFrame = {
spark.sql(s"select * from table where col1 = '$x'")
}
Now I need help here, to create not only DataFrame but also the DataFrame name should be dynamically generated in order to achieve
val category1DF = catDfgen($x)
val category2DF = catDfgen($x)
...etc. Would it be possible to do it using the code below?
set1.map( x => val $x+"DF" = catDfgen($x))
If not please suggest an effective method.
Suman, I believe the below might help your use-case
import org.apache.spark.sql.{DataFrame, SparkSession}
object Test extends App {
val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
val set1 = Set("category1","category2","category3")
val dfs: Map[String, DataFrame] = set1.map(x =>
(s"${x}DF", spark.sql(s"select * from table where col1 = '$x'").alias(s"${x}DF").toDF())
).toMap
dfs("category1DF").show()
spark.stop()
}

scala spark conversion error when creating dataframe

I am a newbie in scala. Please be patient.
I have this code.
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// create spark session
implicit val spark = SparkSession.builder().appName("clustering").getOrCreate()
// read file
val fileName = """file:///some_location/head_sessions_sample.csv"""
// create DF from file
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
try {
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
a
}
catch {
case e: java.lang.ClassCastException => spark.emptyDataFrame
}
}
val t = inputKmeans(df).filter( _ != null )
t.foreach(r =>
if (r.get(0) != null)
println(r.get(0)))
For the moment, i want to ignore my conversion errors. But somehow, I still have them.
2018-09-24 11:26:22 ERROR Executor:91 - Exception in task 0.0 in stage
4.0 (TID 6) java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double
I dont think there is any point to give a snapshot of the csv. At this point, i just want to ignore conversion errors.
Any ideas why this is happening?
As mentioned in the comment, the issue is because the values are not Double type.
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
Either cast to the Correct DataType i.e Long Type (you can also provide the Schema explicitly using Case Class and apply the schema to DataFrame).
Or use the VectorAssembler to convert the columns into features. This is easier and recommended approach.
import org.apache.spark.ml.feature.VectorAssembler
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
val assembler = new VectorAssembler().setInputCols(Array("start_ts", "duration", "ip_dist")).setOutputCol("features")
val output = assembler.transform(df).select("id", "features")
output
}
i think i discovered the problem. the "try catch" is placed at the level of the DF creation, not at the level of the conversion. in consequence, it catches problems related to DF creation, not conversion issues.

OneHotEncoder in Spark Dataframe in Pipeline

I've been trying to get an example running in Spark and Scala with the adult dataset .
Using Scala 2.11.8 and Spark 1.6.1.
The problem (for now) lies in the amount of categorical features in that dataset that all need to be encoded to numbers before a Spark ML algorithm can do its job..
So far I have this:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Adult {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Adult example").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("src/main/resources/adult.data")
val categoricals = data.dtypes filter (_._2 == "StringType")
val encoders = categoricals map (cat => new OneHotEncoder().setInputCol(cat._1).setOutputCol(cat._1 + "_encoded"))
val features = data.dtypes filterNot (_._1 == "label") map (tuple => if(tuple._2 == "StringType") tuple._1 + "_encoded" else tuple._1)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(encoders ++ Array(lr))
val model = pipeline.fit(training)
}
}
However, this doesn't work. Calling pipeline.fit still contains the original string features and thus throws an exception.
How can I remove these "StringType" columns in a pipeline?
Or maybe I'm doing it completely wrong, so if someone has a different suggestion I'm happy to all input :).
The reason why I choose to follow this flow is because I have an extensive background in Python and Pandas, but am trying to learn both Scala and Spark.
There is one thing that can be rather confusing here if you're used to higher level frameworks. You have to index the features before you can use encoder. As it is explained in the API docs:
one-hot encoder (...) maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
val df = Seq((1L, "foo"), (2L, "bar")).toDF("id", "x")
val categoricals = df.dtypes.filter (_._2 == "StringType") map (_._1)
val indexers = categoricals.map (
c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_idx")
)
val encoders = categoricals.map (
c => new OneHotEncoder().setInputCol(s"${c}_idx").setOutputCol(s"${c}_enc")
)
val pipeline = new Pipeline().setStages(indexers ++ encoders)
val transformed = pipeline.fit(df).transform(df)
transformed.show
// +---+---+-----+-------------+
// | id| x|x_idx| x_enc|
// +---+---+-----+-------------+
// | 1|foo| 1.0| (1,[],[])|
// | 2|bar| 0.0|(1,[0],[1.0])|
// +---+---+-----+-------------+
As you can see there is no need to drop string columns from the pipeline. In practice OneHotEncoder will accept numeric column with NominalAttribute, BinaryAttribute or missing type attribute.