Spark Scala CSV Column names to Lower Case - scala

Please find the code below and Let me know how I can change the Column Names to Lower case. I tried withColumnRename but I have to do it for each column and type all the column names. I just want to do it on columns so I don't want to mention all the column names as there are too many of them.
Scala Version: 2.11
Spark : 2.2
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object dataframeset {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val rdd1 = sc.cassandraTable("tdata", "map3")
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host","127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val df = spark1.read.format("csv").option("header","true").option("inferschema", "true").load("/Users/Desktop/del2.csv")
import spark1.implicits._
println("\nTop Records are:")
df.show(1)
val dfprev1 = df.select(col = "sno", "year", "StateAbbr")
dfprev1.show(1)
}
}
Required output:
|sno|year|stateabbr| statedesc|cityname|geographiclevel
All the Columns names should be in lower case.
Actual output:
Top Records are:
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
|sno|year|StateAbbr| StateDesc|CityName|GeographicLevel|DataSource| category|UniqueID| Measure|Data_Value_Unit|DataValueTypeID| Data_Value_Type|Data_Value|Low_Confidence_Limit|High_Confidence_Limit|Data_Value_Footnote_Symbol|Data_Value_Footnote|PopulationCount|GeoLocation|categoryID|MeasureId|cityFIPS|TractFIPS|Short_Question_Text|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
| 1|2014| US|United States| null| US| BRFSS|Prevention| 59|Current lack of h...| %| AgeAdjPrv|Age-adjusted prev...| 14.9| 14.6| 15.2| null| null| 308745538| null| PREVENT| ACCESS2| null| null| Health Insurance|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
only showing top 1 row
+---+----+---------+
|sno|year|StateAbbr|
+---+----+---------+
| 1|2014| US|
+---+----+---------+
only showing top 1 row

Just use toDF:
df.toDF(df.columns map(_.toLowerCase): _*)

Other way to achieve it is using FoldLeft method.
val myDFcolNames = myDF.columns.toList
val rdoDenormDF = myDFcolNames.foldLeft(myDF)((myDF, c) =>
myDF.withColumnRenamed(c.toString.split(",")(0), c.toString.toLowerCase()))

Related

I tried to use groupBy on my dataframe after adding a new column but I faced the problem Task NotSerializable

This is my code, I am getting the Task Not Serializable Error when I do this result.groupBy("value")
object Test extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[4]")
.appName("https://SparkByExamples.com")
.getOrCreate()
import spark.implicits._
def myUDF = udf { (v: Double) =>
if (v < 0) 100
else 500
}
val central: DataFrame = Seq((1, 2014),(2, 2018)).toDF("key", "year1")
val other1: DataFrame = Seq((1, 2016),(2, 2015)).toDF("key", "year2")
val result = central.join(other1, Seq("key"))
.withColumn("value", myUDF(col("year2")))
result.show()
val result2 = result.groupBy("value")
.count()
result2.show()
}
I ran the same code I havent got any Task Not Serializable. Some where you have misconception.
import org.apache.log4j.Level
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Test extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark: SparkSession = SparkSession.builder()
.master("local[4]")
.appName("https://SparkByExamples.com")
.getOrCreate()
import spark.implicits._
def myUDF = udf { (v: Double) =>
if (v < 0) 100
else 500
}
val central: DataFrame = Seq((1, 2014),(2, 2018)).toDF("key", "year1")
val other1: DataFrame = Seq((1, 2016),(2, 2015)).toDF("key", "year2")
val result = central.join(other1, Seq("key"))
.withColumn("value", myUDF(col("year2")))
result.show()
val result2 = result.groupBy("value")
.count()
result2.show()
}
Result :
+---+-----+-----+-----+
|key|year1|year2|value|
+---+-----+-----+-----+
| 1| 2014| 2016| 500|
| 2| 2018| 2015| 500|
+---+-----+-----+-----+
+-----+-----+
|value|count|
+-----+-----+
| 500| 2|
+-----+-----+
Conclusion :
This kind of situations will arise when your spark version not compatible with your Scala version.
check this https://mvnrepository.com/artifact/org.apache.spark/spark-core for all versions and corresonding scala versions you need to use.

create a dataset with data frame from sequence of tuples with out using case class

I have sequence of tuples through which I made RDD and converted that to dataframe. like below.
val rdd = sc.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
now i want to create a dataset from df. How can I do that ?
simply df.as[(Int, String)] is what you need to do. pls see full example here.
package com.examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object SeqTuplesToDataSet {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName(this.getClass.getName).config("spark.master", "local").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val rdd = spark.sparkContext.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
val myds: Dataset[(Int, String)] = df.as[(Int, String)]
myds.show()
}
}
Result :
+---+---------+
| Id|firstname|
+---+---------+
| 1| User1|
| 2| user2|
| 3| user3|
+---+---------+

why flatten and collect_list error in scala? with cannot resolve symbol

I have a dataset that I am trying to flatten using scala.
+---------+-----------+--------+
|visitorId|trackingIds|emailIds|
+---------+-----------+--------+
| a | 666b| 12|
| 7 | c0b5| 45|
| 7 | c0b4| 87|
| a | 666b,7p88| |
+---------+-----------+--------+
I am trying to achieve a dataframe which is grouped by the visitorID. Below is the format
+---------+---------------------+--------+
|visitorId| trackingIds |emailIds|
+---------+---------------------+--------+
| a | 666b,666b,7p88| 12,87|
| 7 | c0b4,c0b5 | 45|
+---------+---------------------+--------+
My code:
object flatten_data{
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local[5]")
.appName("Flatten_DF")
.enableHiveSupport()
.getOrCreate()
val df = spark.read.format("csv")
.option("header","true")
.option("delimiter",",")
.load("/home/cloudera/Desktop/data.txt")
print(df.show())
val flattened = df.groupBy("visitorID").agg(collect_list("trackingIds"))
}
}
I am using IntelliJ Idea and I am getting an error at "collect_list".
I read through many solution on stackoverflow where people have asked on how to flatten and groupbykey and have used the same collect_list. I am not sure why this is not working for me. Is it because of IntelliJ?
I reworked on your code and this seems to be working:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object flatten_data{
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val someDF = Seq(
("a", "666b",12),
("7", "c0b5",45),
("7", "666b,7p88",10)
).toDF("visitorId","trackingIds","emailIds")
someDF.groupBy("visitorID").agg(collect_list("trackingIds")).show()
}
}
collect_list is a method defined in the org.apache.spark.sql.functions object, so you need to import it:
import org.apache.spark.sql.functions.collect_list
Alternatively, you can import the entire object, then you'll be able to use other functions from there as well:
import org.apache.spark.sql.functions._
Finally, the approach I personally prefer is to import functions as f, and use qualified calls:
import org.apache.spark.sql.{functions => f}
agg(f.collect_list(...))
This way, the global namespace inside the file is not polluted by the entire host of functions defined in functions.

Multiple Filter of Dataframe on Spark with Scala

I am trying to filter this txt file
TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car
I simply want to filter every raw and drop it if a column has a null element.
In my sample dataset there are three of them which are null.
However I am getting and empty datascheme when i run the code. Do I miss something?
This is my code in scala
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExample")
.getOrCreate()
val filePath="src/main/resources/demodata.txt"
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)
df.where(!$"Gender".isNull && !$"TotalChildren".isNull).show
}
}
Project is on IntelliJ
Thank you a lot
You can do this multiple ways.. Below is one.
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile2 {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="src/main/resources/demodata.tx"
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)
val df2 = df.select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")
.filter("Gender is not null")
.filter("BirthDate is not null")
.filter("TotalChildren is not null")
.filter("ProductCategoryName is not null")
df2.show()
}
}
Output:
+------+-------------------+---------+-------------+-------------------+
|Gender| BirthDate|TotalCost|TotalChildren|ProductCategoryName|
+------+-------------------+---------+-------------+-------------------+
|Female|1957-03-06 00:00:00| 5000| 3| Beauty|
| Male|1959-03-06 00:00:00| 6000| 4| Car|
+------+-------------------+---------+-------------+-------------------+
Thanks,
Naveen
You can just filter it from the dataframe as below,
df.where(!$"Gender".isNull && !$"TotalChildren".isNull).show

How can I lit an Option when converting from Dataset to Dataframe

So this is what I've been trying:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
​
val conf =
new SparkConf().setMaster("local[*]").setAppName("test")
.set("spark.ui.enabled", "false").set("spark.app.id", "testApp")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
​
case class B(s: String)
case class A(i: Int, b: Option[B])
​
val df = Seq(1,2,3).map(Tuple1.apply).toDF
​
// lit with a struct works fine
df.select(col("_1").as("i"), struct(lit("myString").as("s")).as("b")).as[A].show
​
/*
+---+-----------------+
| i| b|
+---+-----------------+
| 1|Some(B(myString))|
| 2|Some(B(myString))|
| 3|Some(B(myString))|
+---+-----------------+
*/
​
// lit with a null throws an exception
df.select(col("_1").as("i"), lit(null).as("b")).as[A].show
​
/*
org.apache.spark.sql.AnalysisException: Can't extract value from b#16;
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:475)
*/
Use correct types:
import org.apache.spark.sql.types._
val s = StructType(Seq(StructField("s", StringType)))
df.select(col("_1").as("i"), lit(null).cast(s).alias("b")).as[A].show
lit(null) alone is represented as NullType so it won't match expected type.