Spark Scala CSV Column names to Lower Case

Spark Scala CSV Column names to Lower Case - scala

Please find the code below and Let me know how I can change the Column Names to Lower case. I tried withColumnRename but I have to do it for each column and type all the column names. I just want to do it on columns so I don't want to mention all the column names as there are too many of them.
Scala Version: 2.11
Spark : 2.2
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object dataframeset {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val rdd1 = sc.cassandraTable("tdata", "map3")
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host","127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val df = spark1.read.format("csv").option("header","true").option("inferschema", "true").load("/Users/Desktop/del2.csv")
import spark1.implicits._
println("\nTop Records are:")
df.show(1)
val dfprev1 = df.select(col = "sno", "year", "StateAbbr")
dfprev1.show(1)
}
}
Required output:
|sno|year|stateabbr| statedesc|cityname|geographiclevel
All the Columns names should be in lower case.
Actual output:
Top Records are:
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
|sno|year|StateAbbr| StateDesc|CityName|GeographicLevel|DataSource| category|UniqueID| Measure|Data_Value_Unit|DataValueTypeID| Data_Value_Type|Data_Value|Low_Confidence_Limit|High_Confidence_Limit|Data_Value_Footnote_Symbol|Data_Value_Footnote|PopulationCount|GeoLocation|categoryID|MeasureId|cityFIPS|TractFIPS|Short_Question_Text|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
| 1|2014| US|United States| null| US| BRFSS|Prevention| 59|Current lack of h...| %| AgeAdjPrv|Age-adjusted prev...| 14.9| 14.6| 15.2| null| null| 308745538| null| PREVENT| ACCESS2| null| null| Health Insurance|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
only showing top 1 row
+---+----+---------+
|sno|year|StateAbbr|
+---+----+---------+
| 1|2014| US|
+---+----+---------+
only showing top 1 row

Just use toDF:
df.toDF(df.columns map(_.toLowerCase): _*)

Other way to achieve it is using FoldLeft method.
val myDFcolNames = myDF.columns.toList
val rdoDenormDF = myDFcolNames.foldLeft(myDF)((myDF, c) =>
myDF.withColumnRenamed(c.toString.split(",")(0), c.toString.toLowerCase()))

Related

I tried to use groupBy on my dataframe after adding a new column but I faced the problem Task NotSerializable

This is my code, I am getting the Task Not Serializable Error when I do this result.groupBy("value")
object Test extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[4]")
.appName("https://SparkByExamples.com")
.getOrCreate()
import spark.implicits._
def myUDF = udf { (v: Double) =>
if (v < 0) 100
else 500
}
val central: DataFrame = Seq((1, 2014),(2, 2018)).toDF("key", "year1")
val other1: DataFrame = Seq((1, 2016),(2, 2015)).toDF("key", "year2")
val result = central.join(other1, Seq("key"))
.withColumn("value", myUDF(col("year2")))
result.show()
val result2 = result.groupBy("value")
.count()
result2.show()
}

I ran the same code I havent got any Task Not Serializable. Some where you have misconception.
import org.apache.log4j.Level
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Test extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark: SparkSession = SparkSession.builder()
.master("local[4]")
.appName("https://SparkByExamples.com")
.getOrCreate()
import spark.implicits._
def myUDF = udf { (v: Double) =>
if (v < 0) 100
else 500
}
val central: DataFrame = Seq((1, 2014),(2, 2018)).toDF("key", "year1")
val other1: DataFrame = Seq((1, 2016),(2, 2015)).toDF("key", "year2")
val result = central.join(other1, Seq("key"))
.withColumn("value", myUDF(col("year2")))
result.show()
val result2 = result.groupBy("value")
.count()
result2.show()
}
Result :
+---+-----+-----+-----+
|key|year1|year2|value|
+---+-----+-----+-----+
| 1| 2014| 2016| 500|
| 2| 2018| 2015| 500|
+---+-----+-----+-----+
+-----+-----+
|value|count|
+-----+-----+
| 500| 2|
+-----+-----+
Conclusion :
This kind of situations will arise when your spark version not compatible with your Scala version.
check this https://mvnrepository.com/artifact/org.apache.spark/spark-core for all versions and corresonding scala versions you need to use.

create a dataset with data frame from sequence of tuples with out using case class

I have sequence of tuples through which I made RDD and converted that to dataframe. like below.
val rdd = sc.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
now i want to create a dataset from df. How can I do that ?

simply df.as[(Int, String)] is what you need to do. pls see full example here.
package com.examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object SeqTuplesToDataSet {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName(this.getClass.getName).config("spark.master", "local").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val rdd = spark.sparkContext.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
val myds: Dataset[(Int, String)] = df.as[(Int, String)]
myds.show()
}
}
Result :
+---+---------+
| Id|firstname|
+---+---------+
| 1| User1|
| 2| user2|
| 3| user3|
+---+---------+

why flatten and collect_list error in scala? with cannot resolve symbol

I have a dataset that I am trying to flatten using scala.
+---------+-----------+--------+
|visitorId|trackingIds|emailIds|
+---------+-----------+--------+
| a | 666b| 12|
| 7 | c0b5| 45|
| 7 | c0b4| 87|
| a | 666b,7p88| |
+---------+-----------+--------+
I am trying to achieve a dataframe which is grouped by the visitorID. Below is the format
+---------+---------------------+--------+
|visitorId| trackingIds |emailIds|
+---------+---------------------+--------+
| a | 666b,666b,7p88| 12,87|
| 7 | c0b4,c0b5 | 45|
+---------+---------------------+--------+
My code:
object flatten_data{
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local[5]")
.appName("Flatten_DF")
.enableHiveSupport()
.getOrCreate()
val df = spark.read.format("csv")
.option("header","true")
.option("delimiter",",")
.load("/home/cloudera/Desktop/data.txt")
print(df.show())
val flattened = df.groupBy("visitorID").agg(collect_list("trackingIds"))
}
}
I am using IntelliJ Idea and I am getting an error at "collect_list".
I read through many solution on stackoverflow where people have asked on how to flatten and groupbykey and have used the same collect_list. I am not sure why this is not working for me. Is it because of IntelliJ?

I reworked on your code and this seems to be working:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object flatten_data{
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val someDF = Seq(
("a", "666b",12),
("7", "c0b5",45),
("7", "666b,7p88",10)
).toDF("visitorId","trackingIds","emailIds")
someDF.groupBy("visitorID").agg(collect_list("trackingIds")).show()
}
}

collect_list is a method defined in the org.apache.spark.sql.functions object, so you need to import it:
import org.apache.spark.sql.functions.collect_list
Alternatively, you can import the entire object, then you'll be able to use other functions from there as well:
import org.apache.spark.sql.functions._
Finally, the approach I personally prefer is to import functions as f, and use qualified calls:
import org.apache.spark.sql.{functions => f}
agg(f.collect_list(...))
This way, the global namespace inside the file is not polluted by the entire host of functions defined in functions.

Multiple Filter of Dataframe on Spark with Scala

I am trying to filter this txt file
TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car
I simply want to filter every raw and drop it if a column has a null element.
In my sample dataset there are three of them which are null.
However I am getting and empty datascheme when i run the code. Do I miss something?
This is my code in scala
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExample")
.getOrCreate()
val filePath="src/main/resources/demodata.txt"
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)
df.where(!$"Gender".isNull && !$"TotalChildren".isNull).show
}
}
Project is on IntelliJ
Thank you a lot

You can do this multiple ways.. Below is one.
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile2 {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="src/main/resources/demodata.tx"
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)
val df2 = df.select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")
.filter("Gender is not null")
.filter("BirthDate is not null")
.filter("TotalChildren is not null")
.filter("ProductCategoryName is not null")
df2.show()
}
}
Output:
+------+-------------------+---------+-------------+-------------------+
|Gender| BirthDate|TotalCost|TotalChildren|ProductCategoryName|
+------+-------------------+---------+-------------+-------------------+
|Female|1957-03-06 00:00:00| 5000| 3| Beauty|
| Male|1959-03-06 00:00:00| 6000| 4| Car|
+------+-------------------+---------+-------------+-------------------+
Thanks,
Naveen

You can just filter it from the dataframe as below,
df.where(!$"Gender".isNull && !$"TotalChildren".isNull).show

How can I lit an Option when converting from Dataset to Dataframe

So this is what I've been trying:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._

val conf =
new SparkConf().setMaster("local[*]").setAppName("test")
.set("spark.ui.enabled", "false").set("spark.app.id", "testApp")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

case class B(s: String)
case class A(i: Int, b: Option[B])

val df = Seq(1,2,3).map(Tuple1.apply).toDF

// lit with a struct works fine
df.select(col("_1").as("i"), struct(lit("myString").as("s")).as("b")).as[A].show

/*
+---+-----------------+
| i| b|
+---+-----------------+
| 1|Some(B(myString))|
| 2|Some(B(myString))|
| 3|Some(B(myString))|
+---+-----------------+
*/

// lit with a null throws an exception
df.select(col("_1").as("i"), lit(null).as("b")).as[A].show

/*
org.apache.spark.sql.AnalysisException: Can't extract value from b#16;
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:475)
*/

Use correct types:
import org.apache.spark.sql.types._
val s = StructType(Seq(StructField("s", StringType)))
df.select(col("_1").as("i"), lit(null).cast(s).alias("b")).as[A].show
lit(null) alone is represented as NullType so it won't match expected type.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Scala CSV Column names to Lower Case - scala

Just use toDF: df.toDF(df.columns map(_.toLowerCase): _*)

Other way to achieve it is using FoldLeft method. val myDFcolNames = myDF.columns.toList val rdoDenormDF = myDFcolNames.foldLeft(myDF)((myDF, c) => myDF.withColumnRenamed(c.toString.split(",")(0), c.toString.toLowerCase()))

Related

I tried to use groupBy on my dataframe after adding a new column but I faced the problem Task NotSerializable

create a dataset with data frame from sequence of tuples with out using case class

why flatten and collect_list error in scala? with cannot resolve symbol

Multiple Filter of Dataframe on Spark with Scala

How can I lit an Option when converting from Dataset to Dataframe

Categories

Resources