Spark - Make dataframe with multi column csv - scala

origin.csv
no,key1,key2,key3,key4,key5,...
1,A1,B1,C1,D1,E1,..
2,A2,B2,C2,D2,E2,..
3,A3,B3,C3,D3,E3,..
WhatIwant.csv
1,A1,key1
1,B1,key2
1,C1,key3
...
3,A3,key1
3,B3,key2
...
I loaded csv with read method(origin.csv dataframe), but unable to convert it.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
Any idea of this?

Try this.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val df = Seq((1,"A1","B1","C1","D1"), (2,"A2","B2","C2","D2"), (3,"A3","B3","C3","D2")).toDF("no", "key1", "key2","key3","key4")
df.show
def myUDF(df: DataFrame, by: Seq[String]): DataFrame = {
val (columns, types) = df.dtypes.filter{ case (clm, _) => !by.contains(clm)}.unzip
require(types.distinct.size == 1)
val keys = explode(array(
columns.map(clm => struct(lit(clm).alias("key"),col(clm).alias("val"))): _*
))
val byValue = by.map(col(_))
df.select(byValue :+ keys.alias("_key"): _*).select(byValue ++ Seq($"_key.val", $"_key.key"): _*)
}
val df1 = myUDF(df, Seq("no"))
df1.show

Related

I don't know how to do the same using parquet file

Link to (data.csv) and (output.csv)
import org.apache.spark.sql._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val tempDF=spark.read.csv("data.csv")
tempDF.coalesce(1).write.parquet("Parquet")
val rdd = sc.textFile("Parquet")
I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file.Link of (data.csv) and (output.csv)
val header = rdd.first
val rdd1 = rdd.filter(_ != header)
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
import spark.sqlContext.implicits._
val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
dataSet.coalesce(1).write.option("header","true").csv("output")
}
case class CountryAgg(country: String, values: String)
}
I reckon, you are trying to add up corresponding elements from the array based on Country. I have done this using DataFrame APIs, which makes the job easier.
Code for your reference:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("path", "/path/to/input/data.csv")
.load()
val df1 = df.select(
$"Country",
(split($"Values", ";"))(0).alias("c1"),
(split($"Values", ";"))(1).alias("c2"),
(split($"Values", ";"))(2).alias("c3"),
(split($"Values", ";"))(3).alias("c4"),
(split($"Values", ";"))(4).alias("c5")
)
.groupBy($"Country")
.agg(
sum($"c1" cast "int").alias("s1"),
sum($"c2" cast "int").alias("s2"),
sum($"c3" cast "int").alias("s3"),
sum($"c4" cast "int").alias("s4"),
sum($"c5" cast "int").alias("s5")
)
.select(
$"Country",
concat(
$"s1", lit(";"),
$"s2", lit(";"),
$"s3", lit(";"),
$"s4", lit(";"),
$"s5"
).alias("Values")
)
df1.repartition(1)
.write
.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "/path/to/output")
.save()
Here is the output for your reference.
scala> df1.show()
+-------+-------------------+
|Country| Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
| China| 218;239;234;209;75|
| India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+
P.S.:
You can change the output format to parquet/orc or anything you wish.
I have repartitioned df1 into 1 partition just so that you could get a single output file. You can choose to repartition or not based
on your usecase
Hope this helps.
You could just read the file as parquet and perform the same operations on the resulting dataframe:
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")
If you need an rdd you can then just call:
val rdd = parquetFileDF.rdd
The you can proceed with the transformations as before and write as parquet like you have in your question.

spark Scala RDD to DataFrame Date format

Would you be able to help in this spark prob statement
Data -
empno|ename|designation|manager|hire_date|sal|deptno
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30
Code:
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
})
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
2.
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
df.printSchema
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
rawDF.printSchema
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
col(column).cast(typ)
}
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
convertedDF.show()
convertedDF.printSchema
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
rawDF.printSchema
Hope this will help.

Remove words from column if present in list

I have a dataframe with column 'text' which has many rows consisting of english sentences.
text
It is evening
Good morning
Hello everyone
What is your name
I'll see you tomorrow
I have a variable of type List which has some words such as
val removeList = List("Hello", "evening", "because", "is")
I want to remove all those words from column text which are present in removeList.
So my output should be
It
Good morning
everyone
What your name
I'll see you tomorrow
How can I do this using Spark Scala.
I wrote a code something like this:
val stopWordsList = List("Hello", "evening", "because", "is");
val df3 = sqlContext.sql("SELECT text FROM table");
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
def cleanText(x:String, stopWordsList:List[String]):Any = {
for(str <- stopWordsList) {
if(x.contains(str)) {
x.replaceAll(str, "")
}
}
}
But I am getting error
Error:(44, 12) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Error:(44, 12) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[String])org.apache.spark.sql.Dataset[String].
Unspecified value parameter evidence$6.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Check this df and rdd way.
val df = Seq(("It is evening"),("Good morning"),("Hello everyone"),("What is your name"),("I'll see you tomorrow")).toDF("data")
val removeList = List("Hello", "evening", "because", "is")
val rdd2 = df.rdd.map{ x=> {val p = x.getAs[String]("data") ; val k = removeList.foldLeft(p) ( (p,t) => p.replaceAll("\\b"+t+"\\b","") ) ; Row(x(0),k) } }
spark.createDataFrame(rdd2, df.schema.add(StructField("new1",StringType))).show(false)
Output:
+---------------------+---------------------+
|data |new1 |
+---------------------+---------------------+
|It is evening |It |
|Good morning |Good morning |
|Hello everyone | everyone |
|What is your name |What your name |
|I'll see you tomorrow|I'll see you tomorrow|
+---------------------+---------------------+
This code works for me.
Spark version 2.3.0, Scala version 2.11.8.
Using Datasets
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
)
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
}
val df1 = sc.parallelize(data).toDS // Dataset[String]
val df2 = df1.map(text => cleanText(text, removeList)) // Dataset[String]
Using DataFrames
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
)
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
}
// Creates a temp table.
sc.parallelize(data).toDF("text").createTempView("table")
val df1 = spark.sql("SELECT text FROM table") // DataFrame = [text: string]
val df2 = df1.map(row => cleanText(row.getAs[String](fieldName = "text"), removeList)).toDF("text") // DataFrame = [text: string]

How to convert matrix to DataFrame with scala/spark?

I have a matrix and number of columns and rows is unknow
One example Matrix is:
[5,1.3]
[1,5.2]
I want to convert it to DataFrame,column name is random,how to achive it?
this is my expect result:
+-------------+----+
| _1 | _2 |
+-------------+----+
|5 |1.3 |
|1 |5.2 |
--------------------
I suggest you convert matrix to RDD and then convert RDD to DataFrame, it is not a good way but works fine in Spark 2.0.0.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.mllib.linalg._
import org.apache.spark.rdd.RDD
object mat2df {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("mat2df").setMaster("local[1]")
val sc = new SparkContext(conf)
val values = Array(5, 1, 1.3, 5.2)
val mat = Matrices.dense(2, 2, values).asInstanceOf[DenseMatrix]
def toRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val mat_rows = toRDD(mat)// matrix to rdd
val mat_rdd = mat_rows.map(_.toArray).map{case Array(p0, p1) => (p0, p1)}
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.createDataFrame(mat_rdd) // rdd to dataframe
df.show()
}
}
def matrixToDataFrame(sc:SparkContext, matrix:Matrix, m_nodeColName:String):DataFrame={
val rdd = sc.parallelize(matrix.colIter.toSeq).map(x => {
Row.fromSeq(x.toArray.toSeq)
})
val sc = new SQLContext(nodeContext.getSparkCtx())
var schema = new StructType()
val ids = ArrayBuffer[String]()
for (i <- 0 until matrix.rowIter.size) {
schema = schema.add(StructField(m_nodeColName +"_"+ i.toString(), DoubleType, true))
ids.append(m_nodeColName +"_"+ i.toString())
}
sc.sparkSession.createDataFrame(rdd, schema)
}

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()
If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)
A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")
I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!