Cleaning CSV/Dataframe of size ~40GB using Spark and Scala

Cleaning CSV/Dataframe of size ~40GB using Spark and Scala - scala

I am kind of newbie to big data world. I have a initial CSV which has a data size of ~40GB but in some kind of shifted order. I mean if you see initial CSV, for Jenny there is no age, so sex column value is shifted to age and remaining column value keeps shifting till the last element in the row.
I want clean/process this CVS using dataframe with Spark in Scala. I tried quite a few solution with withColumn() API and all, but nothing worked for me.
If anyone can suggest me some sort of logic or API available which is out there to solve this in a cleaner way. I might not need proper solution but pointers will also do. Help much appreciated!!
Initial CSV/Dataframe
Required CSV/Dataframe
EDIT:
This is how I'm reading the data:
val spark = SparkSession .builder .appName("SparkSQL")
.master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val df = spark.read.option("header", true").csv("path/to/csv.csv")

This pretty much looks like the data is flawed. To handle this, I would suggest reading each line of the csv file as a single string and the applying a map() function to handle the data
case class myClass(name: String, age: Integer, sex: String, siblings: Integer)
val myNewDf = myDf.map(row => {
val myRow: String = row.getAs[String]("MY_SINGLE_COLUMN")
val myRowValues = myRow.split(",")
if (4 == myRowValues.size()) {
//everything as expected
return myClass(myRowValues[0], myRowValues[1], myRowValues[2], myRowValues[3])
} else {
//do foo to guess missing values
}
}

As in your case Data is not properly formatted. To handle this first data has to be cleansed, i.e all rows of CSV should have same Schema or same no of delimiter/columns.
Basic approach to do this in spark could be:
Load data as Text
Apply map operation on loaded DF/DS to clean it
Create Schema manually
Apply Schema on the cleansed DF/DS
Sample Code
//Sample CSV
John,28,M,3
Jenny,M,3
//Sample Code
val schema = StructType(
List(
StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("sex", StringType, nullable = true),
StructField("sib", IntegerType, nullable = true)
)
)
import spark.implicits._
val rawdf = spark.read.text("test.csv")
rawdf.show(10)
val rdd = rawdf.map(row => {
val raw = row.getAs[String]("value")
//TODO: Data cleansing has to be done.
val values = raw.split(",")
if (values.length != 4) {
s"${values(0)},,${values(1)},${values(2)}"
} else {
raw
}
})
val df = spark.read.schema(schema).csv(rdd)
df.show(10)

You can try to define a case class with Optional field for age and load your csv with schema directly into a Dataset.
Something like that :
import org.apache.spark.sql.{Encoders}
import sparkSession.implicits._
case class Person(name: String, age: Option[Int], sex: String, siblings: Int)
val schema = Encoders.product[Person].schema
val dfInput = sparkSession.read
.format("csv")
.schema(schema)
.option("header", "true")
.load("path/to/csv.csv")
.as[Person]

Related

unable to store row elements of a dataset, via mapPartitions(), in variables

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!

Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

Convert csv file to dataframe in Spark 1.5.2 without databricks

I am trying to convert a csv file to a dataframe in Spark 1.5.2 with Scala without the use of the library databricks, as it is a community project and this library is not available. My approach was the following:
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim))
var header = rows.first()
var data = rows.filter(_(0) != header(0))
var df = sc.makeRDD(1 to data.count().toInt).map(i => (data.take(i).drop(i-1)(0)(0), data.take(i).drop(i-1)(0)(1), data.take(i).drop(i-1)(0)(2), data.take(i).drop(i-1)(0)(3), data.take(i).drop(i-1)(0)(4))).toDF(header(0), header(1), header(2), header(3), header(4))
This code, even though it is quite a mess, works without returning any error messages. The problem comes when trying to display the data inside dfin order to verify the correctness of this method and later try to do some queries in df. The error code I am getting after executing df.show() is SPARK-5063. My questions are:
1) Why is it not possible to print the content of df?
2) Is there any other more straightforward method to convert a csv to a dataframe in Spark 1.5.2 without using the library databricks?

For spark 1.5.x can be used code snippet below to convert input into DF
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the DataClass interface with 5 fields.
case class DataClass(id: Int, name: String, surname: String, bdate: String, address: String)
// Create an RDD of DataClass objects and register it as a table.
val peopleData = sc.textFile("input.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim, p(3).trim, p(4).trim)).toDF()
peopleData.registerTempTable("dataTable")
val peopleDataFrame = sqlContext.sql("SELECT * from dataTable")
peopleDataFrame.show()
Spark 1.5

You can create like this:
SparkSession spark = SparkSession
.builder()
.appName("RDDtoDF_Updated")
.master("local[2]")
.config("spark.some.config.option", "some-value")
.getOrCreate();
StructType schema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("eid", DataTypes.IntegerType, false),
DataTypes.createStructField("eName", DataTypes.StringType, false),
DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
DataTypes.createStructField("eGen", DataTypes.StringType,true)});
String filepath = "F:/Hadoop/Data/EMPData.txt";
JavaRDD<Row> empRDD = spark.read()
.textFile(filepath)
.javaRDD()
.map(line -> line.split("\\,"))
.map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));
Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
empDF.groupBy("edept").max("esal").show();

Using Spark with Scala.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
var hiveCtx = new HiveContext(sc)
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))
val df = hiveCtx.createDataframe(rows,schema)
This should work.
But for creating dataframe, would recommend you to use Spark-CSV.

Error in creating dataframe: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string

I have created a schema with following code
val schema= new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
Created a RDD from
val data = spark.sparkContext.textFile("cities.txt")
Converted to RDD of Row to apply schema
val cities = data.map(line => line.split(";")).map(row => Row.fromSeq(row.zip(schema.toSeq)))
val citiesRDD = spark.sqlContext.createDataFrame(cities, schema)
This gives me an error
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string

You don't need a schema to create a Row, you need the schema when you create the DataFrame. You also need to introduce some logic how to convert your splitted line (which produces 3 strings) into integers:
here a minimal solution without exception-handling:
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
val schema = new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
val cities = data.map(line => {
val Array(city,female,male) = line.split(";")
Row(
city,
female.toInt,
male.toInt
)
}
)
val citiesDF = sqlContext.createDataFrame(cities, schema)
I normally use case-classes to create a dataframe, because spark can infer the schema from the case class:
// "schema" for dataframe, define outside of main method
case class MyRow(city:Option[String],female:Option[Int],male:Option[Int])
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
import sqlContext.implicits._
val citiesDF = data.map(line => {
val Array(city,female,male) = line.split(";")
MyRow(
Some(city),
Some(female.toInt),
Some(male.toInt)
)
}
).toDF()

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?

Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"

I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()

If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()

You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()

val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)

A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")

I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show

You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!

Spark SQL: automatic schema from csv

does spark sql provide any way to automatically load csv data?
I found the following Jira: https://issues.apache.org/jira/browse/SPARK-2360 but it was closed....
Currently I would load a csv file as follows:
case class Record(id: String, val1: String, val2: String, ....)
sc.textFile("Data.csv")
.map(_.split(","))
.map { r =>
Record(r(0),r(1), .....)
}.registerAsTable("table1")
Any hints on the automatic schema deduction from csv files? In particular a) how can I generate a class representing the schema and b) how can I automatically fill it (i.e. Record(r(0),r(1), .....))?
Update:
I found a partial answer to the schema generation here:
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#data-sources
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
So the only question left would be how to do the step
map(p => Row(p(0), p(1).trim)) dynamically for the given number of attributes?
Thanks for your support!
Joerg

You can use spark-csv where you can save a few keystrokes not having to define the column names and auto-use the headers.

val schemaString = "name age".split(" ")
// Generate the schema based on the string of schema
val schema = StructType(schemaString.map(fieldName => StructField(fieldName, StringType, true)))
val lines = people.flatMap(x=> x.split("\n"))
val rowRDD = lines.map(line=>{
Row.fromSeq(line.split(" "))
})
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
May be this link will help you.
http://devslogics.blogspot.in/2014/11/spark-sql-automatic-schema-from-csv.html

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Cleaning CSV/Dataframe of size ~40GB using Spark and Scala - scala

Related

unable to store row elements of a dataset, via mapPartitions(), in variables

Convert csv file to dataframe in Spark 1.5.2 without databricks

Error in creating dataframe: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string

How to create a DataFrame from a text file in Spark

Spark SQL: automatic schema from csv

Categories

Resources