I can create a Spark StructType via DDL schema like so:
val ddl = "a STRING COMMENT 'max_length=1000'"
val schema = StructType.fromDDL(ddl)
This creates a schema where the field for column a looks like so:
StructField(
name = "a",
dataType = StringType,
nullable = true,
metadata = Metadata(Map("comment" -> "max_length=1000"))
)
After that, I can do something like this to put the comment as actual metadata:
val maxLengthMetadata = metadata.getString("comment") // max_length=1000
/*
String regex to grab elements individually e.g.
key = "max_length"
val = "1000"
*/
metadata.putString(key, val)
Is there a way to format ddl so the Metadata object can be populated like above without going through String manipulation after grabbing data from SQL comment? Something like this:
val ddl = "a STRING max_length='1000'"
So instead of
Metadata(Map("comment" -> "max_length=1000"))
I want
Metadata(Map("max_length" -> "1000"))
without having to go through the above roundabout way.
I've also tried running some scala code to see if I can put some metadata then run StructField.toDDL like so:
val metadata: Metadata = new MetadataBuilder()
.putString("timestamp_mask", "yyyy-MM-dd")
.build()
val schema = StructType(
Seq(
StructField("c", TimestampType, nullable = true, metadata)
)
)
schema.fields.foreach(field => println(field.toDDL))
but this doesn't work either since toDDL depends on metadata.getString("comment")....
I don't see an easy way for DDL to support this kind of behavior.
I am not sure if it's possible to do it using DDL string but you can use json schema instead. For example:
import org.apache.spark.sql.types.{DataType, StructType}
val jsonSchema = """{"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{"max-length": 100}}]}"""
val schema = DataType.fromJson(jsonSchema).asInstanceOf[StructType]
println(s"${schema.fields(0).name} - ${schema.fields(0).metadata}") // col1 - {"max-length":100}
Related
I am kind of newbie to big data world. I have a initial CSV which has a data size of ~40GB but in some kind of shifted order. I mean if you see initial CSV, for Jenny there is no age, so sex column value is shifted to age and remaining column value keeps shifting till the last element in the row.
I want clean/process this CVS using dataframe with Spark in Scala. I tried quite a few solution with withColumn() API and all, but nothing worked for me.
If anyone can suggest me some sort of logic or API available which is out there to solve this in a cleaner way. I might not need proper solution but pointers will also do. Help much appreciated!!
Initial CSV/Dataframe
Required CSV/Dataframe
EDIT:
This is how I'm reading the data:
val spark = SparkSession .builder .appName("SparkSQL")
.master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val df = spark.read.option("header", true").csv("path/to/csv.csv")
This pretty much looks like the data is flawed. To handle this, I would suggest reading each line of the csv file as a single string and the applying a map() function to handle the data
case class myClass(name: String, age: Integer, sex: String, siblings: Integer)
val myNewDf = myDf.map(row => {
val myRow: String = row.getAs[String]("MY_SINGLE_COLUMN")
val myRowValues = myRow.split(",")
if (4 == myRowValues.size()) {
//everything as expected
return myClass(myRowValues[0], myRowValues[1], myRowValues[2], myRowValues[3])
} else {
//do foo to guess missing values
}
}
As in your case Data is not properly formatted. To handle this first data has to be cleansed, i.e all rows of CSV should have same Schema or same no of delimiter/columns.
Basic approach to do this in spark could be:
Load data as Text
Apply map operation on loaded DF/DS to clean it
Create Schema manually
Apply Schema on the cleansed DF/DS
Sample Code
//Sample CSV
John,28,M,3
Jenny,M,3
//Sample Code
val schema = StructType(
List(
StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("sex", StringType, nullable = true),
StructField("sib", IntegerType, nullable = true)
)
)
import spark.implicits._
val rawdf = spark.read.text("test.csv")
rawdf.show(10)
val rdd = rawdf.map(row => {
val raw = row.getAs[String]("value")
//TODO: Data cleansing has to be done.
val values = raw.split(",")
if (values.length != 4) {
s"${values(0)},,${values(1)},${values(2)}"
} else {
raw
}
})
val df = spark.read.schema(schema).csv(rdd)
df.show(10)
You can try to define a case class with Optional field for age and load your csv with schema directly into a Dataset.
Something like that :
import org.apache.spark.sql.{Encoders}
import sparkSession.implicits._
case class Person(name: String, age: Option[Int], sex: String, siblings: Int)
val schema = Encoders.product[Person].schema
val dfInput = sparkSession.read
.format("csv")
.schema(schema)
.option("header", "true")
.load("path/to/csv.csv")
.as[Person]
I had a list of columns, By using these columns prepared a schema
code :
import org.apache.spark.sql.types._
val fields = Array("col1", "col2", "col3", "col4", "col5", "col6")
val dynSchema = StructType( fields.map( field =>
new StructField(field, StringType, true, null) ) )
then schema prepared as
StructType(StructField(col1,StringType,true), StructField(col2,StringType,true),
StructField(col3,StringType,true), StructField(col4,StringType,true),
StructField(col5,StringType,true), StructField(col6,StringType,true))
But I am getting NullPointerException when I try to read the data from a json file using above schema.
// reading the data
spark.read.schema(dynSchema).json("./file/path/*.json")
But it is working if I add array to StructType.
Please help me to generate dynamic schema.
Edit : If i create schema with above fields, I can able to read the data from json.
StructType(Array(
StructField("col1",StringType,true), StructField("col2",StringType,true),
StructField("col3",StringType,true), StructField("col4",StringType,true),
StructField("col5",StringType,true), StructField("col6",StringType,true)))
Simply remove the null argument from the creation of the StructField as follows:
val dynSchema = StructType( fields.map( field =>
new StructField(field, StringType, true)))
The last argument is used to define metadata about the column. Its default value is not null but Metadata.empty. See the source code for more detail. In the source code, they assume that it cannot be null and call methods on it without any checks. This is why you get a NullPointerException.
Given a list of strings, is there a way to create a case class or a Schema without inputing the srings manually.
For eaxample, I have a List,
val name_list=Seq("Bob", "Mike", "Tim")
The List will not always be the same. Sometimes it will contain different names and will vary in size.
I can create a case class
case class names(Bob:Integer, Mike:Integer, Time:Integer)
or a schema
val schema = StructType(StructFiel("Bob", IntegerType,true)::
StructFiel("Mike", IntegerType,true)::
StructFiel("Tim", IntegerType,true)::Nil)
but I have to do it manually. I am looking for a method to perform this operation dynamically.
Assuming the data type of the columns are the same:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val nameList=Seq("Bob", "Mike", "Tim")
val schema = StructType(nameList.map(n => StructField(n, IntegerType, true)))
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(Bob,IntegerType,true), StructField(Mike,IntegerType,true), StructField(Tim,IntegerType,true)
// )
spark.createDataFrame(rdd, schema)
If the data types are different, you'll have to provide them as well (in which case it might not save much time compared with assembling the schema manually):
val typeList = Array[DataType](StringType, IntegerType, DoubleType)
val colSpec = nameList zip typeList
val schema = StructType(colSpec.map(cs => StructField(cs._1, cs._2, true)))
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(Bob,StringType,true), StructField(Mike,IntegerType,true), StructField(Tim,DoubleType,true)
// )
If you have all the fields with same datatype than you can simply create as
val name_list=Seq("Bob", "Mike", "Tim")
val fields = name_list.map(name => StructField(name, IntegerType, true))
val schema = StructType(fields)
If you have different datatype than create a map of fields and type and create a schema as above.
Hope this helps!
All the answers above only covered one aspect which is create the schema. Here is one solution you can use to create the case class from the generated schema:
https://gist.github.com/yoyama/ce83f688717719fc8ca145c3b3ff43fd
I have an issue where when we load a Json file into Spark, store it as Parquet and then try and access the Parquet file from Impala; Impala complains about the names of the columns as they contain characters which are illegal in SQL.
One of the "features" of the JSON files is that they don't have a predefined schema. I want Spark to create the schema, and then I have to modify the field names that have illegal characters.
My first thought was to use withColumnRenamed on the names of the fields in the DataFrame but this only works on top level fields I believe, so I could not use that as the Json contains nested data.
So I created the following code to recreate the DataFrames schema, going recursively through the structure. And then I use that new schema to recreate the DataFrame.
(Code updated with Jacek's suggested improvment of using the Scala copy constructor.)
def replaceIllegal(s: String): String = s.replace("-", "_").replace("&", "_").replace("\"", "_").replace("[", "_").replace("[", "_")
def removeIllegalCharsInColumnNames(schema: StructType): StructType = {
StructType(schema.fields.map { field =>
field.dataType match {
case struct: StructType =>
field.copy(name = replaceIllegal(field.name), dataType = removeIllegalCharsInColumnNames(struct))
case _ =>
field.copy(name = replaceIllegal(field.name))
}
})
}
sparkSession.createDataFrame(df.rdd, removeIllegalCharsInColumnNames(df.schema))
This works. But is there a better / simpler way to achive what I want to do?
And is there a better way to replace the existing schema on a DataFrame? The following code did not work:
df.select($"*".cast(removeIllegalCharsInColumnNames(df.schema)))
It gives this error:
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'cast'
I think the best bet would be to convert the Dataset (before you save as a parquet file) to an RDD and use your custom schema to describe the structure as you want.
val targetSchema: StructType = ...
val fromJson: DataFrame = ...
val targetDataset = spark.createDataFrame(fromJson.rdd, targetSchema)
See the example in SparkSession.createDataFrame as a reference however it uses an RDD directly while you're going to create it from a Dataset.
val schema =
StructType(
StructField("name", StringType, false) ::
StructField("age", IntegerType, true) :: Nil)
val people =
sc.textFile("examples/src/main/resources/people.txt").map(
_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
val dataFrame = sparkSession.createDataFrame(people, schema)
dataFrame.printSchema
// root
// |-- name: string (nullable = false)
// |-- age: integer (nullable = true)
But as you mentioned in your comment (that I later merged to your question):
JSON files don't have a predefined schema.
With that said, I think your solution is a correct one. Spark does not offer anything similar out of the box and I think it's more about developing a custom Scala code that would traverse a StructType/StructField tree and change what's incorrect.
What I would suggest to change in your code is to use the copy constructor (a feature of Scala's case classes - see A Scala case class ‘copy’ method example) that would only change the incorrect name with the other properties untouched.
Using copy constructor would (roughly) correspond to the following code:
// was
// case s: StructType =>
// StructField(replaceIllegal(field.name), removeIllegalCharsInColumnNames(s), field.nullable, field.metadata)
s.copy(name = replaceIllegal(field.name), dataType = removeIllegalCharsInColumnNames(s))
There are some design patterns in functional languages (in general) and Scala (in particular) that could deal with the deep nested structure manipulation, but that might be too much (and I'm hesitant to share it).
I therefore think that the question is in its current "shape" more about how to manipulate a tree as a data structure not necessarily a Spark schema.
does spark sql provide any way to automatically load csv data?
I found the following Jira: https://issues.apache.org/jira/browse/SPARK-2360 but it was closed....
Currently I would load a csv file as follows:
case class Record(id: String, val1: String, val2: String, ....)
sc.textFile("Data.csv")
.map(_.split(","))
.map { r =>
Record(r(0),r(1), .....)
}.registerAsTable("table1")
Any hints on the automatic schema deduction from csv files? In particular a) how can I generate a class representing the schema and b) how can I automatically fill it (i.e. Record(r(0),r(1), .....))?
Update:
I found a partial answer to the schema generation here:
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#data-sources
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
So the only question left would be how to do the step
map(p => Row(p(0), p(1).trim)) dynamically for the given number of attributes?
Thanks for your support!
Joerg
You can use spark-csv where you can save a few keystrokes not having to define the column names and auto-use the headers.
val schemaString = "name age".split(" ")
// Generate the schema based on the string of schema
val schema = StructType(schemaString.map(fieldName => StructField(fieldName, StringType, true)))
val lines = people.flatMap(x=> x.split("\n"))
val rowRDD = lines.map(line=>{
Row.fromSeq(line.split(" "))
})
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
May be this link will help you.
http://devslogics.blogspot.in/2014/11/spark-sql-automatic-schema-from-csv.html