spark scala data cleansing - scala

When developing a Spark job, I always get the same issues of data cleansing. Please can you give me some clues to implement this kind of data cleansing ?
The input can be a CSV/kafka/text containing string fields, integer fields and timestamp fields. I would like to remove all the lines that aren't compliant with the data model. Sometimes I get IP instead of integer, sometimes the timestamp can't be cast to timestamp because in the wrong format...
Additionally I would like to not kill performance of the job manipulating lots of java objects and scala complex structures, and be able to put business rules.
Imagine a dataset with this model
case class Flow(a: String, b: Int, c: Timestamp)
A very simple version would be
val file= sc.textFile(....).split(",").filter{l=>
l.length>x
l(1) forall Character.isDigit
l(2) // ???? how to match date format without costly regexp ????
}
Do you have any other suggestions ? using more complex solutions like Scalaz with monads or Play framework to do it ?
What would be the difference regarding performance ?
I have also looked at the spark csv parser included in 2.x but it's just a brutal try/catch/logging on type casting...

Related

Spark/Scala - Validate JSON document in a row of a streaming DataFrame

I have a streaming application which is processing a streaming DataFrame with column "body" that contains a JSON string.
So in the body is something like (these are four input rows):
{"id":1, "ts":1557994974, "details":[{"id":1,"attr2":3,"attr3":"something"}, {"id":2,"attr2":3,"attr3":"something"}]}
{"id":2, "ts":1557994975, "details":[{"id":1,"attr2":"3","attr3":"something"}, {"id":2,"attr2":"3","attr3":"something"},{"id":3,"attr2":"3","attr3":"something"}]}
{"id":3, "ts":1557994976, "details":[{"id":1,"attr2":3,"attr3":"something"}, {"id":2,"attr2":3}]}
{"id":4, "ts":1557994977, "details":[]}
I would like to check that each row has the correct schema (data types and contains all attributes). I would like to filter out and log the invalid records somewhere (like a Parquet file). I am especially interested in the "details" array - each of the nested documents must have specified fields and correct data types.
So in the example above only row id = 1 is valid.
I was thinking about a case class such as:
case class Detail(
id: Int,
attr2: Int,
attr3: String
)
case class Input(
id: Int,
ts: Long,
details: Seq[Detail]
)
and Try but not sure how to go about it.
Could someone help, please?
Thanks
One approach is to use JSON Schema that can help you with schema validations on the data. The getting started page is a good place to start off with if you're new.
The other approach would roughly work as follows
Build models (case classes) for each of the objects like you've attempted in your question.
Use a JSON library like Spray JSON / Play-JSON to parse the input json.
For all input that fail to be parsed into valid records mostly likely invalid and you can partition those output into a different sink in your spark code. It would also make this robust if you've an isValid method on the objects which can validate if a parsed record is correct or not too.
The easiest way for me is to create a Dataframe with a Schema and then filter with id == 1. This is not the most efficient way.
Heare you can find a example to create a dataframe with Schema: https://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/
Edit
I can't find a Pre-filtering to speed up JSON search in scala, but you can use this three options:
spark.read.schema(mySchema).format(json).load("myPath").filter($"id" === 1)
or
spark.read.schema(mySchema).json("myPath").filter($"id" === 1)
or
spark.read.json("myPath").filter($"id" === 1)

Spark DataFrame Read And Write

I have a use case where I have to load millions of json formatted data into Apache Hive Tables.
So my solution was simply , load them into dataframe and write them as Parquet files .
Then I shall create an external table on them .
I am using Apache Spark 2.1.0 with scala 2.11.8.
It so happens all the messages follow a sort of flexible schema .
For example , a column "amount" can have value - 1.0 or 1 .
Since I am transforming data from semi-structured format to structured format but my schema is slightly
variable , I have compensated by thinking inferSchema option for datasources like json will help me .
spark.read.option("inferSchema","true").json(RDD[String])
When I have used inferSchema as true while reading json data ,
case 1 : for smaller data , all the parquet files have amount as double .
case 2 : For larger data , some parquet files have amount as double and others have int64 .
I tried to debug and found certain concepts like schema evolution and schema merging which
went over my head leaving me with more doubts than answers.
My doubts/questions are
When I try to infer schema , does it not enforce the inferred schema onto full dataset ?
Since I cannot enforce any schema due to my contraints , I thought to cast the whole
column to double datatype as it can have both integers and decimal numbers .
Is there a simpler way ?
My guess being ,Since the data is partitioned , the inferSchema works per partition and then
it gives me a general schema but it does not do anything like enforcing schema or anything
of such sort . Please correct me if I am wrong .
Note : The reason I am using inferSchema option is because the incoming data is too much flexible/variable
to enforce a case class of my own though some of the columns are mandatory . If you have a simpler solution, please suggest .
Infer schema really just processes all the rows to find the types.
Once it does that, it then merges the results to find a schema common to the whole dataset.
For example, some of your fields may have values in some rows, but not on other rows. So the inferred schema for this field then becomes nullable.
To answer your question, it's fine to infer schema for your input.
However, since you intend to use the output in Hive you should ensure all the output files have the same schema.
An easy way to do this is to use casting (as you suggest). I typically like to do a select at the final stage of my jobs and just list all the columns and types. I feel this makes the job more human-readable.
e.g.
df
.coalesce(numOutputFiles)
.select(
$"col1" .cast(IntegerType).as("col1"),
$"col2" .cast( StringType).as("col2"),
$"someOtherCol".cast(IntegerType).as("col3")
)
.write.parquet(outPath)

How to encode recursive types with constraint for a typesafe avro library

Since I'm kinda really stumped right now with this issue I thought I'd ask here.
So here's the problem. I'm currently trying to write a Library to represent Avro Schemas in a typesafe manner that should then later allow to structurally query a given runtime value of a schema. E.g. Does my schema contain a field of a given name within a certain path? Is the schema flat (contains no nestable types except at top level)? etc.
You can find the complete specification of Avro schemas here: https://avro.apache.org/docs/1.8.2/spec.html
Now I have some troubles deciding on a representation of the schema within my code. Right now I'm using an ADT like this because it makes decoding the AvroSchema (which is JSON) really easy with Circe so you can somewhat ignore things like the Refined Types for this issue.
https://gist.github.com/GrafBlutwurst/681e365ecbb0ecad2acf4044142503a9 Please note that this is not the exact implementation. I have one that is able to decode schemas correctly but is a pain to query afterwards.
Anyhow I was wondering:
1) Does anyone have a good Idea how to encode the Typerestriction on AVRO Union. Avro Unions cannot contain other Unions directly, but can for example contain Records which then again can contain Unions. So union -> union is not allowed but union -> record -> union is ok.
2) would using fixpoint recursion in form of Fix, Free and CoFree make the querying later easier? I'm somewhat on the fence since I have no experience using these yet.
Thanks!
PS: Here's some more elaboration on why Refined is in there. In the end I want to enable some very specific uses eg this pseudocode (I'm not quite sure if it is at all possible yet?:
refine[Schema Refined IsFlat](schema) //because it's flat I know it can only be a Recordtype with Fields of Primitives or Optionals (encoded as Union [null, primitive])
.folder { //wonky name
case AvroInt(k, i) => k + " -> " + i.toString
case AvroString(k, s) => k + " -> " + s
//etc...
} // should result in a function List[Vector[Byte]] => Either[Error,List[String]]
Basically given a schema and assuming it satisfies the IsFlat constraint, provide a function that decodes records and convert them into string lists.

Is there a way to add extra metadata for Spark dataframes?

Is it possible to add extra meta data to DataFrames?
Reason
I have Spark DataFrames for which I need to keep extra information. Example: A DataFrame, for which I want to "remember" the highest used index in an Integer id column.
Current solution
I use a separate DataFrame to store this information. Of course, keeping this information separately is tedious and error-prone.
Is there a better solution to store such extra information on DataFrames?
To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:
import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")
And some way to get the max or whatever you want to memoize on the DataFrame:
val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)
sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:
val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()
DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata):
val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)
dfWithMax now has (a column with) the metadata you want!
dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}
Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):
dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992
Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.
As of Spark 1.2, StructType schemas have a metadata attribute which can hold an arbitrary mapping / dictionary of information for each Column in a Dataframe. E.g. (when used with the separate spark-csv library):
customSchema = StructType([
StructField("cat_id", IntegerType(), True,
{'description': "Unique id, primary key"}),
StructField("cat_title", StringType(), True,
{'description': "Name of the category, with underscores"}) ])
categoryDumpDF = (sqlContext.read.format('com.databricks.spark.csv')
.options(header='false')
.load(csvFilename, schema = customSchema) )
f = categoryDumpDF.schema.fields
["%s (%s): %s" % (t.name, t.dataType, t.metadata) for t in f]
["cat_id (IntegerType): {u'description': u'Unique id, primary key'}",
"cat_title (StringType): {u'description': u'Name of the category, with underscores.'}"]
This was added in [SPARK-3569] Add metadata field to StructField - ASF JIRA, and designed for use in Machine Learning pipelines to track information about the features stored in columns, like categorical/continuous, number categories, category-to-index map. See the SPARK-3569: Add metadata field to StructField design document.
I'd like to see this used more widely, e.g. for descriptions and documentation of columns, the unit of measurement used in the column, coordinate axis information, etc.
Issues include how to appropriately preserve or manipulate the metadata information when the column is transformed, how to handle multiple sorts of metadata, how to make it all extensible, etc.
For the benefit of those thinking of expanding this functionality in Spark dataframes, I reference some analogous discussions around Pandas.
For example, see xray - bring the labeled data power of pandas to the physical sciences which supports metadata for labeled arrays.
And see the discussion of metadata for Pandas at Allow custom metadata to be attached to panel/df/series? · Issue #2485 · pydata/pandas.
See also discussion related to units: ENH: unit of measurement / physical quantities · Issue #10349 · pydata/pandas
If you want to have less tedious work, I think you can add an implicit conversion between DataFrame and your custom wrapper (haven't tested it yet though).
implicit class WrappedDataFrame(val df: DataFrame) {
var metadata = scala.collection.mutable.Map[String, Long]()
def addToMetaData(key: String, value: Long) {
metadata += key -> value
}
...[other methods you consider useful, getters, setters, whatever]...
}
If the implicit wrapper is in DataFrame's scope, you can just use normal DataFrame as if it was your wrapper, ie.:
df.addtoMetaData("size", 100)
This way also makes your metadata mutable, so you should not be forced to compute it only once and carry it around.
I would store a wrapper around your dataframe. For example:
case class MyDFWrapper(dataFrame: DataFrame, metadata: Map[String, Long])
val maxIndex = df1.agg("index" ->"MAX").head.getLong(0)
MyDFWrapper(df1, Map("maxIndex" -> maxIndex))
A lot of people saw the word "metadata" and went straight to "column metadata". This does not seem to be what you wanted, and was not what I wanted when I had a similar problem. Ultimately, the problem here is that a DataFrame is an immutable data structure that, whenever an operation is performed on it, the data passes on but the rest of the DataFrame does not. This means that you can't simply put a wrapper on it, because as soon as you perform an operation you've got a whole new DataFrame (potentially of a completely new type, especially with Scala/Spark's tendencies toward implicit conversions). Finally, if the DataFrame ever escapes its wrapper, there's no way to reconstruct the metadata from the DataFrame.
I had this problem in Spark Streaming, which focuses on RDDs (the underlying datastructure of the DataFrame as well) and came to one simple conclusion: the only place to store the metadata is in the name of the RDD. An RDD name is never used by the core Spark system except for reporting, so it's safe to repurpose it. Then, you can create your wrapper based on the RDD name, with an explicit conversion between any DataFrame and your wrapper, complete with metadata.
Unfortunately, this does still leave you with the problem of immutability and new RDDs being created with every operation. The RDD name (our metadata field) is lost with each new RDD. That means you need a way to re-add the name to your new RDD. This can be solved by providing a method that takes a function as an argument. It can extract the metadata before the function, call the function and get the new RDD/DataFrame, then name it with the metadata:
def withMetadata(fn: (df: DataFrame) => DataFrame): MetaDataFrame = {
val meta = df.rdd.name
val result = fn(wrappedFrame)
result.rdd.setName(meta)
MetaDataFrame(result)
}
Your wrapping class (MetaDataFrame) can provide convenience methods for parsing and setting metadata values, as well as implicit conversions back and forth between Spark DataFrame and MetaDataFrame. As long as you run all your mutations through the withMetadata method, your metadata will carry along though your entire transformation pipeline. Using this method for every call is a bit of a hassle, yes, but the simple reality is that there is not a first-class metadata concept in Spark.

Scala: wrapper for Breeze DenseMatrix for column and row referencing

I am new to Scala. Looking at it as an alternative to MATLAB for some applications.
I would like to program in Scala a wrapping class in order to be able to assign column names ("QuantityQ" && "QuantityP" -> Range) and row names (dates -> Range) to Breeze DenseMatrices (http://www.scalanlp.org/) in order to reference columns and rows.
The usage should resemble Python Pandas or Scala Saddle (http://saddle.github.io).
Saddle is very interesting but its usage is limited to 2D matrices. A huge limitation.
My Ideas:
Columns:
I thought a Map would do the job for colums but that may not be the best implementation.
Rows:
For rows, I could maintain a separate Breeze vector with timestamps and provide methods that convert dates into timestamps, doing the numbercruncing through Breeze. This comes with a loss of generality as a user may want to give whatever string names to rows.
Concerning dates I use nscala-time (a scala wrapper for joda)?
What are the drawbacks of my implementation?
Would you design the data structure differently?
Thank you for your help.