Spark streaming - Custom receiver and dataframe infer schema - scala

Consider the below code snippet at receiver
val incomingMessage = subscriberSocket.recv(0)
val stringMessages = new String(incomingMessage).stripLineEnd.split(',')
store(Row.fromSeq(Array(stringMessages(0)) ++ stringMessages.drop(2)))
At receiver, I would not be wanting to convert the table (which is indicated by stringMessages(0) ) each of the column types to actual table types.
At main section of the code, when I do
val df = sqlContext.createDataFrame(eachGDNRdd,getSchemaAsStructField)
println(df.collect().length)
I get the below error
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
Now, the schema consist of both String and Int field. I have cross verified, that field match by type. However, looks like spark dataframe is not inferring the type.
Question
1. Shouldn't spark infer the type of the schema, in the run time (unless there is a contradiction)?
2. Since the table is dynamic, the schema varies based on the first element of each row (which contains table name). Is there any simple suggested way to modify the schema on-the-fly?
Or Am i missing something obvious?

I'm new to Spark and you didn't say what version you're running, but in v2.1.0, schema inference is disabled by default due to the specific reason you mentioned; if the record structure is inconsistent, Spark can't reliably infer the schema. You can enable schema inference by setting spark.sql.streaming.schemaInference to true, but I think you're better off specifying the schema yourself.

Related

Spark when/otherwise can't return different StructType

I have the following Spark code, it depending on condition tries to parse json, each time with a different schema:
df.withColumn("message",
when($"foo".isNull, from_json($"value".cast("string"), schema1))
.otherwise(from_json($"value".cast("string"), schema2)
)
)
it is failing with THEN and ELSE expressions should all be same type or coercible to a common type;
My aim is to apply different schemas depending on the condition.
It's impossible.
from_json converts a string value into a specific StructType according to the provided schema, which will be the datatype of the new created column.
Since the when condition is evaluated for every row, it can't return different StructTypes as a dataframe's column can't be defined with multiple datatypes.
I would recommend you to create 2 different columns, one for the condition $foo.isNull and another for $foo.isNotNull.

No implicit Ordering defined for org.apache.spark.sql.types.TimestampType

I'm playing around with Spark in Scala and a dataset with a predefined Schema I've developed.
The problem I'm facing is when I try to sortBy the current RDD by a field whose type is TimestampType, as the following message appears in the log.
No implicit Ordering defined for org.apache.spark.sql.types.TimestampType.
For the given lines of code.
.sortBy(event => event
.getAs("sample.timestamp")
.asInstanceOf[TimestampType],
ascending = true,
1)
TimestampType is not the actual type of values in the column. It defines the data type at the schema level (in StructType -> StructFields), but the undelying values type should be java.sql.Timestamp.
If you cast the value to Timestamp, the ordering should work properly.

Compare spark schema from Dataframe to type T

I am trying to add some runtime type checks when writting a Spark Dataframe, basically I want to make sure that the DataFrame schema is compatible with a type T, compatible doesn't mean that it has to be exactly the same. Here is my code
def save[T: Encoder](dataframe: DataFrame, url: String): Unit = {
val encoder = implicitly[Encoder[T]]
assert(dataframe.schema == encoder.schema, s"Unable to save schemas don't match")
dataframe.write.parquet(url)
}
Currently I am checking that the schemas are equals, how could I check that they are compatible with the type T?
With compatible I mean that if I execute dataframe.as[T] it will work (but I don't want to execute that because it is quite expensive)
Create an empty dataframe with the same schema and call .as[T] on it. If it works the schema should be compatible!

reading from a spark.structType in scala

I am running the following scala code:
val hiveContext=new org.apache.spark.sql.hive.HiveContex(sc)
val df=hiveContext.sql("SELECT * FROM hl7.all_index")
val rows=df.rdd
val firstStruct=rows.first.get(4)
//I know the column with index 4 IS a StructType
val fs=firstStruct.asInstanceOf[StructType]
//now it fails
//what I'm trying to achieve is
log.println(fs.apply("name"))
I know that firstStruct is of structType and that one of the StructFields' name is "name" but it seems to fail when trying to cast
I've been told that spark/hive structs differ from scala, but, in order to use StructType I needed to
import org.apache.spark.sql.types._
so I assume they actually should be the same type
I looked here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
in order to see how to get to the structField.
Thanks!
Schema types are logical types. They don't map one-to-one to the type of objects from the column with with that schema type.
For example, Hive/SQL use BIGINT for 64 bit integers while SparkSQL uses LongType. The actual type of the data in Scala is Long. This is the issue you are having.
A struct in Hive (StructType in SparkSQL) is represented by Row in a dataframe. So, what you want to do is one of the following:
row.getStruct(4)
import org.apache.spark.sql.Row
row.getAs[Row](4)

Is there a way to add extra metadata for Spark dataframes?

Is it possible to add extra meta data to DataFrames?
Reason
I have Spark DataFrames for which I need to keep extra information. Example: A DataFrame, for which I want to "remember" the highest used index in an Integer id column.
Current solution
I use a separate DataFrame to store this information. Of course, keeping this information separately is tedious and error-prone.
Is there a better solution to store such extra information on DataFrames?
To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:
import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")
And some way to get the max or whatever you want to memoize on the DataFrame:
val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)
sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:
val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()
DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata):
val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)
dfWithMax now has (a column with) the metadata you want!
dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}
Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):
dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992
Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.
As of Spark 1.2, StructType schemas have a metadata attribute which can hold an arbitrary mapping / dictionary of information for each Column in a Dataframe. E.g. (when used with the separate spark-csv library):
customSchema = StructType([
StructField("cat_id", IntegerType(), True,
{'description': "Unique id, primary key"}),
StructField("cat_title", StringType(), True,
{'description': "Name of the category, with underscores"}) ])
categoryDumpDF = (sqlContext.read.format('com.databricks.spark.csv')
.options(header='false')
.load(csvFilename, schema = customSchema) )
f = categoryDumpDF.schema.fields
["%s (%s): %s" % (t.name, t.dataType, t.metadata) for t in f]
["cat_id (IntegerType): {u'description': u'Unique id, primary key'}",
"cat_title (StringType): {u'description': u'Name of the category, with underscores.'}"]
This was added in [SPARK-3569] Add metadata field to StructField - ASF JIRA, and designed for use in Machine Learning pipelines to track information about the features stored in columns, like categorical/continuous, number categories, category-to-index map. See the SPARK-3569: Add metadata field to StructField design document.
I'd like to see this used more widely, e.g. for descriptions and documentation of columns, the unit of measurement used in the column, coordinate axis information, etc.
Issues include how to appropriately preserve or manipulate the metadata information when the column is transformed, how to handle multiple sorts of metadata, how to make it all extensible, etc.
For the benefit of those thinking of expanding this functionality in Spark dataframes, I reference some analogous discussions around Pandas.
For example, see xray - bring the labeled data power of pandas to the physical sciences which supports metadata for labeled arrays.
And see the discussion of metadata for Pandas at Allow custom metadata to be attached to panel/df/series? · Issue #2485 · pydata/pandas.
See also discussion related to units: ENH: unit of measurement / physical quantities · Issue #10349 · pydata/pandas
If you want to have less tedious work, I think you can add an implicit conversion between DataFrame and your custom wrapper (haven't tested it yet though).
implicit class WrappedDataFrame(val df: DataFrame) {
var metadata = scala.collection.mutable.Map[String, Long]()
def addToMetaData(key: String, value: Long) {
metadata += key -> value
}
...[other methods you consider useful, getters, setters, whatever]...
}
If the implicit wrapper is in DataFrame's scope, you can just use normal DataFrame as if it was your wrapper, ie.:
df.addtoMetaData("size", 100)
This way also makes your metadata mutable, so you should not be forced to compute it only once and carry it around.
I would store a wrapper around your dataframe. For example:
case class MyDFWrapper(dataFrame: DataFrame, metadata: Map[String, Long])
val maxIndex = df1.agg("index" ->"MAX").head.getLong(0)
MyDFWrapper(df1, Map("maxIndex" -> maxIndex))
A lot of people saw the word "metadata" and went straight to "column metadata". This does not seem to be what you wanted, and was not what I wanted when I had a similar problem. Ultimately, the problem here is that a DataFrame is an immutable data structure that, whenever an operation is performed on it, the data passes on but the rest of the DataFrame does not. This means that you can't simply put a wrapper on it, because as soon as you perform an operation you've got a whole new DataFrame (potentially of a completely new type, especially with Scala/Spark's tendencies toward implicit conversions). Finally, if the DataFrame ever escapes its wrapper, there's no way to reconstruct the metadata from the DataFrame.
I had this problem in Spark Streaming, which focuses on RDDs (the underlying datastructure of the DataFrame as well) and came to one simple conclusion: the only place to store the metadata is in the name of the RDD. An RDD name is never used by the core Spark system except for reporting, so it's safe to repurpose it. Then, you can create your wrapper based on the RDD name, with an explicit conversion between any DataFrame and your wrapper, complete with metadata.
Unfortunately, this does still leave you with the problem of immutability and new RDDs being created with every operation. The RDD name (our metadata field) is lost with each new RDD. That means you need a way to re-add the name to your new RDD. This can be solved by providing a method that takes a function as an argument. It can extract the metadata before the function, call the function and get the new RDD/DataFrame, then name it with the metadata:
def withMetadata(fn: (df: DataFrame) => DataFrame): MetaDataFrame = {
val meta = df.rdd.name
val result = fn(wrappedFrame)
result.rdd.setName(meta)
MetaDataFrame(result)
}
Your wrapping class (MetaDataFrame) can provide convenience methods for parsing and setting metadata values, as well as implicit conversions back and forth between Spark DataFrame and MetaDataFrame. As long as you run all your mutations through the withMetadata method, your metadata will carry along though your entire transformation pipeline. Using this method for every call is a bit of a hassle, yes, but the simple reality is that there is not a first-class metadata concept in Spark.