How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`? - scala

My understanding of Spark's fileStream() method is that it takes three types as parameters: Key, Value, and Format. In case of text files, the appropriate types are: LongWritable, Text, and TextInputFormat.
First, I want to understand the nature of these types. Intuitively, I would guess that the Key in this case is the line number of the file, and the Value is the text on that line. So, in the following example of a text file:
Hello
Test
Another Test
The first row of the DStream would have a Key of 1 (0?) and a Value of Hello.
Is this correct?
Second part of my question: I looked at the decompiled implementation of ParquetInputFormat and I noticed something curious:
public class ParquetInputFormat<T>
extends FileInputFormat<Void, T> {
//...
public class TextInputFormat
extends FileInputFormat<LongWritable, Text>
implements JobConfigurable {
//...
TextInputFormat extends FileInputFormat of types LongWritable and Text, whereas ParquetInputFormat extends the same class of types Void and T.
Does this mean that I must create a Value class to hold an entire row of my parquet data, and then pass the types <Void, MyClass, ParquetInputFormat<MyClass>> to ssc.fileStream()?
If so, how should I implement MyClass?
EDIT 1: I have noticed a readSupportClass which is to be passed to ParquetInputFormat objects. What kind of class is this, and how is it used to parse the parquet file? Is there some documentation that covers this?
EDIT 2: As far as I can tell, this is impossible. If anybody knows how to stream in parquet files to Spark then please feel free to share...

My sample to read parquet files in Spark Streaming is below.
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "parquet.avro.AvroReadSupport")
val stream = ssc.fileStream[Void, GenericRecord, ParquetInputFormat[GenericRecord]](
directory, { path: Path => path.toString.endsWith("parquet") }, true, ssc.sparkContext.hadoopConfiguration)
val lines = stream.map(row => {
println("row:" + row.toString())
row
})
Some points are ...
record type is GenericRecord
readSupportClass is AvroReadSupport
pass Configuration to fileStream
set parquet.read.support.class to the Configuration
I referred to source codes below for creating sample.
And I also could not find good examples.
I would like to wait better one.
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

You can access the parquet by adding some parquet specific hadoop settings :
val ssc = new StreamingContext(conf, Seconds(5))
var schema =StructType(Seq(
StructField("a", StringType, nullable = false),
........
))
val schemaJson=schema.json
val fileDir="/tmp/fileDir"
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport") ssc.sparkContext.hadoopConfiguration.set("org.apache.spark.sql.parquet.row.requested_schema", schemaJson)
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_BINARY_AS_STRING.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_BINARY_AS_STRING.key, "false")
val streamRdd = ssc.fileStream[Void, UnsafeRow, ParquetInputFormat[UnsafeRow]](fileDir,(t: org.apache.hadoop.fs.Path) => true, false)
streamRdd.count().print()
ssc.start()
ssc.awaitTermination()
This code was prepared with Spark 2.1.0.

Related

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

Text extraction using spark and scala

I have an text extraction algorithm in scala, I want to use spark on top of it. I am not able to understand how to use it as I am new to both spark and scala
My algorithm is like this
object HelloWorld {
val algoobejct = new ObjectExtract
var textFile = ("Path to text file")
for each sentence in textFile
{
val instances = algoobject.extract(sentence);
save instances to texFile
}
I can have here multiple text files and these text files are a lot.
can anyone tell me how it can be done using spark ?
My algorithm is in scala so I will be using scala only to do this task.
try this..
val algoobejct = new ObjectExtract
val rdd = sparkContext.textFile("Path to text file")
rdd.map(sentence=>algoobject.extract(sentence)).saveAsTextFile("outputDirectory")
just make sure algoobejct extends Serializable otherwise it won't work

How to load a csv directly into a Spark Dataset?

I have a csv file [1] which I want to load directly into a Dataset. The problem is that I always get errors like
org.apache.spark.sql.AnalysisException: Cannot up cast `probability` from string to float as it may truncate
The type path of the target object is:
- field (class: "scala.Float", name: "probability")
- root class: "TFPredictionFormat"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
Moreover, and specifically for the phrases field (check case class [2]) it get
org.apache.spark.sql.AnalysisException: cannot resolve '`phrases`' due to data type mismatch: cannot cast StringType to ArrayType(StringType,true);
If I define all the fields in my case class [2] as type String then everything works fine but this is not what I want. Is there a simple way to do it [3]?
References
[1] An example row
B017NX63A2,Merrell,"['merrell_for_men', 'merrell_mens_shoes', 'merrel']",merrell_shoes,0.0806054356579781
[2] My code snippet is as follows
import spark.implicits._
val INPUT_TF = "<SOME_URI>/my_file.csv"
final case class TFFormat (
doc_id: String,
brand: String,
phrases: Seq[String],
prediction: String,
probability: Float
)
val ds = sqlContext.read
.option("header", "true")
.option("charset", "UTF8")
.csv(INPUT_TF)
.as[TFFormat]
ds.take(1).map(println)
[3] I have found ways to do it by first defining columns on a DataFrame level and the convert things to Dataset (like here or here or here) but I am almost sure this is not the way things supposed to be done. I am also pretty sure that Encoders are probably the answer but I don't have a clue how
TL;DR With csv input transforming with standard DataFrame operations is the way to go. If you want to avoid you should use input format which has is expressive (Parquet or even JSON).
In general data to be converted to statically typed dataset must be already of the correct type. The most efficient way to do it is to provide schema argument for csv reader:
val schema: StructType = ???
val ds = spark.read
.option("header", "true")
.schema(schema)
.csv(path)
.as[T]
where schema could be inferred by reflection:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
val schema = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
Unfortunately it won't work with your data and class because csv reader doesn't support ArrayType (but it would work for atomic types like FloatType) so you have to use the hard way. A naive solution could be expressed as below:
import org.apache.spark.sql.functions._
val df: DataFrame = ??? // Raw data
df
.withColumn("probability", $"probability".cast("float"))
.withColumn("phrases",
split(regexp_replace($"phrases", "[\\['\\]]", ""), ","))
.as[TFFormat]
but you may need something more sophisticated depending on the content of phrases.

Read specific column from Parquet without using Spark

I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code:
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.avro.generic.GenericRecord
import org.apache.parquet.hadoop.ParquetReader
import org.apache.parquet.avro.AvroParquetReader
object parquetToJson{
def main (args : Array[String]):Unit= {
//case class Customer(key: Int, name: String, sellAmount: Double, profit: Double, state:String)
val parquetFilePath = new Path("data/parquet/Customer/")
val reader = AvroParquetReader.builder[GenericRecord](parquetFilePath).build()//.asInstanceOf[ParquetReader[GenericRecord]]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
val list = iter.toList
list.foreach(record => println(record))
}
}
The commented out case class represents the schema of my file and write now the above code reads all the columns from the file. I want to read specific columns.
If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).
In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.

Unit testing with Spark dataframes

I'm trying to test a part of my program which performs transformations on dataframes
I want to test several different variations of these dataframe which rules out the option of reading a specific DF from a file
And so my questions are:
Is there any good tutorial on how to perform unit testing with Spark and dataframes, especially regarding the dataframes creation?
How can I create these different several lines dataframes without a lot of boilerplate and without reading these from a file?
Are there any utility classes for checking for specific values inside a dataframe?
I obviously googled that before but could not find anything which was very useful. Among the more useful links I found were:
Running a basic unit test with a dataframe
Custom made assertions with DF
It would be great if examples/tutorials are in Scala but I'll take whatever language you've got
Thanks in advance
This link shows how we can programmatically create a data frame with schema. You can keep the data in separate traits and mix it in with your tests. For instance,
// This example assumes CSV data. But same approach should work for other formats as well.
trait TestData {
val data1 = List(
"this,is,valid,data",
"this,is,in-valid,data",
)
val data2 = ...
}
Then with ScalaTest, we can do something like this.
class MyDFTest extends FlatSpec with Matchers {
"method" should "perform this" in new TestData {
// You can access your test data here. Use it to create the DataFrame.
// Your test here.
}
}
To create the DataFrame, you can have few util methods like below.
def schema(types: Array[String], cols: Array[String]) = {
val datatypes = types.map {
case "String" => StringType
case "Long" => LongType
case "Double" => DoubleType
// Add more types here based on your data.
case _ => StringType
}
StructType(cols.indices.map(x => StructField(cols(x), datatypes(x))).toArray)
}
def df(data: List[String], types: Array[String], cols: Array[String]) = {
val rdd = sc.parallelize(data)
val parser = new CSVParser(',')
val split = rdd.map(line => parser.parseLine(line))
val rdd = split.map(arr => Row(arr(0), arr(1), arr(2), arr(3)))
sqlContext.createDataFrame(rdd, schema(types, cols))
}
I am not aware of any utility classes for checking specific values in a DataFrame. But I think it should be simple to write one using the DataFrame APIs.
You could use SharedSQLContext and SharedSparkSession that Spark uses for its own unit tests. Check my answer for examples.
For those looking to achieve something similar in Java, you can use start by using this project to initialize a SparkContext within your unit tests: https://github.com/holdenk/spark-testing-base
I personally had to mimick the file structure of some AVRO files. So I used Avro-tools (https://avro.apache.org/docs/1.8.2/gettingstartedjava.html#download_install) to extract the schema from my binary records using the following command:
java -jar $AVRO_HOME/avro tojson largeAvroFile.avro | head -3
Then, using this small helper method, you can convert the output JSON into a DataFrame to use in your unit tests.
private DataFrame getDataFrameFromList() {
SQLContext sqlContext = new SQLContext(jsc());
ImmutableList<String> elements = ImmutableList.of(
{"header":{"appId":"myAppId1","clientIp":"10.22.63.3","createdDate":"2017-05-10T02:09:59.984Z"}}
{"header":{"appId":"myAppId1","clientIp":"11.22.63.3","createdDate":"2017-05-11T02:09:59.984Z"}}
{"header":{"appId":"myAppId1","clientIp":"12.22.63.3","createdDate":"2017-05-11T02:09:59.984Z"}}
);
JavaRDD<String> parallelize = jsc().parallelize(elements);
return sqlContext.read().json(parallelize);
}