how can i create Spark DataFrame From .Thrift file's struct Object - scala

I tried this
val temp = Seq[ProcessAction]() // ProcessAction is declared in Thrift
val toDF = temp.toDF()
I got the error
scala.ScalaReflectionException: none is a term
if I use case class object rather than ProcessAction I can get the DataFrame...
Are there any ways to get rid of this error??

Parquet files understand Thrift encoded objects so you could use ThriftParquetWriter to load a parquet file and then use Spark SQL or something to get those objects into a DataFrame.
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftParquetWriter.java

Related

Scala load data into DataFrame

In python and specifically pandas one can pass a Amazon S3 streaming object to the function pd.read_csv and easily return a dataframe with the streaming data. Something like this:
result = s3.get_object(Bucket=my_bucket, key= my_key)
df = pd.read_csv(result)
Where in the above result is a streaming object.
Is there a possible way to do something similar in scala? I have the below code when retrieving the streaming input in scala. The object is of type S3ObjectInputStream
val obj = amazonS3Client.getObject(bucket_name, file_name)
Then the hope is I could do something like:
scala.read_csv(obj.getObjectContent())
I'm new to Scala so I would appreciate the help. Thanks

How to read parquet files with Flink as Case Class (scala)?

With spark we can easily read parquet files and use it as a Case Class with the following code:
spark.read.parquet("my_parquet_table").as[MyCaseClass]
With Flink, I'm having a lot a trouble to do that. My Case Class comes from an Avro schema, so it is a SpecificRecord.
I tried the following:
val parquetInputFormat = new ParquetRowInputFormat(new Path(path), messageType)
env.readFile(parquetInputFormat, path)
The issue here is the messageType, I was not able to convert my case class nor the avro schema to a valid messageType.
I tried this:
val messageType = ParquetSchemaConverter.toParquetType(TypeInformation.of(classOf[MyCaseClass], true)
which ends with the following error:
class org.apache.flink.formats.avro.typeutils.AvroTypeInfo cannot be cast to class org.apache.flink.api.java.typeutils.RowTypeInfo
I could try to use the table-api, but it would mean having to create the all table schema myself and it would be a pain to maintain.
If someone can indicate me an example of implementation, or propose anything that might help it will be greatly appreciated.

How to convert a java.io list to a DataFrame in Scala?

I'm using this code to get the list of files in a directory, and want to call to toDF method that works when converting lists to dataframes. However, because this is a java.io List, it's saying it won't work.
val files = Option(new java.io.File("data").list).map(_.count(_.endsWith(".csv"))).getOrElse(0)
When I try to do
files.toDF.show()
I get this error:
How can I get this to work? Can someone help me with the code to convert this java.io List to a regular list?
Thanks
val files = Option(new java.io.File("data").list).map(_.count(_.endsWith(".csv"))).getOrElse(0)
Above Code returns - Int, And you are trying to convert Int Value to DataFrame, How is it possible. If I understand you wanted to convert list of .csv files as DataFrame. Please use this below code -
val files = Option(new java.io.File("data").list)).get.filter(x=>x.endsWith(".csv")).toList
import spark.implicits._
files.toDF().show()

How to serialize/deserialize case classes from a spark dataset to/from s3

Let's say I have a Dataset[MyData] where MyData is defined as:
case class MyData(id: String, listA: List[SomeOtherCaseClass])
I want to save the data to s3 and load back later with MyData case class.
I know case class data is serializable. But is it possible to do like:
myData.write.xxxx("s3://someBucket/some")
// later
val myloadedData: Dataset[MyData] = spark.read.yyyy("s3://someBucket/some", MyData)
What does serialization means for you?
Because you only need to do exactly what you showed, choosing any available format you like, e.g. csv, json, parquet, orc, ...
(I would recommend doing a benchmarking between ORC and Parquet for your data, to see which one works better for you).
myData.write.orc("s3://someBucket/somePath")
And, when reading, just use the same format to get a DataFrame back, which you can cast to a Dataset[MyData] using the as[T] method.
val myloadedData: Dataset[MyData] = spark.read.orc("s3://someBucket/somePath").as[MyData]
Or, your question was how to connect to S3? - If so, if you are running from EMR then everything will be setup already. You only need to prepend your path with s3://, as you already did.

pyspark FPGrowth doesn't work with RDD

I am trying to use the FPGrowth function on some data in Spark. I tested the example here with no problems:
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
However, my dataset is coming from hive
data = hiveContext.sql('select transactionid, itemid from transactions')
model = FPGrowth.train(data, minSupport=0.1, numPartitions=100)
This failed with Method does not exist:
py4j.protocol.Py4JError: An error occurred while calling o764.trainFPGrowthModel. Trace:
py4j.Py4JException: Method trainFPGrowthModel([class org.apache.spark.sql.DataFrame, class java.lang.Double, class java.lang.Integer]) does not exist
So, I converted it to an RDD:
data=data.rdd
Now I start getting some strange pickle serializer errors.
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
Then I start looking at the types. In the example, the data is run through a flatmap. This returns a different type than the RDD.
RDD Type returned by flatmap: pyspark.rdd.PipelinedRDD
RDD Type returned by hiveContext: pyspark.rdd.RDD
FPGrowth only seems to work with the PipelinedRDD. Is there some way I can convert a regular RDD to a PipelinedRDD?
Thanks!
Well, my query was wrong, but changed that to use collect_set and then
I managed to get around the type error by doing:
data=data.map(lambda row: row[0])