Avro Files always read as GenericRecord - scala

I have a avro files with a specified schema.
When I am loading the Avro Files they always come out as GenericData even though I am specifying the Schema.
val schema = Article.Schema$
val job = new Job()
AvroJob.setInputKeySchema(job, schema)
val rootDir = "path-to-avro-files"
val articlesRDD = sc.newAPIHadoopFile(rootDir, classOf[AvroKeyInputFormat[Article]], classOf[AvroKey[Article]], classOf[NullWritable], job.getConfiguration)
This code works and I get an RDD with the data contained in the avro files but unfortunately the entries of the RDD are all of type GenericData. This means whenever I want to access a field of my specific schema, I am getting the following error:
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to de.uni_mannheim.desq.converters.nyt.avroschema.Article
This is the code I use to extract a field from the avro file
val abstracts = articlesRDD.map(tuple => {
val abstract = tuple._1.datum.getAbstract
abstract
}
Also calling 'asInstanceOf' after accessing the 'datum' (in order to convert the GenericRecord to my Article) leads to the same error.

So I ended up following this tutorial (http://subprotocol.com/system/apache-spark-ec2-avro.html) and regenerating my AvroSchema with a newer version of the Avro-tools.
The version generated with avro-tools 1.7.x is not working with this solution while the version generated with 1.8.1 does.

Related

How to convert csv file in S3 bucket to RDD

I'm pretty new with this topic so any help will be much appreciated.
I trying to read a csv file which is stored in a S3 bucket and convert its data to an RDD to work directly with it without the need to create a file locally.
So far I've been able to load the file using AmazonS3ClientBuilder, but the only thing I've got is to have the file content in a S3ObjectInputStream and I'm not able to work with its content.
val bucketName = "bucket-name"
val credentials = new BasicAWSCredentials(
"acessKey",
"secretKey"
);
val s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
val s3object = s3client.getObject(bucketName, "file-name.csv")
val inputStream = s3object.getObjectContent()
....
I have also tried to use a BufferedSource to work with it but once done, I don't know how to convert it to a dataframe or RDD to work with it.
val myData = Source.fromInputStream(inputStream)
....
You can do it with S3A file system provided in Hadoop-AWS module:
Add this dependency https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
Either define <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property> in core-site.xml or add .config("fs.s3.impl", classOf[S3AFileSystem].getName) to SparkSession builder
Access S3 using spark.read.csv("s3://bucket/key"). If you want the RDD that was asked spark.read.csv("s3://bucket/key").rdd
At the end I was able to get the results I was searching for taking a look at https://gist.github.com/snowindy/d438cb5256f9331f5eec

Read Parquet into scala without Spark

I have a Parquet file which I would like to read into my Scala program without using Spark or other Big Data Technologies.
I found the projects
https://github.com/apache/parquet-mr
https://github.com/51zero/eel-sdk
but not detailed enough examples to get them to work.
Parquet-MR
https://stackoverflow.com/a/35594368/4533188 mentions this, but the examples given are not complete. For example it is not clear what path is supposed to be. It is supposed to implement InputFile, how is this supposed to be done? Also, from the post it seems to me that Parquet-MR does not directly truns the parquet data as standard Scala classes.
Eel
Here I tried
import io.eels.component.parquet.ParquetSource
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val parquetFilePath = new Path("file://home/raeg/Datatroniq/Projekte/14. Witzenmann/Teilprojekt Strom und Spannung/python_witzenmann/src/data/1.parquet")
implicit val hadoopConfiguration = new Configuration()
implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required
ParquetSource(parquetFilePath)
.toDataStream()
.collect
.foreach(row => println(row))
but I get the error
java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(ParquetReaderTesting.sc:2582)
at org.apache.hadoop.fs.FileSystem.createFileSystem(ParquetReaderTesting.sc:2589)
at org.apache.hadoop.fs.FileSystem.access$200(ParquetReaderTesting.sc:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(ParquetReaderTesting.sc:2628)
at org.apache.hadoop.fs.FileSystem$Cache.get(ParquetReaderTesting.sc:2610)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:366)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:165)
at dataReading.A$A6$A$A6.hadoopFileSystem$lzycompute(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.hadoopFileSystem(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.get$$instance$$hadoopFileSystem(ParquetReaderTesting.sc:7)
at #worksheet#.#worksheet#(ParquetReaderTesting.sc:30)
in my worksheet.

Spark Scala file stream

I am new to Spark and Scala. I want to keep read files from folder and persist file content in Cassandra. I have written simple Scala program using file streaming to read the file content. it is not reading files from the specified folder.
Can anybody correct my below sample code ?
i am using Windows 7
Code:
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val lines = ssc.textFileStream("file:///C:/input/")
lines.foreachRDD(file=> {
file.foreach(fc=> {
println(fc)
})
})
ssc.start()
ssc.awaitTermination()
}
I think a normal spark job is needed for the scenario rather than spark streaming.Spark streaming is used in cases where your source is something like kafka or a normal port where there is continuous inflow of data.

ingesting data in solr using spark scala

I am trying to ingest data to solr using scala and spark however, my code is missing something. For instance, I got below code from Hortonworks tutorial.
I am using spark 1.6.2, solr 5.2.1, scala 2.10.5.
Can anybody provide me a workable snippet to successfully insert data into solr?
val input_file = "hdfs:///tmp/your_text_file"
case class Person(id: Int, name: String)
val people_df1 = sc.textFile(input_file).map(_.split(",")).map(p => Person(p(0).trim.toInt, p(1))).toDF()
val docs = people_df1.map{doc=>
val docx=SolrSupport.autoMapToSolrInputDoc(doc.getAs[Int]("id").toString, doc, null)
docx.setField("scala_s", "supercool")
docx.setField("name_s", doc.getAs[String]("name"))
}
// below code has compilation issue somehow although jar file doest contain these functions.
SolrSupport.indexDocs("sandbox.hortonworks.com:2181","testsparksolr",10,docs)
val solrServer = com.lucidworks.spark.SolrSupport.getSolrServer("http://ambari.asiacell.com:2181")
solrServer.setDefaultCollection("
testsparksolr")
solrServer.commit(false, false)
thanks in advance
Have you tried spark-solr?
The library's main focus is to provide a clean API to index documents to a Solr server as in your case.

Converting a sequence of Json Object to An Rdd

Iam currently having a json object say student.json. The Structure looks something like this
{"serialNo":"1","name":"Rahul"}
{"serialNo":"2","name":"Rakshith"}
case class Student(serialNo:Int,name:String)
student.json is a huge file which Iam planning to parse through a spark job. And the snippet :
import play.api.libs.json.{ Json, JsObject, JsString }
.....
.....
for(jsonLine <-sc.textFile("student.json")
student<- Json.parse(jsonLine).asOpt[Student])
yield(student.serialNumber -> student.name)
Is there a better way to do this??
If student.json is a huge file, and each line is just a valid json object, you should do:
val myRdd = sc.textFile("student.json").map(l=> Json.parse(l).asOpt[Student])
If you want to get the RDD to your local master, you can:
val students = myRdd.collect()..// then you can do operate it in the old fashion way.
I saw you are importing play.api.libs.json which is from the Play Framework. I don't think running a Spark program in a web application is a good idea...