How to store values into a dataframe from a List using Scala for handling nested JSON data - scala

I have the below code, where I am pulling data from an API and storing it into a JSON file. Further I will be loading into an oracle table. Also, the data value of the ID column is the column name under VelocityEntries. I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
With the below code I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
val inputStream = scala.io.Source.fromInputStream(connection.getInputStream).mkString
val fileWriter1 = new FileWriter(new File(filename))
fileWriter1.write(inputStream.mkString)
fileWriter1.close()
val json_df = spark.read.option("multiLine", true).json(filename)
val embedded_df = json_df.select(explode(col("sprints")) as "x").select(("x.*"))
val list_df = json_df.select("velocityStatEntries.*").columns.toList
for( i <- list_df)
{
completed_df = json_df.select(s"velocityStatEntries.$i.completed.value")
completed_df.show()
}

Related

Hbase insert are very slow when kafka avro records are converted to Json

I am using Kafka 10 and receiving records in it from DB2 CDC. Kafka 10 uses Confluent Schema Registry to store the DB2 table schema and sends the records as Avro Array[Byte]. I want to store these records into Hbase (lets say Raw Hbase) and then run some transformation over those new records(like dropping columns, aggregation etc) using Hive and store the transformed records again into Hbase (lets say conformed Hbase). I tried 2 approaches and both are giving me some kind of issues. The records are big in length with ~500 columns(although only 10% of columns are req.) and each record is of size ~10kb.
1) I tried deserializing the records into Array[Byte] and then use the streamBulkPut method to insert it into Hbase.
Deserializer code:
def toRecord(buffer: Array[Byte]): Array[Byte] = {
var schemaRegistry: SchemaRegistryClient = null
schemaRegistry= new CachedSchemaRegistryClient(url, 10)
val bb = ByteBuffer.wrap(buffer)
bb.get() // consume MAGIC_BYTE
val schemaId = bb.getInt // consume schemaId //println(schemaId.toString)
val schema = schemaRegistry.getByID(schemaId) // consult the Schema Registry //println(schema)
val reader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
val writer = new GenericDatumWriter[GenericRecord](schema)
val baos = new ByteArrayOutputStream
val jsonEncoder = EncoderFactory.get.jsonEncoder(schema, baos)
writer.write( reader.read(null, decoder), jsonEncoder) //reader.read(null, decoder): returns Generic record
jsonEncoder.flush
baos.toByteArray
}
HBase bulkPut code:
val messages = KafkaUtils.createDirectStream[Object,Array[Byte],KafkaAvroDecoder,DefaultDecoder](ssc, kafkaParams, topicSet)
val hconf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(ssc.sparkContext, hconf)
val tableName = "your_table"
var rowKeyArray: Array[String] = null
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = getKeyString(parse(avroRecord._1.toString.mkString).extract[Map[String,String]].values.mkString)
var put = new Put(Bytes.toBytes(recordKey))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), AvroDeserializer.toRecord(avroRecord._2))
put
}
def getKeyString(keystr:String):String = {
(Math.abs(keystr map (_.hashCode) reduceLeft( 31 * _ + _) ) % 10 + 48).toChar + "_" + keystr.trim
}
Now this method works but the inserts are painfully slow. I am getting a throughput of ~5k records per minute. The plan was once the records are in Raw Hbase I will use Hive to read and explode the json to run the transformation.
2) Instead of re-serializing the records while storing into Raw Hbase I thought of doing it while loading from Raw->Conformed Hbase (I can manage the slowness here as the data will be already with me i.e. out of kafka). So I tried storing Avro records as it is into Hbase and it ran very fast, I was able to insert 1.5 Million records in 2 mins. Below is code:
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = parse(avroRecord._1.toString.mkString).extract[Map[String,String]]
var put = new Put(Bytes.toBytes(getKeyString(recordKey.values.mkString)))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), avroRecord._2)
put
}
The problem with this approach is Hive is not able to read Avro records from Hbase and I cannot filter the records/run any logic on it.
I would appreciate any kind of help or resource that I can follow to improve the performance. Any approach would work for me if its corresponding issue is solved. Thanks

Spark: How to write org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream]

I have an RDD that has the signature
org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream]
In this RDD, each row has its own partition.
This ByteArrayOutputStream is zip output. I am applying some processing on the data in each partition and i want to export the processed data from each partition as a single zip file. What is the best way to export each Row in the final RDD as one file per row on hdfs?
If you are interested in knowing how I ended up with such an Rdd.
val npyData = transformedTopData.select("tokenIDF", "topLevelId").rdd.repartition(2).mapPartitions(x => {
val vectors = for {
row <- x
} yield {
row.getAs[Vector](0)
}
Seq(ml2npyCSR(vectors.toSeq).zipOut)
}.iterator)
EDIT: Count works perfectly fine
scala> npyData.count()
res9: Long = 2
Spark has very little support for file system operations. You'll need to Hadoop FileSystem API to create individual files
// This method is needed as Hadoop conf object is not serializable
def createFileStream(pathStr:String) = {
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);
val outFileStream = fs.create(new Path(pathStr));
outFileStream
}
// Method writes to individual files.
// Needs a unique id along with object for output file naming
def writeToFile( x:(Char, Long) ) : Unit = {
val (dataStream, id) = x
val output_dir = "/tmp/del_a/"
val outFileStream = createFileStream(output_dir+id)
dataStream.writeTo(outFileStream)
outFileStream.close()
}
// zipWithIndex used for creating unique id for each item in rdd
npyData.zipWithIndex().foreach(writeToFile)
Reference:
Hadoop FileSystem example
ByteArrayOutputStream.writeTo(java.io.OutputStream)
I figured out that I should represent my data as PairRDD and implement a custom FileOutputFormat. I looked in to the implementation of SequenceFileOutputFormat for inspiration and managed to write my own version based on that.
My custom FileOutputFormat is available here

How to add source file name to each row in Spark?

I'm new to Spark and am trying to insert a column to each input row with the file name that it comes from.
I've seen others ask a similar question, but all their answers used wholeTextFile, but I'm trying to do this for larger CSV files (read using the Spark-CSV library), JSON files, and Parquet files (not just small text files).
I can use the spark-shell to get a list of the filenames:
val df = sqlContext.read.parquet("/blah/dir")
val names = df.select(inputFileName())
names.show
but that's a dataframe.
I am not sure how to add it as a column to each row (and if that result is ordered the same as the initial data either, though I assume it always is) and how to do this as a general solution for all input types.
Another solution I just found to add file name as one of the columns in DataFrame
val df = sqlContext.read.parquet("/blah/dir")
val dfWithCol = df.withColumn("filename",input_file_name())
Ref:
spark load data and add filename as dataframe column
When you create a RDD from a text file, you probably want to map the data into a case class, so you could add the input source in that stage:
case class Person(inputPath: String, name: String, age: Int)
val inputPath = "hdfs://localhost:9000/tmp/demo-input-data/persons.txt"
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
Person(inputPath, tokens(0), tokens(1).trim().toInt)
}
rdd.collect().foreach(println)
If you do not want to mix "business data" with meta data:
case class InputSourceMetaData(path: String, size: Long)
case class PersonWithMd(name: String, age: Int, metaData: InputSourceMetaData)
// Fake the size, for demo purposes only
val md = InputSourceMetaData(inputPath, size = -1L)
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
rdd.collect().foreach(println)
and if you promote the RDD to a DataFrame:
import sqlContext.implicits._
val df = rdd.toDF()
df.registerTempTable("x")
you can query it like
sqlContext.sql("select name, metadata from x").show()
sqlContext.sql("select name, metadata.path from x").show()
sqlContext.sql("select name, metadata.path, metadata.size from x").show()
Update
You can read the files in HDFS using org.apache.hadoop.fs.FileSystem.listFiles() recursively.
Given a list of file names in a value files (standard Scala collection containing org.apache.hadoop.fs.LocatedFileStatus), you can create one RDD for each file:
val rdds = files.map { f =>
val md = InputSourceMetaData(f.getPath.toString, f.getLen)
sc.textFile(md.path).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
}
Now you can reduce the list of RDDs into a single one: The function for reduce concats all RDDs into a single one:
val rdd = rdds.reduce(_ ++ _)
rdd.collect().foreach(println)
This works, but I cannot test if this distributes/performs well with large files.

Spark: read csv file from s3 using scala

I am writing a spark job, trying to read a text file using scala, the following works fine on my local machine.
val myFile = "myLocalPath/myFile.csv"
for (line <- Source.fromFile(myFile).getLines()) {
val data = line.split(",")
myHashMap.put(data(0), data(1).toDouble)
}
Then I tried to make it work on AWS, I did the following, but it didn't seem to read the entire file properly. What should be the proper way to read such text file on s3? Thanks a lot!
val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val s3Object = s3Client.getObject(new GetObjectRequest("myBucket", "myFile.csv"));
val reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
var line = ""
while ((line = reader.readLine()) != null) {
val data = line.split(",")
myHashMap.put(data(0), data(1).toDouble)
println(line);
}
I think I got it work like below:
val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));
val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
for (line <- myData) {
val data = line.split(",")
myMap.put(data(0), data(1).toDouble)
}
println(" my map : " + myMap.toString())
Read in csv-file with sc.textFile("s3://myBucket/myFile.csv"). That will give you an RDD[String]. Get that into a map
val myHashMap = data.collect
.map(line => {
val substrings = line.split(" ")
(substrings(0), substrings(1).toDouble)})
.toMap
You can the use sc.broadcast to broadcast your map, so that it is readily available on all your worker nodes.
(Note that you can of course also use the Databricks "spark-csv" package to read in the csv-file if you prefer.)
This can be acheived even withoutout importing amazons3 libraries using SparkContext textfile. Use the below code
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val s3Login = "s3://AccessKey:Securitykey#Externalbucket"
val filePath = s3Login + "/Myfolder/myscv.csv"
for (line <- sc.textFile(filePath).collect())
{
var data = line.split(",")
var value1 = data(0)
var value2 = data(1).toDouble
}
In the above code, sc.textFile will read the data from your file and store in the line RDD. It then split each line with , to a different RDD data inside the loop. Then you can access values from this RDD with the index.

Scala/Spark save texts in program without save to files

My code will save val: s into result.txt
And then read the file again
I want to know is there a method My code can run directly without save to another file and read it back.
I user val textFile = sc.parallelize(s)
But the next part would have error: value contains is not a member of char
import java.io._
val s = (R.capture("lines"))
resultPath = /home/user
val pw = new PrintWriter(new File(f"$resultPath%s/result.txt"))
pw.write(s)
pw.close
//val textFile = sc.textFile(f"$resultPath%s/result.txt") old method:save into a file and read it back
val textFile = sc.parallelize(s)
val rows = textFile.map { line =>
!(line contains "[, 1]")
val fields = line.split("[^\\d.]+")
((fields(0), fields(1).toDouble))
}
I would have to say the problem you are having is that the variable s is a String data type and you are doing a parallelize on a String instead of a collection. So when you run the map function it is iterating over each character in the String.