How to recursively build up a json lens - scala

I'm using json-lenses and trying to recursively search for a branch of a Json object, something akin to:
import spray.json._
import spray.json.lenses.JsonLenses._
import DefaultJsonProtocol._
import scala.annotation.tailrec
#tailrec
def getLens(json: JsValue, lensAcc: Lens = Nil, depth: Int = 0): Lens = {
if (depth > 10) {
throw new IllegalStateException("Recursive search for lens exhausted.")
}
val thisLens = lensAcc / 'rec / element(0)
json.extract[JsObject](thisLens / 'properties.?) match {
case None => getLens(json, thisLens, depth + 1)
case _ => thisLens
}
}
val json = """{"rec": [{"rec": [{"properties": {}}]}]}""".parseJson.asJsObject
val myLens = getLens(json)
which isn't quite right. I'm relatively new to Scala and can't figure out how to fix this.

This should work:
def nextLens = 'rec / element(0)
#scala.annotation.tailrec
final def getLens(json: JsObject, lensAcc: Lens[Id] = nextLens, depth: Int = 0): Lens[Id] = {
if (depth > 10) {
throw new IllegalStateException("Recursive search for lens exhausted.")
}
json.extract[JsObject](lensAcc / 'properties.?) match {
case None => getLens(json, lensAcc / nextLens, depth + 1)
case _ => lensAcc
}
}
Test:
scala> val json = JsonParser("""{"rec": [{"rec": [{"properties": {}}]}]}""").asJsObject
json: spray.json.JsObject = {"rec":[{"rec":[{"properties":{}}]}]}
scala> getLens(json)
res6: spray.json.lenses.Lens[spray.json.lenses.Id] = spray.json.lenses.JsonLenses$$anon$1#448cd79c
scala> json.extract[JsObject](res6)
res9: spray.json.JsObject = {"properties":{}}
I don't know if there is something like an empty lens...

Related

List All objects in S3 with given Prefix in scala

I am trying list all objects in AWS S3 Buckets with input Bucket Name & Filter Prefix using following code.
import scala.collection.JavaConverters._
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ListObjectsV2Request
val bucket_name = "Mybucket"
val fiter_prefix = "Test/a/"
def list_objects(str: String): mutable.Buffer[String] = {
val request : ListObjectsV2Request = new ListObjectsV2Request().withBucketName(bucket_name).withPrefix(str)
var result: ListObjectsV2Result = new ListObjectsV2Result()
do {
result = s3_client.listObjectsV2(request)
val token = result.getNextContinuationToken
System.out.println("Next Continuation Token: " + token)
request.setContinuationToken(token)
}while(result.isTruncated)
result.getObjectSummaries.asScala.map(_.getKey).size
}
list_objects(fiter_prefix)
I have applied continuation method but i am just getting last object list. for example is prefix has 2210 objects i am getting back 210 objects only.
Regards
Mahi
listObjectsV2 returns some or all (up to 1,000) of the objects in a bucket as it is stated here. You need to use Continuation Token to iterate rest of the objects in the bucket.
There is an example code here for java.
This is the code which worked for me.
import scala.collection.JavaConverters._
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ListObjectsV2Request
val bucket_name = "Mybucket"
val fiter_prefix = "Test/a/"
def list_objects(str: String): List[String] = {
val s3_client = new AmazonS3Client
var final_list: List[String] = List()
var list: List[String] = List()
val request: ListObjectsV2Request = new ListObjectsV2Request().withBucketName(bucket_name).withPrefix(str)
var result: ListObjectsV2Result = new ListObjectsV2Result()
do {
result = s3_client.listObjectsV2(request)
val token = result.getNextContinuationToken
System.out.println("Next Continuation Token: " + token)
request.setContinuationToken(token)
list = (result.getObjectSummaries.asScala.map(_.getKey)).toList
println(list.size)
final_list = final_list ::: list
println(final_list)
} while (result.isTruncated)
println("size", final_list.size)
final_list
}
list_objects(fiter_prefix)
A solution using vanilla Scala avoiding vars and tail recursion:
import software.amazon.awssdk.regions.Region
import software.amazon.awssdk.services.s3.S3Client
import software.amazon.awssdk.services.s3.model.{ListObjectsV2Request,
ListObjectsV2Response}
import scala.annotation.tailrec
import scala.collection.JavaConverters.asScalaBufferConverter
import scala.collection.mutable
import scala.collection.mutable.ListBuffer
val sourceBucket = "yourbucket"
val sourceKey = "yourKey"
val subFolderPrefix = "yourprefix"
def getAllPaths(s3Client: S3Client, initReq: ListObjectsV2Request): List[String] = {
#tailrec
def listAllObjectsV2(
s3Client: S3Client,
req: ListObjectsV2Request,
tokenOpt: Option[String],
isFirstTime: Boolean,
initList: ListBuffer[String]
): ListBuffer[String] = {
println(s"IsFirstTime = ${isFirstTime}, continuationToken = ${tokenOpt}")
(isFirstTime, tokenOpt) match {
case (true, Some(x)) =>
// this combo is not possible..
initList
case (false, None) =>
// end
initList
case (_, _) =>
// possible scenarios are :
// true, None : First iteration
// false, Some(x): Second iteration onwards
val response =
s3Client.listObjectsV2(tokenOpt.fold(req)(token => req.toBuilder.continuationToken(token).build()))
val keys: Seq[String] = response.contents().asScala.toList.map(_.key())
val nextTokenOpt = Option(response.nextContinuationToken())
listAllObjectsV2(s3Client, req, nextTokenOpt, isFirstTime = false, keys ++: initList)
}
}
listAllObjectsV2(s3Client, initReq, None, true, mutable.ListBuffer.empty[String]).toList
}
val s3Client = S3Client.builder().region(Region.US_WEST_2).build()
val request: ListObjectsV2Request =
ListObjectsV2Request.builder
.bucket(sourceBucket)
.prefix(sourceKey + "/" + subFolderPrefix)
.build
val listofAllKeys: List[String] = getAllPaths(s3Client, request)

Spark doesn't conform to expected type TraversableOnce

val num_idf_pairs = rescaledData.select("item", "features")
.rdd.map(x => {(x(0), x(1))})
val itemRdd = rescaledData.select("item", "features").where("item = 1")
.rdd.map(x => {(x(0), x(1))})
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val sims = num_idf_pairs.flatMap {
case (key, value) =>
val sv1 = value.asInstanceOf[SV]
import breeze.linalg._
val valuesVector = new SparseVector[Double](sv1.indices, sv1.values, sv1.size)
itemRdd.map {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val xVector = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val sim = valuesVector.dot(xVector) / (norm(valuesVector) * norm(xVector))
(id2.toString, key.toString, sim)
}
}
The error is doesn't conform to expected type TraversableOnce.
When i modify as follows:
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val docSims = num_idf_pairs.flatMap {
case (id1, idf1) =>
val idfs = b_num_idf_pairs.value.filter(_._1 != id1)
val sv1 = idf1.asInstanceOf[SV]
import breeze.linalg._
val bsv1 = new SparseVector[Double](sv1.indices, sv1.values, sv1.size)
idfs.map {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val bsv2 = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val cosSim = bsv1.dot(bsv2).asInstanceOf[Double] / (norm(bsv1) * norm(bsv2))
(id1.toString(), id2.toString(), cosSim)
}
}
it compiles but this will cause an OutOfMemoryException. I set --executor-memory 4G.
The first snippet:
num_idf_pairs.flatMap {
...
itemRdd.map { ...}
}
is not only not valid Spark code (no nested transformations are allowed), but also, as you already know, won't type check, because RDD is not TraversableOnce.
The second snippet likely fails, because data you are trying to collect and broadcast is to large.
It looks like you are trying to find all items similarity so you'll need Cartesian product, and structure your code roughly like this:
num_idf_pairs
.cartesian(itemRdd)
.filter { case ((id1, idf1), (id2, idf2)) => id1 != id2 }
.map { case ((id1, idf1), (id2, idf2)) => {
val cosSim = ??? // Compute similarity
(id1.toString(), id2.toString(), cosSim)
}}

Avoid duplicated computation in guards

Here is the code:
import java.util.{Calendar, Date, GregorianCalendar}
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.commons.conversions.scala._
case class Quota(date: Date, used: Int)
object MongoDateDemo extends App {
val client = InsertUsers.getClient
val db = client("github")
val quota = db("quota")
val rand = scala.util.Random
// quota.drop()
// (1 to 100).foreach { _ =>
// quota += DBObject("date" -> new Date(), "used" -> rand.nextInt(10))
// Thread.sleep(1000)
// }
val minuteInMilliseconds = 60 * 1000
def thresholdDate(minute: Int) = new Date(new Date() .getTime - minuteInMilliseconds * minute) // since a minute ago
val fields = DBObject("_id" -> 0, "used" -> 1)
val x = quota.find("date" $gte thresholdDate(28), fields).collect {
case x if x.getAs[Int]("used").isDefined => x.getAs[Int]("used").get
}
println(x.toList.sum)
// val y = x.map {
// case dbo: DBObject => Quota(dbo.getAs[Date]("date").getOrElse(new Date(0)), dbo.getAs[Int]("used").getOrElse(0))
// }
}
It's reading documents from a collection and filter out those that don't have "used" defined, then summing up the numbers.
The x.getAs[Int]("used") part is duplicated computation, how can I avoid it?
Not much of a Scala programmer, but isn't that what flatMap is for?
quota
.find("date" $gte thresholdDate(38), fields)
.flatMap(_.getAs[Int]("used").toList)
Since this is not possible to avoid, I had to do it in two steps, map into Options then collect. I used view method so that the collection is not traversed twice:
import java.util.{Calendar, Date, GregorianCalendar}
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.commons.conversions.scala._
case class Quota(date: Date, used: Int)
object MongoDateDemo extends App {
val client = InsertUsers.getClient
val db = client("github")
val quota = db("quota")
val rand = scala.util.Random
// quota.drop()
// (1 to 100).foreach { _ =>
// quota += DBObject("date" -> new Date(), "used" -> rand.nextInt(10))
// Thread.sleep(1000)
// }
val minuteInMilliseconds = 60 * 1000
def thresholdDate(minute: Int) = new Date(new Date() .getTime - minuteInMilliseconds * minute) // since a minute ago
val fields = DBObject("_id" -> 0, "used" -> 1)
val usedNumbers = quota.find("date" $gte thresholdDate(38), fields).toList.view.map {
_.getAs[Int]("used")
}.collect {
case Some(i) => i
}.force
println(usedNumbers.sum)
// val y = x.map {
// case dbo: DBObject => Quota(dbo.getAs[Date]("date").getOrElse(new Date(0)), dbo.getAs[Int]("used").getOrElse(0))
// }
}
Assuming getAs returns an Option, this should do what you want:
val x = quota.find("date" $gte thresholdDate(28), fields).flatMap { _.getAs[Int]("used") }
This is similar to doing this:
scala> List(Some(1), Some(2), None, Some(4)).flatMap(x => x)
res: List[Int] = List(1, 2, 4)
Or this:
scala> (1 to 20).flatMap(x => if(x%2 == 0) Some(x) else None)
res: scala.collection.immutable.IndexedSeq[Int] = Vector(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

SparkContext cannot be launched in the same programe with Streaming SparkContext

I created the following test that fit a simple linear regression model to a dummy streaming data.
I use hyper-parameters optimisation to find good values of stepSize, numiterations and initialWeights of the linear model.
Everything runs fine, except the last lines of the code that are commented out:
// Save the evaluations for further visualization
// val gridEvalsRDD = sc.parallelize(gridEvals)
// gridEvalsRDD.coalesce(1)
// .map(e => "%.3f\t%.3f\t%d\t%.3f".format(e._1, e._2, e._3, e._4))
// .saveAsTextFile("data/mllib/streaming")
The problem is with the SparkContext sc. If I initialize it at the beginning of a test, then the program shown errors. It looks like sc should be defined in some special way in order to avoid conflicts with scc (streaming spark context). Any ideas?
The whole code:
// scalastyle:off
package org.apache.spark.mllib.regression
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.util.LinearDataGenerator
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{StreamingContext, TestSuiteBase}
import org.apache.spark.streaming.TestSuiteBase
import org.scalatest.BeforeAndAfter
class StreamingLinearRegressionHypeOpt extends TestSuiteBase with BeforeAndAfter {
// use longer wait time to ensure job completion
override def maxWaitTimeMillis: Int = 20000
var ssc: StreamingContext = _
override def afterFunction() {
super.afterFunction()
if (ssc != null) {
ssc.stop()
}
}
def calculateMSE(output: Seq[Seq[(Double, Double)]], n: Int): Double = {
val mse = output
.map {
case seqOfPairs: Seq[(Double, Double)] =>
val err = seqOfPairs.map(p => math.abs(p._1 - p._2)).sum
err*err
}.sum / n
mse
}
def calculateRMSE(output: Seq[Seq[(Double, Double)]], n: Int): Double = {
val mse = output
.map {
case seqOfPairs: Seq[(Double, Double)] =>
val err = seqOfPairs.map(p => math.abs(p._1 - p._2)).sum
err*err
}.sum / n
math.sqrt(mse)
}
def dummyStringStreamSplit(datastream: Stream[String]) =
datastream.flatMap(txt => txt.split(" "))
test("Test 1") {
// create model initialized with zero weights
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(0.0, 0.0))
.setStepSize(0.2)
.setNumIterations(25)
// generate sequence of simulated data for testing
val numBatches = 10
val nPoints = 100
val inputData = (0 until numBatches).map { i =>
LinearDataGenerator.generateLinearInput(0.0, Array(10.0, 10.0), nPoints, 42 * (i + 1))
}
// Without hyper-parameters optimization
withStreamingContext(setupStreams(inputData, (inputDStream: DStream[LabeledPoint]) => {
model.trainOn(inputDStream)
model.predictOnValues(inputDStream.map(x => (x.label, x.features)))
})) { ssc =>
val output: Seq[Seq[(Double, Double)]] = runStreams(ssc, numBatches, numBatches)
val rmse = calculateRMSE(output, nPoints)
println(s"RMSE = $rmse")
}
// With hyper-parameters optimization
val gridParams = Map(
"initialWeights" -> List(Vectors.dense(0.0, 0.0), Vectors.dense(10.0, 10.0)),
"stepSize" -> List(0.1, 0.2, 0.3),
"numIterations" -> List(25, 50)
)
val gridEvals = for (initialWeights <- gridParams("initialWeights");
stepSize <- gridParams("stepSize");
numIterations <- gridParams("numIterations")) yield {
val lr = new StreamingLinearRegressionWithSGD()
.setInitialWeights(initialWeights.asInstanceOf[Vector])
.setStepSize(stepSize.asInstanceOf[Double])
.setNumIterations(numIterations.asInstanceOf[Int])
withStreamingContext(setupStreams(inputData, (inputDStream: DStream[LabeledPoint]) => {
lr.trainOn(inputDStream)
lr.predictOnValues(inputDStream.map(x => (x.label, x.features)))
})) { ssc =>
val output: Seq[Seq[(Double, Double)]] = runStreams(ssc, numBatches, numBatches)
val cvRMSE = calculateRMSE(output, nPoints)
println(s"RMSE = $cvRMSE")
(initialWeights, stepSize, numIterations, cvRMSE)
}
}
// Save the evaluations for further visualization
// val gridEvalsRDD = sc.parallelize(gridEvals)
// gridEvalsRDD.coalesce(1)
// .map(e => "%.3f\t%.3f\t%d\t%.3f".format(e._1, e._2, e._3, e._4))
// .saveAsTextFile("data/mllib/streaming")
}
}
// scalastyle:on

Chisel: Access to Module Parameters from Tester

How does one access the parameters used to construct a Module from inside the Tester that is testing it?
In the test below I am passing the parameters explicitly both to the Module and to the Tester. I would prefer not to have to pass them to the Tester but instead extract them from the module that was also passed in.
Also I am new to scala/chisel so any tips on bad techniques I'm using would be appreciated :).
import Chisel._
import math.pow
class TestA(dataWidth: Int, arrayLength: Int) extends Module {
val dataType = Bits(INPUT, width = dataWidth)
val arrayType = Vec(gen = dataType, n = arrayLength)
val io = new Bundle {
val i_valid = Bool(INPUT)
val i_data = dataType
val i_array = arrayType
val o_valid = Bool(OUTPUT)
val o_data = dataType.flip
val o_array = arrayType.flip
}
io.o_valid := io.i_valid
io.o_data := io.i_data
io.o_array := io.i_array
}
class TestATests(c: TestA, dataWidth: Int, arrayLength: Int) extends Tester(c) {
val maxData = pow(2, dataWidth).toInt
for (t <- 0 until 16) {
val i_valid = rnd.nextInt(2)
val i_data = rnd.nextInt(maxData)
val i_array = List.fill(arrayLength)(rnd.nextInt(maxData))
poke(c.io.i_valid, i_valid)
poke(c.io.i_data, i_data)
(c.io.i_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
expect(c.io.o_valid, i_valid)
expect(c.io.o_data, i_data)
(c.io.o_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
step(1)
}
}
object TestAObject {
def main(args: Array[String]): Unit = {
val tutArgs = args.slice(0, args.length)
val dataWidth = 5
val arrayLength = 6
chiselMainTest(tutArgs, () => Module(
new TestA(dataWidth=dataWidth, arrayLength=arrayLength))){
c => new TestATests(c, dataWidth=dataWidth, arrayLength=arrayLength)
}
}
}
If you make the arguments dataWidth and arrayLength members of TestA you can just reference them. In Scala this can be accomplished by inserting val into the argument list:
class TestA(val dataWidth: Int, val arrayLength: Int) extends Module ...
Then you can reference them from the test as members with c.dataWidth or c.arrayLength