I have a problem to solve where data comes in as JSON from kinesis like below:
{
datatype: "datatype_1"
id : "id_1"
data : {...}
}
All records in the stream then need to go through a lookup function with datatype and id passed as arguments to find a unique group of locations to write the items to in JSON.
i.e.
def get_locations(id: String, datatype: String): Array[String] = //custom logic here
where the resultant array would look like
[ "s3:///example_bucket/example_folder_1", "s3:///example_bucket2/example_folder_2"]
My question is how do I most efficiently group records coming off the stream by datatype and id and write to the various s3 locations. I was hoping to do something like below:
sparkSession.readStream.format("kinesis")
.option("streamName", kinesis_stream_name)
.option("initialPosition", "latest")
.option("region", aws_region)
.load()
//more transforms
.select(
col("datatype"),
col("id"),
col("data")
)
// Not sure how I can do what's below
// .write.partitionBy("id", "datatype")
// .format("json")
// .option("compression","gzip")
// .save(get_locations("id","datatype")) //saving to all locations in result array
I do advise you to create the bucket in the code in the runtime as a best practice, you can use the node.js aws S3 API or your runtime language API
As you said in your comment you are getting the parameters from your runtime.
However as an answer to your question here is a function that creates a bucket containing the id in its name (you can change it to the format that you like ) then in that bucket you will have a lot of files based on the partition of the dataframe while saving:
import java.util
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.model.AmazonS3Exception
import com.amazonaws.services.s3.{AmazonS3, AmazonS3ClientBuilder}
def get_locations(id: String, datatype: String) = {
//you can configure the default region to the adequat region
//of course
val s3: AmazonS3 = AmazonS3ClientBuilder.standard.withRegion(Regions.DEFAULT_REGION).build
object CreateBucket {
def getBucket(bucket_name: String): Bucket = {
var named_bucket = null.asInstanceOf[Bucket]
val buckets: util.List[Bucket] = s3.listBuckets
import scala.collection.JavaConversions._
for (b <- buckets) {
if (b.getName.equals(bucket_name)) named_bucket = b
}
named_bucket
}
def createBucket(bucket_name: String): Bucket = {
var b = null.asInstanceOf[Bucket]
if (s3.doesBucketExistV2(bucket_name)) {
System.out.format("Bucket %s already exists.\n", bucket_name)
b = getBucket(bucket_name)
}
else try b = s3.createBucket(bucket_name)
catch {
case e: AmazonS3Exception =>
System.err.println(e.getErrorMessage)
}
b
}
}
//change your bucket name here if
//you like
val bucket_name = "bucket_" + id
var my_bucket = null.asInstanceOf[Bucket]
if (s3.doesBucketExistV2(bucket_name)) {
System.out.format("Bucket %s already exists.\n", bucket_name)
my_bucket = CreateBucket.getBucket(bucket_name)
}
else try my_bucket = s3.createBucket(bucket_name)
catch {
case e: AmazonS3Exception =>
System.err.println(e.getErrorMessage)
}
my_bucket
}
//I don't know how you will get those parameters
var id = " "
var datatype = " "
df.write.partitionBy("id", "dataType")
.format("json")
.option("compression", "gzip")
.save(get_locations(id, datatype).toString)
Don't forget to add the dependecies in maven or in build.sbt with the version that you have already in aws (sdk) :
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.11.979</version>
</dependency>
Related
I have a dataframe in spark and I need to process a particular column in that dataframe using a REST API. The API does some transformation to a string and returns a result string. The API can process multiple strings at a time.
I can iterate over the columns of the dataframe, collect n values of the column in a batch and call the api and then add it back to the dataframe, and continue with the next batch. But this seems like the normal way of doing it without taking advantage of spark.
Is there a better way to do this which can take advantage of spark sql optimiser and spark parallel processing?
For Spark parallel processing you can use mapPartitions
case class Input(col: String)
case class Output ( col : String,new_col : String )
val data = spark.read.csv("/a/b/c").as[Input].repartiton(n)
def declare(partitions: Iterator[Input]): Iterator[Output] ={
val url = ""
implicit val formats: DefaultFormats.type = DefaultFormats
var list = new ListBuffer[Output]()
val httpClient =
try {
while (partitions.hasNext) {
val x = partitions.next()
val col = x.col
val concat_url =""
val apiResp = HttpClientAcceptSelfSignedCertificate.call(httpClient, concat_url)
if (apiResp.isDefined) {
val json = parse(apiResp.get)
val new_col = (json \\"value_to_take_from_api").children.head.values.toString
val output = Output(col,new_col)
list+=output
}
else {
val new_col = "Not Found"
val output = Output(col,new_col)
list+=output
}
}
} catch {
case e: Exception => println("api Exception with : " + e.getMessage)
}
finally {
HttpClientAcceptSelfSignedCertificate.close(httpClient)
}
list.iterator
}
val dd:Dataset[Output] =data.mapPartitions(x=>declare(x))
I'm implementing the application that fetches data from external environments via Apache Kafka.
This data is first mapped to objects and then passed to the process (TimeWindow) (look at the code below)
val busDataStream = env.addSource(kafkaConsumer)
.filter { _.nonEmpty}
.flatMap(line => JsonMethods.parse(line).toOption)
.map(_.extract[BusModel])
class CustomProcess() extends ProcessWindowFunction[BusModel, BusModel, String, TimeWindow] {
lazy val busState: ValueState[BusModel] = getRuntimeContext.getState(
new ValueStateDescriptor[BusModel]("BusModel state", classOf[BusModel])
)
override def process(key: String, context: Context, elements: Iterable[BusModel], out: Collector[BusModel]): Unit = {
for (e <- elements) {
if (busState.value() != null) {
out.collect(busState.value())
val result: Double = calculateSomething(e, busState.value())
}
busState.update(e)
println(s"BusState: ${busState.value()}")
}
}
}
val dataStream: DataStream[BusModel] = busDataStream
.keyBy(_.VehicleNumber)
.timeWindow(Time.seconds(10))
.process(new CustomCountProc)
After the new information is prepared, I would like this data to be put into the Cassandra database. I tried to implement this value using a connector, but unfortunately the new records don't show up in the database ...
I also added a createTypeInformation method which should map the data of the selected object to the column types in the database, but that unfortunately didn't help.
createTypeInformation[(String, Double, Double, Double)]
val sinkStream = dataStream
.map(busRide => (
java.util.UUID.randomUUID.toString,
busRide.valueA,
busRide.valueB,
busRide.valueC,
))
CassandraSink.addSink(sinkStream)
.setQuery("INSERT INTO transport.bus_flink_speed(" +
"\"FirstColumn\", " +
"\"SecondColumn " +
"\"ThirdColumn\", " +
"\"ForthColumn\")" +
" values (?, ?, ?, ?);")
.setHost("localhost")
.build()
env.execute("Flink Kafka Example")
Does anyone have any idea why this doesn't work?
I'm checking out Deequ which seems like a really nice library. I was wondering if it is possible to load constraints from a csv file or an orc-table in HDFS?
Lets say I have a table with theese types
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
and I want to put constraints like:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicates
But I want to load the ".isComplete("id")", ".isUnique("id")" from a csv file so the business can add the constraints and we can run te tests based on their input
val verificationResult = VerificationSuite()
.onData(data)
.addChecks(Seq(checks))
.run()
I've managed to get the constraints from suggestionResult.constraintSuggestion
val allConstraints = suggestionResult.constraintSuggestions
.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}
.toSeq
which gives a List like for example:
allConstraints = List(CompletenessConstraint(Completeness(id,None)), ComplianceConstraint(Compliance('id' has no negative values,id >= 0,None))
But it gets generated from suggestionResult.constraintSuggestions. But I want to be able to create a List like that based on the inputs from a csv file, can anyone help me?
To sum things up:
Basically I just want to add:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("columnName1")
.isUnique("columnName1")
.isComplete("columnName2")
dynamically based on a file where the file has for example:
columnName;isUnique;isComplete (header)
columnName1;true;true
columnName2;false;true
I chose to store the CSV in src/main/resources as it's very easy to read from there, and easy to maintain in parallel with the code being QA'ed.
def readCSV(spark: SparkSession, filename: String): DataFrame = {
import spark.implicits._
val inputFileStream = Try {
this.getClass.getResourceAsStream("/" + filename)
}
.getOrElse(
throw new Exception("Cannot find" + filename + "in src/main/resources")
)
val readlines =
scala.io.Source.fromInputStream(inputFileStream).getLines.toList
val csvData: Dataset[String] =
spark.sparkContext.parallelize(readlines).toDS
spark.read.option("header", true).option("inferSchema", true).csv(csvData)
}
This loads it as a DataFrame; this can easily be passed to code like gavincruick's example on GitHub, copied here for convenience:
//code to build verifier from DF that has a 'Constraint' column
type Verifier = DataFrame => VerificationResult
def generateVerifier(df: DataFrame, columnName: String): Try[Verifier] = {
val constraintCheckCodes: Seq[String] = df.select(columnName).collect().map(_(0).toString).toSeq
def checkSrcCode(checkCodeMethod: String, id: Int): String = s"""com.amazon.deequ.checks.Check(com.amazon.deequ.checks.CheckLevel.Error, "$id")$checkCodeMethod"""
val verifierSrcCode = s"""{
|import com.amazon.deequ.constraints.ConstrainableDataTypes
|import com.amazon.deequ.{VerificationResult, VerificationSuite}
|import org.apache.spark.sql.DataFrame
|
|val checks = Seq(
| ${constraintCheckCodes.zipWithIndex
.map { (checkSrcCode _).tupled }
.mkString(",\n ")}
|)
|
|(data: DataFrame) => VerificationSuite().onData(data).addChecks(checks).run()
|}
""".stripMargin.trim
println(s"Verification function source code:\n$verifierSrcCode\n")
compile[Verifier](verifierSrcCode)
}
/** Compiles the scala source code that, when evaluated, produces a value of type T. */
def compile[T](source: String): Try[T] =
Try {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(source)
val compiledCode = toolbox.compile(tree)
compiledCode().asInstanceOf[T]
}
//example usage...
//sample test data
val testDataDF = Seq(
("2020-02-12", "England", "E10000034", "Worcestershire", 1),
("2020-02-12", "Wales", "W11000024", "Powys", 0),
("2020-02-12", "Wales", null, "Unknown", 1),
("2020-02-12", "Canada", "MADEUP", "Ontario", 1)
).toDF("Date", "Country", "AreaCode", "Area", "TotalCases")
//constraints in a DF
val constraintsDF = Seq(
(".isComplete(\"Area\")"),
(".isComplete(\"Country\")"),
(".isComplete(\"TotalCases\")"),
(".isComplete(\"Date\")"),
(".hasCompleteness(\"AreaCode\", _ >= 0.80, Some(\"It should be above 0.80!\"))"),
(".isContainedIn(\"Country\", Array(\"England\", \"Scotland\", \"Wales\", \"Northern Ireland\"))")
).toDF("Constraint")
//Build Verifier from constraints DF
val verifier = generateVerifier(constraintsDF, "Constraint").get
//Run verifier against a sample DF
val result = verifier(testDataDF)
//display results
VerificationResult.checkResultsAsDataFrame(spark, result).show()
It depends on how complicated you want to allow the constraints to be. In general, deequ allows you to use arbitrary scala code for the validation function of a constraint, so its difficult (and dangerous from a security perspective) to load that from a file.
I think you would have to come up with your own schema and semantics for the CSV file, at least it is not directly supported in deequ.
I ran the following code, and it prints out successfully the "URL" from the s3Service.createUnsignedObjectUrl method. My question is, how does the variable even get returned and stored into the "linkForText" variable? I read that scala functions usually need something like a ": Int" and a "return" within the "def"ined function. But I see none of that here. How is the store function able to do this?
package com.justthor
import org.jets3t.service.impl.rest.httpclient.RestS3Service
import org.jets3t.service.security.AWSCredentials
import org.jets3t.service.model.S3Object
import org.jets3t.service.acl.{ AccessControlList, GroupGrantee, Permission }
import java.io.InputStream
object Main extends App{
val classPath = "/"
// upload a simple text file
val textFilename = "test.txt"
val linkForText = store(textFilename, getClass.getResourceAsStream(s"$classPath$textFilename"))
// upload a cat image, taken from http://imgur.com/gallery/bTiwg
// set the content type to "image/jpg"
val imageFilename = "cat.jpg"
val linkForImage = store(imageFilename, getClass.getResourceAsStream(s"$classPath$imageFilename"), "image/jpg")
println(s"Url for the text file is $linkForText")
println(s"Url for the cat image is $linkForImage")
def store(key: String, inputStream: InputStream, contentType: String = "text/plain") = {
val awsAccessKey = "YOUR_ACCESS_KEY"
val awsSecretKey = "YOUR_SECRET_KEY"
val awsCredentials = new AWSCredentials(awsAccessKey, awsSecretKey)
val s3Service = new RestS3Service(awsCredentials)
val bucketName = "test-scala-upload"
val bucket = s3Service.getOrCreateBucket(bucketName)
val fileObject = s3Service.putObject(bucket, {
// public access is disabled by default, so we have to manually set the permission to allow read access to the uploaded file
val acl = s3Service.getBucketAcl(bucket)
acl.grantPermission(GroupGrantee.ALL_USERS, Permission.PERMISSION_READ)
val tempObj = new S3Object(key)
tempObj.setDataInputStream(inputStream)
tempObj.setAcl(acl)
tempObj.setContentType(contentType)
tempObj
})
s3Service.createUnsignedObjectUrl(bucketName,
fileObject.getKey,
false, false, false)
}
}
Inference and scala-isms.
The return type is inferred from the return value, which is the result of the final statement in a method/function.
So, in your case whatever is returned by the last line: s3Service.createUnsignedObjectUrl(...) is the value returned from store. And, as there is no branching, then the return type will be inferred from this value. And, if there is branching, then inferrence will take the least common denominator of the possible return types.
I have a table in HBase named as "orders" it has column family 'o' and columns as {id,fname,lname,email}
having row key as id. I am trying to get the value of fname and email only from hbase using spark. Currently what 'i am doing is given below
override def put(params: scala.collection.Map[String, Any]): Boolean = {
var sparkConfig = new SparkConf().setAppName("Connector")
var sc: SparkContext = new SparkContext(sparkConfig)
var hbaseConfig = HBaseConfiguration.create()
hbaseConfig.set("hbase.zookeeper.quorum", ZookeeperQourum)
hbaseConfig.set("hbase.zookeeper.property.clientPort", zookeeperPort)
hbaseConfig.set(TableInputFormat.INPUT_TABLE, schemdto.tableName);
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname,o:email");
var hBaseRDD = sc.newAPIHadoopRDD(hbaseConfig, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
try {
hBaseRDD.map(tuple => tuple._2).map(result => result.raw())
.map(f => KeyValueToString(f)).saveAsTextFile(sink)
true
} catch {
case _: Exception => false
}
}
def KeyValueToString(keyValues: Array[KeyValue]): String = {
var it = keyValues.iterator
var res = new StringBuilder
while (it.hasNext) {
res.append( Bytes.toString(it.next.getValue()) + ",")
}
res.substring(0, res.length-1);
}
But nothing is returned and If I try to fetch only one column such as
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname");
then it returns all the values of column fname
So my question is how to get multiple columns from hbase using spark
Any help will be appreciated.
List of columns to scan needs to be space-delimited, according to the documentation.
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname o:email");