How to pull data from S3 using Spark - scala

I have a bunch of csv files containing time and space dependent dats in an AWS S3 bucket. The files are prefixed with timestamps on 5mins granularity.
When trying to access them from AWS EMR with Apache Spark and trying to filter them both time and space even beefy clusters (5 x r3.8xlarge) are crashing. The data for I'm trying to do the filtering with broadcast join.
Location is a class with a userid, timestamp and a mobile cell information which I'm trying to join with the cell position information (segmentDF) to filter only those records that are required.
I need further processing of these records, here just try to save them as parquet. I feel there must be a more efficient way of doing it, starting from storing the data in an S3 bucket. Any ideas are appreciated.
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 suggests an alternative and faster way of accessing S3 buckets from spark which I could not implement (see below for code and error rep).
\\ Scala code for the filtering
val locationDF = sc.textFile(s"bucket/location_files/201703*")
.map(line => {
val l = new Location(line)
(l.id, l.time, l.cell)
})
.toDF("id", "time", "cell")
val df = locationDF.join(broadcast(segmentDF), Seq("cell"), "inner").select($"id", $"time", $"lat", $"lng", $"cellName").repartition(32)
df.write.save("somewhere/201703.parquet")
\\ Alternative way of accessing S3 keys
import com.amazonaws.services.s3._, model._
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
val credentials = new DefaultAWSCredentialsProviderChain().getCredentials
val request = new ListObjectsRequest()
request.setBucketName("s3-eu-west-1.amazonaws.com/bucket")
request.setPrefix("location_files")
request.setMaxKeys(32000)
def s3 = new AmazonS3Client(new BasicAWSCredentials(credentials.getAWSAccessKeyId, credentials.getAWSSecretKey))
val objs = s3.listObjects(request)
sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
Latter ends up with error com.amazonaws.services.s3.model.AmazonS3Exception: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. (Service: Amazon S3; Status Code: 301; Error Code: PermanentRedirect; Request ID: DAE08BA90C01EB5E)

Related

Google Cloud Storage atomic creation of a Blob

I'm using haddop-connectors
project for writing BLOBs to Google Cloud Storage.
I'd like to make sure that a BLOB with a specific target name that is being written in a concurrent context is either written in FULL or not appearing at all as visible in case that an exception has occurred while writing.
In the code below, in case that that an I/O exception occurs, the BLOB written will appear on GCS because the stream is being closed in finally:
val stream = fs.create(path, overwrite)
try {
actions.map(_ + "\n").map(_.getBytes(UTF_8)).foreach(stream.write)
} finally {
stream.close()
}
The other possibility would be to not close the stream and let it "leak" so that the BLOB does not get created. However this is not really a valid option.
val stream = fs.create(path, overwrite)
actions.map(_ + "\n").map(_.getBytes(UTF_8)).foreach(stream.write)
stream.close()
Can anybody share with me a recipe on how to write to GCS a BLOB either with hadoop-connectors or cloud storage client in an atomic fashion?
I have used reflection within hadoop-connectors to retrieve an instance of com.google.api.services.storage.Storage from the GoogleHadoopFileSystem instance
GoogleCloudStorage googleCloudStorage = ghfs.getGcsFs().getGcs();
Field gcsField = googleCloudStorage.getClass().getDeclaredField("gcs");
gcsField.setAccessible(true);
Storage gcs = (Storage) gcsField.get(googleCloudStorage);
in order to have the ability to make a call based on an input stream corresponding to the data in memory.
private static StorageObject createBlob(URI blobPath, byte[] content, GoogleHadoopFileSystem ghfs, Storage gcs)
throws IOException
{
CreateFileOptions createFileOptions = new CreateFileOptions(false);
CreateObjectOptions createObjectOptions = objectOptionsFromFileOptions(createFileOptions);
PathCodec pathCodec = ghfs.getGcsFs().getOptions().getPathCodec();
StorageResourceId storageResourceId = pathCodec.validatePathAndGetId(blobPath, false);
StorageObject object =
new StorageObject()
.setContentEncoding(createObjectOptions.getContentEncoding())
.setMetadata(encodeMetadata(createObjectOptions.getMetadata()))
.setName(storageResourceId.getObjectName());
InputStream inputStream = new ByteArrayInputStream(content, 0, content.length);
Storage.Objects.Insert insert = gcs.objects().insert(
storageResourceId.getBucketName(),
object,
new InputStreamContent(createObjectOptions.getContentType(), inputStream));
// The operation succeeds only if there are no live versions of the blob.
insert.setIfGenerationMatch(0L);
insert.getMediaHttpUploader().setDirectUploadEnabled(true);
insert.setName(storageResourceId.getObjectName());
return insert.execute();
}
/**
* Helper for converting from a Map<String, byte[]> metadata map that may be in a
* StorageObject into a Map<String, String> suitable for placement inside a
* GoogleCloudStorageItemInfo.
*/
#VisibleForTesting
static Map<String, String> encodeMetadata(Map<String, byte[]> metadata) {
return Maps.transformValues(metadata, QuickstartParallelApiWriteExample::encodeMetadataValues);
}
// A function to encode metadata map values
private static String encodeMetadataValues(byte[] bytes) {
return bytes == null ? Data.NULL_STRING : BaseEncoding.base64().encode(bytes);
}
Note in the example above, that even if there are multiple callers trying to create a blob with the same name in parallel, ONE and only ONE will succeed in creating the blob. The other callers will receive 412 Precondition Failed.
GCS objects (blobs) are immutable 1, which means they can be created, deleted or replaced, but not appended.
The Hadoop GCS connector provides the HCFS interface which gives the illusion of appendable files. But under the hood, it is just one blob creation, GCS doesn't know if the content is complete or not from the application's perspective, just as you mentioned in the example. There is no way to cancel a file creation.
There are 2 options you can consider:
Create a temp blob/file, copy it to the final blob/file, then delete the temp blob/file, see 2. Note that there is no atomic rename operation in GCS, rename is implemented as copy-then-delete.
If your data fits into the memory, first read up the stream and buffer the bytes in memory, then create the blob/file, see 3.
GCS connector should also work with the 2 options above, but I think GCS client library gives you more control.

Calling elocation geocode API returns empty in Spark

I have this odd problem when calling eLocations geocoding API via Spark where i will always get an empty body even on addresses I know will return a coordinate. I am developing a geocoding app using Spark (2.3.3) and scala. I am also using scalaj to call the REST API. So the line of code which calls the API is as such:
def getGeoCoderLocation(sc: SparkSession, req: String, url: String, proxy_host: String, proxy_port: String, response_format: String): scala.collection.Map[String, (String, String)] = {
import sc.implicits._
val httpresponse = Http(url).proxy(proxy_host, proxy_port.toInt).postForm.params(("xml_request", req), ("format", response_format)).asString
println(httpresponse.body)
println(httpresponse.contentType.getOrElse(""))
println(httpresponse.headers)
println(httpresponse)
if(!httpresponse.contentType.getOrElse("").contains("text/html")) {
val body = httpresponse.body
val httpresponse_body = parseJSON(Option(body).getOrElse("[{\"x\":, \"y\":}]"))
val location = for (it <- 0 until httpresponse_body.length) yield {
(Option(httpresponse_body(it)(0).x).getOrElse("").toString, Option(httpresponse_body(it)(0).y).getOrElse("").toString, it)
}
val locDF = location.toDF(Seq("LONGITUDE", "LATITUDE", "row"): _*)//.withColumn("row", monotonically_increasing_id())
locDF.show(20, false)
locDF.rdd.map { r => (Option(r.get(2)).getOrElse("").toString, (Option(r.get(0)).getOrElse("").toString, Option(r.getString(1)).getOrElse("").toString)) }.collectAsMap()
}
else {
val locDF = Seq(("","","-")).toDF(Seq("LONGITUDE", "LATITUDE", "row"): _*)//.withColumn("row", monotonically_increasing_id())
locDF.show(20, false)
locDF.rdd.map { r => (Option(r.get(2)).getOrElse("").toString, (Option(r.get(0)).getOrElse("").toString, Option(r.getString(1)).getOrElse("").toString)) }.collectAsMap()
}
}
Where
url = http://elocation.oracle.com/elocation/lbs
proxy_host = (ip of proxy)
proxy_port = (port number)
req = "<?xml version=\"1.0\" standalone=\"yes\"?>\n<geocode_request vendor=\"elocation\">\n\t(address_list)\n\t\t|<list of requests>|\n\t</address_list>\n</geocode_request>"
response_format = JSON
So when I print the body it will always be [{}] (i.e. empty JSON Array) when I run my app in Spark. When I run the same request without a spark-submit i will get a proper Array of JSON objects (e.g. java -jar test.jar).
Is there a setting in Spark which blocks the app from receiving REST responses? We are using Cloudera 5.16.x
I have also tried setting the proxy information using --conf "spark.executor.extraJavaOptions=-Dhttp.proxyHost=(ip) -Dhttp.proxyPort=(port) -Dhttps.proxyHost=(ip) -Dhttps.proxyPort=(port)" but I will get:
Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: Login failure for user: (principal) from keytab (keytab) javax.security.auth.login.LoginException: Cannot locate KDC
Please help as I don't know where to look to solve this as i have never encounter this before.
ok found the cause. the payload was actually empty which is why elocation keep returning blank.
Lesson of the day, check your payload.

Spark streaming store method only work in Duration window but not in foreachRDD workflow in customized receiver

I define a receiver to read data from Redis.
part of receiver simplified code:
class MyReceiver extends Receiver (StorageLevel.MEMORY_ONLY){
override def onStart() = {
while(!isStopped) {
val res = readMethod()
if (res != null) store(res.toIterator)
// using res.foreach(r => store(r)) the performance is almost the same
}
}
}
My streaming workflow:
val ssc = new StreamingContext(spark.sparkContext, new Duration(50))
val myReceiver = new MyReceiver()
val s = ssc.receiverStream(myReceiver)
s.foreachRDD{ r =>
r.persist()
if (!r.isEmpty) {
some short operations about 1s in total
// note this line ######1
}
}
I have a producer which produce much faster than consumer so that there are plenty records in Redis now, I tested with number 10000. I debugged, and all records could quickly be read after they are in Redis by readMethod() above. However, in each microbatch I can only get 30 records. (If store is fast enough it should get all of 10000)
With this suspect, I added a sleep 10 seconds code Thread.sleep(10000) to ######1 above. Each microbatch still gets about 30 records, and each microbatch process time increases 10 seconds. And if I increase the Duration to 200ms, val ssc = new StreamingContext(spark.sparkContext, new Duration(200)), it could get about 120 records.
All of these shows spark streaming only generate RDD in Duration? After gets RDD and in main workflow, store method is temporarily stopped? But this is a great waste if it is true. I want it also generates RDD (store) while the main workflow is running.
Any ideas?
I cannot leave a comment simply I don't have enough reputation. Is it possible that propertyspark.streaming.receiver.maxRate is set somewhere in your code ?

Maintaining state within a stream

I have a heavy load flow of users data. I want to determine if this is a new user by it's id. In order to reduce calls to the db I rather maintain a state in memory of previous users.
val users = mutable.set[String]()
//init the state from db
user = db.getAllUsersIds()
val source: Source[User, NotUsed]
val dbSink: Sink[User, NotUsed] //goes to db
//if the user is added to the set it will return true
val usersFilter = Flow[User].filter(user => users.add(user.id))
now I can create a graph
source ~> usersFilter ~> dbSink
my problem is that the mutable state is shared and unsafe. Is there an option to maintain the state within the flow ?
There are two ways of doing this.
If you are getting a streams of records and you want to deduplicate the stream (because some ids are already processed). You can do
http://janschulte.com/2016/03/08/deduplicate-akka-stream/
The other way of doing this is via database lookups where you check if the ID already exists.
val alreadyExists : Flow[User, NotUsed] = {
// build a cache of known ids
val knownIdList = ... // query database and get list of IDs
Flow[User].filterNot(user => knownIdList.contains(user.id))
}

Scala: Cannot assign values from RDD to ArrayBuffer

Below is my Code:
class Data(val x:Double=0.0,val y:Double=0.0) {
var cluster = 0;
}
var dataList = new ArrayBuffer[Data]()
val data = sc.textFile("Path").map(line => line.split(",")).map(userRecord => (userRecord(3), userRecord(4)))
data.foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
When I do
dataList.size
i get output as 0
But there are more that 4k records in data.
Now when I try using take
data.take(10).foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
Now I got the data in dataList. But I want my whole data in dataList.
Please help.
The problem is that your code inside the foreach is running on distributed worker not the same main thread where you inspect dataList.length. Use Rdd.collect() to get it.
val dataList = data
.take(10)
.map(a => new Data(a._1.toDouble, a._2.toDouble))
.collect()
The problem is related to where your code will be executed. Every operation made inside a transformation, i.e. a map, flatMap, reduce, and so on, is not performed in the main thread (or in the driver node), but in the worker nodes. These nodes run in different threads (or hosts) than the driver node.
Every object that is not store inside a RDD and that is used in a worker node, lives only in the worker memory space. Then, your dataList object is simply freshly created in each worker and the driver node cannot retrieve any information from this remote objects.
The code in the main program and in the so called actions, i.e. foreach, collect, take and so on, is executed in the main thread or driver node. Then, when you run
data.take(10).foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
the take method is getting back from workers the first 10 data of the RDD. All code is executed in the driver node and the magic works.
If you want to build an RDD of Data object you have to transform the information that you read directly into the oringal RDD. Try something similar to the following:
val dataList: RDD[Data] =
data.map(a => new Data(a._1.toDouble, a._2.toDouble))
Try to have a look also to this post: A new way to err, Apache Spark.
Hope it helps.