Calling GCP Translate API within Dataproc pyspark map - google-cloud-dataproc

I am trying to call the language detection method of the translate client api from pyspark for each row in a file.
I created a map method as the following but the job seems to just freeze with no error. If I remove the call to the translate API it executes fine. Is it possible to call Google client API methods within pySpark map ?
mapping method to do translation
def doTranslate(data):
translate_client = translate.Client()
# Get the message information
messageId = data[0]
messageContent = data[6]
detectedLang = translate_client.detect_language(messageContent)
r = []
r.append(detectedLang)
return r

Figured it out!! your question led me in the right direction. thanks!
Turns out I was getting an exception from the call because I was going past the default quota for sizes of messages. I added a try/except block and determined this was the problem. Then cutting the message size down (I am just testing so dont want to mess with the quota) fixed the issue.

Related

How to use delta trigger in flink?

I want to use the deltatrigger in apache flink (flink 1.3) but I have some trouble with this code :
.trigger(DeltaTrigger.of(100, new DeltaFunction[uniqStruct] {
override def getDelta(oldFp: uniqStruct, newFp: uniqStruct): Double = newFp.time - oldFp.time
}, TypeInformation[uniqStruct]))
And I have this error:
error: object org.apache.flink.api.common.typeinfo.TypeInformation is not a value [ERROR] }, TypeInformation[uniqStruct]))
I don't understand why DeltaTrigger need TypeSerializer[T]
and I don't know what to do to remove this error.
Thanks a lot everyone.
I would read into this a bit https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/types_serialization.html sounds like you can create a serializer using typeInfo.createSerializer(config) on your type info. Note what you're passing in currently is a type itself and NOT the type info which is why you're getting the error you are.
You would need to do something more like
val uniqStructTypeInfo: TypeInformation[uniqStruct] = createTypeInformation[uniqStruct]
val uniqStrictTypeSerializer = typeInfo.createSerializer(config)
To quote the page above regarding the config param you need to pass to create serializer
The config parameter is of type ExecutionConfig and holds the
information about the program’s registered custom serializers. Where
ever possibly, try to pass the programs proper ExecutionConfig. You
can usually obtain it from DataStream or DataSet via calling
getExecutionConfig(). Inside functions (like MapFunction), you can get
it by making the function a Rich Function and calling
getRuntimeContext().getExecutionConfig().
DeltaTrigger needs a TypeSerializer because it uses Flink's managed state mechanism to store each element for later comparison with the next one (it just keeps one element, the last one, which is updated as new elements arrive).
You will find an example (in Java) here.
But if all you need is a window that triggers every 100msec, then it'll be easier to just use a TimeWindow, such as
input
.keyBy(<key selector>)
.timeWindow(Time.milliseconds(100)))
.apply(<window function>)
Updated:
To have hour-long windows that trigger every 100msec, you could use sliding windows. However, you would have 10 * 60 * 60 windows, and every event would be placed into each of these 36000 windows. So that's not a great idea.
If you use a GlobalWindow with a DeltaTrigger, then the window will be triggered only when events are more than 100msec apart, which isn't what you've said you want.
I suggest you look at ProcessFunction. It should be straightforward to get what you want that way.

Scala Netty is there any way to share a ReplayingDecoder

I am looking to open up multiple connections using a netty client bootstrap in order to parse messages coming from multiple sources. The messages all have the same format, however, due to the amount of data that needs to be processed, I must run each connection on separate threads (This is assuming netty creates a thread per client channel, which I couldn't find a reference for - if that's not the case, how would this be achieved?).
This is the code that I use to connect to the data server:
var b = new Bootstrap()
.group(group)
.channel(classOf[NioSocketChannel])
.handler(RawFeedChannelInitializer)
var ch1 = b.clone().connect(host, port).sync().channel();
var ch2 = b.clone().connect(host, port).sync().channel();
The initializer calls RawPacketDecoder, which extends ReplayingDecoder, and is defined here.
The code works well without #Sharable when opening a single connection, but for the purpose of my application I must connect to the same server multiple times.
This results in the runtime error #Sharable annotation is not allowed pointing to my RawPacketDecoder class.
I am not entirely sure on how to get past this issue, short of reimplementing in scala an instantiable class of ReplayingDecoder as my decoder based directly on ByteToMessageDecoder.
Any help would be greatly appreciated.
Note: I am using netty 4.0.32 Final
I found the solution in this StockExchange answer.
My issue was that I was using an object based ChannelInitializer (singleton), and ReplayingDecoder as well as ByteToMessageDecoder are not sharable.
My initializer was created as a scala object, and therefore a single instance allowed. Changing the initializer to a scala class and instantiating for each bootstrap clone solved the problem. I modified the bootstrap code above as follows:
var b = new Bootstrap()
.group(group)
.channel(classOf[NioSocketChannel])
//.handler(RawFeedChannelInitializer)
var ch1 = b.clone().handler(new RawFeedChannelInitializer()).connect(host, port).sync().channel();
var ch2 = b.clone().handler(new RawFeedChannelInitializer()).connect(host, port).sync().channel();
I am not sure whether this ensures multithreading as wanted but it does allow to split the data access into multiple connections to the feed server.
Edit Update: After performing additional research on the subject, I have determined that netty does in fact create a thread per channel; this was verified by printing to console after the creation of each channel:
println("No. of active threads: " + Thread.activeCount());
The output shows an incremental number as channels are created and associated with their respective threads.
By default NioEventLoopGroup uses 2*Num_CPU_cores threads as defined here:
DEFAULT_EVENT_LOOP_THREADS = Math.max(1, SystemPropertyUtil.getInt(
"io.netty.eventLoopThreads",
Runtime.getRuntime().availableProcessors() * 2));
This value can be overriden to something else by setting
val group = new NioEventLoopGroup(16)
and then using the group to create/setup the bootstrap.

How do I read environment variables in Postman tests?

I'm using the packaged app version of Postman to write tests against my Rest API. I'm trying to manage state between consecutive tests. To faciliate this, the Postman object exposed to the Javascript test runtime has methods for setting variables, but none for reading.
postman.setEnvironmentVariable("key", value );
Now, I can read this value in the next call via the {{key}} structure that sucks values in from the current environment. BUT, this doesn't work in the tests; it only works in the request building stuff.
So, is there away to read this stuff from the tests?
According to the docs here you can use
environment["foo"] OR environment.foo
globals["bar"] OR globals.bar
to access them.
ie;
postman.setEnvironmentVariable("foo", "bar");
tests["environment var foo = bar"] = environment.foo === "bar";
postman.setGlobalVariable("foobar", "1");
tests["global var foobar = true"] = globals.foobar == true;
postman.setGlobalVariable("bar", "0");
tests["global var bar = false"] = globals.bar == false;
Postman updated their sandbox and added a pm.* API. Although the older syntax for reading variables in the test scripts still works, according to the docs:
Once a variable has been set, use the pm.variables.get() method or,
alternatively, use the pm.environment.get() or pm.globals.get()
method depending on the appropriate scope to fetch the variable. The
method requires the variable name as a parameter to retrieve the
stored value in a script.

Celery - error handling and data storage

I'm trying to better understand common strategies regarding results and errors in Celery.
I see that results have statuses/states and stores results if requested -- when would I use this data? Should error handling and data storage be contained within the task?
Here is a sample scenario, in case it helps better understand my objective:
I have a geocoding task that goeocodes user addresses. If the task fails or succeeds, I'd like to update a field in the database letting the user know. (Error handling) On success I'd like the geocoded data to be inserted into the database (Data storage)
What approach should take?
Let me preface this by saying that I'm still getting a feel for Celery myself. That being said, I have some general inclinations about how I'd go about tackling this, and since no one else has responded, I'll give it a shot.
Based on what you've written, a relatively simple (though I suspect non-optimized) solution is to follow the broad contours of the blog comment spam task example from the documentation.
app.models.py
class Address(models.Model):
GEOCODE_STATUS_CHOICES = (
('pr', 'pre-check'),
('su', 'success'),
('fl', 'failed'),
)
address = models.TextField()
...
geocode = models.TextField()
geocode_status = models.CharField(max_length=2,
choices=GEOCODE_STATUS_CHOICES,
default='pr')
class AppUser(models.Model):
name = models.CharField(max_length=100)
...
address = models.ForeignKey(Address)
app.tasks.py
from celery import task
from app.models import Address, AppUser
from some_module import geocode_function #assuming this returns a string
#task()
def get_geocode(appuser_pk):
user = AppUser.objects.get(pk=appuser_pk)
address = user.address
try:
result = geocode_function(address.address)
address.geocode = result
address.geocode_status = 'su' #set address object as successful
address.save()
return address.geocode #this is optional -- your task doesn't have to return anything
on the other hand, you could also choose to decouple the geo-
code function from the database update for the object instance.
Also, if you're thinking about chaining tasks together, you
might think about if it's advantageous to pass a parameter as
an input or partial input into the child task.
except Exception as e:
address.geocode_status = 'fl' #address object fails
address.save()
#do something_else()
raise #re-raise the error, in case you want to trigger retries, etc
app.views.py
from app.tasks import *
from app.models import *
from django.shortcuts import get_object_or_404
def geocode_for_address(request, app_user_pk):
app_user = get_object_or_404(AppUser, pk=app_user_pk)
...etc.etc. --- **somewhere calling your tasks with appropriate args/kwargs
I believe this meets the minimal requirements you've outlined above. I've intentionally left the view undeveloped since I don't have a sense of how exactly you want to trigger it. It sounds like you also may want some sort of user notification when their address can't be geocoded ("I'd like to update a field in a database letting a user know"). Without knowing more about the specifics of this requirement, I would it sounds like something that might be best accomplished in your html templates (if instance.attribute value is X, display q in template) or by using a django.signals (set up a signal for when a user.address.geocode_status switches to failure -- say, by emailing the user to let them know, etc.).
In the comments to the code above, I mentioned the possibility of decoupling and chaining the component parts of the get_geocode task above. You could also think about decoupling the exception handling from the get_geocode task, by writing a custom error handler task, and using the link_error parameter (for instance., add.apply_async((2, 2), link_error=error_handler.s(), where error_handler has been defined as a task in app.tasks.py ). Also, whether you choose to handle errors via the main task (get_geocode) or via a linked error handler, I would think that you would want to get much more specific about how to handle different sorts of errors (e.g., do something with connection errors different than with address data being incorrectly formatted).
I suspect there are better approaches, and I'm just beginning to understand how inventive you can get by chaining tasks, using groups and chords, etc. Hope this helps at least get you thinking about some of the possibilities. I'll leave it to others to recommend best practices.

making a GET request to a webservice from the playframework 2.0

I'm trying to call a webservice from the play framework, and I think I'm doing it wrong. I have an example call to http://www.myweather2.com/developer/forecast.ashx?uac=eDKGlpcBQN&query=52.6%2C-4.4&output=xml
A snippet from what I'm trying from the playframework is the following:
val response = WS.url("http://www.myweather2.com/developer/forecast.ashx?uac=eDKGlpcBQN&query=52.6%2C-4.4&output=xml").get.get()
val body = response.getBody
When I call this, the body consists of "useraccount does not exist". When I just put this url in a browser, I get the response I'm looking for. What am I doing wrong here?
For some reason, I was getting WS from the wrong import. When I fixed the imports to import play.api.libs.ws.WS, it worked. I'm still amazed it worked halfway with the wrong import
Don't know about "useraccount does not exist" but this seems to work:
val promise = WS.url("http://www.myweather2.com/developer/forecast.ashx?uac=eDKGlpcBQN&query=52.6%2C-4.4&output=xml").get()
val body = promise.value.get.body
Edit: Removed the space.
Also make sure your editor is not inserting a \n or \r after ?
I know this is old, but I just solved this problem while trying to do the same thing - getting the same results.
GET variables must be passed with WS.url("http://...").setQueryParameter(key, value)
Example:
val promise = WS.url("http://www.myweather2.com/developer/forecast.ashx").setQueryParameter("uac", "eDKGlpcBQN").setQueryParameter("query", "52.6%2C-4.4").setQueryParameter("output", "xml").get()
Annoying, but a relatively simple fix.