User agent parser (ua-parser) slows down Spark on EMR - scala

I am using ua-parser in my UDFs to parse User agent info. And I noticed that these jobs are very slow compared to the ones without parser. Here is an example:
import org.uaparser.scala.Parser
val parser: Parser = Parser.default
val parseDeviceUDF = udf((ua: String) => Try(parser.parse(ua).device.family).toOption.orNull)
The stange thing is that when I submit job as a EMR step, it is slow, but when I run same code in Zeppelin or Spark shell it works fine. I write data to parquet files. And that is the stage where it gets stuck.

The answer I am about to give is not about an open-source project, but it does provide information that whoever is researching how to parse the user-agent string to obtain device intelligence will want to know about.
WURFL is a time-honored tool to do User-Agent (and more generally HTTP request) analysis and obtain easily consumable device/browser information. ScientiaMobile has recently released a version of WURFL (called WURFL Microservice) that can be obtained from the major marketplaces of AWS, Azure and GCP (in addition to ScientiaMobile itself of course).
In the case at hand, the (Java) code that would bring a Spark user from HTTP logs to device data would look something like this.
JavaDStream enrichedEvents = events.map(evs -> {
WmClient wmClient = WmClientProvider.getOrCreate(wmServerHost, "80");
for (EnrichedEventData evItem : evs) {
...
HttpServletRequestMock request = new HttpServletRequestMock(evItem.getHeaders());
Model.JSONDeviceData device = wmClient.lookupRequest(request);
evItem.setWurflCompleteName(device.capabilities.get("complete_device_name"));
evItem.setWurflDeviceMake(device.capabilities.get("brand_name"));
evItem.setWurflDeviceModel(device.capabilities.get("model_name"));
evItem.setWurflFormFactor(device.capabilities.get("form_factor"));
evItem.setWurflDeviceOS(device.capabilities.get("device_os") + " " + \
device.capabilities.get("device_os_version"));
...
}
return evs;
});
More information about how Spark and WURFL are integrated can be found in this article.
Disclaimer: I am the CTO of ScientiaMobile and original creator of WURFL.

Related

How to filter by dimension using Google Analytics Data API (GA4) Java client library?

I am trying to call Google Analytics Data API (GA4) using the Java client library and applying a dimension filter. This is the call which is working if I don't use the setDimensionFilter call:
RunReportRequest request =
RunReportRequest.newBuilder()
.setProperty(propertyId)
.addDimensions(com.google.analytics.data.v1beta.Dimension.newBuilder().setName("pageLocation"))
.addMetrics(com.google.analytics.data.v1beta.Metric.newBuilder().setName("screenPageViews"))
.addMetrics(com.google.analytics.data.v1beta.Metric.newBuilder().setName("activeUsers"))
// .setDimensionFilter(FilterExpression.newBuilder().setFilter(Filter.newBuilder().setStringFilter(
// Filter.StringFilter.newBuilder()
// .setMatchType(Filter.StringFilter.MatchType.FULL_REGEXP)
// .setField(Descriptors.FieldDescriptor, "pageLocation")
// .setValue("MY_REGEXP")
// .build())))
.addDateRanges(com.google.analytics.data.v1beta.DateRange.newBuilder()
.setStartDate(startDate.toStringYYYYMMDDWithDashes())
.setEndDate(endDate.toStringYYYYMMDDWithDashes()))
.setKeepEmptyRows(true)
.build();
I don't know how to use setDimensionFilter. If the usage which is commented in the previous code is correct, then the only thing missing is the call to setField. I don't know how to generate the Descriptors.FieldDescriptor instance (or even its meaning).
I have reviewed the client library javadoc, and also the code samples (which are really simple and unfortunately do not show any usage of setDimensionFilter).
The Descriptors.FieldDescriptor isn't part of the GA4 Data API and is an internal functionality of the protobuf framework
If you are trying to call this filter on a field with the name 'pageLocation' instead of using setField, I think you can do something like this
RunReportRequest request =
RunReportRequest.newBuilder()
.setProperty("properties/" + propertyId)
.addDimensions(com.google.analytics.data.v1beta.Dimension.newBuilder().setName("pageLocation"))
.addMetrics(com.google.analytics.data.v1beta.Metric.newBuilder().setName("screenPageViews"))
.addMetrics(com.google.analytics.data.v1beta.Metric.newBuilder().setName("activeUsers"))
.setDimensionFilter(FilterExpression.newBuilder()
.setFilter(Filter.newBuilder()
.setFieldName("pageLocation")
.setStringFilter(Filter.StringFilter.newBuilder()
.setMatchType(Filter.StringFilter.MatchType.FULL_REGEXP)
.setValue("MY_REGEXP"))))
.addDateRanges(com.google.analytics.data.v1beta.DateRange.newBuilder()
.setStartDate("2020-03-31")
.setEndDate("2021-03-31"))
.build();
Also, if you want an additional example of how to use setDimensionFilter, here is another code example that might help
RunReportRequest request =
RunReportRequest.newBuilder()
.setProperty("properties/" + propertyId)
.addDimensions(Dimension.newBuilder().setName("city"))
.addMetrics(Metric.newBuilder().setName("activeUsers"))
.addDateRanges(DateRange.newBuilder().setStartDate("2020-03-31").setEndDate("today"))
.setDimensionFilter(FilterExpression.newBuilder()
.setAndGroup(FilterExpressionList.newBuilder()
.addExpressions(FilterExpression.newBuilder()
.setFilter(Filter.newBuilder()
.setFieldName("platform")
.setStringFilter(Filter.StringFilter.newBuilder()
.setMatchType(Filter.StringFilter.MatchType.EXACT)
.setValue("Android"))))
.addExpressions(FilterExpression.newBuilder()
.setFilter(Filter.newBuilder()
.setFieldName("eventName")
.setStringFilter(Filter.StringFilter.newBuilder()
.setMatchType(Filter.StringFilter.MatchType.EXACT)
.setValue("in_app_purchase"))))))
.setMetricFilter(FilterExpression.newBuilder()
.setFilter(Filter.newBuilder()
.setFieldName("sessions")
.setNumericFilter(Filter.NumericFilter.newBuilder()
.setOperation(Filter.NumericFilter.Operation.GREATER_THAN)
.setValue(NumericValue.newBuilder()
.setInt64Value(1000)))))
.build();

Multipart Form Errors with Lagom

Most of our Lagom entrypoints don't use multipart form requests, but one does. Since Lagom doesn't currently support multipart requests natively, the general suggestion I have seen is to call the underlying Play API, using the PlayServiceCall mechanism.
We have done that, and it works--most of the time. But we experience intermittent errors, especially when submitting large files. These are always cases of java.util.zip.ZipException (of various kinds), looking as if not an entire file has been received for processing.
Here's how the entrypoint looks in the code; in particular, the Play wrapping mechanism:
def upload = PlayServiceCall[NotUsed, UUID] {
wrapCall => Action.async(multipartFormData) {
request => wrapCall(ServiceCall { _ =>
val upload = request.body.file("upload")
val input = new FileInputStream(upload.get.ref.file)
val filename = upload.get.filename
// ...
// other code to actually process the file
// ...
})(request).run
}
}
Here are just two examples of exceptions we're seeing:
Caused by: java.util.zip.ZipException: invalid code lengths set
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.util.zip.ZipInputStream.read(ZipInputStream.java:194)
at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:214)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
etc.
Caused by: java.util.zip.ZipException: invalid distance too far back
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.util.zip.ZipInputStream.read(ZipInputStream.java:194)
at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:214)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
etc.
We use Lagom 1.3.8, in Scala. Any suggestions, please?
Try using the new service gateway based on Akka HTTP.
You can enable this by adding the following to your build.sbt:
lagomServiceGatewayImpl in ThisBuild := "akka-http"
The new service gateway is still disabled by default in Lagom 1.3.8, but Lagom users that have experienced this problem have reported that it is resolved by enabling the akka-http gateway. This will become the default implementation in Lagom 1.4.0.

How can I run automated tests to check the average response time of a rest API?

I have a RESTful API that I would like to run some tests against at random moments of the day in order to check the average response time. I wasn't able to do this using Postman's Collection Runner. Is there another tool which allows me to do this, or maybe I'll have to write my own?
You can use services like Pingdom to retrieve calls from your API, or you can use softwares (commercial or opensource, is Zabbix still around?) to monitor your API, or (if you don't need many perks) you can write yourself a script that runs in a cronjob and saves the response time of your API in a txt file (or wherever you want) for further inspection.
Here's a little example, in php, but you can easily adapt it to your fav. language.
// I don't know how much will it take to run the API request
set_time_limit(0)
$start = microtime(true);
$result = executeApiCall()
$executionTime = microtime(true) - $start;
storeExecutionTime($executionTime)
function storeExecutionTime($time) {
// store the data somewhere
}

How to represent instantiation of DocumentClient in Powershell

Having a hard time figuring out how to do the same thing in powershell as the followings lines:
(in namespace Microsoft.Azure.Documents)
DocumentClient client = new DocumentClient(new Uri("endpoint"), "authKey")
Database database = client.CreateDatabaseQuery().Where(d => d.Id == "collectionName").AsEnumerable().FirstOrDefault()
Can anyone help?
tx
Look here: https://alexandrebrisebois.wordpress.com/2014/08/23/using-powershell-to-seed-azure-documentdb-from-blob-storage/
It shows you how use the authKey endpoint uri in a raw REST request.
Also, study the REST API for DocumentDB here: https://msdn.microsoft.com/en-us/library/azure/dn781481.aspx?f=255&MSPPError=-2147217396.
It'll allow you to look up how to do more operations following the Alexandre's example.
There is also this powershell commandlet DLL that makes many of the operations easy: https://github.com/savjani/Azure-DocumentDB-Powershell-Cmdlets

How to post a file in grails

I am trying to use HTTP to POST a file to an outside API from within a grails service. I've installed the rest plugin and I'm using code like the following:
def theFile = new File("/tmp/blah.txt")
def postBody = [myFile: theFile, foo:'bar']
withHttp(uri: "http://picard:8080/breeze/project/acceptFile") {
def html = post(body: postBody, requestContentType: URLENC)
}
The post works, however, the 'myFile' param appears to be a string rather than an actual file. I have not had any success trying to google for things like "how to post a file in grails" since most of the results end up dealing with handling an uploaded file from a form.
I think I'm using the right requestContentType, but I might have missed something in the documentation.
POSTing a file is not as simple as what you have included in your question (sadly). Also, it depends on what the API you are calling is expecting, e.g. some API expect files as base64 encoded text, while others accept them as mime-multipart.
Since you are using the rest plugin, as far as I can recall it uses the Apache HttpClient, I think this link should provide enough info to get you started (assuming you are dealing with mime-multipart). It shouldn't be too hard to change it around to work with your API and perhaps make it a bit 'groovy-ier'