scala api for delta lake optimize command - scala

The databricks docs say that you can change zordering of a delta table by doing:
spark.read.table(connRandom)
.write.format("delta").saveAsTable(connZorder)
sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)")
The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is:
spark.read.table(connRandom)
.write.format("delta").saveAsTable(connZorder)
.optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port")
but I cant find any documentation that says that this is possible.
Is there a scala api for delta lake optimization commands? If so, how do I replicate the aforementioned logic in scala?

Related

How can I mock DynamoDB access via Spark in Scala?

I have a Spark job written in Scala that ultimately writes out to AWS DynamoDB. I want to write some unit tests around it, but the only problem is I don't have a clue how to go about mocking the bit that writes to DynamoDB. I'm making use of their emr-dynamodb-connector class, which means I'm not using any dependency injection (otherwise this would be easy).
After I read in some RDD data using Spark, I do some simple transforms on it into a Pair RDD of type (org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable). So my code's only brush-up with Dynamo is by creating DynamoDBItemWritable objects. That class doesn't inherently contain any logic to utilize the AWS SDK to save anything; it's essentially just a data object. My code then calls this:
val conf = new Configuration()
conf.set("dynamodb.servicename", "dynamodb")
conf.set("dynamodb.input.tableName", "MyOutputTable")
conf.set("dynamodb.output.tableName", "MyInputTable")
conf.set("dynamodb.endpoint", "https://dynamodb.us-east-1.amazonaws.com")
conf.set("dynamodb.regionid", "us-east-1")
conf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
conf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
myTransformedRdd.saveAsHadoopDataset(new JobConf(conf)
...and the connector magically registers the right classes and makes the right calls so that it effectively saves the results to DynamoDB accordingly.
I can't mock SparkSession because it has a private constructor (that would be extremely messy anyway). And I don't have any direct way, as far as I know, to mock the DynamoDB client. Is there some magic syntax in Scala (or Scalatest, or Scalamock) to allow me to tell it that if it ever wants to instantiate a Dynamo client class, that it should use a mocked version instead?
If not, how would I go about testing this code? I suppose theoretically, perhaps there's a way to set up a local, in-memory instance of Dynamo and then change the value of dynamodb.endpoint but that sounds horribly messy just to get a unit test working. Plus I'm not sure it's possible anyway.
Take a look at LocalStack. It provides an easy-to-use test/mocking framework for developing AWS-related applications by spinning up the AWS-compatible APIs on your local machine or in Docker. It supports two dozen of AWS APIs and DynamoDB is among them. It is really a great tool for functional testing without using a separate environment in AWS for that.
If you need only DynamoDB there is another tool: DynamoDB Local, a Docker image with Amazon DynamoDB onboard.
Both are as simple as starting a Docker container:
docker run -p 8000:8000 amazon/dynamodb-local
docker run -P localstack/localstack
And if you're using JUnit 5 for the tests, let me recommend you JUnit 5 extensions for AWS, a few JUnit 5 extensions that could be useful for testing AWS-related code. These extensions can be used to inject clients for AWS service clients provided by tools like localstack (or the real ones). Both AWS Java SDK v 2.x and v 1.x are supported.

How do I exclude algorithms from H2O AutoML in Sparkling Water by using Scala

I have to exclude some algorithms from AutoMl model.
I am trying this to exclude algorithms but it fails.
buildSpecHopper_1.build_models.exclude_algos = Array(Algo.DeepLearning,Algo.GLM)
But it throws Class cast exception:
java.lang.ClassCastException: [Lai.h2o.automl.AutoML$algo; cannot be cast to [Lai.h2o.automl.Algo;
in Sparkling Water, the proper way how to use AutoML is via our Spark wrapper:
https://github.com/h2oai/sparkling-water/blob/078f38ae5c863f7203cbdc9c35110f23c557d756/examples/pipelines/hamOrSpamMultiAlgo.script.scala#L97
That wrapper has options to both specify include and exclude algos.
I can see that you are using directly the Java API and probably hit a bug. We will have a look on that, but I advice to use the higher level API.

Create Parquet file in Scala without Spark

I am trying to write streaming JSON messages directly to Parquet using Scala (no Spark). I see only couple of post online and this post, however I see the ParquetWriter API is deprecated and the solution doesn't actually provides an example to follow. I read some other posts too but didn't find any descriptive explanation.
I know I have to use ParquetFileWriter API but lack of documentation is making difficult for me to use it. Can someone please provide and example of it along with all the constructor parameter and how to create those parameter, especially schema?
You may want to try using Eel, a a toolkit to manipulate data in the Hadoop ecosystem.
I recommend reading the README to gain a better understanding of the library, but to give you a sense of how the library works, what your are trying to do would look somewhat like the following:
val source = JsonSource(() => new FileInputStream("input.json"))
val sink = ParquetSink(new Path("output.parquet"))
source.toDataStream().to(sink)

How can one use spark Catalyst?

According to this
Spark Catalyst is An implementation-agnostic framework for manipulating trees of relational operators and expressions.
I want to use Spark Catalyst to parse SQL DMLs and DDLs to write and generate custom Scala code for. However, it is not clear for me by reading the code if there is any wrapper class around Catalyst that I can use? The ideal wrapper would receive a sql statement and produces the equivalent Scala code. For my use case would look like this
def generate("select substring(s, 1, 3) as from t1") =
{ // custom code
return custom_scala_code_which is executable given s as List[String]
}
This is a simple example, but the idea is that I don't want to write another parser and I need to parse many SQL functionality from a legacy system that I have to write a custom Scala implementation for them.
In a more general question, with a lack of class level design documentation, how can someone learn the code base and make contributions?
Spark takes SQL queries using spark.sql. For example: you can just feed the string SELECT * FROM table as an argument to such as spark.sql("SELECT * FROM table") after having defined your dataframe as "table". To define your dataframe as "table" for use in SQL queries create a temporary view using
DataFrame.createOrReplaceTempView("table")
You can see examples here:
https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#running-sql-queries-programmatically
Dataframe automatically changes into RDD and optimise the code, and this optimization is done through Catalyst. When a programmer writes a code in Dataframe , internally code will be optimized. For more detail visit
Catalyst optimisation in Spark

Reading csv file by Flink, scala, addSource and readCsvFile

I'd want to read csv file using by Flink, Scala-language and addSource- and readCsvFile-functions. I have not found any simple examples about that. I have only found: https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/scala/com/dataartisans/flinktraining/exercises/datastream_scala/cep/LongRides.scala and this too complex for my purpose.
In definition: StreamExecutionEnvironment.addSource(sourceFunction) should i only use readCsvFile as sourceFunction ?
After reading i'd want to use CEP (Complex Event Processing).
readCsvFile() is only available as part of Flink's DataSet (batch) API, and cannot be used with the DataStream (streaming) API. Here's a pretty good example of readCsvFile(), though it's probably not relevant to what you're trying to do.
readTextFile() and readFile() are methods on StreamExecutionEnvironment, and do not implement the SourceFunction interface -- they are not meant to be used with addSource(), but rather instead of it. Here's an example of using readTextFile() to load a CSV using the DataStream API.
Another option is to use the Table API, and a CsvTableSource. Here's an example and some discussion of what it does and doesn't do. If you go this route, you'll need to use StreamTableEnvironment.toAppendStream() to convert your table stream to a DataStream before using CEP.
Keep in mind that all of these approaches will simply read the file once and create a bounded stream from its contents. If you want a source that reads in an unbounded CSV stream, and waits for new rows to be appended, you'll need a different approach. You could use a custom source, or a socketTextStream, or something like Kafka.
If you have a CSV file with 3 fields - String,Long,Integer
then do below
val input=benv.readCsvFile[(String,Long,Integer)]("hdfs:///path/to/your_csv_file")
PS:-I am using flink shell that is why I have benv