Hive configuration for Spark integration tests - scala

I am looking for a way to configure Hive for Spark SQL integration testing such that tables are written either in a temporary directory or somewhere under the test root. My investigation suggests that this requires setting both fs.defaultFS and hive.metastore.warehouse.dir before HiveContext is created.
Just setting the latter, as mentioned in this answer is not working on Spark 1.6.1.
val sqlc = new HiveContext(sparkContext)
sqlc.setConf("hive.metastore.warehouse.dir", hiveWarehouseDir)
The table metadata goes in the right place but the written files go to /user/hive/warehouse.
If a dataframe is saved without an explicit path, e.g.,
df.write.saveAsTable("tbl")
the location to write files to is determined via a call to HiveMetastoreCatalog.hiveDefaultTableFilePath, which uses the location of the default database, which seems to be cached during the HiveContext construction, thus setting fs.defaultFS after HiveContext construction has no effect.
As an aside, but very relevant for integration testing, this also means that DROP TABLE tbl only removes the table metadata but leaves the table files, which wreaks havoc with expectations. This is a known problem--see here & here--and the solution may be to ensure that hive.metastore.warehouse.dir == fs.defaultFS + user/hive/warehouse.
In short, how can configuration properties such as fs.defaultFS and hive.metastore.warehouse.dir be set programmatically before the HiveContext constructor runs?

In Spark 2.0 you can set "spark.sql.warehouse.dir" on the SparkSession's builder, before creating a SparkSession. It should propagate correctly.
For Spark 1.6, I think your best bet might be to programmatically create a hite-site.xml.

The spark-testing-base library has a TestHiveContext configured as part of the setup for DataFrameSuiteBaseLike. Even if you're unable to use scala-testing-base directly for some reason, you can see how they make the configuration work.

Related

Deploying DB2 user define functions in sequence of dependency

We have about 200 user define functions in DB2. These UDF are generated by datastudio into a single script file.
When we create a new DB, we need to run the script file several times because some UDF are dependent on other UDF and cannot be create until the precedent functions are created first.
Is there a way to generate a script file so that the order they are deployed take into account this dependency. Or is there some other technique to arrange the order efficiently?
Many thanks in advance.
That problem should only happen if the setting of auto_reval is not correct. See "Creating and maintaining database objects" for details.
Db2 allows to create objects in an "unsorted" order. Only when the object is used (accessed), the objects and its depending objects are checked. The behavior was introduced a long time ago. Only some old, migrated databases keep auto_reval=disabled. Some environments might set it based on some configuration scripts.
if you still run into issues, try setting auto_reval=DEFERRED_FORCE.
The db2look system command can generate DDL by by object creation time with the -ct option, so that can help if you don't want to use the auto_reval method.

How to get all the Spark config along with default config?

I'm working on a project for which I need to collect all the Spark config.
The problem is that if a parameter is not explicitly set, I would need the default value. Is there a way to get all the config, including all defaults?
I tried with:
sc.getConf.getAll, but in this way, I do not get the defaults.
SparkListener, in order to get some specific information (for example the number of executors), but I couldn't find a way to get other needed information like the number of cores per executor, memory per executor etc...

Custom UDAF not working ( Ksql: Confluent)

I am facing issues while creating custom UDAF in Ksql. Use case is to find "first" and "last" value of a column in a tumbling window. There is no such built in UDAF (https://docs.confluent.io/current/ksql/docs/syntax-reference.html#aggregate-functions) so I am trying to create custom UDAF.
I performed following steps based on this document https://www.confluent.io/blog/write-user-defined-function-udf-ksql/
i. created UDAF & AggregateFunctionFactory and registered it in FunctionRegistry as follows:
addAggregateFunctionFactory(new MyAggFunctionFactory());
ii.Build ksql-engine jar and replaced the same in confluent package at following path $CONFLUENT_HOME/share/java/ksql.
iii.Restarted ksql-server
However, it seems that function is not registered. Any suggestions?
Confluent Version: 4.1.0
Note: I tried creating simple UDF .That works well. Issue is with UDAF
Issue was that I was naming the function as 'First' which seems to be some keyword. Changed the function name , it worked

Is it possible to prevent the SQL Producer from overwriting just one of the tables columns?

Scenario: A computed property needs to available for RAW methods. The IsComputed property set in the model will not work as its value will not be available to RAW methods.
Attempted Solution: Create a computed column directly on the SQL table as opposed to setting the IsComputed property in the model. Specify that CodefluentEntities not overwrite the computed column. I would than expect the BOM to read the computed SQL field no differently than if it was a normal database field.
Problem: I can't figure out how to prevent Codefluent Entities from overwriting the computed column. I attempted to use the production flags as well as setting produce="false" for the property in the .cfp. Neither worked.
Question: Is it possible to prevent Codefluent Entities from overwriting my computed column and if so, how?
The solution youre looking for is here
You can execute whatever custom T-SQL scripts you like, the only premise is to give the script a specific name so the Producer knows when to execute it.
i.e. if you want your custom script to execute after the tables are generated, name your script
after_[ProjectName]_tables.
Save your custom t-sql file alongside the codefluent generated files and build the project.
In my specific case, i had to enable full-text index in one of my table columns, i wrote the SQL script for the functionality, saved it as
`after_[ProjectName]_relations_add`
Heres how they look in my file directory
file directory
Alternate Solution: An alternate solution is to execute the following the TSQL script after the SQL Producer finishes generating.
ALTER TABLE PunchCard DROP COLUMN PunchCard_CompanyCodeCalculated
GO
ALTER TABLE PunchCard
ADD PunchCard_CompanyCodeCalculated AS CASE
WHEN PunchCard_CompanyCodeAdjusted IS NOT NULL THEN PunchCard_CompanyCodeAdjusted
ELSE PunchCard_CompanyCode
END
GO
Additional Configuration Needed to Make Solution Work: In order for this solution to work one must also configure the BOM so that it does not attempt to save the data associated with the computed columns. This can be done through Model using the advanced properties. In my case I selected the CompanyCodeCalculated property. Went to advanced settings. And set the Save setting to False.
Question: Somewhere in the Knowledge Center there is a passing reference on how to automate the execution SQL Scripts after the SQL Producer finishes but I can not find it. Anybody now how this is done?
Post Usage Comments: Just wanted to let people know I implemented this approach and am so far happy with the results.

Quartz scheduler RAM vs JDBC for job store

Can both RAM and JDBC job stores be used together? I wasn't able to find an answer to this in the official documentation.
No, they couldnt be used together.
Usually a scheduler will be obtained by the StdSchedulerFactory.
These factory uses a java.util.Properties class to determine the settings of the scheduler.
I.e. you could pass RAMJobstore as well as JDBCJobstore as value of the key org.quartz.jobStore.class, but in that case the key would be overloaded and only the last value will be visible.