How to get all the Spark config along with default config? - scala

I'm working on a project for which I need to collect all the Spark config.
The problem is that if a parameter is not explicitly set, I would need the default value. Is there a way to get all the config, including all defaults?
I tried with:
sc.getConf.getAll, but in this way, I do not get the defaults.
SparkListener, in order to get some specific information (for example the number of executors), but I couldn't find a way to get other needed information like the number of cores per executor, memory per executor etc...

Related

Export database to multiple files in same job Spring Batch

I need to export some database of arround 180k objects to JSON files so I can retain data structure in certain way that suits me for later import to other database. However because of amount of data, I wanto to separate and group data based on some atribute value from database records itself. So all records that have attribute1=value1, I want to go to value1.json, value2.json and so on.
However I still haven't figured out how to do this kind of job. I am using RepositoryItemReader and JsonFileWriter.
I started by filtering data on that attribute and running separate exports, just to verify that works, however I need to do this so I can automate whole process and let it work.
Can this be done?
There are several ways to do that. Here are a couple of options:
Option 1: parallel steps
You start by creating a tasklet that calculates the distinct values of the attribute you want to group items by, and you put this information in the job execution context.
After that, you create a flow with a chunk-oriented step for each value. Each chunk-oriented step would process a distinct value and generate an output file. The item reader and writer would be step-scoped bean and dynamically configured with the information from the job execution context.
Option 2: partitioned step
Here, you would implement a Partitioner that creates a partition for each distinct value. Each worker step would then process a distinct value and generate an output file.
Both options should perform equally in your use-case. However, option 2 is easier to implement and configure in my opinion.

Helm Template Processing Order

I'm trying to figure out the precise processing order for templates within Helm (v3.8.1 and newer). Specifically, I'm looking for the order in which individual files are processed - if any.
I'm trying to figure out the best means to render some common-use values based on child charts' values. I have some subcharts that will need to interoperate, and I'm aware of the availability of the "global" map in order to publish values consumable by all charts. I've not tested modifying the contents of "global" programmatically, but I expect this should be possible.
Regardless, what I'm getting at is that using static variables defined in values.yaml (global, specifically, or even subcharts' value overrides) may not be enough in order to achieve the interoperation I need for all of these subcharts.
Specifically, I would like for each subchart, as it's processed, to compute specific values that could be consumed for interoperation and then publish them to the "global" map (i.e. global.subchart1.someValue, global.subchart2.someOtherValue, etc).
For instance: one of my subcharts is an LDAP provider, but an external LDAP could also be used, so the LDAP URL is something that needs to be "computed" to either be the external one (manually-specified), or the internal component's one.
This type of modification would allow me to consume those values (if available) wherever interoperation is required, via a single source of truth.
I'm looking for something like this:
Parse the top-level chart's Chart.yaml
Parse and process the top-level chart's non-template files (_* files)
Repeat the prior two steps for all subcharts, recursively
Parse the top-level chart's values.yaml
Parse and process/render the top-level chart's remaining template files
Repeat the prior two steps for all subcharts, recursively
Ideally, also including the order in which files are processed (I'm currently presuming they're processed alphabetically).
I realize this is probably not the ideal scenario, but I do want the ability for each subchart to be able to export information akin to "my service is exported on this port, with this user and password" (yes, I'm aware that secrets would likely be required here, etc...), or "this is the search DN for the LDAP users", etc.

Spark local variable broadcast to executor

var countryMap = Map("Amy" -> "Canada", "Sam" -> "US", "Bob" -> "Canada")
val names = List("Amy", "Sam", "Eric")
sc.parallelize(names).flatMap(broadcastMap.value.get).collect.foreach(println)
//output
Canada
US
I'm running this spark job in YARN mode, and I'm sure that driver and executors are not in the same node/JVM (see the attached pic). Since countryMap is not a broadcast variable, so the executor should not see it and this code shouldn't print anything. However, it printed Canada and US.
My question is that does Spark populate local variables to executors automatically if they are serializable? if not, how does the executor see the driver's local variables?
Hay Edwards,
when you invoke collect that bring result set back to driver try to perform mapping. that the reason that you could find mappings get generated.
Cheers,
local variables : The driver and each executor,no serialization required,Shared within the actuator/driver.
main variables: The driver and each copy(insulate) of the task,need serialization
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
reference https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#broadcast-variables
Basically, local variables in the driver will be broadcast to executors automatically. However, you need to create broadcast variables when you need them across different stages.
To have broadcastMap.value.get function running on a cluster, Spark needs to serialize broadcastMap and send to every executor, so you have a function with data already attached to it in the form an object instance. If you make broadcastMap class unserializable - you won't be able to run this code whatsoever.
So, Spark doesn't populate local variables to executors, but rather you explicitly tell it to serialize object broadcastMap and run a method of that object distributedly.

Jmeter - Can I change a variable halfway through runtime?

I am fairly new to JMeter so please bear with me.
I need to understand whist running a JMeter script I can change the variable with the details of "DB1" to then look at "DB2".
Reason for this is I want to throw load at a MongoDB and then switch to another DB at a certain time. (hotdb/colddb)
The easiest way is just defining 2 MongoDB Source Config elements pointing to separate database instances and give them 2 different MongoDB Source names.
Then in your script you will be able to manipulate MongoDB Source parameter value in the MongoDB Script test element or in JSR223 Samplers so your queries will hit either hotdb or colddb
See How to Load Test MongoDB with JMeter article for detailed information
How about reading the value from a file in a beanshell/javascript sampler each iteration and storing in a variable, then editing/saving the file when you want to switch? It's ugly but it would work.

Get list of executions filtered by parameter value

I am using Spring-batch 3.0.4 stable. While submitting a job I add some specific parameters to its execution, say, a tag. Jobs information is persisted in the DB.
Later on I will need to retrieve all the executions marked with a particular tag.
Currently I see 2 options:
Get all job instances with org.springframework.batch.core.explore.JobExplorer#findJobInstancesByJobName. For each instance get all available executions with org.springframework.batch.core.explore.JobExplorer#getJobExecutions. Filter the resulting collection of executions checking its JobParameters.
Write my own JdbcTemplate-based DAO implementation to run the select query.
While the former option seems pretty inefficient, the latter one suggests writing extra code to deal with the Spring-specific database tables structure.
Is there any option I am missing here?