How to filter jbehave examples table rows based on scenario meta data

How to filter jbehave examples table rows based on scenario meta data - jbehave

Is there a way we can filter jbehave examples table rows at runtime using the scenario meta data? For e.g
Scenario: my scenario title
Meta:
#id 1
Examples:
|Meta:|col1|col2|
|id 1 |val1|val2|
|id 2| val |val |
|id 1| val |val |
When we run this scenario it should iterate only for the 1st and 3rd row, based on the meta data set on the scenario.
What I am trying to do is to externalize data across scenarios/ stories and try to use filtered data rows applicable for particular scenario.
I found some similar topics based meta filtering but not specific to this.
Appreciate any help. Thanks

A meta character # must be used in the example table, in this way:
Scenario: some scenario
Meta: #id
Given I pass value '1'
Examples:
|Meta:|col1|col2|
|#id 1|val1|val2|
|#id 2| val|val |
|#id 1| val|val |
Then you need to define the filter in the configuration, for example:
configuredEmbedder().useMetaFilters(Arrays.asList("+id 1"));
More on this topic can be found here:
http://jbehave.org/reference/stable/meta-filtering.html

Related

Scala: best way to update a deltatable after filling missing values

I have the following delta table
+-+----+
|A|B |
+-+----+
|1|10 |
|1|null|
|2|20 |
|2|null|
+-+----+
I want to fill the null values in column B based on the A column.
I figured this to do so:
var df = spark.sql("select * from MyDeltaTable")
val w = Window.partitionBy("A")
df = df.withColumn("B", last("B", true).over(w))
Which gives me the desired output:
+-+----+
|A|B |
+-+----+
|1|10 |
|1|10 |
|2|20 |
|2|20 |
+-+----+
Now, my question is:
What is the best way to write the result in my delta table correctly ?
Should I merge ? Re-write with overwrite option ?
My delta table us huge and it will keep on increasing, I am looking for the best possible method to achieve so.
Thank you

It depends on the distribution of the rows (aka. are they all in 1 file or spread through many?) that contain null values you'd like to fill.
MERGE will rewrite entire files, so you may end up rewriting enough of the table to justify simply overwriting it instead. You'll have to test this to determine what's best for your use case.
Also, to use MERGE, you need to filter the dataset down only to the changes. Your example "desired output" table has the all the data, which you'd fail to MERGE in its current state because there are duplicate keys.
Check the Important! section in the docs for more

Best Way to Manage Derived Properties

I have a couple custom NSManagedObjects that have various relationships between each other. Below is a very simplified example. In production there should ~10 instances of A, >= 10k instances of B, and <30 instances of C.
First, I'm trying to track the sum of B.value for specific categories in A. Second, I'm tracking the sum of B.value in C if B.date is between C.startDate and C.endDate. C instances are in a linked list style representing sequential windows in time.
If B.value changes manually going to A and C is fairly simple and updating the cached value in each. Updating the date in B is a little tougher as I'd have to search through the list and update it.
With all of this in mind, I've been trying to determine what is the best way in core data to keep these cached values up to date? My current thought is a mediator pattern, NotificationCenter, or KVO. Mediator pattern is not super flexible but would work. NotificationCenter seems ideal, however I'm not sure how to ensure that all instances of C are always in memory and subscribed to the publisher. KVO seems solid, but it doesn't seem to report edits for objects already in to-many relationships. What is the best way to keep these objects in sync with each other in the most Core data-esc way?
+---------+ +--------+ +---------+
|A | |B | |C |
+---------+ +--------+ +---------+
|totalCatA| <------->> |category| <<-------> |total |
|totalCatB| |date | |startDate|
+---------+ |value | |endDate |
+--------+ |prevC |
|nextC |
+---------+

Question regarding best choice for specifying schema in spark and How to Delete/remove column from rdd spark?

I have two datasets NYC taxi data and weather data. Weather data a huge number of columns from, around 100, of which I need only 5-10. And I want them to be typed instead of strings hence need a schema. I know two ways for this
Rdd->Rows-> give schema and then convert to DF
Dataframe InferSchema (the documentation says its double passes)
Is inferSchema a good choice for the 100 column thing and don't have to write a StructType or CaseClass for 100 columns?
And the taxi data with billion+ records which also has around 60 columns but I need only around 10. What would be a suitable choice for this dataset? Write schema for all 60 columns?
Second Question: as I mentioned I don't need all the columns so I'm dropping columns. from the documentation and Internet, I got to know how to do this using DF just with the select function.
But I in case I have to write schema which in my knowledge is only possible using RDD. How to remove/drop columns in RDD?
Ideally, using any RDD/DF id like to drop columns first then specify a schema. Is this possible?
I know it's a lot of questions but I'm a newbie with spark all this popped up in my mind and i want to do it the right way.
Thanks

You don't need RDD to achieve it, it's really simple. Just load your data to DF then select and cast wanted columns.
scala> val df = Seq("1","2","3").toDF("c1")
scala> df.show()
+---+
| c1|
+---+
| 1|
| 2|
| 3|
+---+
scala> df.printSchema()
root
|-- c1: string (nullable = true)
scala> val newDF = df.select('c1.cast("int"))
scala> newDF.printSchema()
root
|-- c1: integer (nullable = true)

Iterate Data frame and use those value in Spark SQL statement

I have a data frame say
DF
Animal
======
Cat
Dog
Horse
I want to iterate these values and use them in Spark SQL statement.
Can someone please help me with this?

Spark dataset/dataframe APIs are more declarative than imperative (like SQL) that means you describe what you want the end data to be and let the spark engine figure out the exact transformation.
What you're describing doesn't make sense as a use case for spark

It's a weird use case, but you can iterate over your values an do whatever you want with a foreach.
INPUT
df.show
+------+
|animal|
+------+
| cat|
| dog|
| horse|
+------+
SENTENCE
Same as I used a print, you can do any other function, but as in the comments said, it's a bit weird
df.foreach(row => println(row.getAs[String](0)))
With this piece you get the actual value
row.getAs[String](0)

Understanding some basics of Spark SQL

I'm following http://spark.apache.org/docs/latest/sql-programming-guide.html
After typing:
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
I have some questions that I didn't see the answers to.
First, what is the $-notation?
As in
df.select($"name", $"age" + 1).show()
Second, can I get the data from just the 2nd row (and I don't know what the data is in the second row).
Third, how would you read in a color image with spark sql?
4th, I'm still not sure what the difference is between a dataset and dataframe in spark. The variable df is a dataframe, so could I change "Michael" to the integer 5? Could I do that in a dataset?

$ is not annotation. It is a method call (shortcut for new ColumnName("name")).
You wouldn't. Spark SQL has no notion of row indexing.
You wouldn't. You can use low level RDD API with specific input formats (like ones from HIPI project) and then convert.
Difference between DataSet API and DataFrame

1) For question 1, $ sign is used as a shortcut for selecting a column and applying functions on top of it. For example:
df.select($"id".isNull).show
which can be other wise written as
df.select(col("id").isNull)
2) Spark does not have indexing, but for prototyping you can use df.take(10)(i) where i could be the element you want. Note: the behaviour could be different each time as the underlying data is partitioned.