spark dataframe handle option some none - scala

Similar to Apache Spark: dealing with Option/Some/None in RDDs I have a function which is applied via df.mapPartitions
def mapToTopics(iterator: Iterator[RawRecords]): Iterator[TopicContent] = {
iterator.map(k => {
browser.parseString(k.content) >> elementList("doc").map(d => {
TopicContent((d >> text("docno")).head, (d >> text("text")).head, k.path)
})
})
}
The following is also defined:
#transient lazy val browser = JsoupBrowser()
case class TopicContent(topic: String, content: String, filepath: String)
case class RawRecords(path: String, content: String)
Above will throw an error (NoSuchElementException) if no xml tags with text exist (which happens for some malformed documents)
How can I correct and simplify this code to properly handle the options?
When trying to use a util.Try as outlined by the link above and applying a flatMap my code would fail, as instead of Element it was using Char
edit
try {
Some(TopicContent((d >> text("docno")).head, (d >> text("text")).head, k.path))
} catch {
case noelem: NoSuchElementException => {
println(d.head)
None
}
}
})
val flattended = results.flatten
Will unfortunately only return a Option[Nothing]
edit4
https://gist.github.com/geoHeil/bfb01427b88cf58ea755f912ce539712 a minimal sample without spark (and full code below as well)
import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.scraper.ContentExtractors.elementList
#transient lazy val browser = JsoupBrowser()
val broken =
"""
|<docno>
| LA051089-0001
| </docno>
| <docid>
| 54901
| </docid>
| <date>
| <p> May 10, 1989, Wednesday, Home Edition </p>
| </date>
| <section>
| <p> Metro; Part 2; Page 2; Column 2 </p>
| </section>
| <graphic>
| <p> Photo, Cloudy and Clear A stormy afternoon provides a clear view of Los Angeles' skyline, with the still-emerging Library Tower rising above its companion buildings. KEN LUBAS / Los Angeles Times </p>
| </graphic>
| <type>
| <p> Wild Art </p>
| </type>
""".stripMargin
val correct =
"""
|<DOC>
|<DOCNO> FR940104-0-00001 </DOCNO>
|<PARENT> FR940104-0-00001 </PARENT>
|<TEXT>
|
|<!-- PJG FTAG 4700 -->
|
|<!-- PJG STAG 4700 -->
|
|<!-- PJG ITAG l=90 g=1 f=1 -->
|
|<!-- PJG /ITAG -->
|
|<!-- PJG ITAG l=90 g=1 f=4 -->
|Federal Register
|<!-- PJG /ITAG -->
|
|<!-- PJG ITAG l=90 g=1 f=1 -->
|&blank;/&blank;Vol. 59, No. 2&blank;/&blank;Tuesday, January 4, 1994&blank;/&blank;Rules and Regulations
|
|<!-- PJG 0012 frnewline -->
|
|<!-- PJG /ITAG -->
|
|<!-- PJG ITAG l=01 g=1 f=1 -->
|Vol. 59, No. 2
|<!-- PJG 0012 frnewline -->
|
|<!-- PJG /ITAG -->
|
|<!-- PJG ITAG l=02 g=1 f=1 -->
|Tuesday, January 4, 1994
|<!-- PJG 0012 frnewline -->
|
|<!-- PJG 0012 frnewline -->
|
|<!-- PJG /ITAG -->
|
|<!-- PJG /STAG -->
|
|<!-- PJG /FTAG -->
|</TEXT>
|</DOC>
""".stripMargin
case class RawRecords(path: String, content: String)
case class TopicContent(topic: String, content: String, filepath: String)
val raw = Seq(RawRecords("first", correct), RawRecords("second", broken))
val result = mapToTopics(raw.iterator)
// Variant 1
def mapToTopics(iterator: Iterator[RawRecords]): Iterator[TopicContent] = {
iterator.flatMap(k => {
val documents = browser.parseString(k.content) >> elementList("doc")
documents.map(d => {
val docno = d >> text("docno")
// try {
val textContent = d >> text("text")
TopicContent(docno, textContent, k.path)
// } catch {
// case _:NoSuchElementException => TopicContent(docno, None, k.path)
// }
}) //.filter(_.content !=None)
})
}
// When broken down even further you see the following will produce Options of strings
browser.parseString(raw(0).content) >> elementList("doc").map(d => {
val docno = d >> text("docno")
val textContent = d >> text("text")
(docno.headOption, textContent.headOption)
})
// while below will now map to characters. What is wrong here?
val documents = browser.parseString(raw(0).content) >> elementList("doc")
documents.map(d => {
val docno = d >> text("docno")
val textContent = d >> text("text")
(docno.headOption, textContent.headOption)
})

The difference between the two examples lies in the precedence of the operators. When you're doing browser.parseString(raw(0).content) >> elementList("doc").map(...), you're calling map on elementList("doc"), and not on the whole expression. In order for the first example to behave the same as the second one, you need to write either (browser.parseString(raw(0).content) >> elementList("doc")).map(...) (recommended) or browser.parseString(raw(0).content) >> elementList("doc") map(...).
In the context of scala-scraper, the library you're using, the two expressions mean very different things. With browser.parseString(raw(0).content) >> elementList("doc") you're extracting a List[Element] from a document, and calling map on that does just what you'd expect from a collection. On the other hand, elementList("doc") is an HtmlExtractor[List[Element]] and calling map on an extractor creates a new HtmlExtractor with the results of the original extractor transformed. That's the reason why you end up with two very different results.

I am unfamiliar with the API you are using, but using headOpton in a for comprehension might help you:
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._
iterator.map(k => {
browser.parseString(k.content) >> elementList("doc").flatMap(d => {
for {
docno <- text("docno")).headOption
text <- (d >> text("text")).headOption
} yield TopicContent(docno, text, k.path)
})
})
This way you only construct TopicContent, really a Some(TopicContent), when both docno and text are present--and None otherwise. Then the flatMap removes all the Nones and extracts the content in Somes leaving you with a collection of TopicContent instances created for all valid XML.

Related

How to check for level in react-testing-library?

I have a heading <h4>Big offer!</h4> on the page, when I first ran my tests I got:
Expected: "Big offer!"
Received: <h4>Big offer!</h4>
35 | const switchToggle = screen.getByRole('checkbox');
36 | expect(switchToggle.checked).toEqual(true);
> 37 | expect(titleEl).toEqual(title);
Ok, it get's correct file, so I changed my code to actually check heading level and text, but it failed:
Expected: "Big offer!"
Received: undefined
35 | const switchToggle = screen.getByRole('checkbox');
36 | expect(switchToggle.checked).toEqual(true);
> 37 | expect(titleEl.name).toEqual(title);
| ^
38 | expect(titleEl.level).toEqual(4);
I always thought that .name is equal to getByText(). I commented out a line and tried checking for level, and it failed again:
Expected: 4
Received: undefined
36 | expect(switchToggle.checked).toEqual(true);
37 | //expect(titleEl.name).toEqual(title);
> 38 | expect(titleEl.level).toEqual(4);
I don't understand why my test cases failed. Code for test was:
const title = 'Big offer!';
render(<Component title={title}/>);
const titleEl = screen.getByRole('heading');
expect(titleEl).toEqual(title);
expect(titleEl.level).toEqual(4);
It wont work because screen.getByRole (or others querys) returns an HTMLElement (in your case returns a HTMLHeadingElement that inherits HTMLElement). And it doesnt have name or level properties. You can check more here.
To check the text rendered you should use:
expect(titleEl).toHaveTextContent(title);
And for h4 type check you just need to filter it on query:
const titleEl = screen.getByRole('heading', { level: 4 });
The full test:
it('test search input', async () => {
const title = 'Big offer!';
render(<SearchBox title={title} />);
const titleEl = screen.getByRole('heading', { level: 4 });
expect(titleEl).toBeInTheDocument();
expect(titleEl).toHaveTextContent(title);
});

How to create map with different type in pyspark?

I am having a requirement where i have to concat map of different types.
Sample data will be of :
+----------+--------------------+-------+----------+-----------+---------------+---------------+-------------+----+----+
|ADDRESS_ID| CAS_ID|VERSION|RG_ADDRESS|IS_VERIFIED|GEOFENCE_RADIUS|ADDRESS_RECENCY|ADDRESS_SCORE| LAT| LNG|
+----------+--------------------+-------+----------+-----------+---------------+---------------+-------------+----+----+
| 75199688|tmdsfggds|6| | false| 1000| 1| 85|null|null|
I have tried below but its not workinf duee to type mismatch:
column_list_string = ['ADDRESS_ID', 'CAS_ID','RG_ADDRESS']
column_list_int = ['VERSION', 'GEOFENCE_RADIUS', 'ADDRESS_RECENCY', 'ADDRESS_SCORE']
column_list_double = ['LAT', 'LNG']
column_list_bool = ['IS_VERIFIED']
def convert(s, c = column_list):
print(s,c)
return {c[0]: s[0], c[1] : s[1], c[2]: s[2], c[3]: s[3], c[4] : s[4], c[5]: s[5] ,c[6]: s[6], c[7] : s[7], c[8]: s[8] ,c[9]: s[9]}
convert_udf_str = F.udf(convert, MapType(StringType(), StringType()))
convert_udf_int = F.udf(convert, MapType(StringType(), IntegerType()))
convert_udf_double = F.udf(convert, MapType(StringType(), DoubleType()))
convert_udf_bool = F.udf(convert, MapType(StringType(), BooleanType()))
dfs = dfs.withColumn('value_dict_string', convert_udf_str(F.struct(column_list_string)))
dfs = dfs.withColumn('value_dict_int', convert_udf_int(F.struct(column_list_int)))
dfs = dfs.withColumn('value_dict_double', convert_udf_double(F.struct(column_list_double)))
dfs = dfs.withColumn('value_dict_booleean', convert_udf_bool(F.struct(column_list_bool)))
dfs = dfs.withColumn('value_dict', map_concat(dfs['value_dict_string'],dfs['value_dict_int'],dfs['value_dict_double'],dfs['value_dict_booleean']))
Error i see is
Py4JJavaError: An error occurred while calling o1179.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve 'map_concat(value_dict_string, value_dict_int, value_dict_double, value_dict_booleean)' due to data type mismatch: input to function map_concat should all be the same type, but it's [map<string,string>, map<string,int>, map<string,double>, map<string,boolean>];;
'Project [ADDRESS_ID#994, CAS_ID#995, VERSION#996, RG_ADDRESS#997, IS_VERIFIED#998, GEOFENCE_RADIUS#999, ADDRESS_RECENCY#1000, ADDRESS_SCORE#1001, LAT#1002, LNG#1003, LAT_LNG_ACCURACY#1004, LAT_LNG_CONFIDENCE#1005, map_concat(value_dict_string#1423, value_dict_int#1442, value_dict_double#1461, value_dict_booleean#1480) AS value_dict#1498, value_dict_string#1423, value_dict_int#1442, value_dict_double#1461, value_dict_booleean#1480]

How to remove SOAP Optional elements in the request

Using Karate, I would like to build a soap request.
There are Optional elements in the SOAP Request, how to remove them based on the Scenario Outline: Example?
Shared an example, for the purpose of discussion. If there is already a sample code, please share, thank you.
Feature: SOAP request to get Customer Details
Background:
Given url 'http://localhost:8080/CustomerService_V2_Ws'
Scenario Outline:
* def removeElement =
"""
function(parameters, inputXml) {
if (parameters.city = null)
karate.remove(inputXml, '/Envelope/Header/AutHeader/ClientContext/city');
if (parameters.zipcode = null)
karate.remove(inputXml, '/Envelope/Header/AutHeader/ClientContext/zipcode');
return inputXml;
}
"""
* def inputXml = read('soap-request.xml')
* def updatedXml = removeElement(parameters,inputXml)
Given request updatedXml
When soap action ''
Then status <http_code>
Examples:
| CustomerId | ZipCode | City |
| 001 | null | null |
| 002 | 41235 | null |
| 003 | null | New York |
**Contents of "soap-request.xml"**
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Header>
<wsc1:AutHeader xmlns:wsc1="http://example.com/ws/WSCommon_v22">
<wsc1:SourceApplication>ABC</wsc1:SourceApplication>
<wsc1:DestinationApplication>SoapUI</wsc1:DestinationApplication>
<wsc1:Function>CustomerService.readDetails</wsc1:Function>
<wsc1:Version>2</wsc1:Version>
<wsc1:ClientContext>
<wsc1:customerid>10000</wsc1:customerid>
<!--Optional:-->
<wsc1:zipcode>11111</wsc1:zipcode>
<!--Optional:-->
<wsc1:city>xyz</wsc1:city>
</wsc1:ClientContext>
</wsc1:AutHeader>
</soapenv:Header>
<soapenv:Body />
</soapenv:Envelope>
Yes my suggestion is use embedded expressions. When the expression is prefixed with 2 hash signs, this will "delete if null" which was designed for your exact use case. Here is an example:
Scenario: set / remove xml chunks using embedded expressions
* def phone = '123456'
# this will remove the <acc:phoneNumberSearchOption> element
* def searchOption = null
* def search =
"""
<acc:getAccountByPhoneNumber>
<acc:phoneNumber>#(phone)</acc:phoneNumber>
<acc:phoneNumberSearchOption>##(searchOption)</acc:phoneNumberSearchOption>
</acc:getAccountByPhoneNumber>
"""
* match search ==
"""
<acc:getAccountByPhoneNumber>
<acc:phoneNumber>123456</acc:phoneNumber>
</acc:getAccountByPhoneNumber>
"""
Do note that there are many more examples here: xml.feature

django cms plugin instance related_set returns empty list

I have the following models
class NewSlide(models.Model):
slider = models.ForeignKey('NewSliderPlugin')
title = models.CharField(max_length=255)
content = models.TextField(max_length=80, null=True)
link = models.CharField(max_length=255)
image = models.ImageField(upload_to='slides', null=True)
visible = models.BooleanField(default=False)
def __unicode__(self): # Python 3: def __str__(self):
return self.title
class NewSliderPlugin(CMSPlugin):
title = models.CharField(max_length=255)
template = models.CharField(max_length=255, choices=(('slider.html','Top Level Slider'), ('slider2.html','Featured Slider')))
The plugin code as below:
class NewSlideInline(admin.StackedInline):
model = NewSlide
extra = 1
class NewCMSSliderPlugin(CMSPluginBase):
model = NewSliderPlugin
name = "NewSlider"
render_template = "slider.html"
inlines = [NewSlideInline]
def render(self, context, instance, placeholder):
self.render_template = instance.template
print instance.title
print instance.newslide_set.all(), 1111111111111111
context.update({
'slider': instance,
'object': instance,
'placeholder': placeholder
})
return context
I have added slides to the plugin and published changes, however 1instance.newslide_set.all()1 returns empty list: [] 1111111111111111
Update:
it creates 2 records, somehow the admin references 49, but render code gives 63
mysql> select * from cmsplugin_newsliderplugin;
+------------------+-----------+-------------+
| cmsplugin_ptr_id | title | template |
+------------------+-----------+-------------+
| 49 | slide | slider.html |
| 63 | slide | slider.html |
+------------------+-----------+-------------+
mysql> select * from slider_newslide;
+----+-----------+-------+---------+------+----------------+---------+
| id | slider_id | title | content | link | image | visible |
+----+-----------+-------+---------+------+----------------+---------+
| 6 | 49 | ttttt | testt | test | slides/287.jpg | 0 |
+----+-----------+-------+---------+------+----------------+---------+
By the way, I have django-reversion installed, not sure if it's because of this app.
OK according to the documentation I need to copy the related items:
class NewSliderPlugin(CMSPlugin):
title = models.CharField(max_length=255)
template = models.CharField(max_length=255, choices=(('slider.html','Top Level Slider'), ('slider2.html','Featured Slider')))
def copy_relations(self, oldinstance):
for slide in oldinstance.newslide_set.all():
# instance.pk = None; instance.pk.save() is the slightly odd but
# standard Django way of copying a saved model instance
slide.pk = None
slide.slider = self
slide.save()

Trouble understanding HTML::Element documentation in Perl

I was looking into HTML:Element documentation and came across attr_get_i method which according to documentation states that:
In list context, returns a list consisting of the values of the given
attribute for $h and for all its ancestors starting from $h and
working its way up.
Now, according to the example given there:
<html lang='i-klingon'>
<head><title>Pati Pata</title></head>
<body>
<h1 lang='la'>Stuff</h1>
<p lang='es-MX' align='center'>
Foo bar baz <cite>Quux</cite>.
</p>
<p>Hooboy.</p>
</body>
</html>
If $h is the <cite> element, $h->attr_get_i("lang") in list context will return the list ('es-MX', 'i-klingon').
Now, according to my unuderstanding the returned list should be ('es-MX', 'la', 'i-klingon') that is it should also consider <h1 lang='la'>Stuff</h1> but according to the documentation it doesn't.
Now, why am I wrong here.
The 'lang' attributes here are:
+-------------+------------------+
| lang | path |
+-------------+------------------+
| i-klingon | /html |
| la | /html/body/h1 |
| es-MX | /html/body/p |
+-------------+------------------+
The <cite> node does not have <h1> as its parent (path is /html/body/p/cite), so <h1> is not its ancestor. This is why the method does not return it.
<h1 lang='la'>Stuff</h1> is not an ancestor of <cite>, it is a sibling.