PyDeequ how to use Pattern matching - pyspark

from pydeequ.analyzers import *
analysisResult=AnalysisRunner(spark)
.onData(df)
.addAnalyzer(Size())
.addAnalyzer(PatternMatch(column="country",pattern_regex="^[a-zA-Z]+(?:[\s-][a-za-Z]+)*$"))
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
#doesnt match any row tried with multiple different patterns.Do I need to import any library to do regex.

Related

How do you write to Kafka using the ABRiS library in PySpark?

Has anyone been able to write to Kafka using this library using PySpark?
I've been able to successfully read using the code from the README documentation:
import logging, traceback
import requests
from pyspark.sql import Column
from pyspark.sql.column import *
jvm_gateway = spark_context._gateway.jvm
abris_avro = jvm_gateway.za.co.absa.abris.avro
naming_strategy = getattr(getattr(abris_avro.read.confluent.SchemaManager, "SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
schema_registry_config_dict = {"schema.registry.url": schema_registry_url,
"schema.registry.topic": topic,
"value.schema.id": "latest",
"value.schema.naming.strategy": naming_strategy}
conf_map = getattr(getattr(jvm_gateway.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
for k, v in schema_registry_config_dict.items():
conf_map = getattr(conf_map, "$plus")(jvm_gateway.scala.Tuple2(k, v))
deserialized_df = data_frame.select(Column(abris_avro.functions.from_confluent_avro(data_frame._jdf.col("value"), conf_map))
.alias("data")).select("data.*")
However, I am struggling to extend the behaviour by writing to topics via the to_confluent_avro function.

CSV Feeders for gatling 3

I am using Gatling 3. I have a csv feeder with just one column titled accountIds. I need to pass this in the body of my POST request as JSON. I have tried a lot of different syntax but nothing seems to be working. I can also not print what is actually being sent in the body. It works if I remove the "$accountIds" and use an actual value instead. Below is my code:
val searchFeeder = csv("C://data/accountids.csv").random
val scn1 = scenario("Scenario 1")
.feed(searchFeeder)
.exec(http("Search")
.post("/v3/accounts/")
.body(StringBody("""{"accountIds": "${accountIds}"}""")).asJson)
setUp(scn1.inject(atOnceUsers(10)).protocols(httpConf))
Have you enabled trace level in logback.xml to see the details of post request?
Also, can you confirm if location you have mentioned "C://data/accountids.csv" is the right one. Generally, data folder resides in project location and within project you can access the data file as:
val searchFeeder = csv("data/stack.csv").random
I just created sample script and enabled logging.I am able to see that accountId is getting passed:
package basicpackage
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.core.scenario.Simulation
class StackFeeder extends Simulation {
val httpConf=http.baseUrl("http://example.com")
val searchFeeder = csv("data/stack.csv").random
val scn1 = scenario("Scenario 1")
.feed(searchFeeder)
.exec(http("Search")
.post("/v3/accounts/")
.body(StringBody("""{"accountIds": "${accountIds}"}""")).asJson)
setUp(scn1.inject(atOnceUsers(1)).protocols(httpConf))
csv file location

Scala - Get files based on name pattern

I would like to filter the files based on some patterns like :
- Team_*.txt (e.g.: Team_Orlando.txt);
- Name.*.City.txt (e.g.: Name.Robert.California.txt);
Or any name (the pattern * . * - it has spaces because was broken my text).
All the filters come from a database table and they are dynamic.
I'm trying to avoid use commands from SO like cp or mv. Is possible to filter files using patterns like the above ?
Here is what i've tried but got a regex error:
def getFiles(dir:File, filter:String) = {
(dir.isDirectory, dir.exists) match {
case (true, true) =>
dir.listFiles.filter(f => f.getName.matches(filter))
case _ =>
Array[File]()
}
}
You can use java.nio Files.newDirectoryStream() for that, it will accept pattern in desired format:
val stream = Files.newDirectoryStream(dir, pattern)
Check http://docs.oracle.com/javase/tutorial/essential/io/dirs.html#glob for detailed description.

Working with banana-rdf

Does anyone have an example on how to properly integrate banana-rdf into a project?
Based on the example on how to use a SPARQL engine, I have tried to set up something for my project, but I get an error that I don't know how to resolve.
import java.net.URL
import org.w3.banana.jena.JenaModule
import org.w3.banana.{SparqlHttpModule, SparqlOpsModule, RDFOpsModule, RDFModule}
object SparqlService extends RDFModule with RDFOpsModule with SparqlOpsModule
with SparqlHttpModule with JenaModule
import SparqlService._
import SparqlService.sparqlOps
import SparqlService.sparqlOps._
import SparqlService.sparqlHttp.sparqlEngineSyntax._
import SparqlService.ops._
val endpoint = new URL("http://dbpedia.org/sparql/")
val query = parseSelect("""
PREFIX ont: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?language WHERE {
?language a ont:ProgrammingLanguage .
?language ont:influencedBy ?other .
?other ont:influencedBy ?language .
} LIMIT 100
""").get
val answers: Rdf#Solutions = endpoint.executeSelect(query).get
val languages: Iterator[Rdf#URI] = answers.iterator map { row =>
row("language").get.as[Rdf#URI].get
}
println(languages.to[List])
Unfortunately, I get the following error and I don't get why.
Error:(27, 26) could not find implicit value for parameter fromPG:
org.w3.banana.binder.FromPG[org.w3.banana.jena.Jena,com.hp.hpl.jena.graph.Node_URI]
row("language").get.as[Rdf#URI].get
Any idea?

PyQt4 QComboBox autocomplete without using setModel?

I have found several excellent examples of a PyQt4 QComboBox with autocomplete (e.g. How do I Filter the PyQt QCombobox Items based on the text input?), but they all use setModel and setSourceModel... etc.
Is it possible to create an autocomplete QComboBox in PyQt4 without using a model?
Using smitkpatel's comment... I found a setCompleter example which works. It was posted by flutefreak at QComboBox with autocompletion works in PyQt4 but not in PySide.
from PyQt4 import QtCore
from PyQt4 import QtGui
class AdvComboBox(QtGui.QComboBox):
def __init__(self, parent=None):
super(AdvComboBox, self).__init__(parent)
self.setFocusPolicy(QtCore.Qt.StrongFocus)
self.setEditable(True)
# add a filter model to filter matching items
self.pFilterModel = QtGui.QSortFilterProxyModel(self)
self.pFilterModel.setFilterCaseSensitivity(QtCore.Qt.CaseInsensitive)
self.pFilterModel.setSourceModel(self.model())
# add a completer, which uses the filter model
self.completer = QtGui.QCompleter(self.pFilterModel, self)
# always show all (filtered) completions
self.completer.setCompletionMode(QtGui.QCompleter.UnfilteredPopupCompletion)
self.setCompleter(self.completer)
# connect signals
def filter(text):
print "Edited: ", text, "type: ", type(text)
self.pFilterModel.setFilterFixedString(str(text))
self.lineEdit().textEdited[unicode].connect(filter)
self.completer.activated.connect(self.on_completer_activated)
# on selection of an item from the completer, select the corresponding item from combobox
def on_completer_activated(self, text):
if text:
index = self.findText(str(text))
self.setCurrentIndex(index)
if __name__ == "__main__":
import sys
app = QtGui.QApplication(sys.argv)
combo = AdvComboBox()
names = ['bob', 'fred', 'bobby', 'frederick', 'charles', 'charlie', 'rob']
combo.addItems(names)
combo.resize(300, 40)
combo.show()
sys.exit(app.exec_())