Compute Metrics by using Deequ with Scala - scala

I am new to Scala and Amazon Deequ. I have been asked to write a Scala code that would compute metrics (e.g. Completeness, CountDistinct etc) on constraints by using Deequ on source csv files stored on S3, and load the generated metrics in a Glue table which will be further used for reporting.
Can anyone please help me by pointing me in the right direction towards online resources that would help me achieve this ? I am new to both Scala and Deequ. So can anyone give me a sample Scala code and explain how the deequ libraries could be used etc ?
Please let me know if additional information is required to explain my question better.

Thank you for your interest in Deequ. The github page of deequ has information on how to get started with using it: https://github.com/awslabs/deequ
Additionally, there is a blogpost at the AWS blog with some examples as well: https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/
Best,
Sebastian

You can check the examples available here: https://github.com/awslabs/deequ/tree/master/src/main/scala/com/amazon/deequ/examples
Hope that helps.
Take some time to read the documentation too.

Related

How can I load data from PostgreSQL into Deep Learning 4J?

Could you help to understand how can I load data from PostgreSQL into Deep Learning 4J, please?
I'll appreciate if anyone have an example.
Thanks in advance
If you want to load your data from Postgres before vectorization, you can use the JdbcRecordReader. It is in the datavec-jdbc maven package.
For an example how to use it, check out the unit-test for it: https://github.com/eclipse/deeplearning4j/blob/master/datavec/datavec-jdbc/src/test/java/org/datavec/api/records/reader/impl/JDBCRecordReaderTest.java
This will give you access to your data in record form. If all your data is already numeric: great. If not, you will have to vectorize it. Explaining how to do that is probably too long for Stack Overflow. Take a look at https://www.dubs.tech/guides/quickstart-with-dl4j/ where I explain how to do it with records coming from a CSV file.
If you're confused beyond that, feel also free to ask questions on community.konduit.ai, there you can ask follow up questions better than on StackOverflow.
dl4j, as any other framework works with tensors - INDArray in case of dl4j. So in order to "load data" you'll have to convert it to tensors actually. That applies to any data source - text, images, mp3s - everything is vectorized before sending into neural network.

Create Parquet file in Scala without Spark

I am trying to write streaming JSON messages directly to Parquet using Scala (no Spark). I see only couple of post online and this post, however I see the ParquetWriter API is deprecated and the solution doesn't actually provides an example to follow. I read some other posts too but didn't find any descriptive explanation.
I know I have to use ParquetFileWriter API but lack of documentation is making difficult for me to use it. Can someone please provide and example of it along with all the constructor parameter and how to create those parameter, especially schema?
You may want to try using Eel, a a toolkit to manipulate data in the Hadoop ecosystem.
I recommend reading the README to gain a better understanding of the library, but to give you a sense of how the library works, what your are trying to do would look somewhat like the following:
val source = JsonSource(() => new FileInputStream("input.json"))
val sink = ParquetSink(new Path("output.parquet"))
source.toDataStream().to(sink)

OrientDB shortest path with dynamic weight?

I am trying to create a graph in OrientDB where the weight of edges has to be calculated on demand using data from another database. I would like to know if there is a way to do this, since all example I´ve seen use static weight properties, none of which is dynamic by nature.
If I could use a stored function as a property and have it evaluate each time I call shortestPath then it would solve my problem, but I haven´t found any documentation on this topic.
Help would be greatly appreciated!
This isn't supported by OrientDB out of the box, even if it would be something nice to have. Could you open a new issue?
About the solution I suggest to clone the OSQLFunctionDijkstra class, do your changes and plug into OrientDB engine with a different name.

Documentation or specification for .step and .stp files

I am looking for some kind of specification, documentation, explanation, etc. for .stp/.step files.
It's more about what information each line contains instead of a general information.
I can't seem to figure out what each value means all by myself.
Does anyone know some good readings about STEP files?
I already searched google but all I got were information about the general structure instead of each particular value.
The structure of STEP-File, i.e. the grammar and the logic behind how the file is organized is described in the standard ISO 10303-21.
ISO 10303 or STEP is divided into Application Protocols (AP). Each AP defines a schema written in EXPRESS. The schemas are available on the Internet: the CAX-IF provides some, STEPtools has some good HTML documentations.
The reference of the AP schemas is hosted on stepmod.

Solr and custom update handler

I have a question about Solr and the possibility to implement a customized update handler
Basically, the scenario is this:
FIELD-A : my main field
FIELD-B and FIELD-C : 2 copyfield with source in A
After FIELD-A has its value stored, i need this valued to be copied in FIELD-B and C, then processed (let's say extract a substring) and stored in FIELD-B and C before indexing time. I'm not using DIH.
edit: i'm pushing my data via nutch (forgot to mention that)
As far as i've understood, copyfields triggers after indexing (but i'm not so sure about this).
I've already read throu the wiki page and still i don't understand a lot of things:
1) customupdateprocessor is an alternative to conditionalcopyfield or do they have to exist both in my solr?
2) after creating my conditionalcopyfield jar file, how do i declare it in my schema?
3) how do i have to modify my solrconfig.xml to use my updater?
4) if i'm choosing the wrong way, any suggestion is appreciated, better if some examples or well documented links are provided
I read a lot (googling and lucene ml on nabble) but there's not so much documentation about this. I just need to create a custom updater for my two copyfields,
Thanks all in advance!
Its not really complicated.. Following is an excellent link I came across to write a custom solr update handler.
http://knackforge.com/blog/selvam/integrating-solr-and-mahout-classifier
I tested it in my solr and it just works fine!
If you are using SOLR 4 or planning to use it, http://wiki.apache.org/solr/ScriptUpdateProcessor could be an easier solution. Have fun!