How to filter rows with column constraint in Deequ ColumnProfileRunner? - scala

I am new to Scala and Spark. I am exploring the Amazon Deequ library for data profiling.
How do I get count of rows having a particular value while using ColumnProfilerRunner()?
The AnalysisRunner has an option of "compliance" I am looking for a similar option to filter rows that comply with the given column constraint.
I have multiple columns hence I want to check dynamically instead of using column names.
Appreciate any help.
Thanks

Deequ's column profiler computes a fixed set of statistics. If you want to compute custom statistics of your data, you should use the VerificationSuite. Checkout the examples on deequ's github page.

Related

Using dplyr correctly to combine shared values of a row to a new column of a table

How do I combine data from two tables based on certain shared values from the row?
I already tried using the which function and it didn't work.
I think you will have the best luck using the dplyr fuction. Specifically you can use right_join(). You can wright it like this, right_join(df1,df2, by="specification")
This will combine that columns from df2 with the specifications matching the rows according to the shared specification from df1.
For future reference it would be a lot of help if you included a screenshot of code just so it is easier to know exactly what you are asking.
Anyway, let me know if this answers your question!

vega: Can I create marks using information coming from two datasets?

I would like to create some marks, where the information of the size comes from one dataset and the information of the color comes from another dataset.
Is this possible?
Or can I update created marks (created with dataset 1) by using information from a second dataset?
Yes, you can do it.
You can use lookup transform provided there is a lookup key in both datasets.
In this example, 'category' is the key that performs lookup transform

Azure Data Factory Mapping Dataflow add Rownumber

I thought this would be fairly straight forward but i can't really find a simple way of doing it. I want to add a unique rownumber to a source dataset in a ADF Mapping Dataflow. In SSIS i would have done this with a script component but there's no option for that as far as i can see in ADF. I've looked for suitable functions in the derived columns expressions editor and also the aggregate component but there doesn't appear to be one.
Any ideas how this could be achieved?
Thanks
Many options:
Add a surrogate key transform
Hash row columns in Derived Column using SHA2
Use the rowNumber() function in a Window transformation
Give those a shot and let us know what you think
I did like this:
Add a Column with the same value to all the rows (I've used an integer with value = 1);
Added a window, using the column create previously on step 1 (Over);
Add a column on step 4 to window (window columns) with any name and rowNumber() as expression;

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.

HBase - MultiGet and Selective Columns

I can query using multiget to be able to selectively query multiple random rows from HBase.
http://hostname:port/tablename/multiget/?row=row1&row=row2
For selecting few columns,
http://hostname:port/tablename/rowkey/columnFamily:columnName
How to be able to use multiget and be able to select only few columns at the same time?
Looking at this (HBASE-3541) JIRA issue for multi-gets, it seems like there is no option to specify columns when using multiget. However, it also seems like it would be pretty simple to add this functionality.
EDIT: the issue I opened regarding this problem (linked here) was resolved and included in HBase's 1.3.0 release, and now it is possible to select only specific columns.