One of our clients has a requirement to build/develop data quality rules using hiveQL.
E.g, Replace NULL values, Change date format in YYYY-MM-DD, Standardize amount column values in US & EU format, etc.
Problem Statement:
I have the set of data quality rules in one hive table(dq_rules), want to execute each rule one by one and store the errors(the data issues such as null column, incorrect date format column) in another hive table(dq_logging) for reporting/logging purpose.
Please Suggest me solution by keeping one thing in mind that, I want to make this solution generic and executable for any hive table/columns(It means it should be parameterized).
Restriction: I cannot use existing Data Quality tools. I need to complete it using a hive only(Restriction is given by Client).
Schema for Tables:
dq_rules => Validation Rule ID,Rule Category,DQ Dimension,Rule Description Date Added,Date Retired
dq_logging => Error_ID,Source_Name,Erroneous_Source_Fields,Source_File_Record,Validation Rule ID
If anyone has a solution related to writing shell/python script that will also work for me. I just need to make it end to end process.
Related
Iam using mdriven build 7.0.0.11347 for DDD project and have model designed in .ecomdl file.
In this file i have a class Job with WorkDone as one of a property. Backedup SQL table has WorkDone varchar(255) field. Now i wanted to increase length of this field and When i changed the WorkDone property length from 255 to 2000 then it modified the code file but when application runs EvolveSchema then evolving process doesn't recognize this change which leads to no scripts being generated. In the end database doesn't get this updated.
Can you please help me how to get this change persist to database. I thought to increase manually to SQL table but then if database gets change in case of new envrionment QA production then it has to be done every time, which id don't want to do.
In MDriven we dont evolve attribute changes - we only write a warning (255->2000 this change will not be evolved)
You should take steps to alter the column in the database yourself.
We should fix in the future but currently this is a limitation
To expand on my comment, VARCHAR can only be from 0-255 chars
Using TEXT will allow for non-binary (character) strings and BLOBs will allow for binary (byte) strings
Your mileage may vary with this as to what you can do with them, as I am using MySQL knowledge and knowledgebases (since you don't specify your SQL type)
See below for exaplanations of the types;
char / varchar
blobs / text
Is there any way to alter the Cassandra column from timestamp to date without data lost? For example '2021-02-25 20:30:00+0000' to '2021-02-25'
If not, what is the easiest way to migrate this column(timestamp) to the new column(date)?
It's impossible to change a type of the existing column, so you need to add a new column with correct data type, and perform migration. Migration could be done via Spark + Spark Cassandra Connector - it could be most flexible solution, and even could be done via single node machine with Spark running in the local master mode (default). Code could look something like this (try on test data first):
import pyspark.sql.functions as F
options = { "table": "tbl", "keyspace": "ks"}
spark.read.format("org.apache.spark.sql.cassandra").options(**options).load()\
.select("pk_col1", "pk_col2", F.col("timestamp_col").cast("date").alias("new_name"))\
.write.format("org.apache.spark.sql.cassandra").options(**options).save()
P.S. you can use DSBulk, for example, but you need to have enough space to offload the data (although you need only primary key column + your timestamp)
To add to Alex Ott's answer, there are validations done in Cassandra that prevents changing the data type of a column. The reason is that SSTables (Cassandra data files) are immutable -- once they are written to disk, they are never modified/edited/updated. They can only be compacted to new SSTables.
Some try to get around it by dropping the column from the table then adding it back in with a new data type. Unlike traditional RDBMS, the existing data in the SSTables don't get updated so if you tried to read the old data, you'll get a CorruptSSTableException because the CQL type of the data on disk won't match that of the schema.
For this reason, it is no longer possible to drop/recreate columns with the same name (CASSANDRA-14948). If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/8018/. Cheers!
You can use ToDate to change it. For example: Table Email has column Date with format: 2001-08-29 13:03:35.000000+0000.
Select Date, ToDate(Date) as Convert from keyspace.Email:
date | convert ---------------------------------+------------ 2001-08-29 13:03:35.000000+0000 | 2001-08-29
I am pretty new to Pentaho so my query might sound very novice.
I have written a transformation in which am using CSV file input step and table input step.
Steps I followed:
Initially, I created a parameter in transformation properties. The
parameter birthdate doesn't have any default value set.
I have used this parameter in postgresql query in table input step
in the following manner:
select * from person where EXTRACT(YEAR FROM birthdate) > ${birthdate};
I am reading the CSV file using CSV file input step. How do I assign the birthdate value which is present in my CSV file to the parameter which I created in the transformation?
(OR)
Could you guide me the process of assigning the CSV field value directly to the SQL query used in the table input step without the use of a parameter?
TLDR;
I recommend using a "database join" step like in my third suggestion below.
See the last image for reference
First idea - Using Table Input as originally asked
Well, you don't need any parameter for that, unless you are going to provide the value for that parameter when asking the transformation to run. If you need to read data from a CSV you can do that with this approach.
First, read your CSV and make sure your rows are ok.
After that, use a select values to keep only the columns to be used as parameters.
In the table input, use a placeholder (?) to determine where to place the data and ask it to run for each row that it receives from the source step.
Just keep in ming that the order of columns received by the table input (the columns out of the select values) is the same order that it will be used for the placeholders (?). This should not be a problem with your question that uses only one placeholder, but keep that in mind as you ramp up using Pentaho.
Second idea, using a Database Lookup
This is another approach where you can't personalize the query made to the database and may experience a better performance because you can set a "Enable cache" flag and if you don't need to use a function on your where clause this is really recommended.
Third idea, using a Database Join
That is my recommended approach if you need a function on your where clause. It looks a lot like the Table Input approach but you can skip the select values step and select what columns to use, repeat the same column a bunch of times and enable a "outer join" flag that returns the rows without result from the query
ProTip: If you feel the transformation running too slow, try to use multiple copies from the step (documentation here) and obviously make sure the table have the appropriate indexes in place.
Yes there's a way of assigning directly without the use of parameter. Do as follows.
Use Block this step until steps finish to halt the table input step till csv input step completes.
Following is how you configure each step.
Note:
Postgres query should be select * from person where EXTRACT(YEAR
FROM birthdate) > ?::integer
Check Execute for each row and Replace variables in in Table input step.
Select only the birthday column in CSV input step.
I have this simple flow in Talend DI 6 (simplified for posting on SO):
The last step crashes with a NullPointerException, because missing XML attributes are returned as null.
Is there a way to get empty string values instead of nulls?
For now I'm using a tReplace step to remove nulls as a work-around, but it's tedious and adds to the cost of maintenance by creating one more place where the list of attributes needs to be maintained.
In Talend DI 5.6.2 it is possible to add default data values to the schema. The column in the schema is called "Default". If you expect strings, you can set an empty string, which is set if the column value is null:
Talend schema view with Default column
Works also for other data types. Talend DI 6 should still be able to do this, although the field might be renamed.
Here's what I had like to do.
I had like to put "rules" in a database table. This is sort of like the drools xls decision table format except that all the rules will be rows in a table. This way I can modify the rules easily . I need to put this in a table and not an xls because my rules could be frequently changing. Is this possible with drools? Can I build a knowledgebase with rules retrieved from a DB (instead of a DRL or a xls file) and every time rules change can I rebuild the knowledge base from scratch (or maybe just parts of the knowledgebase, essentially update only those rules that's changed..)
It depends on what kind of rules you have in mind. A database-backed approach makes sense if you have lots of rules that have the same structure, and which only vary according to certain 'parameters'. In this case you can write a single generic rule, and use the database to store all of the combinations that apply. For example, suppose you have a rules to calculate shipping rates per country, for an order, e.g.
rule "Shipping rates to France"
when
$order : Order(country == 'fr')
then
$order.setShippingRate(10.0);
update(order);
end
// Similar rules for other countries…
You could replace these rules data from your database where each CountryShippingRate specifies the rate for one country. Then you insert all of the CountryShippingRate rows as fact objects in the rule session, and a single rule, like:
rule "Shipping rates"
when
$order : Order($country : country)
CountryShippingRate($rate : rate, country == $country)
then
$order.setShippingRate($rate);
update(order);
end
In practice, it turns out that lots of decision table type rules can be rewritten this way.