Masking the logs of kafka connector - apache-kafka

I have property file in which there is some secrets(credentials) and certificates, i don't want to log them.
So is there any way to put that credentials somewhere else or not log them.
Do we have something called masking in apache kafka..?

If you happen to be using KSQL for streaming queries you use the masking function MASK().
CREATE STREAM MASKED_PURCHASES AS
SELECT MASK(CUSTOMER_NAME) AS CUSTOMER_NAME,
MASK_RIGHT(DATE_OF_BIRTH,12) AS DATE_OF_BIRTH,
ORDER_ID, PRODUCT, ORDER_TOTAL_USD, TOWN, COUNTRY
FROM PURCHASES;
ksql> SELECT CUSTOMER_NAME, DATE_OF_BIRTH, PRODUCT, ORDER_TOTAL_USD FROM MASKED_PURCHASES LIMIT 1;
Xxxxxx-Xxxxxx | 1908-03-nnXnn-nn-nnX | Langers - Mango Nectar | 5.80
Documentation source is here.

Related

Create Table without data aggregation

I just started to use the ksqlDB Confluent feature, and it stood out that it is not possible to proceed with the following command: CREATE TABLE AS SELECT A, B, C FROM [STREAM_A] [EMIT CHANGES];
I wonder why this is not possible or if there's a way of doing it?
Data aggregation here is feeling a heavy process to a simple solution.
Edit 1: Source is a STREAM and not a TABLE.
The field types are:
String
Integers
Record
Let me share an example of the executed command that returns an error as a result.
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
[EMIT CHANGES];
create a table with a smaller dataset and fewer fields than the original topic
I think the confusion here is that you talk about a TABLE, but you're actually creating a STREAM. The two are different types of object.
A STREAM is an unbounded series of events - just like a Kafka topic. The only difference is that a STREAM has a declared schema.
A TABLE is state, for a given key. It's the same as KTable in Kafka Streams if you're familiar with that.
Both are backed by Kafka topics.
So you can do this - note that it's creating a STREAM not a TABLE
CREATE STREAM test_stream
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL;
If you really want to create a TABLE then use the LATEST_BY_OFFSET aggregation, assuming you'd using id as your key:
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, LATEST_BY_OFFSET(timestamp)
, LATEST_BY_OFFSET(servicename)
, LATEST_BY_OFFSET(content->assignedcontent)
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
GROUP BY id;

Druid query to get "latest" value from third column

I have a table in Druid, something like
Timestamp || UserId || Action
And I need to get the latest Action for each UserId. In MySQL I would do something like
Select * from users u1 inner join (
select UserId, max(Timestamp) as maxt from users group by UserId
) u2
on u1.UserId = u2.UserId and u1.Timestamp = u2.maxt
But Druid can't do joins and only very basic sub-selects.
I know the "right" answer is probably to denormalize the data at ingestion time, but unfortunately that's not an option as I don't "own" the ingestion part.
The only solution I have come up with so far is to retrieve all the results for both queries in Java code and do the join manually, but I will run into memory constraints when the dataset grows I would imagine.
I tried to look at materialized views, but that looks like it's still incubating and would require a hadoop cluster, so isn't really viable.
I tried to do something like
Select * from users u1 where concat(Timestamp, UserId) in (
select concat(UserId, max(Timestamp)) from users group by UserId
)
But it didn't like that either.
Any suggestions?
LATEST(expr)
Returns the latest value of expr, which must be numeric. If expr
comes from a relation with a timestamp column (like a Druid
datasource) then "latest" is the value last encountered with the
maximum overall timestamp of all values being aggregated. If expr
does not come from a relation with a timestamp, then it is simply the
last value encountered.
https://druid.apache.org/docs/0.20.0/querying/sql.html

name is null error while doing group by column_name in confluent kafka ksql

I get error in confluent-5.0.0.
ksql>CREATE TABLE order_per_hour AS SELECT after->order_id,count(*) FROM transaction WINDOW SESSION(60 seconds) GROUP BY after->order_id;
name is null
error-name is null
after is the struct field in schema.
simple select query without group by is working fine.
I've submitted a PR to add support for this to KSQL here https://github.com/confluentinc/ksql/pull/2076
Hope this helps,
Andy
Currently you can only use column names in the GROUP BY clause. As a work around you can write your query as the following:
CREATE STREAM foo AS SELECT after->order_id as o_id FROM transaction;
CREATE TABLE order_per_hour AS SELECT o_id,count(*) FROM foo WINDOW SESSION(60 seconds) GROUP BY o_id;

Ksql: Left Join Displays columns from stream but not tables

I have one steam and a table in KSQL as mentioned below:
Stream name: DEAL_STREAM
Table name: EXPENSE_TABLE
When I run the below queries it displays only columns from the stream but no table columns are being displays.
Is this the expected output. If not am I doing something wrong?
SELECT TD.EXPENSE_CODE, TD.BRANCH_CODE, TE.EXPENSE_DESC
FROM DEAL_STREAM TD
LEFT JOIN EXPENSE_TABLE TE ON TD.EXPENSE_CODE = TE.EXPENSE_CODE
WHERE TD.EXPENSE_CODE LIKE '%NL%' AND TD.BRANCH_CODE LIKE '%AM%';
An output of the query is as shown below.
NL8232##0 | AM | null
NL0232##0 | AM | null
NL6232#!0 | AM | null
NL5232^%0 | AM | null
When I run the below queries it displays only columns from the stream but no table columns are being displays.
In a stream-table (left) join, the output records will contain null columns (for table-side columns) if there is not matching record in the table at the time of the join/lookup.
Is this the expected output. If not am I doing something wrong?
Is it possible that, for example, you wrote the (1) input data into the stream before you wrote (2) the input data into the table? If so, then the stream-table join query would have attempted to perform table-lookups at the time of (1) when no such lookup data was available in the table yet (because that happened later at time (2)). Because there was no such table data available, the join wrote output records where the table-side columns were null.
Note: This stream-table join in KSQL (and, by extension, Apache Kafka's Streams API, on which KSQL is built) is the pretty much the norm for joins in the streaming world. Here, only the stream-side of the stream-table join will trigger downstream join outputs, and if there's no matching for a stream record on the table-side at the time when a new input record is being joined, then the table-side columns will be null. Since this is, however, a common cause of user confusion, we are currently working on adding table-side triggering of join output to Apache Kafka's Streams API and KSQL. When such a feature is available, then your issue above would not happen anymore.

Select a single record on basis of Group By "Having"

I have a table say "T1" from which I have to select only one record which has been last updated by any user. The expected result should be somewhat like :-
Since you want the most recent DATETIME for each transaction (based on comments in response to another answer), you actually want to be able to retrieve more than just one record - you want to retrieve one for each group of transactionID:
SELECT transactionID, userID, MAX(updatedDateTime) AS MostRecent
FROM T1
GROUP BY transactionID
This works with test data that includes additional transactionIDs. I'll add a SQLFiddle if the site will work for me . . .
You can try this.
SELECT X.* FROM T1 X
WHERE X.updatedDateTime=(SELECT MAX(updatedDateTime) FROM Temp WHERE temp.userID = T1.userID)