How to copy partial topic data from one cluster to another - apache-kafka

I have a use case where I need to copy the data from one topic to another topic in a different cluster but I need to copy only from a given offset. What can I use for the above use case?
I have looked into mirror maker as it copies data from one cluster to another but how to mention the offset part, I am not getting that.
Is there any utility I can use?

If as you say "This will be a one time operation" you can use kafkacat this -o option.
For example (the easiest case):
kafkacat -C -b mybrocker_cluster_1:9092 -t mytopic1 -o <offset> | \
kafkacat -P -b mybrocker_cluster_2:9092 -t mytopic1
You probably still need to add a few parameters to the consumer:
-X message.max.bytes=<value> -X fetch.message.max.bytes=<value> -X receive.message.max.bytes=<value>

Related

How to get the current Nodes where the Job is running

I'm developing a job and the user can choose in which nodes can run, so the Node Filter is open to the convenient of the user.
When the job is starting I need to do a calculation based in the number of nodes chosed by the user, exist a way to get this number?
Regards,
Alejandro L
By design, that information is only available after the execution, so, a good approach is to call the job via API (in a script step) and with the execution ID number (available in the API call output) you can list and count the nodes, e.g:
#!/bin/bash
nodes=$(curl -s -X GET "http://localhost:4440/api/41/execution/16" \
--header "Accept: application/json" \
--header "X-Rundeck-Auth-Token: your_user_token" \
| jq -r '.successfulNodes | . []')
number_of_nodes=$(echo $nodes | wc -w)
echo "Number of nodes: $number_of_nodes"
This example needs jq to extract the nodes from the API response.
Anyway, your request sounds good for an enhancement request, please suggest that here.
a workaround to take would be using job.filter variable
so if you do #job.filter#
it returns a string with the list of nodes like us-east-1-0,us-east-1-1,us-east-1-2
if you save it as a string, and then split the string on ',' then you get an array of nodes:
IFS=',' read -r -a array <<< "$string"
and then you can get the number of nodes by
echo ${#array[#]}
Note
as #MegaDrive68k mentioned this won't work if use select all node with the use of filter .*

How to print debugging information on one/specific OpenAPI model?

According to the OpenAPI docs here is how one can print generator's models data:
$ java -jar openapi-generator-cli.jar generate \
-g typescript-fetch \
-o out \
-i api.yaml \
-DdebugModels
which outputs 39000 lines and it's a little difficult to find a modele of one's interest.
How to output debug information on just one model?
Unfortunately, there's no way to generate the debug log for just one model or operation.
As a workaround, you can draft a new spec that contains the model you want to debug.

Kafkacat Produce message from a file with headers

I need to produce batch messages to Kafka so I have a file that I feed kafkacat:
kafkacat -b localhost:9092 -t <my_topic> -T -P -l /tmp/msgs
The content of /tmp/msgs is as follows
-H "id=1"
{"key" : "value0"}
-H "id=2"
{"key" : "value1"}
When I run the kafkacat command above, it inserts four messages to kafka - one message per line in /tmp/msgs.
I need to instruct kafkacat to parse the file correctly - that is -H "id=1" is the header for the message {"key" = "value0"}.
How do I achieve this?
Thanks
You need to pass the headers as follows.
kcat -b localhost:9092 -t topic-name -P -H key1=value1 -H key2=value2 /temp/payload.json

Mahout clustering: How to retrieve the name of a named vector

I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster.
I read that you can use the option --namedVector when creating the sparse-files but where does it take the ID from and how can I retrieve this ID after the clustering is completed?
Right now I am doing the following steps:
I have a directory with a file for each document. The files are in the following format with the ID of the document as filename:
filename: documentID.txt
[TITLE]
[CONTENT]
I create a sparse directory with namedVectors using:
./mahout seqdirectory -i tmp/es-out -o tmp/es-out-seqdir -c UTF-8 -chunk 64 -xm sequential
./mahout seq2sparse -i tmp/es-out-seqdir -o tmp/es-out-sparse --maxDFPercent 85 --namedVector
Then I can cluster the results and create a dump:
./mahout kmeans -i tmp/es-out-sparse/tfidf-vectors -c tmp/es-kmeans-clusters -o tmp/es-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering
./mahout clusterdump -i tmp/es-kmeans/clusters-10-final -o tmp/clusterdump -d tmp/es-out-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir tmp/es-kmeans/clusteredPoints
The dump looks like this:
:VL-190{n=1 c=[1:3.407, 110:6.193, 2007:3.736, about:1.762, according:2.948, account:3.507, acting:6.
Top Terms:
epa => 13.471728324890137
mountaintop => 11.364262580871582
mine => 10.942587852478027
Weight : [props - optional]: Point:
[...]
k-means in Mahout is only a toy.
You can use it for howtos and tutorials, but for real use it is too slow, too limited, roo hard to use. (Also, k-means results are not half as good as people think... most of the time they are dogfood.)
Benchmark other tools, and you'll be surprised big time.
I found a way. You can use the seqdumper to extract the cluster mapping:
./mahout seqdumper -i /tmp/es-kmeans/clusteredPoints/part-m-00000 -o /tmp/cluster-points.txt
Than you can use a regex to extract the mapping of the vector IDs to cluster IDs.

K-means on Mahout returning non-exclusive clusters

In my data I have users with a list of likes, I've dumped these likes into individual files for each user and would like to cluster them. Everything is working except the output has the same likes in multiple clusters. My understanding is k-means should be exclusive. I figure the problem is perhaps with how I am dumping the data. I have also dumped all of the likes without spaces for the time being until I can write a custom tokenizer. Here's what I'm running (from a ruby script).
system("#{MAHOUT_CMD} seqdirectory -c UTF-8 -i data/users -o data/kmeans/converted")
system("#{MAHOUT_CMD} seq2sparse -i data/kmeans/converted -o data/kmeans/vectors")
system("#{MAHOUT_CMD} kmeans -i data/kmeans/vectors/tfidf-vectors -c data/kmeans/initial_clusters -o data/kmeans/kmeans_clusters -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.1 -k 20 -x 20")
last_cluster_folder = Dir["data/kmeans/kmeans_clusters/*"].last.gsub("data/kmeans/kmeans_clusters/", "")
system("#{MAHOUT_CMD} clusterdump -s data/kmeans/kmeans_clusters/#{last_cluster_folder}/ -d data/kmeans/vectors/dictionary.file-0 -dt sequencefile -o data/kmeans/clusters.txt -n 1000")
The output lists the "top terms" in each cluster, however many of the likes occur in each cluster (though with different weights). Is the normal output for clusterdumper, do I need to find out what cluster each word belongs to by its weight?
Thanks
Mahout probably is only doing approximate k-means. Plus, there might be objects that have the same distance to more than one cluster.
You should however be able to just take the k means, and then do a 1-nearest-neighbor classification to get a unique result for each objects (this is trivial to parallelize and very fast).