I want to create I stream kafka consumer in pyFlink, which can read tweets data after deserialization (json),
I have pyflink version 1.14.4 (last version)
Can I have an example of kafka producer and a simple code of flink consumer for stream in python?
Here is an example given in PyFlink examples which shows how to read json data from Kafka consumer in PyFlink DataStream API:
################################################################################
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
import logging
import sys
from pyflink.common import Types, JsonRowDeserializationSchema, JsonRowSerializationSchema
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaProducer, FlinkKafkaConsumer
# Make sure that the Kafka cluster is started and the topic 'test_json_topic' is
# created before executing this job.
def write_to_kafka(env):
type_info = Types.ROW([Types.INT(), Types.STRING()])
ds = env.from_collection(
[(1, 'hi'), (2, 'hello'), (3, 'hi'), (4, 'hello'), (5, 'hi'), (6, 'hello'), (6, 'hello')],
type_info=type_info)
serialization_schema = JsonRowSerializationSchema.Builder() \
.with_type_info(type_info) \
.build()
kafka_producer = FlinkKafkaProducer(
topic='test_json_topic',
serialization_schema=serialization_schema,
producer_config={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'}
)
# note that the output type of ds must be RowTypeInfo
ds.add_sink(kafka_producer)
env.execute()
def read_from_kafka(env):
deserialization_schema = JsonRowDeserializationSchema.Builder() \
.type_info(Types.ROW([Types.INT(), Types.STRING()])) \
.build()
kafka_consumer = FlinkKafkaConsumer(
topics='test_json_topic',
deserialization_schema=deserialization_schema,
properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group_1'}
)
kafka_consumer.set_start_from_earliest()
env.add_source(kafka_consumer).print()
env.execute()
if __name__ == '__main__':
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format="%(message)s")
env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///path/to/flink-sql-connector-kafka-1.15.0.jar")
print("start writing data to kafka")
write_to_kafka(env)
print("start reading data from kafka")
read_from_kafka(env)
Related
Initially, I created a topic named "quickstart-events" and then produced some messages into it, then consumed it from the kafka-console-consumer with consumer group "quickstartGroup" and now I want to replicate the group from source to destination.
When I run describe command to describe the group in the source cluster
~/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group quickstartGroup
The output I'm getting is
Consumer group 'quickstartGroup' has no active members.
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
quickstartGroup quickstart-events 1 9 12 3 - - -
quickstartGroup quickstart-events 0 9 12 3 - - - -
Here, the topic is getting replicated but when I run the command to describe the group in the destination cluster
kafka-consumer-groups.sh --bootstrap-server localhost:9093 --describe --group quickstartGroup
Getting error as
Error: Consumer group 'quickstartGroup' does not exist
My Mirror Maker 2 properties file contents are:
# Licensed to the Apache Software Foundation (ASF) under A or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see org.apache.kafka.clients.consumer.ConsumerConfig for more details
# Sample MirrorMaker 2.0 top-level configuration file
# Run with ./bin/connect-mirror-maker.sh connect-mirror-maker.properties
# specify any number of cluster aliases
clusters = A, B
# connection information for each cluster
# This is a comma separated host:port pairs for each cluster
# for e.g. "A_host1:9092, A_host2:9092, A_host3:9092"
A.bootstrap.servers = localhost:9092
B.bootstrap.servers = localhost:9093
# enable and configure individual replication flows
A->B.enabled = true
# regex which defines which topics gets replicated. For eg "foo-.*"
A->B.topics = quickstart-events.*
A->B.groups = quickstartGroup.*
# Setting replication factor of newly created remote topics
replication.factor=2
############################# Internal Topic Settings #############################
# The replication factor for mm2 internal topics "heartbeats", "B.checkpoints.internal" and
# "mm2-offset-syncs.B.internal"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
checkpoints.topic.replication.factor=1
heartbeats.topic.replication.factor=1
offset-syncs.topic.replication.factor=1
# The replication factor for connect internal topics "mm2-configs.B.internal", "mm2-offsets.B.internal" and
# "mm2-status.B.internal"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
offset.storage.replication.factor=1
status.storage.replication.factor=1
config.storage.replication.factor=1
# customize as needed
# replication.policy.separator = _
# sync.topic.acls.enabled = false
# emit.heartbeats.interval.seconds = 5
groups.exclude = ''
replication.policy.class=ch.mawileo.kafka.mm2.PrefixlessReplicationPolicy
P.S: I am using Kafka 2.8
I need to override server.properties for kafka to add advertised host name, advertised listener and advertised port.
I tried changing server.properties and restarting kafka from ambari but the server.properties go back to previous values once the kafka server is up.
Then i figured i could start zookeeper and kafka from command line. I tried
/usr/hdp/current/zookeeper-server/bin/zkServer.sh start
The zookeeper started fine and reflected that it started in ambari but
/usr/hdp/3.0.1.0-187/kafka/bin/kafka-server-start.sh /usr/hdp/3.0.1.0-187/kafka/config/server.properties --override advertised.host.name=localhost --override advertised.listeners=PLAINTEXT
://:sandbox-hdp.hortonworks.com:6667 --override advertised.port=6667
This did not start kafka server or not atleast when i tried some consumers and did not reflect in the Ambari GUI.
The ports are mentioned in the properties file, any leads on how to override advertised listeners? And if yes what should be the advertised host name, listener and port values?
I am attaching the properties files, IP addresses for both windows(Host OS) and sandbox.
Sandbox IP: 172.18.0.2
Windows IP: 192.168.0.21
This is consumer.properties
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see org.apache.kafka.clients.consumer.ConsumerConfig for more details
# list of brokers used for bootstrapping knowledge about the rest of the cluster
# format: host1:port1,host2:port2 ...
bootstrap.servers=localhost:9092
# consumer group id
group.id=test-consumer-group
# What to do when there is no initial offset in Kafka or if the current
# offset does not exist any more on the server: latest, earliest, none
#auto.offset.reset=
This is producer.properties
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see org.apache.kafka.clients.producer.ProducerConfig for more details
############################# Producer Basics #############################
# list of brokers used for bootstrapping knowledge about the rest of the cluster
# format: host1:port1,host2:port2 ...
bootstrap.servers=localhost:9092
# specify the compression codec for all data generated: none, gzip, snappy, lz4
compression.type=none
# name of the partitioner class for partitioning events; default partition spreads data randomly
#partitioner.class=
# the maximum amount of time the client will wait for the response of a request
#request.timeout.ms=
# how long `KafkaProducer.send` and `KafkaProducer.partitionsFor` will block for
#max.block.ms=
# the producer will wait for up to the given delay to allow other records to be sent so that the sends can be batched together
#linger.ms=
# the maximum size of a request in bytes
#max.request.size=
# the default batch size in bytes when batching multiple records sent to a partition
#batch.size=
# the total bytes of memory the producer can use to buffer records waiting to be sent to the server
#buffer.memory=
This is server.properties
# Generated by Apache Ambari. Sun May 3 19:25:08 2020
auto.create.topics.enable=true
auto.leader.rebalance.enable=true
compression.type=producer
controlled.shutdown.enable=true
controlled.shutdown.max.retries=3
controlled.shutdown.retry.backoff.ms=5000
controller.message.queue.size=10
controller.socket.timeout.ms=30000
default.replication.factor=1
delete.topic.enable=true
external.kafka.metrics.exclude.prefix=kafka.network.RequestMetrics,kafka.server.DelayedOperationPurgatory,kafka.server.BrokerTopicMetrics.BytesRejectedPerSec
external.kafka.metrics.include.prefix=kafka.network.RequestMetrics.ResponseQueueTimeMs.request.OffsetCommit.98percentile,kafka.network.RequestMetrics.ResponseQueueTimeMs.request.Offsets.95percentile,kafka.network.RequestMetrics.ResponseSendTimeMs.request.Fetch.95percentile,kafka.network.RequestMetrics.RequestsPerSec.request
fetch.purgatory.purge.interval.requests=10000
kafka.ganglia.metrics.group=kafka
kafka.ganglia.metrics.host=localhost
kafka.ganglia.metrics.port=8671
kafka.ganglia.metrics.reporter.enabled=true
kafka.metrics.reporters=
kafka.timeline.metrics.host_in_memory_aggregation=
kafka.timeline.metrics.host_in_memory_aggregation_port=
kafka.timeline.metrics.host_in_memory_aggregation_protocol=
kafka.timeline.metrics.hosts=
kafka.timeline.metrics.maxRowCacheSize=10000
kafka.timeline.metrics.port=
kafka.timeline.metrics.protocol=
kafka.timeline.metrics.reporter.enabled=true
kafka.timeline.metrics.reporter.sendInterval=5900
kafka.timeline.metrics.truststore.password=
kafka.timeline.metrics.truststore.path=
kafka.timeline.metrics.truststore.type=
leader.imbalance.check.interval.seconds=300
leader.imbalance.per.broker.percentage=10
listeners=PLAINTEXT://sandbox-hdp.hortonworks.com:6667
log.cleanup.interval.mins=10
log.dirs=/kafka-logs
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.retention.bytes=-1
log.retention.check.interval.ms=600000
log.retention.hours=168
log.roll.hours=168
log.segment.bytes=1073741824
message.max.bytes=1000000
min.insync.replicas=1
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
num.replica.fetchers=1
offset.metadata.max.bytes=4096
offsets.commit.required.acks=-1
offsets.commit.timeout.ms=5000
offsets.load.buffer.size=5242880
offsets.retention.check.interval.ms=600000
offsets.retention.minutes=86400000
offsets.topic.compression.codec=0
offsets.topic.num.partitions=50
offsets.topic.replication.factor=1
offsets.topic.segment.bytes=104857600
port=6667
producer.metrics.enable=false
producer.purgatory.purge.interval.requests=10000
queued.max.requests=500
replica.fetch.max.bytes=1048576
replica.fetch.min.bytes=1
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.lag.max.messages=4000
replica.lag.time.max.ms=10000
replica.socket.receive.buffer.bytes=65536
replica.socket.timeout.ms=30000
sasl.enabled.mechanisms=GSSAPI
sasl.mechanism.inter.broker.protocol=GSSAPI
security.inter.broker.protocol=PLAINTEXT
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.client.auth=none
ssl.key.password=
ssl.keystore.location=
ssl.keystore.password=
ssl.truststore.location=
ssl.truststore.password=
zookeeper.connect=sandbox-hdp.hortonworks.com:2181
zookeeper.connection.timeout.ms=25000
zookeeper.session.timeout.ms=30000
zookeeper.sync.time.ms=2000
This is zookeeper.properties
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# the directory where the snapshot is stored.
dataDir=/tmp/zookeeper
# the port at which the clients will connect
clientPort=2181
# disable the per-ip limit on the number of connections since this is a non-production config
maxClientCnxns=0
Cannot be sure if you are hitting this, but the generic solution to 'I changed properties and my change disappeared' can be explained as follows:
There are config files used by the services, these are managed by Ambari.
It is possible to change these files directly, but this is not recommended as Ambari will simply overwrite them as soon as it gets the chance.
As such the root cause of your problem may be that you updated configs directly in the file, rather than through Ambari.
You may also want to check out this thread: https://community.cloudera.com/t5/Support-Questions/Producing-to-Kafka-HDP-Sandbox-2-6-4/td-p/189742
I am facing the following task: I have individual files (like Mb) stored in Google Cloud Storage Bucket grouped in directories by date (each directory contains around 5k files). I need to look at each file (xml) , filter proper one and put them into Mongo or write back to Google Cloud Storage in lets say parquet format. I wrote a simple pySpark program that looks like this:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = (
SparkSession
.builder
.appName('myApp')
.config("spark.mongodb.output.uri", "mongodb://<mongo_connection>")
.config("spark.mongodb.output.database", "test")
.config("spark.mongodb.output.collection", "test")
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true")
.config("spark.dynamicAllocation.enabled", "true")
.getOrCreate()
)
spark_context = spark.sparkContext
spark_context.setLogLevel("INFO")
sql_context = pyspark.SQLContext(spark_context)
# configure Hadoop
hadoop_conf = spark_context._jsc.hadoopConfiguration()
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoop_conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
# DataFrame schema
schema = StructType([
StructField('filename', StringType(), True),
StructField("date", DateType(), True),
StructField("xml", StringType(), True)
])
# -------------------------
# Main operation
# -------------------------
# get all files
files = spark_context.wholeTextFiles('gs://bucket/*/*.gz')
rows = files \
.map(lambda x: custom_checking_map(x)) \
.filter(lambda x: x is not None)
# transform to DataFrame
df = sql_context.createDataFrame(rows, schema)
# write to mongo
df.write.format("mongo").mode("append").save()
# write back to Cloud Storage
df.write.parquet('gs://bucket/test.parquet')
spark_context.stop()
I tested it on a subset (single directory gs://bucket/20191010/*.gz) and it works. I deploy it on Google Dataproc cluster, but doubt anything is happening single the logs stop after 19/11/06 15:41:40 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1573054807908_0001
I am running 3 worker cluster with 4 cores and 15GB RAM + 500GB HDD. Spark version 2.3.3, scala 2.11 mongo-connector-spark_2.11-2.3.3.
I am new to Spark so any suggestions are appreciated. Normally, I would write this work using Python multiprocessing, but wanted to move to something "better", but now I am not sure.
It could take significant amount of time to list very large number of files in GCS - most probably your job "hangs" while Spark driver listing all files before starting processing.
You will achieve much better performance by listing all directories first and after that processing files in each directory - to achieve best performance you can process directories in parallel, but taking into account that each directory has 5k files and your cluster only 3 workers, it could be good enough to process directories sequentially.
my kafka cluster version is 0.10.0.0, and i want to use pyspark stream to read kafka data. but in Spark Streaming + Kafka Integration Guide, http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
there is no python code example.
so can pyspark use spark-streaming-kafka-0-10 to integrate kafka?
Thank you in advance for your help !
I also use spark streaming with Kafka 0.10.0 cluster. After adding following line to your code, you are good to go.
spark.jars.packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0
And here a sample in python:
# Initialize SparkContext
sc = SparkContext(appName="sampleKafka")
# Initialize spark stream context
batchInterval = 10
ssc = StreamingContext(sc, batchInterval)
# Set kafka topic
topic = {"myTopic": 1}
# Set application groupId
groupId = "myTopic"
# Set zookeeper parameter
zkQuorum = "zookeeperhostname:2181"
# Create Kafka stream
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topic)
#Do as you wish with your stream
# Start stream
ssc.start()
ssc.awaitTermination()
You can use spark-streaming-kafka-0-8 when your brokers are 0.10 and later. spark-streaming-kafka-0-8 supports newer brokers versions while streaming-kafka-0-10 does not support older broker versions. streaming-kafka-0-10 as of now is still experimental and has no Python support.
Here is how I launch the Spark job :
./bin/spark-submit \
--class MyDriver\
--master spark://master:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/SparkJob-0.0.1-SNAPSHOT.jar
The class MyDriver accesses the spark context using :
val sc = new SparkContext(new SparkConf())
val dataFile= sc.textFile("/data/example.txt", 1)
In order to run this within a cluster I copy the file "/data/example.txt" to all nodes within the cluster. Is there a mechanism using Spark to share this data file between nodes without manually copying them ? I don't think I can use a broadcast variable in this case ?
Update :
An option is to have a dedicated file server which shares the file to be processed : val dataFile= sc.textFile("http://fileserver/data/example.txt", 1)
sc.textFile("/some/file.txt") read a file distributed in hdfs, i.e.:
/some/file.txt is (already) split in multiple parts which are distributed a couple of computers each.
and each worker/task read one parts of the file. This is useful because you don't need to manage which part yourself.
If you have copied the files on each worker node, you can read it in all task:
val myRdd = sc.parallelize(1 to 100) // 100 tasks
val fileReadEveryWhere = myRdd.map( read("/my/file.txt") )
and have the code of read(...) implemented somewhere.
Otherwise, you can also use a [broadcast variable] that is seed from the driver to all workers:
val myObject = read("/my/file.txt") // obj instantiated on driver node
val bdObj = sc.broadcast(myObject)
val myRdd = sc.parallelize(1 to 100)
.map{ i =>
// use bdObj in task i, ex:
bdObj.value.process(i)
}
In this case, myObject should be serializable and it is better if it is not too big.
Also, the method read(...) is run on the driver machine. So you only need the file on the driver. But if you don't know which machine it is (e.g. if you use spark-submit) then the file should be on all machines :-\ . In this case, it is maybe better to have access to some DB or external file system.