How apache beam realize Scala zipWithIndex()? - apache-beam

Pcollection p1 = {"a","b","c"}
PCollection> p2 = p1.apply("some operation ") //{(1,"a"),(2,"b"),(3,"c")}
How can I get the result by Apache Beam,Thanks?

Related

RejectedExecutionException: ReactorDispatcher instance is closed. - Azure Event Hubs & Databricks Spark

I am trying to consume data from Azure Event Hubs with Databricks PySpark and write it in an ADLS sink. Somehow, the spark jobis not able to finish and gets aborted after running for 2 hours. The error is Caused by: java.util.concurrent.RejectedExecutionException: ReactorDispatcher instance is closed.
here is a full error https://gist.github.com/kingindanord/a5f585c6ee7053c275c714d1b07c6538#file-spark_error-log
and here is my python script
import json
from datetime import date, timedelta, datetime
from pyspark.sql import functions as F
KEY_VAULT_NAME="KEY_VAULT_NAME"
EVENT_HUBS_SECRET_NAME="EVENT_HUBS_SECRET_NAME"
EVENT_HUBS_CONSUMER_NAME="EVENT_HUBS_CONSUMER_NAME"
BATCH_START_DATE = datetime.strptime("2022-03-22 23:00:00", "%Y-%m-%d %H:%M:%S")
BATCH_END_DATE = datetime.strptime("2022-03-23 00:00:00", "%Y-%m-%d %H:%M:%S")
CONTAINER_NAME = "CONTAINER_NAME_AZ"
HUB_NAME = "HUB_NAME"
ROOT_FOLDER = "ROOT_FOLDER"
SINK_URI = 'abfss://{CONTAINER_NAME}#.dfs.core.windows.net/{SINK_ROOT_FOLDER}'.format(CONTAINER_NAME=CONTAINER_NAME, SINK_ROOT_FOLDER=ROOT_FOLDER)
connection = dbutils.secrets.get(scope = KEY_VAULT_NAME, key = EVENT_HUBS_SECRET_NAME)
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connection)
ehConf['eventhubs.consumerGroup'] = EVENT_HUBS_CONSUMER_NAME
# Create the positions
startingEventPosition = {
"offset": None,
"seqNo": -1, #not in use
"enqueuedTime": BATCH_START_DATE.strftime("%Y-%m-%dT00:00:00.000Z"),
"isInclusive": True
}
endingEventPosition = {
"offset": None,
"seqNo": -1,
"enqueuedTime": BATCH_END_DATE.strftime("%Y-%m-%dT00:00:00.000Z"),
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
ehConf["eventhubs.endingPosition"] = json.dumps(endingEventPosition)
ehConf["eventhubs.MaxEventsPerTrigger"] = 1000
ehConf["eventhubs.UseExclusiveReceiver"] = True
df = spark.read.format("eventhubs").options(**ehConf).load()
df2 = df.withColumn("body", df["body"].cast("string")) \
.withColumn("year", F.date_format(df["enqueuedTime"], "yyyy")) \
.withColumn("month", F.date_format(df["enqueuedTime"], "MM")) \
.withColumn("day", F.date_format(df["enqueuedTime"], "dd"))\
.select("body", "year", "month", "day")
df2.write.partitionBy("year", "month", "day").mode("overwrite") \
.format("delta") \
.parquet(SINK_URI)
I am using a separate consumer group for this application. The Event hub has 3 partitions, Auto-inflate throughput units are enabled and it is set to 21 units.
Databricks Runtime Version: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) Worker type & Driver type are Standard_E16_v3 (128GB Memory, 16 Cores) Min workers: 1, Max workers, 3.
As you can see in the code, startingEventPosition and endingEventPosition are only one hour apart, so the size of data should be around 3 GB, I don't know why I am not able to consume them. Can you please help me with this issue.
You can try the 2 workarounds:
Set different Consumer Groups for each stream.
Restart databricks cluster and then try again.
Refer this github link

Py4JJava wrong columns error when calling PCA of pyspark.ml.feature

I am trying to visualize word2vec words using pyspark's PCA function, but I'm getting an unhelpful error message. Saying column features are of the wrong type, but they aren't. (Full message below)
Background
spark-2.4.0-bin-hadoop2.7
Scala 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
3.6.5 |Anaconda, Inc.
Ubuntu 16.04
My Code
maxWordsVis = 15
Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat)
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])
$dfFeat.head()
Row(features=DenseVector([-0.1282, 0.0699, -0.0891, -0.0437, -0.0915, -0.0557, 0.1432, -0.1564, 0.0058, -0.0603, 0.1383, -0.0359, -0.0306, -0.0415, -0.0191, 0.058, 0.0119, -0.0302, 0.0362, -0.0466, 0.0403, -0.1035, 0.0456, 0.0892, 0.0548, -0.0735, 0.1094, -0.0299, -0.0549, -0.1235, 0.0062, 0.1381, -0.0082, 0.085, -0.0083, -0.0346, -0.0226, -0.0084, -0.0463, -0.0448, 0.0285, -0.0013, 0.0343, -0.0056, 0.0756, -0.0068, 0.0562, 0.0638, 0.023, -0.0224, -0.0228, 0.0281, -0.0698, -0.0044, 0.0395, -0.021, 0.0228, 0.0666, 0.0362, 0.0116, -0.0088, 0.0949, 0.0265, -0.0293, -0.007, -0.0746, 0.0891, 0.0145, 0.0532, -0.0084, -0.0853, 0.0037, -0.055, -0.0706, -0.0296, 0.0321, 0.0495, -0.0776, -0.1339, -0.065, 0.0856, 0.0328, 0.0821, 0.036, -0.0179, -0.0006, -0.036, 0.0438, -0.0077, -0.0012, 0.0322, 0.0354, 0.0513, 0.0436, 0.0002, -0.0578, 0.1062, 0.019, 0.0346, -0.1261]))
numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")
Error Message
Py4JJavaError: An error occurred while calling o4583.fit. : java.lang.IllegalArgumentException: requirement failed:
Column features must be of type
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:224)

Latest records/messages present in a topic kafka

Is there a way to fetch the latest 1000 records/messages present in a topic in kafka ? similar to tail -f 1000 in case of a file in linux ?
Using Python Kafka I think!!! I found this way to get the last message.
Configure it to get n last messages but make sure there are enough messages in case the topic is empty. this looks like a job for streaming i.e Kafka streams or Kafka SQL
#!/usr/bin/env python
from kafka import KafkaConsumer, TopicPartition
TOPIC = 'example_topic'
GROUP = 'demo'
BOOTSTRAP_SERVERS = ['bootstrap.kafka:9092']
consumer = KafkaConsumer(
bootstrap_servers=BOOTSTRAP_SERVERS,
group_id=GROUP,
# enable_auto_commit=False,
auto_commit_interval_ms=0,
max_poll_records=1
)
candidates = []
consumer.commit()
msg = None
partitions = consumer.partitions_for_topic(TOPIC)
for p in partitions:
tp = TopicPartition(TOPIC, p)
consumer.assign([tp])
committed = consumer.committed(tp)
consumer.seek_to_end(tp)
last_offset = consumer.position(tp)
print(f"\ntopic: {TOPIC} partition: {p} committed: {committed} last: {last_offset} lag: {(last_offset - committed)}")
consumer.poll(
timeout_ms=100,
# max_records=1
)
# consumer.assign([partition])
consumer.seek(tp, last_offset-4)
for message in consumer:
# print(f"Message is of type: {type(message)}")
print(message)
# print(f'message.offset: {message.offset}')
# TODO find out why the number is -1
if message.offset == last_offset-1:
candidates.append(message)
# print(f' {message}')
# comment if you don't want the messages committed
consumer.commit()
break
print('\n\ngooch\n\n')
latest_msg = candidates[0]
for msg in candidates:
print(f'finalists:\n {msg}')
if msg.timestamp > latest_msg.timestamp:
latest_msg = msg
consumer.close()
print(f'\n\nlatest_message:\n{latest_msg}')
I know that in Java/Scala Kafka Streams there is a possibility to create a table i.e a sub topic with only the last entry in another topic so confluence Kafka library in c might offer a more elegant and efficient way. it has python and java bindings and kafkacat CLI.
You can use the seek method of KafkaConsumer class - you need to find current offsets for every partition, and then perform calculation to find correct offsets.
consumer = KafkaConsumer()
partition = TopicPartition('foo', 0)
start = 1234
end = 2345
consumer.assign([partition])
consumer.seek(partition, start)
for msg in consumer:
if msg.offset > end:
break
else:
print msg
source

apache spark - delay on driver

I am attaching image from spark UI, and i am asking what is causing the delay( represented by white space) based on the description of my code below
Description:
1) isEmpt: is a action triggered on a Dataset DS1. it takes fe milliseconds : 60ms.
2) The white space between "isEmpty" and " run at ThreadPool...".
3) "collect at graphUtil" : collection of Datasets created between 1) and 2)
The script is running on yarn cluster.
Between 1) and 2) i am declaring Datasets which uses sqlContext.implicits._, i am not collecting them here.so this is supposed to be work on Driver.Those Datasets contains Join/filter/....
Having that i am not collecting them between 1) and 2) what could be causing this delay.
Code between 1) and 2)
import sqlContext.implicits._
val intermediateInputFlowsIdsDS= intermediateInputFlowsDS
.map(x=>x.flow)
.toDF("flowid").distinct().as[Int].repartition($"flowid")
val df_exch_flow_interm_out=df_exch_flow.filter(df_exch_flow("flow_type")==="PRODUCT_FLOW"
&&df_exch_flow("is_input")==="0" )
val allproducersExchDS= intermediateInputFlowsIdsDS.join(df_exch_flow_interm_out,
intermediateInputFlowsIdsDS("flowid")===df_exch_flow_interm_out("f_flow") )
.repartition($"f_owner")
//proc{id,name,proctype}/inter{flowid}/df_exch{exch,proc,flow,direct,amount,provider,unit}/df_flow{id,name,type}/unit{id,src,factor,dest}
df_proc.join(allproducersExchDS,df_proc("Id")=== allproducersExchDS("f_owner"))
.map(row => {
/*(flowid,procid,value)*/
new FlowProducer( row.getInt(3),// flowid output of producer
row.getInt(0) ,// the process id of producer
row.getDouble(8),// value of the matrix A cell,
row.getDouble(16),//factor
row.getString(17),//destination unit
row.getString(2)//process type
)
}).repartition($"producer_flow")

suggestion with hadoop project

I am thinking to build something using big data. Ideally what I would like to do is:
take a .csv put it into flume, than kafka, perform n ETL and put back in Kafka, from kafka put into flume and then in hdfs. Once the infos are in hdfs I would like to perform a map reduce job or some hive queries and then chart whatever I want.
How can I put the .csv file into flume and save it to kafka? I have this piece of code but I am not sure if it works:
myagent.sources = r1
myagent.sinks = k1
myagent.channels = c1
myagent.sources.r1.type = spooldir
myagent.sources.r1.spoolDir = /home/xyz/source
myagent.sources.r1.fileHeader = true
myagent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
vmagent.channels.c1.type = memory
myagent.channels.c1.capacity = 1000
myagent.channels.c1.transactionCapacity = 100
myagent.sources.r1.channels = c1
myagent.sinks.k1.channel = c1
Any help or suggestions? And if this piece of code is correct, how to move on?
Thanks everyone!!
Your Sink config is incomplete. Try :
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = mytopic
a1.sinks.k1.brokerList = localhost:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1
https://flume.apache.org/FlumeUserGuide.html#kafka-sink