Spark Kafka streamRead and streamWrite showing null records - scala

Am writing a batch application to consume Kafka events and write it to GCS location. Tried deleting the checkpoint location and also verified kafka has 200 messages to consume
Spark - 2.4.8
Scala - 2.21
Submit Command : spark-shell --master yarn --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.8
import org.apache.spark.sql.streaming.Trigger
val readInputKafkaDataNew = spark.readStream
.format("kafka")
.option(
"kafka.bootstrap.servers",
"localhost:9092"
)
.option("subscribe", "changefeed")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.writeStream
.outputMode("append")
.format("csv")
.option(
"path",
"gs://test-data-today/test_data/abc"
)
.option(
"checkpointLocation",
"gs://test-data-today/test_data/chkdir"
)
.trigger(Trigger.Once())
.start()
.awaitTermination();
The console log prints its committing the offsets but the dataframe is empty
22/06/24 22:25:35 INFO org.apache.spark.sql.execution.streaming.MicroBatchExecution: Streaming query made progress: {
"id" : "df24d47d-1dbb-4124-9d4f-0e4d9e6a0275",
"runId" : "e0f330e7-fdcf-47ba-85c6-417afbb3cff9",
"name" : null,
"timestamp" : "2022-06-24T22:25:23.257Z",
"batchId" : 0,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 5675,
"getBatch" : 1,
"getEndOffset" : 0,
"queryPlanning" : 17,
"setOffsetRange" : 4870,
"triggerExecution" : 12018,
"walCommit" : 715
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[changefeed]]",
"startOffset" : null,
"endOffset" : {
"changefeed" : {
"2" : 34,
"5" : 28,
"4" : 26,
"1" : 39,
"3" : 49,
"0" : 30
}
},
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "FileSink[gs://gs://test-data-today/test_data/abc]"
}
}

Related

Join data-frame based on value in list of WrappedArray

I have to join two spark data-frames in Scala based on a custom function. Both data-frames have the same schema.
Sample Row of data in DF1:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 50.0,
"sf1" : "val_1",
"sf2" : "val_2"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
}
]
}
Sample Row of data in DF2:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 90.0,
"sf1" : "val_7",
"sf2" : "val_8"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
RESULT of Joining these sample rows:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
The result is:
full-outer-join based on value of "F1", "F2" and "F3" +
join of "F4" keeping unique nodes(use name as id) with max value of "count"
I am not very familiar with Scala and have been struggling with this for more than a day now. Here is what I have gotten to so far:
val df1 = sqlContext.read.parquet("stack_a.parquet")
val df2 = sqlContext.read.parquet("stack_b.parquet")
val df4 = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df5 = df2.toDF(df1.columns.map(_ + "_B"):_*)
val df6 = df4.join(df5, df4("F1_A") === df5("F1_B") && df4("F2_A") === df5("F2_B") && df4("F3_A") === df5("F3_B"), "outer")
def joinFunction(r:Row) = {
//Need the real-deal here!
//print(r(3)) //-->Any = WrappedArray([..])
//also considering parsing as json to do the processing but not sure about the performance impact
//val parsed = JSON.parseFull(r.json) //then play with parsed
r.toSeq //
}
val finalResult = df6.rdd.map(joinFunction)
finalResult.collect
I was planning to add the custom merge logic in joinFunction but I am struggling to convert the WrappedArray/Any class to something I can work with.
Any inputs on how to do the conversion or the join in a better way will be very helpful.
Thanks!
Edit (7 Mar, 2021)
The full-outer join actually has to be performed only on "F1".
Hence, using #werner's answer, I am doing:
val df1_a = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df2_b = df2.toDF(df2.columns.map(_ + "_B"):_*)
val finalResult = df1_a.join(df2_b, df1_a("F1_A") === df2_b("F1_B"), "full_outer")
.drop("F1_B")
.withColumn("F4", joinFunction(col("F4_A"), col("F4_B")))
.drop("F4_A", "F4_B")
.withColumn("F2", when(col("F2_A").isNull, col("F2_B")).otherwise(col("F2_A")))
.drop("F2_A", "F2_B")
.withColumn("F3", when(col("F3_A").isNull, col("F3_B")).otherwise(col("F3_A")))
.drop("F3_A", "F3_B")
But I am getting this error. What am I missing..?
You can implement the merge logic with the help of an udf:
//case class to define the schema of the udf's return value
case class F4(name: String, unit: String, count: Double, sf1: String, sf2: String)
val joinFunction = udf((a: Seq[Row], b: Seq[Row]) =>
(a ++ b).map(r => F4(r.getAs[String]("name"),
r.getAs[String]("unit"),
r.getAs[Double]("count"),
r.getAs[String]("sf1"),
r.getAs[String]("sf2")))
//group the elements from both arrays by name
.groupBy(_.name)
//take the element with the max count from each group
.map { case (_, d) => d.maxBy(_.count) }
.toSeq)
//join the two dataframes
val finalResult = df1.withColumnRenamed("F4", "F4_A").join(
df2.withColumnRenamed("F4", "F4_B"), Seq("F1", "F2", "F3"), "full_outer")
//call the merge function
.withColumn("F4", joinFunction('F4_A, 'F4_B))
//drop the the intermediate columns
.drop("F4_A", "F4_B")

Streaming query not showing any progress in Spark

I am getting status messages of the form from a Spark Structured Streaming Application:
18/02/12 16:38:54 INFO StreamExecution: Streaming query made progress: {
"id" : "a6c37f0b-51f4-47c5-a487-8bd269b80142",
"runId" : "061e41b4-f488-4483-a290-403f1f7eff03",
"name" : null,
"timestamp" : "2018-02-12T11:08:54.323Z",
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 30,
"triggerExecution" : 46
},
"eventTime" : {
"watermark" : "1970-01-01T00:00:00.000Z"
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[file:/home/chiralcarbon/IdeaProjects/spark_structured_streaming/args[0]]",
"startOffset" : null,
"endOffset" : null,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink#bcc171"
}
}
All of the messages have numInputRows with value 0.
The program streams data from a parquet file and outputs the same stream to the console.Following is the code:
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder.
master("local")
.appName("sparkSession")
.getOrCreate()
val schema = ..
val in = spark.readStream
.schema(schema)
.parquet("args[0]")
val query = in.writeStream
.format("console")
.outputMode("append")
.start()
query.awaitTermination()
}
}
What is the cause and how do I resolve this?
You have an error in readStream:
val in = spark.readStream
.schema(schema)
.parquet("args[0]")
You probably want to read from directory provided in the first argument. Then use instead direct invocation or string interpolation:
val in = spark.readStream
.schema(schema)
.parquet(args(0))
or the last line, if expression is longer or have some concatenation in other situation:
.parquet(s"${args(0)}")
Currently your code tries to read from non-existing directory, so no file will be read. After change, directory will be provided in correct way and Spark will start read files

apache drill select single column crashes

I'm currently exploring apache drill, running on a cluster mode. my datasoure is mongodb.My datasource table contains 5 million documents. I can't execute a simple query
select body from mongo.twitter.tweets limit 10;
Throwing exception
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 264 (expected: range(0, 256)) Fragment 1:2 [Error Id: 8903127a-e9e9-407e-8afc-2092b4c03cf0 on test01.css.org:31010] (java.lang.IndexOutOfBoundsException) index: 0, length: 264 (expected: range(0, 256)) io.netty.buffer.AbstractByteBuf.checkIndex():1134 io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes():272 io.netty.buffer.WrappedByteBuf.setBytes():390 io.netty.buffer.UnsafeDirectLittleEndian.setBytes():30 io.netty.buffer.DrillBuf.setBytes():753 io.netty.buffer.AbstractByteBuf.setBytes():510 org.apache.drill.exec.store.bson.BsonRecordReader.writeString():265 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167 org.apache.drill.exec.store.bson.BsonRecordReader.write():75 org.apache.drill.exec.store.mongo.MongoRecordReader.next():186 org.apache.drill.exec.physical.impl.ScanBatch.next():178 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():745
Working query which is fetching results:
select body from mongo.twitter.tweets where tweet_id = 'tag:search.twitter.com,2005:xxxxxxxxxx';
Sample document in source
{
"_id" : ObjectId("58402ad5757d7fede822e641"),
"rule_list" : [
"x",
"(contains:x (contains:y OR contains:y1)) OR (contains:v contains:b) OR (contains:v (contains:r OR contains:t))"
],
"actor_friends_count" : 79,
"klout_score" : 19,
"actor_favorites_count" : 0,
"actor_preferred_username" : "xxxxxxx",
"sentiment" : "neg",
"tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxx",
"object_actor_followers_count" : 1286,
"actor_posted_time" : "2016-07-16T14:08:25.000Z",
"actor_id" : "id:twitter.com:xxxxxxxx",
"actor_display_name" : "xxxxx",
"retweet_count" : 6,
"hashtag_list" : [
"myhashtag"
],
"body" : "my tweet body",
"actor_followers_count" : 25,
"actor_status_count" : 243,
"verb" : "share",
"posted_time" : "2016-08-01T07:49:00.000Z",
"object_actor_status_count" : 206,
"lang" : "ar",
"object_actor_preferred_username" : "xxxxxx",
"original_tweet_id" : "tag:search.twitter.com,2005:xxxxxx",
"gender" : "male",
"object_actor_id" : "id:twitter.com:xxxxxxx",
"favorites_count" : 0,
"object_posted_time" : "2016-06-20T04:12:02.000Z",
"object_actor_friends_count" : 2516,
"generator_display_name" : "Twitter for iPhone",
"object_actor_display_name" : "sdfsf",
"actor_listed_count" : 0
}
Any help is appreciated!
set store.mongo.bson.record.reader = false;

How TO Filter by _id in mongodb using pig

I have a mongo documents like this:
db.activity_days.findOne()
{
"_id" : ObjectId("54b4ee617acf9ce0440a3185"),
"aca" : 0,
"ca" : 0,
"cbdw" : true,
"day" : ISODate("2014-12-10T00:00:00Z"),
"dm" : 0,
"fbc" : 0,
"go" : 2500,
"gs" : [ ],
"its" : [
{
"_id" : ObjectId("551ac8d44f9f322e2b055d3a"),
"at" : 2000,
"atn" : "Running",
"cas" : 386.514909469507,
"dis" : 2.788989730832084,
"du" : 1472,
"ibr" : false,
"ide" : false,
"lcs" : false,
"pt" : 0,
"rpt" : 0,
"src" : 1001,
"stp" : 0,
"tcs" : [ ],
"ts" : 1418257729,
"u_at" : ISODate("2015-01-13T00:32:10.954Z")
}
],
"po" : 0,
"se" : 0,
"st" : 0,
"tap3c" : [ ],
"tzo" : -21600,
"u_at" : ISODate("2015-01-13T00:32:10.952Z"),
"uid" : ObjectId("545eb753ae9237b1df115649")
}
I want to use pig to filter special _id range,I can write mongo query like this:
db.activity_day.find(_id:{$gt:ObjectId("54a48e000000000000000000"),$lt:ObjectId("54cd6c800000000000000000")})
But I don't know how to write in pig, anyone knows?
You could try using mongo-hadoop connector for Pig, see mongo-hadoop: Usage with Pig.
Once you REGISTER the JARs (core, pig, and the Java driver), e.g., REGISTER /path-to/mongo-hadoop-pig-<version>.jar; via grunt you could run:
SET mongo.input.query '{"_id":{"\$gt":{"\$oid":"54a48e000000000000000000},"\$lt":{"\$oid":"54cd6c800000000000000000}}}'
rangeActivityDay = LOAD 'mongodb://localhost:27017/database.collection' USING com.mongodb.hadoop.pig.MongoLoader()
DUMP rangeActivityDay
You may want to use LIMIT before dumping the data as well.
The above was tested using: mongo-java-driver-3.0.0-rc1.jar, mongo-hadoop-pig-1.4.0.jar, mongo-hadoop-core-1.4.0.jar and MongoDB v3.0.9

Apache Spark:How to get parquet output file size and records

I am newbie in apache spark and i want to get parquet output file size.
My Scenario is
Read the file from csv and save as text file
myRDD.saveAsTextFile("person.txt")
after saved the file spark UI(localhost:4040) showing me inputBytes 15607801 and outputBytes 13551724
but when i save as parquet file
myDF.saveAsParquetFile("person.perquet")
UI(localhost:4040) on stage tab, only show me inputBytes 15607801 and there is nothing in outputBytes.
Can anybody help me. Thanks in advance
Edit
When I call REST API it giving me following response.
[ {
"status" : "COMPLETE",
"stageId" : 4,
"attemptId" : 0,
"numActiveTasks" : 0,
"numCompleteTasks" : 1,
"numFailedTasks" : 0,
"executorRunTime" : 10955,
"inputBytes" : 15607801,
"inputRecords" : 1440721,
**"outputBytes" : 0,**
**"outputRecords" : 0,**
"shuffleReadBytes" : 0,
"shuffleReadRecords" : 0,
"shuffleWriteBytes" : 0,
"shuffleWriteRecords" : 0,
"memoryBytesSpilled" : 0,
"diskBytesSpilled" : 0,
"name" : "saveAsParquetFile at ParquetExample.scala:82",
"details" : "org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:1494)\ncom.spark.sql.ParquetExample$.main(ParquetExample.scala:82)\ncom.spark.sql.ParquetExample.main(ParquetExample.scala)",
"schedulingPool" : "default",
"accumulatorUpdates" : [ ]
}, {
"status" : "COMPLETE",
"stageId" : 3,
"attemptId" : 0,
"numActiveTasks" : 0,
"numCompleteTasks" : 1,
"numFailedTasks" : 0,
"executorRunTime" : 2091,
"inputBytes" : 15607801,
"inputRecords" : 1440721,
**"outputBytes" : 13551724,**
**"outputRecords" : 1200540,**
"shuffleReadBytes" : 0,
"shuffleReadRecords" : 0,
"shuffleWriteBytes" : 0,
"shuffleWriteRecords" : 0,
"memoryBytesSpilled" : 0,
"diskBytesSpilled" : 0,
"name" : "saveAsTextFile at ParquetExample.scala:77",
"details" : "org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1379)\ncom.spark.sql.ParquetExample$.main(ParquetExample.scala:77)\ncom.spark.sql.ParquetExample.main(ParquetExample.scala)",
"schedulingPool" : "default",
"accumulatorUpdates" : [ ]
} ]