Am writing a batch application to consume Kafka events and write it to GCS location. Tried deleting the checkpoint location and also verified kafka has 200 messages to consume
Spark - 2.4.8
Scala - 2.21
Submit Command : spark-shell --master yarn --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.8
import org.apache.spark.sql.streaming.Trigger
val readInputKafkaDataNew = spark.readStream
.format("kafka")
.option(
"kafka.bootstrap.servers",
"localhost:9092"
)
.option("subscribe", "changefeed")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.writeStream
.outputMode("append")
.format("csv")
.option(
"path",
"gs://test-data-today/test_data/abc"
)
.option(
"checkpointLocation",
"gs://test-data-today/test_data/chkdir"
)
.trigger(Trigger.Once())
.start()
.awaitTermination();
The console log prints its committing the offsets but the dataframe is empty
22/06/24 22:25:35 INFO org.apache.spark.sql.execution.streaming.MicroBatchExecution: Streaming query made progress: {
"id" : "df24d47d-1dbb-4124-9d4f-0e4d9e6a0275",
"runId" : "e0f330e7-fdcf-47ba-85c6-417afbb3cff9",
"name" : null,
"timestamp" : "2022-06-24T22:25:23.257Z",
"batchId" : 0,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 5675,
"getBatch" : 1,
"getEndOffset" : 0,
"queryPlanning" : 17,
"setOffsetRange" : 4870,
"triggerExecution" : 12018,
"walCommit" : 715
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[changefeed]]",
"startOffset" : null,
"endOffset" : {
"changefeed" : {
"2" : 34,
"5" : 28,
"4" : 26,
"1" : 39,
"3" : 49,
"0" : 30
}
},
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "FileSink[gs://gs://test-data-today/test_data/abc]"
}
}
Related
I have to join two spark data-frames in Scala based on a custom function. Both data-frames have the same schema.
Sample Row of data in DF1:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 50.0,
"sf1" : "val_1",
"sf2" : "val_2"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
}
]
}
Sample Row of data in DF2:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 90.0,
"sf1" : "val_7",
"sf2" : "val_8"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
RESULT of Joining these sample rows:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
The result is:
full-outer-join based on value of "F1", "F2" and "F3" +
join of "F4" keeping unique nodes(use name as id) with max value of "count"
I am not very familiar with Scala and have been struggling with this for more than a day now. Here is what I have gotten to so far:
val df1 = sqlContext.read.parquet("stack_a.parquet")
val df2 = sqlContext.read.parquet("stack_b.parquet")
val df4 = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df5 = df2.toDF(df1.columns.map(_ + "_B"):_*)
val df6 = df4.join(df5, df4("F1_A") === df5("F1_B") && df4("F2_A") === df5("F2_B") && df4("F3_A") === df5("F3_B"), "outer")
def joinFunction(r:Row) = {
//Need the real-deal here!
//print(r(3)) //-->Any = WrappedArray([..])
//also considering parsing as json to do the processing but not sure about the performance impact
//val parsed = JSON.parseFull(r.json) //then play with parsed
r.toSeq //
}
val finalResult = df6.rdd.map(joinFunction)
finalResult.collect
I was planning to add the custom merge logic in joinFunction but I am struggling to convert the WrappedArray/Any class to something I can work with.
Any inputs on how to do the conversion or the join in a better way will be very helpful.
Thanks!
Edit (7 Mar, 2021)
The full-outer join actually has to be performed only on "F1".
Hence, using #werner's answer, I am doing:
val df1_a = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df2_b = df2.toDF(df2.columns.map(_ + "_B"):_*)
val finalResult = df1_a.join(df2_b, df1_a("F1_A") === df2_b("F1_B"), "full_outer")
.drop("F1_B")
.withColumn("F4", joinFunction(col("F4_A"), col("F4_B")))
.drop("F4_A", "F4_B")
.withColumn("F2", when(col("F2_A").isNull, col("F2_B")).otherwise(col("F2_A")))
.drop("F2_A", "F2_B")
.withColumn("F3", when(col("F3_A").isNull, col("F3_B")).otherwise(col("F3_A")))
.drop("F3_A", "F3_B")
But I am getting this error. What am I missing..?
You can implement the merge logic with the help of an udf:
//case class to define the schema of the udf's return value
case class F4(name: String, unit: String, count: Double, sf1: String, sf2: String)
val joinFunction = udf((a: Seq[Row], b: Seq[Row]) =>
(a ++ b).map(r => F4(r.getAs[String]("name"),
r.getAs[String]("unit"),
r.getAs[Double]("count"),
r.getAs[String]("sf1"),
r.getAs[String]("sf2")))
//group the elements from both arrays by name
.groupBy(_.name)
//take the element with the max count from each group
.map { case (_, d) => d.maxBy(_.count) }
.toSeq)
//join the two dataframes
val finalResult = df1.withColumnRenamed("F4", "F4_A").join(
df2.withColumnRenamed("F4", "F4_B"), Seq("F1", "F2", "F3"), "full_outer")
//call the merge function
.withColumn("F4", joinFunction('F4_A, 'F4_B))
//drop the the intermediate columns
.drop("F4_A", "F4_B")
I am getting status messages of the form from a Spark Structured Streaming Application:
18/02/12 16:38:54 INFO StreamExecution: Streaming query made progress: {
"id" : "a6c37f0b-51f4-47c5-a487-8bd269b80142",
"runId" : "061e41b4-f488-4483-a290-403f1f7eff03",
"name" : null,
"timestamp" : "2018-02-12T11:08:54.323Z",
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 30,
"triggerExecution" : 46
},
"eventTime" : {
"watermark" : "1970-01-01T00:00:00.000Z"
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[file:/home/chiralcarbon/IdeaProjects/spark_structured_streaming/args[0]]",
"startOffset" : null,
"endOffset" : null,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink#bcc171"
}
}
All of the messages have numInputRows with value 0.
The program streams data from a parquet file and outputs the same stream to the console.Following is the code:
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder.
master("local")
.appName("sparkSession")
.getOrCreate()
val schema = ..
val in = spark.readStream
.schema(schema)
.parquet("args[0]")
val query = in.writeStream
.format("console")
.outputMode("append")
.start()
query.awaitTermination()
}
}
What is the cause and how do I resolve this?
You have an error in readStream:
val in = spark.readStream
.schema(schema)
.parquet("args[0]")
You probably want to read from directory provided in the first argument. Then use instead direct invocation or string interpolation:
val in = spark.readStream
.schema(schema)
.parquet(args(0))
or the last line, if expression is longer or have some concatenation in other situation:
.parquet(s"${args(0)}")
Currently your code tries to read from non-existing directory, so no file will be read. After change, directory will be provided in correct way and Spark will start read files
I'm currently exploring apache drill, running on a cluster mode. my datasoure is mongodb.My datasource table contains 5 million documents. I can't execute a simple query
select body from mongo.twitter.tweets limit 10;
Throwing exception
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 264 (expected: range(0, 256)) Fragment 1:2 [Error Id: 8903127a-e9e9-407e-8afc-2092b4c03cf0 on test01.css.org:31010] (java.lang.IndexOutOfBoundsException) index: 0, length: 264 (expected: range(0, 256)) io.netty.buffer.AbstractByteBuf.checkIndex():1134 io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes():272 io.netty.buffer.WrappedByteBuf.setBytes():390 io.netty.buffer.UnsafeDirectLittleEndian.setBytes():30 io.netty.buffer.DrillBuf.setBytes():753 io.netty.buffer.AbstractByteBuf.setBytes():510 org.apache.drill.exec.store.bson.BsonRecordReader.writeString():265 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167 org.apache.drill.exec.store.bson.BsonRecordReader.write():75 org.apache.drill.exec.store.mongo.MongoRecordReader.next():186 org.apache.drill.exec.physical.impl.ScanBatch.next():178 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():745
Working query which is fetching results:
select body from mongo.twitter.tweets where tweet_id = 'tag:search.twitter.com,2005:xxxxxxxxxx';
Sample document in source
{
"_id" : ObjectId("58402ad5757d7fede822e641"),
"rule_list" : [
"x",
"(contains:x (contains:y OR contains:y1)) OR (contains:v contains:b) OR (contains:v (contains:r OR contains:t))"
],
"actor_friends_count" : 79,
"klout_score" : 19,
"actor_favorites_count" : 0,
"actor_preferred_username" : "xxxxxxx",
"sentiment" : "neg",
"tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxx",
"object_actor_followers_count" : 1286,
"actor_posted_time" : "2016-07-16T14:08:25.000Z",
"actor_id" : "id:twitter.com:xxxxxxxx",
"actor_display_name" : "xxxxx",
"retweet_count" : 6,
"hashtag_list" : [
"myhashtag"
],
"body" : "my tweet body",
"actor_followers_count" : 25,
"actor_status_count" : 243,
"verb" : "share",
"posted_time" : "2016-08-01T07:49:00.000Z",
"object_actor_status_count" : 206,
"lang" : "ar",
"object_actor_preferred_username" : "xxxxxx",
"original_tweet_id" : "tag:search.twitter.com,2005:xxxxxx",
"gender" : "male",
"object_actor_id" : "id:twitter.com:xxxxxxx",
"favorites_count" : 0,
"object_posted_time" : "2016-06-20T04:12:02.000Z",
"object_actor_friends_count" : 2516,
"generator_display_name" : "Twitter for iPhone",
"object_actor_display_name" : "sdfsf",
"actor_listed_count" : 0
}
Any help is appreciated!
set store.mongo.bson.record.reader = false;
I have a mongo documents like this:
db.activity_days.findOne()
{
"_id" : ObjectId("54b4ee617acf9ce0440a3185"),
"aca" : 0,
"ca" : 0,
"cbdw" : true,
"day" : ISODate("2014-12-10T00:00:00Z"),
"dm" : 0,
"fbc" : 0,
"go" : 2500,
"gs" : [ ],
"its" : [
{
"_id" : ObjectId("551ac8d44f9f322e2b055d3a"),
"at" : 2000,
"atn" : "Running",
"cas" : 386.514909469507,
"dis" : 2.788989730832084,
"du" : 1472,
"ibr" : false,
"ide" : false,
"lcs" : false,
"pt" : 0,
"rpt" : 0,
"src" : 1001,
"stp" : 0,
"tcs" : [ ],
"ts" : 1418257729,
"u_at" : ISODate("2015-01-13T00:32:10.954Z")
}
],
"po" : 0,
"se" : 0,
"st" : 0,
"tap3c" : [ ],
"tzo" : -21600,
"u_at" : ISODate("2015-01-13T00:32:10.952Z"),
"uid" : ObjectId("545eb753ae9237b1df115649")
}
I want to use pig to filter special _id range,I can write mongo query like this:
db.activity_day.find(_id:{$gt:ObjectId("54a48e000000000000000000"),$lt:ObjectId("54cd6c800000000000000000")})
But I don't know how to write in pig, anyone knows?
You could try using mongo-hadoop connector for Pig, see mongo-hadoop: Usage with Pig.
Once you REGISTER the JARs (core, pig, and the Java driver), e.g., REGISTER /path-to/mongo-hadoop-pig-<version>.jar; via grunt you could run:
SET mongo.input.query '{"_id":{"\$gt":{"\$oid":"54a48e000000000000000000},"\$lt":{"\$oid":"54cd6c800000000000000000}}}'
rangeActivityDay = LOAD 'mongodb://localhost:27017/database.collection' USING com.mongodb.hadoop.pig.MongoLoader()
DUMP rangeActivityDay
You may want to use LIMIT before dumping the data as well.
The above was tested using: mongo-java-driver-3.0.0-rc1.jar, mongo-hadoop-pig-1.4.0.jar, mongo-hadoop-core-1.4.0.jar and MongoDB v3.0.9
I am newbie in apache spark and i want to get parquet output file size.
My Scenario is
Read the file from csv and save as text file
myRDD.saveAsTextFile("person.txt")
after saved the file spark UI(localhost:4040) showing me inputBytes 15607801 and outputBytes 13551724
but when i save as parquet file
myDF.saveAsParquetFile("person.perquet")
UI(localhost:4040) on stage tab, only show me inputBytes 15607801 and there is nothing in outputBytes.
Can anybody help me. Thanks in advance
Edit
When I call REST API it giving me following response.
[ {
"status" : "COMPLETE",
"stageId" : 4,
"attemptId" : 0,
"numActiveTasks" : 0,
"numCompleteTasks" : 1,
"numFailedTasks" : 0,
"executorRunTime" : 10955,
"inputBytes" : 15607801,
"inputRecords" : 1440721,
**"outputBytes" : 0,**
**"outputRecords" : 0,**
"shuffleReadBytes" : 0,
"shuffleReadRecords" : 0,
"shuffleWriteBytes" : 0,
"shuffleWriteRecords" : 0,
"memoryBytesSpilled" : 0,
"diskBytesSpilled" : 0,
"name" : "saveAsParquetFile at ParquetExample.scala:82",
"details" : "org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:1494)\ncom.spark.sql.ParquetExample$.main(ParquetExample.scala:82)\ncom.spark.sql.ParquetExample.main(ParquetExample.scala)",
"schedulingPool" : "default",
"accumulatorUpdates" : [ ]
}, {
"status" : "COMPLETE",
"stageId" : 3,
"attemptId" : 0,
"numActiveTasks" : 0,
"numCompleteTasks" : 1,
"numFailedTasks" : 0,
"executorRunTime" : 2091,
"inputBytes" : 15607801,
"inputRecords" : 1440721,
**"outputBytes" : 13551724,**
**"outputRecords" : 1200540,**
"shuffleReadBytes" : 0,
"shuffleReadRecords" : 0,
"shuffleWriteBytes" : 0,
"shuffleWriteRecords" : 0,
"memoryBytesSpilled" : 0,
"diskBytesSpilled" : 0,
"name" : "saveAsTextFile at ParquetExample.scala:77",
"details" : "org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1379)\ncom.spark.sql.ParquetExample$.main(ParquetExample.scala:77)\ncom.spark.sql.ParquetExample.main(ParquetExample.scala)",
"schedulingPool" : "default",
"accumulatorUpdates" : [ ]
} ]