I am currently working on a little project where I stream machine data (JSON format) from a kafka topic for further analysis.
The JSON from the column values shall be split into multiple columns with their corresponding values. Now I always have the problem that I do not see all data in the column values, the view seems to always be truncated.
Reading the stream:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "20.86.42.36:9092")
.option("subscribe", "machine1")
.load()
display(df)
Result:
Dataframe with base64 encoded message
My first problem was that I received the data in binary, which I resolved by casting it to string, using this code:
val df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Result:
enter image description here
Now I still have the problem that I do not see the full column values which makes it hard for me to transform the JSON data into single columns.
I used display(df1) to print the dataframe.
Does anybody have an idea what I am doing wrong?
Try df.show(False) to print without truncation.
I suspect that the display function is similar.
Also, you've not base64 decoded anything. Casting to a string didn't return JSON, only deserialized UTF8 bytes from the topic (databricks may show that as Base64 via the display function, but that's not what's actually in Kafka if simply casting returned JSON)
Related
I want to save Spark DataFrame in Delta format to S3, however, for some reason, the data is not saved. I debugged all the processing steps there was data and right before saving it, I ran count on the DataFrame which returned 24 rows. But as soon as save is called no data appears in the resulting folder. What could be the reason for it?
This is how I save the data:
df
.select(schema)
.repartition(partitionKeys.map(new ColumnName(_)): _*)
.sortWithinPartitions(sortByKeys.map(new ColumnName(_)): _*)
.write
.format("delta")
.partitionBy(partitionKeys: _*)
.mode(saveMode)
.save("s3a://etl-qa/data_feed")
There is a quick start from Databricks that explains how to read and write from and to a delta lake.
If the Dataframe you are trying to save is called df you need to execute:
df.write.format("delta").save(s3path)
I am using Spark 2.1.0 and Kafka 0.9.0.
I am trying to push the output of a batch spark job to kafka. The job is supposed to run every hour but not as streaming.
While looking for an answer on the net I could only find kafka integration with Spark streaming and nothing about the integration with the batch job.
Does anyone know if such thing is feasible ?
Thanks
UPDATE :
As mentioned by user8371915, I tried to follow what was done in Writing the output of Batch Queries to Kafka.
I used a spark shell :
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Here is the simple code that I tried :
val df = Seq(("Rey", "23"), ("John", "44")).toDF("key", "value")
val newdf = df.select(to_json(struct(df.columns.map(column):_*)).alias("value"))
newdf.write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic", "alerts").save()
But I get the error :
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:497)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 50 elided
Have any idea what is this related to ?
Thanks
tl;dr You use outdated Spark version. Writes are enabled in 2.2 and later.
Out-of-the-box you can use Kafka SQL connector (the same as used with Structured Streaming). Include
spark-sql-kafka in your dependencies.
Convert data to DataFrame containing at least value column of type StringType or BinaryType.
Write data to Kafka:
df
.write
.format("kafka")
.option("kafka.bootstrap.servers", server)
.save()
Follow Structured Streaming docs for details (starting with Writing the output of Batch Queries to Kafka).
If you have a dataframe and you want to write it to a kafka topic, you need to convert columns first to a "value" column that contains data in a json format. In scala it is
import org.apache.spark.sql.functions._
val kafkaServer: String = "localhost:9092"
val topicSampleName: String = "kafkatopic"
df.select(to_json(struct("*")).as("value"))
.selectExpr("CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServer)
.option("topic", topicSampleName)
.save()
For this error
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
I think you need to parse the message to Key value pair. Your dataframe should have value column.
Let say if you have a dataframe with student_id, scores.
df.show()
>> student_id | scores
1 | 99.00
2 | 98.00
then you should modify your dataframe to
value
{"student_id":1,"score":99.00}
{"student_id":2,"score":98.00}
To convert you can use similar code like this
df.select(to_json(struct($"student_id",$"score")).alias("value"))
I am new to Spark streaming. I am trying structured Spark streaming with local csv files. I am getting the below exception while processing.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
This is my code.
val df = spark
.readStream
.format("csv")
.option("header", "false") // Use first line of all files as header
.option("delimiter", ":") // Specifying the delimiter of the input file
.schema(inputdata_schema) // Specifying the schema for the input file
.load("file:///home/Teju/Desktop/SparkInputFiles/*.csv")
val filterop = spark.sql("select tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID,first(rssi_weightage(RSSI)) as RSSI_Weight from my_table where RSSI > -127 group by tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID order by Timestamp ASC")
val outStream = filterop.writeStream.outputMode("complete").format("console").start()
I created cron job so every 5 mins I will get one input csv file. I am trying to parse through Spark streaming.
(This is not a solution but more a comment, but given its length it ended up here. I'm going to make it an answer eventually right after I've collected enough information for investigation).
My guess is that you're doing something incorrect on df that you have not included in your question.
Since the error message is about FileSource with the path as below and it is a streaming dataset that must be df that's in play.
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
Given the other lines I guess that you register the streaming dataset as a temporary table (i.e. my_table) that you then use in spark.sql to execute SQL and writeStream to the console.
df.createOrReplaceTempView("my_table")
If that's correct, the code you've included in the question is incomplete and does not show the reason for the error.
Add .writeStream.start to your df, as the Exception is telling you.
Read the docs for more detail.
I have a hive table column (Json_String String) it has some 1000 rows, Where each row is a Json of same structure. I am trying read the json in to Dataframe as below
val df = sqlContext.read.json("select Json_String from json_table")
but it is throwing up the below exception
java.io.IOException: No input paths specified in job
is there any way to read all the rows in to dataframe as we do with Json files using wild card
val df = sqlContext.read.json("file:///home/*.json")
I think what you're asking for is to read the Hive table as usual and transform the JSON column using from_json function.
from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
Given you use sqlContext in your code, I'm afraid that you use Spark < 2.1.0 which then does not offer from_json (which was added in 2.1.0).
The solution then is to use a custom user-defined function (UDF) to do the parsing yourself.
val df = sqlContext.read.json("select Json_String from json_table")
The above won't work since json operator expects a path or paths to JSON files on disk (not as a result of executing a query against a Hive table).
json(paths: String*): DataFrame Loads a JSON file (JSON Lines text format or newline-delimited JSON) and returns the result as a DataFrame.
is it possible to convert RDD[CassandraRow] to RDD[String] ? if so , is there any disadvantage of working against the converted RDD ?
You can use sqlContext to read data from Cassandra table, it returns an DataFrame, and when you read text file using sparkContext it returns RDD and then you can convert that to DataFrame.
If your text files are CSV, Spark 2.0 Supports csv data source, it returns an DataFrame by deafult. Please see this.. https://spark.apache.org/releases/spark-release-2-0-0.html#new-features and https://github.com/databricks/spark-csv/issues/
Update:
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html