Hi I am trying to read tweets from Twitter using Apache Spark Streaming and trying to convert to a DataFrame. I have the approach that I have pasted below. However, I am not beign able to get the correct approach. Some pointers would be welcome.
As you can see converting to DF inside the foreach does not get me a single DF from tweetStream. I probably have the wrong approach as I am new to this. How do I approach this?
val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth).filter(status=>status.getLang=="en")
.map(status=>gson.toJson(status))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
tweetStream.foreachRDD({status=>val DF = status.toDF()})
I have not tried it, but maybe something like this works:
var df_tweets:DataFrame = null
dstream_tweets.foreachRDD {
rrd => if (df_tweets != null) {
df_tweets = df_tweets.unionAll(rdd.toDF) // combine previous dataframe
} else {
df_tweets = rdd.toDF() // create new dataframe
}
}
Related
I have the below code, where I am pulling data from an API and storing it into a JSON file. Further I will be loading into an oracle table. Also, the data value of the ID column is the column name under VelocityEntries. I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
With the below code I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
val inputStream = scala.io.Source.fromInputStream(connection.getInputStream).mkString
val fileWriter1 = new FileWriter(new File(filename))
fileWriter1.write(inputStream.mkString)
fileWriter1.close()
val json_df = spark.read.option("multiLine", true).json(filename)
val embedded_df = json_df.select(explode(col("sprints")) as "x").select(("x.*"))
val list_df = json_df.select("velocityStatEntries.*").columns.toList
for( i <- list_df)
{
completed_df = json_df.select(s"velocityStatEntries.$i.completed.value")
completed_df.show()
}
I get tweets from kafka topic with Avro (serializer and deserializer).
Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord].
Now i want to convert each rdd to a dataframe to analyse these tweets via SQL.
Any solution to convert RDD[GenericRecord] to dataframe please ?
I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED
//Define function to convert from GenericRecord to Row
def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = {
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
import scala.collection.JavaConversions._
for (field <- record.getSchema.getFields) {
objectArray(field.pos) = record.get(field.pos)
}
new GenericRowWithSchema(objectArray, sqlType.dataType.asInstanceOf[StructType])
}
//Inside your stream foreachRDD
val yourGenericRecordRDD = ...
val schema = new Schema.Parser().parse(...) // your schema
val sqlType = SchemaConverters.toSqlType(new Schema.Parser().parse(strSchema))
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
As you see, I am using a SchemaConverter to get the dataframe structure from the schema that you used to deserialize (this could be more painful with schema registry). For this you need the following dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
you will need to change your spark version depending on yours.
UPDATE: the code above only works for flat avro schemas.
For nested structures I used something different. You can copy the class SchemaConverters, it has to be inside of com.databricks.spark.avro (it uses some protected classes from the databricks package) or you can try to use the spark-bigquery dependency. The class will not be accessible by default, so you will need to create a class inside a package com.databricks.spark.avro to access the factory method.
package com.databricks.spark.avro
import com.databricks.spark.avro.SchemaConverters.createConverterToSQL
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
class SchemaConverterUtils {
def converterSql(schema : Schema, sqlType : StructType) = {
createConverterToSQL(schema, sqlType)
}
}
After that you should be able to convert the data like
val schema = .. // your schema
val sqlType = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
....
//inside foreach RDD
var genericRecordRDD = deserializeAvroData(rdd)
///
var converter = SchemaConverterUtils.converterSql(schema, sqlType)
...
val rowRdd = genericRecordRDD.flatMap(record => {
Try(converter(record).asInstanceOf[Row]).toOption
})
//To DataFrame
val df = sqlContext.createDataFrame(rowRdd, sqlType)
A combination of https://stackoverflow.com/a/48828303/5957143 and https://stackoverflow.com/a/47267060/5957143 works for me.
I used the following to create MySchemaConversions
package com.databricks.spark.avro
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.DataType
object MySchemaConversions {
def createConverterToSQL(avroSchema: Schema, sparkSchema: DataType): (GenericRecord) => Row =
SchemaConverters.createConverterToSQL(avroSchema, sparkSchema).asInstanceOf[(GenericRecord) => Row]
}
And then I used
val myAvroType = SchemaConverters.toSqlType(schema).dataType
val myAvroRecordConverter = MySchemaConversions.createConverterToSQL(schema, myAvroType)
// unionedResultRdd is unionRDD[GenericRecord]
var rowRDD = unionedResultRdd.map(record => MyObject.myConverter(record, myAvroRecordConverter))
val df = sparkSession.createDataFrame(rowRDD , myAvroType.asInstanceOf[StructType])
The advantage of having myConverter in the object MyObject is that you will not encounter serialization issues (java.io.NotSerializableException).
object MyObject{
def myConverter(record: GenericRecord,
myAvroRecordConverter: (GenericRecord) => Row): Row =
myAvroRecordConverter.apply(record)
}
Even though something like this may help you,
val stream = ...
val dfStream = stream.transform(rdd:RDD[GenericRecord]=>{
val df = rdd.map(_.toSeq)
.map(seq=> Row.fromSeq(seq))
.toDF(col1,col2, ....)
df
})
I'd like to suggest you an alternate approach. With Spark 2.x you can skip the whole process of creating DStreams. Instead, you can do something like this with structured streaming,
val df = ss.readStream
.format("com.databricks.spark.avro")
.load("/path/to/files")
This will give you a single dataframe which you can directly query. Here, ss is the instance of spark session. /path/to/files is the place where all your avro files are being dumped from kafka.
PS: You may need to import spark-avro
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Hope this helped. Cheers
You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
import sqlContext.implicits.
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.
this is simple " how to " question::
We can bring data to Spark environment through com.databricks.spark.csv. I do know how to create HBase table through spark, and write data to the HBase tables manually. But is that even possible to load a text/csv/jason files directly to HBase through Spark? I cannot see anybody talking about it. So, just checking. If possible, please guide me to a good website that explains the scala code in detail to get it done.
Thank you,
There are multiple ways you can do that.
Spark Hbase connector:
https://github.com/hortonworks-spark/shc
You can see lot of examples on the link.
Also you can use SPark core to load the data to Hbase using HbaseConfiguration.
Code Example:
val fileRDD = sc.textFile(args(0), 2)
val transformedRDD = fileRDD.map { line => convertToKeyValuePairs(line) }
val conf = HBaseConfiguration.create()
conf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("hbase.master", "localhost:60000")
conf.set("fs.default.name", "hdfs://localhost:8020")
conf.set("hbase.rootdir", "/hbase")
val jobConf = new Configuration(conf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)
transformedRDD.saveAsNewAPIHadoopDataset(jobConf)
def convertToKeyValuePairs(line: String): (ImmutableBytesWritable, Put) = {
val cfDataBytes = Bytes.toBytes("cf")
val rowkey = Bytes.toBytes(line.split("\\|")(1))
val put = new Put(rowkey)
put.add(cfDataBytes, Bytes.toBytes("PaymentDate"), Bytes.toBytes(line.split("|")(0)))
put.add(cfDataBytes, Bytes.toBytes("PaymentNumber"), Bytes.toBytes(line.split("|")(1)))
put.add(cfDataBytes, Bytes.toBytes("VendorName"), Bytes.toBytes(line.split("|")(2)))
put.add(cfDataBytes, Bytes.toBytes("Category"), Bytes.toBytes(line.split("|")(3)))
put.add(cfDataBytes, Bytes.toBytes("Amount"), Bytes.toBytes(line.split("|")(4)))
return (new ImmutableBytesWritable(rowkey), put)
}
Also you can use this one
https://github.com/nerdammer/spark-hbase-connector
I am trying to convert a csv file to a dataframe in Spark 1.5.2 with Scala without the use of the library databricks, as it is a community project and this library is not available. My approach was the following:
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim))
var header = rows.first()
var data = rows.filter(_(0) != header(0))
var df = sc.makeRDD(1 to data.count().toInt).map(i => (data.take(i).drop(i-1)(0)(0), data.take(i).drop(i-1)(0)(1), data.take(i).drop(i-1)(0)(2), data.take(i).drop(i-1)(0)(3), data.take(i).drop(i-1)(0)(4))).toDF(header(0), header(1), header(2), header(3), header(4))
This code, even though it is quite a mess, works without returning any error messages. The problem comes when trying to display the data inside dfin order to verify the correctness of this method and later try to do some queries in df. The error code I am getting after executing df.show() is SPARK-5063. My questions are:
1) Why is it not possible to print the content of df?
2) Is there any other more straightforward method to convert a csv to a dataframe in Spark 1.5.2 without using the library databricks?
For spark 1.5.x can be used code snippet below to convert input into DF
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the DataClass interface with 5 fields.
case class DataClass(id: Int, name: String, surname: String, bdate: String, address: String)
// Create an RDD of DataClass objects and register it as a table.
val peopleData = sc.textFile("input.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim, p(3).trim, p(4).trim)).toDF()
peopleData.registerTempTable("dataTable")
val peopleDataFrame = sqlContext.sql("SELECT * from dataTable")
peopleDataFrame.show()
Spark 1.5
You can create like this:
SparkSession spark = SparkSession
.builder()
.appName("RDDtoDF_Updated")
.master("local[2]")
.config("spark.some.config.option", "some-value")
.getOrCreate();
StructType schema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("eid", DataTypes.IntegerType, false),
DataTypes.createStructField("eName", DataTypes.StringType, false),
DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
DataTypes.createStructField("eGen", DataTypes.StringType,true)});
String filepath = "F:/Hadoop/Data/EMPData.txt";
JavaRDD<Row> empRDD = spark.read()
.textFile(filepath)
.javaRDD()
.map(line -> line.split("\\,"))
.map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));
Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
empDF.groupBy("edept").max("esal").show();
Using Spark with Scala.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
var hiveCtx = new HiveContext(sc)
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))
val df = hiveCtx.createDataframe(rows,schema)
This should work.
But for creating dataframe, would recommend you to use Spark-CSV.
I am writing a spark job, trying to read a text file using scala, the following works fine on my local machine.
val myFile = "myLocalPath/myFile.csv"
for (line <- Source.fromFile(myFile).getLines()) {
val data = line.split(",")
myHashMap.put(data(0), data(1).toDouble)
}
Then I tried to make it work on AWS, I did the following, but it didn't seem to read the entire file properly. What should be the proper way to read such text file on s3? Thanks a lot!
val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val s3Object = s3Client.getObject(new GetObjectRequest("myBucket", "myFile.csv"));
val reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
var line = ""
while ((line = reader.readLine()) != null) {
val data = line.split(",")
myHashMap.put(data(0), data(1).toDouble)
println(line);
}
I think I got it work like below:
val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));
val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
for (line <- myData) {
val data = line.split(",")
myMap.put(data(0), data(1).toDouble)
}
println(" my map : " + myMap.toString())
Read in csv-file with sc.textFile("s3://myBucket/myFile.csv"). That will give you an RDD[String]. Get that into a map
val myHashMap = data.collect
.map(line => {
val substrings = line.split(" ")
(substrings(0), substrings(1).toDouble)})
.toMap
You can the use sc.broadcast to broadcast your map, so that it is readily available on all your worker nodes.
(Note that you can of course also use the Databricks "spark-csv" package to read in the csv-file if you prefer.)
This can be acheived even withoutout importing amazons3 libraries using SparkContext textfile. Use the below code
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val s3Login = "s3://AccessKey:Securitykey#Externalbucket"
val filePath = s3Login + "/Myfolder/myscv.csv"
for (line <- sc.textFile(filePath).collect())
{
var data = line.split(",")
var value1 = data(0)
var value2 = data(1).toDouble
}
In the above code, sc.textFile will read the data from your file and store in the line RDD. It then split each line with , to a different RDD data inside the loop. Then you can access values from this RDD with the index.