data parallelism in spark : reading avro data from hdfs - scala

I am trying to read avro data using scala within spark environment. My data is not getting distributed and while running it is going to 2 nodes only. we have 20+nodes. Here is my code snippet
#serializable case class My_Class (val My_ID : String )
val filePath = "hdfs://path";
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](filePath)
val rddprsid = avroRDD.map(A =>
new My_Class(new String(A._1.datum.get("My_ID").toString()))
);
val uploadFilter = rddprsid.filter(E => E.My_ID ne null);
val as = uploadFilter.distinct(100).count;
I am not able to use parallelize operation on the rdd as it complaints about the following error.
<console>:30: error: type mismatch;
found : org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)]
required: Seq[?]
Can someone please help?

You are seeing only 2 nodes because yarn submission defaults to 2. You need to submit with --num-executors [NUMBER] and optionally --executor-cores [NUMBER]
As to the parallelize....you're data is already parallelized...thus the wrapper around RDD. parallelize is only for use to take in-memory data across the cluster.

Related

Create a neo4j graph from an hdfs file

I want to put the data that is in HDFS in a neo4j chart. Taking into account the suggestion given in this question, I used the neo4j-spark-connector, initializing them this way:
/usr/local/sparks/bin/spark-shell --conf spark.neo4j.url="bolt://192.xxx.xxx.xx:7687" --conf spark.neo4j.user="xxxx" --conf spark.neo4j.password="xxxx" --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M1, graphframes:graphframes:0.2.0-spark2.0-s2.11
I read the file that is in the hdfs and put it in neo4j from the function Neo4jDataFrame.mergeEdgeList.
import org.neo4j.spark.dataframe.Neo4jDataFrame
object Neo4jTeste {
def main {
val lines = sc.textFile("hdfs://.../testeNodes.csv")
val filteredlines = lines.map(_.split("-")).map{x => (x(0),x(1),x(2),x(3))}
val newNames = Seq("name","apelido","city","date")
val df = filteredlines.toDF(newNames: _*)
Neo4jDataFrame.mergeEdgeList(sc, df, ("Name", Seq("name")),("HAPPENED_IN", Seq.empty), ("Age", Seq("age")))
} }
However, as my data is very large, neo4j is unable to represent all nodes. I think the problem is with the function.
I've tried it this way too:
import org.neo4j.spark._
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (n:Person) RETURN id(n) as id ").loadRowRdd
However, this way I cannot read the HDFS file or divide it into columns
Can someone help me find another solution? With the Neo4jDataFrame.mergeEdgeList function, I only see 150 nodes instead of the 500 expected.

What is the difference between saveAsObjectFile and persist in apache spark?

I am trying to compare java and kryo serialization and on saving the rdd on disk, it is giving the same size when saveAsObjectFile is used, but on persist it shows different in spark ui. Kryo one is smaller then java, but the irony is processing time of java is lesser then that of kryo which was not expected from spark UI?
val conf = new SparkConf()
.setAppName("kyroExample")
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(
Array(classOf[Person],classOf[Array[Person]])
)
val sparkContext = new SparkContext(conf)
val personList: Array[Person] = (1 to 100000).map(value => Person(value + "", value)).toArray
val rddPerson: RDD[Person] = sparkContext.parallelize(personList)
val evenAgePerson: RDD[Person] = rddPerson.filter(_.age % 2 == 0)
case class Person(name: String, age: Int)
evenAgePerson.saveAsObjectFile("src/main/resources/objectFile")
evenAgePerson.persist(StorageLevel.MEMORY_ONLY_SER)
evenAgePerson.count()
persist and saveAsObjectFile solve different needs.
persist has a misleading name. It is not supposed to be used to permanently persist the rdd result. Persist is used to temporarily persist a computation result of the rdd during a spark workfklow. The user has no control, over the location of the persisted dataframe. Persist is just caching with different caching strategy - memory, disk or both. In fact cache will just call persist with a default caching strategy.
e.g
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs again on the full dataframe of all the lines , repeats the above operation
errors.filter(col("line").like("%MySQL%")).count()
vs
val errors = df.filter(col("line").like("%ERROR%"))
errors.persist()
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs only on the errors tmp result containing only the filtered error lines
errors.filter(col("line").like("%MySQL%")).count()
saveAsObjectFile is for permanent persistance. It is used for serializing the final result of a spark job onto a persistent and usually distributed filessystem like hdfs or amazon s3
Spark persist and saveAsObjectFile is not the same at all.
persist - persist your RDD DAG to the requested StorageLevel, thats mean from now and on any transformation apply on this RDD will be calculated only from the persisted DAG.
saveAsObjectFile - just save the RDD into a SequenceFile of serialized object.
saveAsObjectFile doesn`t use "spark.serializer" configuration at all.
as you can see the below code:
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}
saveAsObjectFile uses Utils.serialize to serialize your object, when serialize method defenition is:
/** Serialize an object using Java serialization */
def serialize[T](o: T): Array[Byte] = {
val bos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(bos)
oos.writeObject(o)
oos.close()
bos.toByteArray
}
saveAsObjectFile use the Java Serialization always.
On the other hand, persist will use the configured spark.serializer you configure him.

Pyspark predict using kafka direct stream

I am trying to pull kafka data to spark streaming, load an already built model from HDFS and then make predictions using kafka message.
I tried several methods but I'm stuck at model.predict because of a TypeError: Cannot convert type into Vector
The data received from kafka is float comma separated.
Here is my code :
sc = SparkContext(appName="PythonStreamingKafkaForecast")
ssc = StreamingContext(sc, 10)
# Create stream to get kafka messages
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["my_topic"], {"metadata.broker.list": "kafka_ip"})
features = directKafkaStream.foreachRDD(lambda rdd: rdd.map(lambda s: Vectors.dense(s[1].split(","))))
model = LinearRegressionModel.load(sc, "hdfs://hadoop_ip/model.model")
#Predict
predicted = model.predict(features)
I also tried this :
lines = directKafkaStream.map(lambda x: x[1])
features = lines.map(lambda data: Vectors.dense([float(c) for c in data.split(',')]))
But this time, features is of type TransformedStream which won't work for preidctions ...
Could you tell me what I'm doing wrong ?
Thank you for your help
Ok, the issue was to try to read data from kafka even if the topic is empty.
This solved my problem :
def predict(rdd):
count = rdd.count()
if (count > 0):
features = rdd.map(lambda s: Vectors.dense(s[1].split(",")))
return features
else:
print("No data received")
directKafkaStream.foreachRDD(lambda rdd: predict(rdd))

Spark: Write each record in RDD to individual files in HDFS directory

I have a requirement where I want to write each individual records in an RDD to an individual file in HDFS.
I did it for the normal filesystem but obviously, it doesn't work for HDFS.
stream.foreachRDD{ rdd =>
if(!rdd.isEmpty()) {
rdd.foreach{
msg =>
val value = msg._2
println(value)
val fname = java.util.UUID.randomUUID.toString
val path = dir + fname
write(path, value)
}
}
}
where write is a function which writes to the filesystem.
Is there a way to do it within spark so that for each record I can natively write to the HDFS, without using any other tool like Kafka Connect or Flume??
EDIT: More Explanation
For eg:
If my DstreamRDD has the following records,
abcd
efgh
ijkl
mnop
I need different files for each record, so different file for "abcd", different for "efgh" and so on.
I tried creating an RDD within the streamRDD but I learnt it's not allowed as the RDD's are not serializable.
You can forcefully repartition the rdd to no. of partitions as many no. of records and then save
val rddCount = rdd.count()
rdd.repartition(rddCount).saveAsTextFile("your/hdfs/loc")
You can do in couple of ways..
From rdd, you can get the sparkCOntext, once you got the sparkCOntext, you can use parallelize method and pass the String as List of String.
For example:
val sc = rdd.sparkContext
sc.parallelize(Seq("some string")).saveAsTextFile(path)
Also, you can use sqlContext to convert the string to DF then write in the file.
for Example:
import sqlContext.implicits._
Seq(("some string")).toDF

Map reduce to perform group by and sum in Cassandra, with spark and job server

I am creating a spark job server, which connects to cassandra. After getting the records i want to perform a simple group by and sum on it. I am able to retreive the data, I could not print the output. I have tried google on for hours and have posted in cassandra google groups as well. My current code is as below and i am getting error at collect.
override def runJob(sc: SparkContext, config: Config): Any = {
//sc.cassandraTable("store", "transaction").select("terminalid","transdate","storeid","amountpaid").toArray().foreach (println)
// Printing of each record is successful
val rdd = sc.cassandraTable("POSDATA", "transaction").select("terminalid","transdate","storeid","amountpaid")
val map1 = rdd.map ( x => (x.getInt(0), x.getInt(1),x.getDate(2))->x.getDouble(3) ).reduceByKey((x,y)=>x+y)
println(map1)
// output is ShuffledRDD[3] at reduceByKey at Daily.scala:34
map1.collect
//map1.ccollectAsMap().map(println(_))
//Throwing error java.lang.ClassNotFoundException: transaction.Daily$$anonfun$2
}
Your map1 is a RDD. You can try the following:
map1.foreach(r => println(r))
Spark does lazy evaluation on rdd. So try some action
map1.take(10).foreach(println)