I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer
Related
I am trying to load in a table from SalesForce using spark. I invoked this code
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
object Sample {
def main(arg: Array[String]) {
val spark = SparkSession.builder().
appName("salesforce").
master("local[*]").
getOrCreate()
val tableName = "Opportunity"
val outputPath = "output/result" + tableName
val salesforceDf = spark.
read.
format("jdbc").
option("url", "jdbc:datadirect:sforce://login.salesforce.com;").
option("driver", "com.ddtek.jdbc.sforce.SForceDriver").
option("dbtable", tableName).
option("user", "").
option("password", "xxxxxxxxx").
option("securitytoken", "xxxxx")
.load()
salesforceDf.createOrReplaceTempView("Opportunity")
spark.sql("select * from Opportunity").collect.foreach(println)
//save the result
salesforceDf.write.save(outputPath)
}
}
And the docs I was referring to said to start a spark shell as:
spark-shell --jars /path_to_driver/sforce.jar
Which outputted a lot of lines in the terminal and this was the last line:
22/07/12 14:57:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a12b060d-5c82-4283-b2b9-53f9b3863b53
And then to submit spark
spark-submit --jars sforce.jar --class <Your class name> your jar file
However I am not sure where this jar file is and if that was substantiated? and how to submit that. Any help is appreciated, thank you.
I want to put the data that is in HDFS in a neo4j chart. Taking into account the suggestion given in this question, I used the neo4j-spark-connector, initializing them this way:
/usr/local/sparks/bin/spark-shell --conf spark.neo4j.url="bolt://192.xxx.xxx.xx:7687" --conf spark.neo4j.user="xxxx" --conf spark.neo4j.password="xxxx" --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M1, graphframes:graphframes:0.2.0-spark2.0-s2.11
I read the file that is in the hdfs and put it in neo4j from the function Neo4jDataFrame.mergeEdgeList.
import org.neo4j.spark.dataframe.Neo4jDataFrame
object Neo4jTeste {
def main {
val lines = sc.textFile("hdfs://.../testeNodes.csv")
val filteredlines = lines.map(_.split("-")).map{x => (x(0),x(1),x(2),x(3))}
val newNames = Seq("name","apelido","city","date")
val df = filteredlines.toDF(newNames: _*)
Neo4jDataFrame.mergeEdgeList(sc, df, ("Name", Seq("name")),("HAPPENED_IN", Seq.empty), ("Age", Seq("age")))
} }
However, as my data is very large, neo4j is unable to represent all nodes. I think the problem is with the function.
I've tried it this way too:
import org.neo4j.spark._
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (n:Person) RETURN id(n) as id ").loadRowRdd
However, this way I cannot read the HDFS file or divide it into columns
Can someone help me find another solution? With the Neo4jDataFrame.mergeEdgeList function, I only see 150 nodes instead of the 500 expected.
I would like to pass set of avro files as input to spark job and create dataframe on top of those files. (I don't want to place files in a directory and pass directory as input).
In Spark shell, I'm able to create dataframe successfully like below.
val DF = hiveContext.read.format("com.databricks.spark.avro").load("/data/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro","/data/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro")
But the same is failing when I try to run through spark-submit command.
To pass the avro files independently to spark job, I'm trying to place avro files in a text file and pass this file as input argument to Driver class.
textFile:
/data/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro
/data/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro
spark-submit --class Spark_submit_test --master yarn Spark_submit_test.jar textFile
val filename = args(0)
val files = Source.fromFile(filename).getLines
val fileList = files.mkString(",")
println("fileList : "+fileList)
=> This prints
fileList : /data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro,/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro
val DF = hiveContext.read.format("com.databricks.spark.avro").load(fileList)
Getting below exception :
Exception in thread "main" java.io.FileNotFoundException: File hdfs://bdaolc01-ns/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro,/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro does not exist.
Not sure how I can avoid "hdfs://bdaolc01-ns" appending in beginning.
Please correct me if I'm doing wrong or suggest better approach for doing the same.
Note : I tried enclosing file names in double quotes, but no use.
Expected Result :
Dataframe should be created successfully and df.printSchema should list proper schema of the avro files.
You want the splat operator!
myList: _*
scala> val data = spark.read.parquet(paths: _*)
data: org.apache.spark.sql.DataFrame = [id: bigint, a: int ... 1 more field]
scala> val paths = List("/tmp/example-parquet/part-00000-38cd8823-bff7-46f0-82a0-13d1d00ecce5-c000.snappy.parquet")
paths: List[String] = List(/tmp/example-parquet/part-00000-38cd8823-bff7-46f0-82a0-13d1d00ecce5-c000.snappy.parquet)
scala> val data = spark.read.parquet(paths: _*)
data: org.apache.spark.sql.DataFrame = [id: bigint, a: int ... 1 more field]
scala> data.count
res0: Long = 12500000
Pass the input file path to spark-submit command with --files option.
Also pass the input file name as command line argument.
This way I'll be able to read the file in Drive class.
val avrofiles = Source.fromFile(inputFileName).getLines.toArray
And create a dataframe
val dF = hiveContext.read.format("com.databricks.spark.avro").load(avrofiles:_*)
I have a cluster of 9 computers with Apache Hadoop 2.7.2 and Spark 2.0.0 installed on them. Each computer runs an HDFS datanode and Spark slave. One of these computers also runs an HDFS namenode and Spark master.
I've uploaded a few TBs of gz-archives in HDFS with Replication=2. It turned out that some of the archives are corrupt. I'd want to find them. It looks like 'gunzip -t ' can help. So I'm trying to find a way to run a Spark application on the cluster so that each Spark executor tests archives 'local' (i.e. having one of the replicas located on the same computer where this executor runs) to it as long as it is possible. The following script runs but sometimes Spark executors process 'remote' files in HDFS:
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchivesTester' in spark.pom
// and placing this jar file into Spark's home directory):
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
// means testing for corruption the gz-archives in the directory hdfs://LV-WS10.lviv:9000/commoncrawl
// using a Spark cluster with the Spark master URL spark://LV-WS10.lviv:7077 and 9 Spark slaves
package com.qbeats.cortex
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapred.FileSplit
import org.apache.spark.rdd.HadoopRDD
import org.apache.spark.{SparkContext, SparkConf, AccumulatorParam}
import sys.process._
object CommoncrawlArchivesTester extends App {
object LogAccumulator extends AccumulatorParam[String] {
def zero(initialValue: String): String = ""
def addInPlace(log1: String, log2: String) = if (log1.isEmpty) log2 else log1 + "\n" + log2
}
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchivesTester"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val log = sc.accumulator(LogAccumulator.zero(""))(LogAccumulator)
val text = sc.hadoopFile(args(1), classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]
val fileAndLine = hadoopRdd.mapPartitionsWithInputSplit { (inputSplit, iterator) =>
val fileName = inputSplit.asInstanceOf[FileSplit].getPath.toString
class FilePath extends Iterable[String] {
def iterator = List(fileName).iterator
}
val result = (sys.env("HADOOP_PREFIX") + "/bin/hadoop fs -cat " + fileName) #| "gunzip -t" !
println("Processed %s.".format(fileName))
if (result != 0) {
log.add(fileName)
println("Corrupt: %s.".format(fileName))
}
(new FilePath).iterator
}
val result = fileAndLine.collect()
println("Corrupted files:")
println(log.value)
}
}
}
What would you suggest?
ADDED LATER:
I tried another script which gets files from HDFS via textFile(). I looks like a Spark executor doesn't prefer among input files the files which are 'local' to it. Doesn't it contradict to "Spark brings code to data, not data to code"?
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchiveLinesCounter' in spark.pom)
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
package com.qbeats.cortex
import org.apache.spark.{SparkContext, SparkConf}
object CommoncrawlArchiveLinesCounter extends App {
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchiveLinesCounter"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val helper = new Helper
val nLines = sc.
textFile(args(1) + "/*").
mapPartitionsWithIndex( (index, it) => {
println("Processing partition %s".format(index))
it
}).
count
println(nLines)
}
}
}
SAIF C, could you explain in more detail please?
I've solved the problem by switching from Spark’s standalone mode to YARN.
Related topic: How does Apache Spark know about HDFS data nodes?
I get the following error
Exception in thread "main" java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/linkage/out1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:180)
when launching the following command
spark-submit --class spark00.DataAnalysis1 --master local sproject1.jar linkage linkage/out1
The two last arguments (linkage and linkage/out1) are HDFS directories, the first contains several CSV files, the second doesn't exist, I assume that it will be automatically created.
The following code has been tested successfully with REPL (Spark 1.1, Scala 2.10.4), except of course the saveAsTextFile() part. I've followed the step-by-step method explained in O'Reilly's "Advanced Analytics with Spark" book.
Since it worked on REPL, I wanted to transpose this into a JAR file using Eclipse Juno, with the following code.
package spark00
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object DataAnalysis1 {
case class MatchData(id1: Int, id2: Int, scores: Array[Double], matched: Boolean)
def isHeader(line:String) = line.contains("id_1")
def toDouble(s:String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line:String) = {
val pieces = line.split(",")
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
MatchData(id1, id2, scores, matched)
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("DataAnalysis1")
val sc = new SparkContext(conf)
// Load our input data.
val rawblocks = sc.textFile(args(0))
// CLEAN-UP
// a. calling !isHeader(): suppress header
val noheader = rawblocks.filter(!isHeader(_))
// b. calling parse(): setting feature types and renaming headers
val parsed = noheader.map(line => parse(line))
// EXPORT CLEAN FILE
parsed.coalesce(1,true).saveAsTextFile(args(1))
}
}
As you can see args(0) should be "linkage" directory, and args(1) is actually the output HDFS directory linkage/out1 based on my spark-submit command above.
I've also tried the last line without coalesce(1,true)
Here's the official RDD type for parsed
parsed: org.apache.spark.rdd.RDD[(Int, Int, Array[Double], Boolean)] = MappedRDD[3] at map at <console>:34
Thank you in advance for your support
Nov 20th: I'm adding this simple Wordcount code that works well when running the spark-submit command the same way as for the code above. Thus, my question will be: why the saveAsTextFile() worked for this one and not the for other code ?
object SpWordCount {
def main(args: Array[String]) {
// Create a Scala Spark Context.
val conf = new SparkConf().setMaster("local").setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile(args(0))
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into word and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(args(1))
}
}