I want to put the data that is in HDFS in a neo4j chart. Taking into account the suggestion given in this question, I used the neo4j-spark-connector, initializing them this way:
/usr/local/sparks/bin/spark-shell --conf spark.neo4j.url="bolt://192.xxx.xxx.xx:7687" --conf spark.neo4j.user="xxxx" --conf spark.neo4j.password="xxxx" --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M1, graphframes:graphframes:0.2.0-spark2.0-s2.11
I read the file that is in the hdfs and put it in neo4j from the function Neo4jDataFrame.mergeEdgeList.
import org.neo4j.spark.dataframe.Neo4jDataFrame
object Neo4jTeste {
def main {
val lines = sc.textFile("hdfs://.../testeNodes.csv")
val filteredlines = lines.map(_.split("-")).map{x => (x(0),x(1),x(2),x(3))}
val newNames = Seq("name","apelido","city","date")
val df = filteredlines.toDF(newNames: _*)
Neo4jDataFrame.mergeEdgeList(sc, df, ("Name", Seq("name")),("HAPPENED_IN", Seq.empty), ("Age", Seq("age")))
} }
However, as my data is very large, neo4j is unable to represent all nodes. I think the problem is with the function.
I've tried it this way too:
import org.neo4j.spark._
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (n:Person) RETURN id(n) as id ").loadRowRdd
However, this way I cannot read the HDFS file or divide it into columns
Can someone help me find another solution? With the Neo4jDataFrame.mergeEdgeList function, I only see 150 nodes instead of the 500 expected.
Related
I am following spark hbase connector basic example to read a HBase table in spark2 shell version 2.2.0. It looks like the code is working, but when I run df.show() command, I do not see any results and it seems to run forever.
import org.apache.spark.sql.{ DataFrame, Row, SQLContext }
import org.apache.spark.sql.execution.datasources.hbase._
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
def catalog = s"""{
|"table":{"namespace":"default", "name":"testmeta"},
|"rowkey":"vgil",
|"columns":{
|"id":{"cf":"rowkey", "col":"vgil", "type":"string"},
|"col1":{"cf":"pp", "col":"dtyp", "type":"string"}
|}
|}""".stripMargin
def withCatalog(cat: String): DataFrame = { sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->cat)).format("org.apache.spark.sql.execution.datasources.hbase").load()}
val df = withCatalog(catalog)
df.show()
df.show() will neither give any output nor any error. It will keep on running forever.
Also, how can I run queryy for range of row keys.
Here is the scan of the HBase test table.
hbase(main):001:0> scan 'testmeta'
ROW COLUMN+CELL
fmix column=pp:dtyp, timestamp=1541714925380, value=ss1
fmix column=pp:lati, timestamp=1541714925371, value=41.50
fmix column=pp:long, timestamp=1541714925374, value=-81.61
fmix column=pp:modm, timestamp=1541714925377, value=ABC
vgil column=pp:dtyp, timestamp=1541714925405, value=ss2
vgil column=pp:lati, timestamp=1541714925397, value=41.50
I have followed some of solutions on the web, but unfortunately not able to get the data from HBase.
Thanks in advance for help!
Posting my answer after lots of trial, so I found that adding --conf option to start spark shell helped me connect to HBase.
spark2-shell --master yarn --deploy-mode client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11,it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --repositories http://repo.hortonworks.com/content/groups/public/ --conf spark.hbase.host=192.168.xxx.xxx --files /mnt/fs1/opt/cloudera/parcels/CDH-5.13.0-1.cdh5.13.0.p0.29/share/doc/hbase-solr-doc-1.5+cdh5.13.0+71/demo/hbase-site.xml
Then the following code snippet could fetch a value for one column qualifier.
val hBaseRDD_iacp = sc.hbaseTable[(String)]("testmeta").select("lati").inColumnFamily("pp").withStartRow("vg").withStopRow("vgz")
object myschema {
val column1 = StructField("column1", StringType)
val struct = StructType(Array(column1))
}
val rowRDD = hBaseRDD.map(x => Row(x))
val myDf = sqlContext.createDataFrame(rowRDD,myschema.struct)
myDf.show()
I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer
I built an H2O model in R and saved the POJO code. I want to score parquet files in hdfs using the POJO but I'm not sure how to go about it. I plan on reading the parquet files into spark (scala/SparkR/PySpark) and scoring them on there. Below is the excerpt I found on H2O's documentation page.
"How do I run a POJO on a Spark Cluster?
The POJO provides just the math logic to do predictions, so you won’t find any Spark (or even H2O) specific code there. If you want to use the POJO to make predictions on a dataset in Spark, create a map to call the POJO for each row and save the result to a new column, row-by-row"
Does anyone have some example code of how I can do this? I'd greatly appreciate any assistance. I code primarily in R and SparkR, and I'm not sure how I can "map" the POJO to each line.
Thanks in advance.
I just posted a solution that actually uses DataFrame/Dataset. The post used a Star Wars dataset to build a model in R and then scored MOJO on the test set in Spark. I'll paste the only relevant part here:
Scoring with Spark (and Scala)
You could either use spark-submit or spark-shell. If you use spark-submit, h2o-genmodel.jar needs to be put under lib folder of the root directory of your spark application so it could be added as a dependency during compilation. The following code assumes you're running spark-shell. In order to use h2o-genmodel.jar, you need to append the jar file when launching spark-shell by providing a --jar flag. For example:
/usr/lib/spark/bin/spark-shell \
--conf spark.serializer="org.apache.spark.serializer.KryoSerializer" \
--conf spark.driver.memory="3g" \
--conf spark.executor.memory="10g" \
--conf spark.executor.instances=10 \
--conf spark.executor.cores=4 \
--jars /path/to/h2o-genmodel.jar
Now in the Spark shell, import the dependencies
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.MojoModel
Using DataFrame
val modelPath = "/path/to/zip/file"
val dataPath = "/path/to/test/data"
// Import data
val dfStarWars = spark.read.option("header", "true").csv(dataPath)
// Import MOJO model
val mojo = MojoModel.load(modelPath)
val easyModel = new EasyPredictModelWrapper(mojo)
// score
val dfScore = dfStarWars.map {
x =>
val r = new RowData
r.put("height", x.getAs[String](1))
r.put("mass", x.getAs[String](2))
val score = easyModel.predictBinomial(r).classProbabilities
(x.getAs[String](0), score(1))
}.toDF("name", "isHumanScore")
The variable score is a list of two scores for level 0 and 1. score(1) is the score for level 1, which is "human". By default the map function returns a DataFrame with unspecified column names "_1", "_2", etc. You can rename the columns by calling toDF.
Using Dataset
To use the Dataset API we just need to create two case classes, one for the input data, and one for the output.
case class StarWars (
name: String,
height: String,
mass: String,
is_human: String
)
case class Score (
name: String,
isHumanScore: Double
)
// Dataset
val dtStarWars = dfStarWars.as[StarWars]
val dtScore = dtStarWars.map {
x =>
val r = new RowData
r.put("height", x.height)
r.put("mass", x.mass)
val score = easyModel.predictBinomial(r).classProbabilities
Score(x.name, score(1))
}
With Dataset you can get the value of a column by calling x.columnName directly. Just notice that the types of the column values have to be String, so you might need to manually cast them if they are of other types defined in the case class.
If you want to perform scoring with POJO or MOJO in spark you should be using RowData which is provided within h2o-genmodel.jar class as row by row input data to call easyPredict method to generate scores.
Your solution will be to read the parquet file from HDFS and then for each row, convert that to RowData object by filling each entry and then pass that to your POJO scoring function. Remember POJO and MOJO they both use exact same scoring function to score and the only difference is on how the POJO Class is used vs MOJO resources zip package is used. As MOJO are backward compatible and could work with any newer h2o-genmodel.jar it is best if you use MOJO instead of POJO.
Following is the full Scala code you can use on Spark to load a MOJO model and then do the scoring:
import _root_.hex.genmodel.GenModel
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.easy.prediction
import _root_.hex.genmodel.MojoModel
import _root_.hex.genmodel.easy.RowData
// Load Mojo
val mojo = MojoModel.load("/Users/avkashchauhan/learn/customers/mojo_bin/gbm_model.zip")
val easyModel = new EasyPredictModelWrapper(mojo)
// Get Mojo Details
var features = mojo.getNames.toBuffer
// Creating the row
val r = new RowData
r.put("AGE", "68")
r.put("RACE", "2")
r.put("DCAPS", "2")
r.put("VOL", "0")
r.put("GLEASON", "6")
// Performing the Prediction
val prediction = easyModel.predictBinomial(r).classProbabilities
Here is an example of reading parquet files in Spark and then saving as CSV. You can use the same code to read the parquet from HDFS and then pass the each row as RowData to above example.
Here is detailed example of using MOJO model in spark and perform scoring using RowData.
I have task to convert .mdb files to .csv files. with the help of the code below I am able to read only one table file from .mdb file. I am not able to read if .mdb files contains more than one table and I want to store all the files individually. Kindly help me on this.
object mdbfiles {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Positional File Reading").master("local[*]").getOrCreate()
val sc = spark.sparkContext // Just used to create test RDDs
def main(args: Array[String]): Unit = {
val inputfilepath = "C:/Users/phadpa01/Desktop/InputFiles/sample.mdb"
val outputfilepath ="C:/Users/phadpa01/Desktop/sample_mdb_output"
val db = DatabaseBuilder.open(new File(inputfilepath))
try {
val table = db.getTable("table1");
for ( row <- table) {
//System.out.println(row)
val opresult = row.values()
}
}
}
}
Your problem is that you are calling only one table to be read in with this bit of code
val table = db.getTable("table1");
You should get a list of available tables in the db and then loop over them.
val tableNames = db.getTableNames
Then you can iterate over the tableNames. That should solve the issue for you in reading in more than one table. You may need to update the rest of the code to get it how you want it though.
You should really find a JDBC driver that works with MS Access rather than manually trying to parse the file yourself.
For example UCanAccess
Then, it's a simple SparkSQL command, and you have a DataFrame
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:ucanaccess://c:/Users/phadpa01/Desktop/InputFiles/sample.mdb;memory=false")
.option("dbtable", "table1")
.load()
And one line to a CSV
jdbcDF.write.format("csv").save("table1.csv")
Don't forget to insert UcanAccess jars into context:
ucanaccess-4.0.2.jar,jackcess-2.1.6.jar,hsqldb.jar
Alernative solution
Run a terminal command
http://ucanaccess.sourceforge.net/site.html#clients
I am trying to read avro data using scala within spark environment. My data is not getting distributed and while running it is going to 2 nodes only. we have 20+nodes. Here is my code snippet
#serializable case class My_Class (val My_ID : String )
val filePath = "hdfs://path";
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](filePath)
val rddprsid = avroRDD.map(A =>
new My_Class(new String(A._1.datum.get("My_ID").toString()))
);
val uploadFilter = rddprsid.filter(E => E.My_ID ne null);
val as = uploadFilter.distinct(100).count;
I am not able to use parallelize operation on the rdd as it complaints about the following error.
<console>:30: error: type mismatch;
found : org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)]
required: Seq[?]
Can someone please help?
You are seeing only 2 nodes because yarn submission defaults to 2. You need to submit with --num-executors [NUMBER] and optionally --executor-cores [NUMBER]
As to the parallelize....you're data is already parallelized...thus the wrapper around RDD. parallelize is only for use to take in-memory data across the cluster.