NLineInputFormat not working in Spark - scala

What I want is basically to have each element of data consist of 10 lines. However, with the following code, each element is still one line. What mistake am I doing here?
val conf = new SparkConf().setAppName("MyApp")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array[Class[_]](classOf[NLineInputFormat], classOf[LongWritable],
classOf[Text]))
val sc = new SparkContext(conf)
val c = new Configuration(sc.hadoopConfiguration)
c.set("lineinputformat.linespermap", 10);
val data = sc.newAPIHadoopFile(fname, classOf[NLineInputFormat], classOf[LongWritable],
classOf[Text], c)

NLineInputFormat by design just doesn't perform operation you expect it to:
NLineInputFormat which splits N lines of input as one split. (...) that splits the input file such that by default, one line is fed as a value to one map task.
As you can see it modifies how splits (partitions in the Spark nomenclature) are computed, not how records are determined.
If description is not clear we can illustrate that with a following example:
def nline(n: Int, path: String) = {
val sc = SparkContext.getOrCreate
val conf = new Configuration(sc.hadoopConfiguration)
conf.setInt("mapreduce.input.lineinputformat.linespermap", n);
sc.newAPIHadoopFile(path,
classOf[NLineInputFormat], classOf[LongWritable], classOf[Text], conf
)
}
require(nline(1, "README.md").glom.map(_.size).first == 1)
require(nline(2, "README.md").glom.map(_.size).first == 2)
require(nline(3, "README.md").glom.map(_.size).first == 3)
As show above each partition (possibly excluding the last one) contains exactly n lines.
While you can try to retrofit this to fit your case it won't be recommended for small values of the linespermap parameter.

Related

Spark MinHashLSH Never Progresses

I am new to spark but I am attempting to produce network clusters using user supplied tags or attributes. First I am using the jaccard minhash algorithm to produce similarity scores then running it through power iteration clustering algorithm but as soon as it starts there is no CPU activity and will run for some time with zero progress. Wondering how to configure the cluster or change the code to get this to run. Below is my code
//about 10,000 rows of (id, 100 tags in binary form)
val data = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("inferSchema","true").load("gs://data/*.csv")
val columnNames = data.columns
val tags = columnNames.slice(1, columnNames.size)
//put tags in a vector
val assembler = new VectorAssembler().setInputCols(tags).setOutputCol("attributes")
val newData = assembler.transform(data).select("userID","attributes")
val mh = new MinHashLSH().setNumHashTables(5).setInputCol("attributes").setOutputCol("values")
val modelMINHASH = mh.fit(goodData)
// Approximate nearest neighbor search
val fullData = modelMINHASH.approxSimilarityJoin(newData , newData , 0.9).filter("datasetA.userID < datasetB.userID")
var explodeDF = fullData.withColumn("id", fullData("datasetA.userID")).withColumn("id2", fullData("datasetB.userID")).select("id","id2","distCol")
val temp = explodeDF.rdd
val newRDD = temp.map(x => (x.getAs[Integer]("id").longValue(),x.getAs[Integer]("id2").longValue(),1-x.getAs[Double]("distCol"))).cache()
//this is where the code haults and I see no progress
val modelPIC = new PowerIterationClustering().setK(16).setMaxIterations(5).run(newRDD)
val clusters = modelPIC.assignments

Open files in spark with given timestamp

I use newAPIHadoopFile in my scala class to read text files from HDFS as below
val conf = new SparkConf
val sc = new SparkContext(conf)
val hc = new Configuration(sc.hadoopConfiguration)
val dataFilePath = "/data/sample"
val input = sc.newAPIHadoopFile(dataFilePath, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hc)
But now I just need to open files within a range of timestamp.
Any idea on how I could do that?
Thanks,
Jeff
If your files contain timestamp directly in filename, it's pretty easy:
val path = "/hdfs/some_dir/2016-07-*/*"
val data = sqlContext.jsonFile(data) // or textFile e.g.
data.count() // number of rows in all files matching pattern
This will read all dirs July of 2016 and all files in those dirs. You can do pattern matching even on filenames, e.g. val path = "/hdfs/some_dir/2016-07-01/file-*.json"
Is this helpful or you are looking for system timestamps filtering?
Edit:
In case you need to filter using system timestamp:
val path = "/hdfs/some_dir/"
val now: Long = System.currentTimeMillis / 1000
var files = new java.io.File(path).listFiles.filter(_.lastModified >= now)
Or you can construct more complex date filtering like selecting date in a "human" way. It should be easy now.

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?
I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.
import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line =>
val indexed = line.split('\t').zipWithIndex
val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
LabeledPoint(mykey, Vectors.dense(myValues))
}
val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")
You shouldn't use quotes (s"$quote$x$quote") unless column names contain quotes. Try
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
For pyspark, you can first create a list of the column names:
df_colnames = df.columns
Then you can use that in vectorAssembler:
assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)

Spark MLib - Create LabeledPoint from RDD[Vector] features and RDD[Vector] label

I am building a training set using two text files representing documents and labels.
Documents.txt
hello world
hello mars
Labels.txt
0
1
I have read in these files and converted my document data to a tf-idf weighted term-document matrix which is represented as a RDD[Vector]. I have also read-in and created a RDD[Vector] for my labels:
val docs: RDD[Seq[String]] = sc.textFile("Documents.txt").map(_.split(" ").toSeq)
val labs: RDD[Vector] = sc.textFile("Labels.txt")
.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(docs)
tf.cache()
val idf = new IDF(minDocFreq = 3).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
I would like to use tfidf and labsto create a RDD[LabeledPoint], but I am not sure how to apply a mapping with two different RDDs. Is this even possible/efficient, or do I need to rethink my approach?
One way to handle this is to join based on indices:
import org.apache.spark.RangePartitioner
// Add indices
val idfIndexed = idf.zipWithIndex.map(_.swap)
val labelsIndexed = labels.zipWithIndex.map(_.swap)
// Create range partitioner on larger RDD
val partitioner = new RangePartitioner(idfIndexed.partitions.size, idfIndexed)
// Join with custom partitioner
labelsIndexed.join(idfIndexed, partitioner).values

Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code.
I am calculating the max number of categories and then giving it as a parameter to RF. This takes a lot of time! Is there a parameter to set or an easier way to make the model automatically infer the max categories?Since it can go more than 1000 and I cannot omit them.
How do I handle unseen labels on new data for prediction since StringIndexer will not work in that case. the code below is just a split of data but I will be introducing new data as well in future
// Need to predict 2 classes
val cols_to_predict=Array("Label1","Label2")
// ID col
val omit_cols=Array("Key")
// reading the csv file
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("abc.csv")
.cache()
// creating a features DF by droppping the labels so that I can run all
// the cols through String Indexer
val features=data.drop("Label1").drop("Label2").drop("Key")
// Since I do not know my max categories possible, I find it out
// and use it for maxBins parameter in RF
val distinct_col_counts=features.columns.map(x => data.select(x).distinct().count ).max
val transformers: Array[org.apache.spark.ml.PipelineStage] = features.columns.map(
cname => new StringIndexer().setInputCol(cname).setOutputCol(s"${cname}_index").fit(features)
)
val assembler = new VectorAssembler()
.setInputCols(features.columns.map(cname => s"${cname}_index"))
.setOutputCol("features")
val labelIndexer2 = new StringIndexer()
.setInputCol("prog_label2")
.setOutputCol("Label2")
.fit(data)
val labelIndexer1 = new StringIndexer()
.setInputCol("orig_label1")
.setOutputCol("Label1")
.fit(data)
val rf = new RandomForestClassifier()
.setLabelCol("Label1")
.setFeaturesCol("features")
.setNumTrees(100)
.setMaxBins(distinct_col_counts.toInt)
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer1.labels)
// Split into train and test
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
trainingData.cache()
testData.cache()
// Running only for one label for now Label1
val stages: Array[org.apache.spark.ml.PipelineStage] =transformers :+ labelIndexer1 :+ assembler :+ rf :+ labelConverter //:+ labelIndexer2
val pipeline=new Pipeline().setStages(stages)
val model=pipeline.fit(trainingData)
val predictions = model.transform(testData)