Scala/Spark - Find total number of value in row based on a key - scala

I have a large text file which contains the page views of some Wikimedia projects. (You can find it here if you're really interested) Each line, delimited by a space, contains the statistics for one Wikimedia page. The schema looks as follows:
<project code> <page title> <num hits> <page size>
In Scala, using Spark RDDs or Dataframes, I wish to compute the total number of hits for each project, based on the project code.
So for example for projects with the code "zw", I would like to find all the rows that begin with project code "zw", and add up their hits. Obviously this should be done for all project codes simultaneously.
I have looked at functions like aggregateByKey etc, but the examples I found don't go into enough detail, especially for a file with 4 fields. I imagine it's some kind of MapReduce job, but how exactly to implement it is beyond me.
Any help would be greatly appreciated.

First, you have to read the file in as a Dataset[String]. Then, parse each string into a tuple, so that it can be easily converted to a Dataframe. Once you have a Dataframe, a simple .GroupBy().agg() is enough to finish the computation.
import org.apache.spark.sql.functions.sum
val df = spark.read.textFile("/tmp/pagecounts.gz").map(l => {
val a = l.split(" ")
(a(0), a(2).toLong)
}).toDF("project_code", "num_hits")
val agg_df = df.groupBy("project_code")
.agg(sum("num_hits").as("total_hits"))
.orderBy($"total_hits".desc)
agg_df.show(10)
The above snippet shows the top 10 project codes by total hits.
+------------+----------+
|project_code|total_hits|
+------------+----------+
| en.mw| 5466346|
| en| 5310694|
| es.mw| 695531|
| ja.mw| 611443|
| de.mw| 572119|
| fr.mw| 536978|
| ru.mw| 466742|
| ru| 463437|
| es| 400632|
| it.mw| 400297|
+------------+----------+
It is certainly also possible to do this with the older API as an RDD map/reduce, but you lose many of the optimizations that Dataset/Dataframe api brings.

Related

Combining two VCF files with differing sampleIds and locations

Good day,
How to combine multiple Variant call files (VCF) with differing subjects?
I multiple VCF datasets with differing sampleIds and locations:
file1:
contigName |start | end | names | referenceAllele | alternateAlleles| qual| filters| splitFromMultiAllelic| genotypes
1 |792460|792461|["bla"]|G |["A"] |null|["PASS"] |false | [{"sampleId": "abba", "phased": false, "calls": [0, 0]}]
1 |792461|792462|["blaA"]|G |["A"] |null|["PASS"] |false | [{"sampleId": "abba", "phased": false, "calls": [0, 0]}]
file2:
contigName |start | end | names | referenceAllele | alternateAlleles| qual| filters| splitFromMultiAllelic| genotypes
1 |792460|792461|["bla"]|G |["A"] |null|["PASS"] |false | [{"sampleId": "baab", "phased": false, "calls": [0, 0]}]
1 |792464|792465|["blaB"]|G |["A"] |null|["PASS"] |false | [{"sampleId": "baab", "phased": false, "calls": [0, 0]}]
I need to combine these to single VCF file. I'm required to work in DataBricks (pyspark/scala) environment due to data security.
Glow documentation had and idea, which I aped:
import pyspark.sql.functions as F
spark.read.format("vcf")\
.option("flattenInfoFields", True)\
.load(file_list)\
.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles', 'qual', 'filters','splitFromMultiAllelic')\
.agg(F.sort_array(F.flatten(F.collect_list('genotypes'))).alias('genotypes'))\
.write.mode("overwrite").format("vcf").save(.my_output_destination )
This only works when sampleId's are same in both files:
Task failed while writing rows
Cannot infer sample ids because they are not the same in every row.
I'm considering creating dummy table with NULL calls for all the IDs but that seems silly. (Not to mention huge resource sink.
Is there simple way to combine VCF files with differing sampleIds? Or autofill missing values with NULL calls?
Edit: I managed to do this with bigVCF format. However it autofills -1,-1 calls. I'd like to manually set autofilled values as something more clear that's it's not 'real'
write.mode("overwrite").format("bigvcf").save(
The code above works if you have identical variants in both tables. I would not recommend using it to combine two distinct datasets as this would introduce batch effects.
The best practice for combining two datasets is to reprocess them from the BAM files to gVCF using the same pipeline. Then run joint-genotyping to merge the samples (instead of a custom spark-sql function).
Databricks does provide a GATK4 best practices pipeline that includes joint-genotyping. Or you can use Deep variant to call mutations.
If it is not possible to reprocess the data, then the two datasets should be treated separately in a meta-analysis, as opposed to merging the VCFs and performing a mega-analysis.

What is the canonical way to create objects from rows of a Spark dataframe?

I am using Apache Zeppelin (0.9.0) and Scala (2.11.12). I want to pull some data out of a dataframe and store it to InfluxDB, later to be visualized in Grafana, and cannot figure it out. I'm trying a naive approach with a foreach loop. The idea is to iterate through all rows, extract the columns I need, create a Point object (from this InfluxDB client library), and either send it to InfluxDB or add it to a list and then send all the points in bulk, after the loop.
The dataframe looks like this:
+---------+---------+-------------+-----+
|timestamp|sessionId|eventDuration|speed|
+---------+---------+-------------+-----+
| 1| ses1| 0.0| 50|
| 2| ses1| 1.0| 50|
| 3| ses1| 2.0| 50|
I've tried to do what is described above:
import scala.collection.mutable.ListBuffer
import spark.implicits._
import org.apache.spark.sql._
import com.paulgoldbaum.influxdbclient._
import scala.concurrent.ExecutionContext.Implicits.global
val influxdb = InfluxDB.connect("172.17.0.4", 8086)
val database = influxdb.selectDatabase("test")
var influxData = new ListBuffer[Point]()
dfAnalyseReport.foreach(row =>
{
val point = Point("acceleration")
.addTag("speedBin", row.getLong(3).toString)
.addField("eventDuration", row.getDouble(2))
influxData += point
}
)
val influxDataList = influxData.toList
database.bulkWrite(influxDataList)
The only thing I am getting here is a cryptic java.lang.ClassCastException with no additional info, neither in the notebook output nor in the logs of the Zeppelin Docker container. The error seems to be somewhere in the foreach, as it appears even when I comment out the last two lines.
I also tried adapting approach 1 from this answer, using a case class for columns, but to no avail. I got it to run without an error, but the resulting list was empty. Unfortunately I deleted that attempt. I could reconstruct it if necessary, but I've spent so much time on this I'm fairly certain I have some fundamental misunderstanding on how this should be done.
One further question: I also tried writing each Point to the DB as it was constructed (instead of in bulk). The only difference is that instead of appending to the ListBuffer I did a database.write(point) operation. When done outside of the loop with a dummy point, it goes through without a problem - the data ends up in InfluxDB - but inside the loop it results in org.apache.spark.SparkException: Task not serializable
Could someone point me in the right way? How should I tackle this?
I'd do it with the RDD map method and collect the results to a list:
val influxDataList = dfAnalyseReport.rdd.map(
row => Point("acceleration")
.addTag("speedBin", row.getInt(3).toString)
.addField("eventDuration", row.getDouble(2))
).collect.toList

flatten a spark data frame's column values and put it into a variable

Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?
I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)
The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.

store elements to hashet from file scala

i am playing a little bit with scala and i want to open a text file, read each line and save some of the fields in a hashset.
The input file will be something like this:
1 2 3
2 4 5
At first, i am just trying to store the first element of each column to a variable but nothing seems to happen.
My code is:
var id = 0
val textFile = sc.textFile(inputFile);
val nline = textFile.map(_.split(" ")).foreach(r => id = r(0))
I am using spark because i want to process bigger amount of data later, so i'm trying to get used to it. I am printing id but i get only 0.
Any ideas?
A couple of things:
First, inside map and foreach you are running code out on your executors. The id variable you defined is on the driver. You can pass variables to your executors using closures, but not the other way around. If you think about it, when you have 10 executors running through records simultaneously which value of ID would you expect to be returned?
Edit - foreach is an action
I mistakenly called foreach not an action below. It is an action that just lets your run arbitrary code against your rows. It is useful if you have your own code to save the result to a different data store for example. foreach does not bring any data back to the driver, so it does not help with your case.
End edit
Second, all of the spark methods you called are transformations, you haven't called an action yet. Spark doesn't actually run any code until an action is called. Instead it just builds a graph of the transformations you want to happen until you specify an action. Actions are things that require materializing a result either to provide data back to the driver or save them out somewhere like HDFS.
In your case, to get values back you will want to use an action like "collect" which returns all the values from the RDD back to the driver. However, you should only do this when you know there aren't going to be many values returned. If you are operating on 100 million records you do not want to try and pull them all back to the driver! Generally speaking you will want to only pull data back to the driver after you have processed and reduced it.
i am just trying to store the first element of each column to a
variable but nothing seems to happen.
val file_path = "file.txt"
val ds = ss.read.textFile(file_path)
val ar = ds.map(x => x.split(" ")).first()
val (x,y,z) = (ar(0),ar(1),ar(2))
You can access the first value of the columns with x,y,z as above.
With your file, x=1, y=2, z=3.
val ar1 = ds.map(x => x.split(" "))
val final_ds = ar.select($"value".getItem(0).as("col1") , $"value".getItem(1).as("col2") , $"value".getItem(2).as("col3")) // you can name the columns as like this
Output :
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
| 2| 4| 5|
+----+----+----+
You can run any kind of sql's on final_ds like a small sample below.
final_ds.select("col1","col2").where(final_ds.col("col1") > 1).show()
Output:
+----+----+
|col1|col2|
+----+----+
| 2| 4|
+----+----+

how to define features column in spark ml

I am trying to run the spark logistic regression function (ml not mllib). I have a dataframe which looks like (just the first row shown)
+-----+--------+
|label|features|
+-----+--------+
| 0.0| [60.0]|
(Right now just trying to keep it simple with only one dimension in the feature, but will expand later on.)
I run the following code - taken from the Spark ML documentation
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(df)
This gives me the error -
org.apache.spark.SparkException: Values to assemble cannot be null.
I'm not sure how to fix this error. I looked at sample_libsvm_data.txt which is in the spark github repo and used in some of the examples in the spark ml documentation. That dataframe looks like
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(692,[127,128,129...|
| 1.0|(692,[158,159,160...|
| 1.0|(692,[124,125,126...|
Based on this example, my data looks like it should be in the right format, with one issue. Is 692 the number of features? Seems rather dumb if so - spark should just be able to look at the length of the feature vector to see how many features there are. If I do need to add the number of features, how would I do that? (Pretty new to Scala/Java)
Cheers
This error is thrown by VectorAssembler when any of the features are null. Please verify that you rows doesn't contain null values. If there are null values you must convert it into a default numeric feature before VectorAssembling.
Regarding format of sample_libsvm_data.txt, Its is in stored in in a sparse array/matrix form. Where data is represented as:
0 128:51 129:159 130:253 (Where 0 is label and the subsequent column contains index:numeric_feature format.
You can form your single feature dataframe in the following way using Vector class as follow (I ran it on 1.6.1 shell):
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.ml.classification.LogisticRegression
val training1 = sqlContext.createDataFrame(Seq(
(1.0, Vectors.dense(3.0)),
(0.0, Vectors.dense(3.0)))
).toDF("label", "features")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val model1 = lr.fit(training)
For more, you can check examples at: https://spark.apache.org/docs/1.6.1/ml-guide.html#dataframe (Refer to section Code examples)