I am reading a JSON file into a Spark Dataframe and it creates a extra column at the end.
var df : DataFrame = Seq(
(1.0, "a"),
(0.0, "b"),
(0.0, "c"),
(1.0, "d")
).toDF("col1", "col2")
df.write.mode(SaveMode.Overwrite).format("json").save("/home/neelesh/year=2018/")
val newDF = sqlContext.read.json("/home/neelesh/year=2018/*")
newDF.show
The output of newDF.show is:
+----+----+----+
|col1|col2|year|
+----+----+----+
| 1.0| a|2018|
| 0.0| b|2018|
| 0.0| c|2018|
| 1.0| d|2018|
+----+----+----+
However the JSON file is stored as:
{"col1":1.0,"col2":"a"}
{"col1":0.0,"col2":"b"}
{"col1":0.0,"col2":"c"}
{"col1":1.0,"col2":"d"}
The extra column is not added if year=2018 is removed from the path. What can be the issue here?
I am running Spark 1.6.2 with Scala 2.10.5
Could you try:
val newDF = sqlContext.read.json("/home/neelesh/year=2018")
newDF.show
+----+----+
|col1|col2|
+----+----+
| 1.0| A|
| 0.0| B|
| 0.0| C|
| 1.0| D|
+----+----+
quoting from spark 1.6
Starting from Spark 1.6.0, partition discovery only finds partitions
under the given paths by default. For the above example, if users pass
path/to/table/gender=male to either SQLContext.read.parquet or
SQLContext.read.load, gender will not be considered as a partitioning
column
Spark uses directory structure field=value as partition information see https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#partition-discovery
so in your case the year=2018 is considered a year partition and thus an additonal column
Related
I have an issue when trying to read partitioned data with Spark.
If the data in the partitioned column is in a specific format, it will show up as null in the resulting dataframe.
For example :
case class Alpha(a: String, b:Int)
val ds1 = Seq(Alpha("2020-02-11_12h32m12s", 1), Alpha("2020-05-21_10h32m52s", 2), Alpha("2020-06-21_09h32m38s", 3)).toDS
ds1.show
+--------------------+---+
| a| b|
+--------------------+---+
|2020-02-11_12h32m12s| 1|
|2020-05-21_10h32m52s| 2|
|2020-06-21_09h32m38s| 3|
+--------------------+---+
ds1.write.partitionBy("a").parquet("test")
val ds2 = spark.read.parquet("test")
ds2.show
+---+----+
| b| a|
+---+----+
| 2|null|
| 3|null|
| 1|null|
+---+----+
Do you have any idea how I could instead make that data show up as a String (or Timestamp).
Thanks for the help.
Just needed to set the parameter spark.sql.sources.partitionColumnTypeInference.enabled to false.
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
The first RDD, user_person, is a Hive table which records every person's information:
+---------+---+----+
|person_id|age| bmi|
+---------+---+----+
| -100| 1|null|
| 3| 4|null|
...
Below is my second RDD, a Hive table that only has 40 row and only includes basic information:
| id|startage|endage|energy|
| 1| 0| 0.2| 1|
| 1| 2| 10| 3|
| 1| 10| 20| 5|
I want to compute every person's energy requirement by age scope for each row.
For example,a person's age is 4, so it require 3 energy. I want to add that info into RDD user_person.
How can I do this?
First, initialize the spark session with enableHiveSupport() and copy Hive config files (hive-site.xml, core-site.xml, and hdfs-site.xml) to Spark/conf/ directory, to enable Spark to read from Hive.
val sparkSession = SparkSession.builder()
.appName("spark-scala-read-and-write-from-hive")
.config("hive.metastore.warehouse.dir", params.hiveHost + "user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
Read the Hive tables as Dataframes as below:
val personDF= spark.sql("SELECT * from user_person")
val infoDF = spark.sql("SELECT * from person_info")
Join these two dataframes using below expression:
val outputDF = personDF.join(infoDF, $"age" >= $"startage" && $"age" < $"endage")
The outputDF dataframe contains all the columns of input dataframes.
I am using Apache Spark 2.0 Dataframe/Dataset API
I want to add a new column to my dataframe from List of values. My list has same number of values like given dataframe.
val list = List(4,5,10,7,2)
val df = List("a","b","c","d","e").toDF("row1")
I would like to do something like:
val appendedDF = df.withColumn("row2",somefunc(list))
df.show()
// +----+------+
// |row1 |row2 |
// +----+------+
// |a |4 |
// |b |5 |
// |c |10 |
// |d |7 |
// |e |2 |
// +----+------+
For any ideas I would be greatful, my dataframe in reality contains more columns.
You could do it like this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
// create rdd from the list
val rdd = sc.parallelize(List(4,5,10,7,2))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:28
// zip the data frame with rdd
val rdd_new = df.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
// rdd_new: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[33] at map at <console>:32
// create a new data frame from the rdd_new with modified schema
spark.createDataFrame(rdd_new, df.schema.add("new_col", IntegerType)).show
+----+-------+
|row1|new_col|
+----+-------+
| a| 4|
| b| 5|
| c| 10|
| d| 7|
| e| 2|
+----+-------+
Adding for completeness: the fact that the input list (which exists in driver memory) has the same size as the DataFrame suggests that this is a small DataFrame to begin with - so you might consider collect()-ing it, zipping with list, and converting back into a DataFrame if needed:
df.collect()
.map(_.getAs[String]("row1"))
.zip(list).toList
.toDF("row1", "row2")
That won't be faster, but if the data is really small it might be negligible and the code is (arguably) clearer.
Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.
Example, the original dataframe:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| A|
|1473492972|4060600988513370| A|
|1473509764|4060600988513370| B|
|1473513432|4060600988513370| C|
|1473513432|4060600988513370| A|
+----------+----------------+--------------------+
to the factorized version:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| 0|
|1473492972|4060600988513370| 0|
|1473509764|4060600988513370| 1|
|1473513432|4060600988513370| 2|
|1473513432|4060600988513370| 0|
+----------+----------------+--------------------+
In scala itself it would be fairly simple, but since Spark distributes it's dataframes over nodes I'm not sure how to keep a mapping from A->0, B->1, C->2.
Also, assume the dataframe is pretty big (gigabytes), which means loading one entire column into the memory of a single machine might not be possible.
Can it be done?
You can use StringIndexer to encode letters into indices:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("col3")
.setOutputCol("col3Index")
val indexed = indexer.fit(df).transform(df)
indexed.show()
+----------+----------------+----+---------+
| col1| col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370| A| 0.0|
|1473492972|4060600988513370| A| 0.0|
|1473509764|4060600988513370| B| 1.0|
|1473513432|4060600988513370| C| 2.0|
|1473513432|4060600988513370| A| 0.0|
+----------+----------------+----+---------+
Data:
val df = spark.createDataFrame(Seq(
(1473490929, "4060600988513370", "A"),
(1473492972, "4060600988513370", "A"),
(1473509764, "4060600988513370", "B"),
(1473513432, "4060600988513370", "C"),
(1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")
You can use an user defined function.
First you create the mapping you need:
val updateFunction = udf {(x: String) =>
x match {
case "A" => 0
case "B" => 1
case "C" => 2
case _ => 3
}
}
And now you only have to apply it to your DataFrame:
df.withColumn("col3", updateFunction(df.col("col3")))
How can I use aggregate functions in a where clause in Apache Spark 1.6?
Consider the following DataFrame
+---+------+
| id|letter|
+---+------+
| 1| a|
| 2| b|
| 3| b|
+---+------+
How can I select all rows where letter occurs more than once, i.e. the expected output would be
+---+------+
| id|letter|
+---+------+
| 2| b|
| 3| b|
+---+------+
This does obviously not work:
df.where(
df.groupBy($"letter").count()>1
)
My example its about count, but I'd like to be able to use other aggregate functions (the results thereof) as well.
EDIT:
Just for counting,I just came up with the following solution:
df.groupBy($"letter").agg(
collect_list($"id").as("ids")
)
.where(size($"ids") > 1)
.withColumn("id", explode($"ids"))
.drop($"ids")
You can use left semi join:
df.join(
broadcast(df.groupBy($"letter").count.where($"count" > 1)),
Seq("letter"),
"leftsemi"
)
or window functions:
import org.apache.spark.sql.expressions.Window
df
.withColumn("count", count($"*").over(Window.partitionBy("letter")))
.where($"count" > 1)
In Spark 2.0 or later you can Bloom filter but it is not available in 1.x