Convert dataframe to json in scala - mongodb

Assuming i have a wordcount example where i get a dataframe as word in one column and wordcount in another column, I want to collect the same and store it as an array of json in mongo collection.
eg for dataframe:
|Word | Count |
| abc | 1 |
| xyz | 23 |
I should get the json like:
{words:[{word:"abc",count:1},{word:"xyz",count:23}]}
When i tried .toJSON on the dataframe and collected the value as list and added it to a dataframe the result which got stored in my mongo was a collection of string rather than collection of JSON.
query used :
explodedWords1.toJSON.toDF("words").agg(collect_list("words")).toDF("words")
result : "{\"words\":[{\"word\":\"abc\",\"count\":1},{\"word\":\"xyz\",\"count\":23}]}"
I am new to Scala. Any help will be good. (Will be helpful if external package aint used).

The absolute best way to store data from dataframes into Mongo is using the
MongoDB Spark Connector (https://docs.mongodb.com/spark-connector/master/)
Just add "org.mongodb.spark" %% "mongo-spark-connector" % "2.2.0" to your sbt dependencies and check code below
import com.mongodb.spark.MongoSpark
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.master("local[2]")
.appName("test")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/dbname")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/dbname")
.getOrCreate()
import spark.implicits._
val explodedWords1 = List(
("abc",1),
("xyz",23)
).toDF("Word", "Count")
MongoSpark.save(explodedWords1.write.option("collection", "wordcount").mode("overwrite"))
However if you do want the results as a single json file then the script below should do it:
explodedWords1.repartition(1).write.json("/tmp/wordcount")
Finally if you want the json as a list of strings in your scala just use
explodedWords1.toJSON.collect()
Update:
I didn't see that you wanted all records aggregated to one field ("words")
If you use the code below, then all three methods above still function (swapping explodedWords1 with aggregated)
import org.apache.spark.sql.functions._
val aggregated = explodedWords1.agg(
collect_list(map(lit("word"), 'Word, lit("count"), 'Count)).as("words")
)
Option 1: explodedWords1
Option 2: aggregated

Related

loop a sequence of s3 parquet file path with same schema and save in a single dataframe in scala

What is needed to a seq of s3 location is given. The difference for any two location is the partition column value of the table.
Each parquet folder has same schema .
So we need to loop the sequence of s3 parquet file path with same schema and save in a single dataframe in scala along.
If you have an array with all of directories you want to import you can iterate over that array and make a collection of dataframes and then union them into a single one.
Try something like this.
//You have now a collection of dataframes
val dataframes = directories.map(dir =>
spark.read.parquet(dir))
//Let's union them into one
val df_union = dataframes.reduce(_ union _)
If you turn on the options, then simply you can load the files recursively.
spark.read.parquet("s3a://path/to/root/")
The options are as follows.
spark.hive.mapred.supports.subdirectories true
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
This can be used in a way that
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SparkSession
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("test")
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.read.parquet("s3a://path/to/root/")

I can't fit the FP-Growth model in spark

Please, can you help me ? I have an 80 CSV files dataset and a cluster of one master and 4 slaves. I want to read the CSV files in a dataframe and parallelize it on the four slaves. After that, I want to filter the dataframe with a group by. In my spark queries, the result contains columns "code_ccam" and "dossier" grouped by ("code_ccam","dossier"). I want to use the FP-Growth algorithm to detect sequences of "code_ccam" which are repeated by "folder". But when I use the FPGrowth.fit() command, I have the following error :
"error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Dataset[_]"
Here are my spark commands:
val df = spark.read.option("header", "true").csv("file:///home/ia/Projet-Spark-ace/Donnees/Fichiers CSV/*.csv")
import org.apache.spark.sql.functions.{concat, lit}
val df2 = df.withColumn("dossier", concat(col("num_immatriculation"), lit(""), col("date_acte"), lit(""), col("rang_naissance"), lit(""), col("date_naissance")))
val df3 = df2.drop("num_immatriculation").drop("date_acte").drop("rang_naissance").drop("date_naissance")
val df4 = df3.select("dossier","code_ccam").groupBy("dossier","code_ccam").count()
val transactions = df4.agg(collect_list("code_ccam").alias("codes_ccam")).rdd.map(x => x)
import org.apache.spark.ml.fpm.FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("code_ccam").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(transactions)
Tkank you very much. It worked. I replaced collect_list by collect_set.

Dataframe: how to groupBy/count then order by count in Scala

I have a dataframe that contains a thousands of rows, what I'm looking for is to group by and count a column and then order by the out put: what I did is somthing looks like :
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val objHive = new HiveContext(sc)
val df = objHive.sql("select * from db.tb")
val df_count=df.groupBy("id").count().collect()
df_count.sort($"count".asc).show()
You can use sort or orderBy as below
val df_count = df.groupBy("id").count()
df_count.sort(desc("count")).show(false)
df_count.orderBy($"count".desc).show(false)
Don't use collect() since it brings the data to the driver as an Array.
Hope this helps!
//import the SparkSession which is the entry point for spark underlying API to access
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val pathOfFile="f:/alarms_files/"
//create session and hold it in spark variable
val spark=SparkSession.builder().appName("myApp").getOrCreate()
//read the file below API will return DataFrame of Row
var df=spark.read.format("csv").option("header","true").option("delimiter", "\t").load("file://"+pathOfFile+"db.tab")
//groupBY id column and take count of the column and order it by count of the column
df=df.groupBy(df("id")).agg(count("*").as("columnCount")).orderBy("columnCount")
//for projecting the dataFrame it will show only top 20 records
df.show
//for projecting more than 20 records eg:
df.show(50)

Spark - UnsupportedOperationException: collect_list is not supported in a window operation

I am using Spark 1.6. I have a dataframe generated from a parquet file with 6 columns. I am trying to group (partitionBy) and order(orderBy) the rows in the dataframe, to later collect those columns in an Array.
I wasn't sure if this actions were possible in Spark 1.6, but in the following answers they show how it can be done:
https://stackoverflow.com/a/35529093/1773841 #zero323
https://stackoverflow.com/a/45135012/1773841 #Ramesh Maharjan
Based on those answers I wrote the following code:
val sqlContext: SQLContext = new HiveContext(sc)
val conf = sc.hadoopConfiguration
val dataPath = "/user/today/*/*"
val dfSource : DataFrame = sqlContext.read.format("parquet").option("dateFormat", "DDMONYY").option("timeFormat", "HH24:MI:SS").load(dataPath)
val w = Window.partitionBy("code").orderBy("date".desc)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val dfCollec = dfData.withColumn("collected", collect_list(struct("col1","col2","col3","col4","col5","col6")).over(w))
So, I followed the pattern written by Ramesh, and I created the sqlContext based on Hive as Zero recommended. But I am still getting the following error:
java.lang.UnsupportedOperationException:
'collect_list(struct('col1,'col2,'col3,'col4,'col5,'col6)) is not
supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1052)
What am I missing still?

how to create multiple list from dataframe in spark?

how to create multiple list from dataframe in spark.
In my case, I want to order mongodb documents with grouping specific key. and create multiple list which is grouped on the basis of one key of schema
please help me
sparkSession = SparkSession.builder().getOrCreate()
MongoSpark.load[SparkSQL.Character](sparkSession).printSchema()
val characters = MongoSpark.load[SparkSQL.Character](sparkSession)
characters.createOrReplaceTempView("characters")
val sqlstmt = sparkSession.sql("SELECT * FROM characters WHERE site = 'website'")
...
You can do something like this:
val columns = sqlstmt.columns.map(col)
task1
.groupBy(key)
.agg(collect_list(struct(columns: _*)).as("data"))
Don't forget to import
import org.apache.spark.sql.functions._