How to get a list[String] from a DataFrame - scala

I have a text file in HDFS with a list of ids that I want to read as a list of String. When I do this
spark.read.text(filePath).collect.toList
I get a List[org.apache.spark.sql.Row] instead. How do I read this file into a list of string?

If you use spark.read.textFile(filepath) instead, you will get a DataSet[String] instead of a DataFrame (aka, DataSet[Row]). Then when you collect you will get an Array[String] instead of Array[Row].
You can also convert a DataFrame with a single string column into a DataSet[String] using df.as[String]. So df.as[String].collect will get an Array[String] from a DataFrame (assuming the DataFrame contains a single string column, otherwise this will fail)

Use map(_.getString(0)) to extract the value from the Row object:
spark.read.text(filePath).map(_.getString(0)).collect.toList

Related

i have json string in my dataframe i already tried to exract json sting columns using pyspark

df = spark.read.json("dbfs:/mnt/evbhaent2blobs", multiLine=True)
df2 = df.select(F.col('body').cast("Struct").getItem('CustomerType').alias('CustomerType'))
display(df)
my df is
my oupputdf
I am taking a guess that your dataframe has a column "body" which is a json string and you want to parse the json and extract an element from it.
First you need to define or extract the json schema. And then parse json string and extract its elements as column. From the extracted columns, you can select the desired columns.
json_schema = spark.read.json(df.rdd.map(lambda row: row.body)).schema
df2 = df.withColumn('body_json', F.from_json(F.col('body'), json_schema))\
.select("body_json.*").select('CustomerType')
display(df2)

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

Spark Dataframe - Add new Column from List[String]

I have a List[String] and add the value of these Strings as Column Names to an existing Dataframe.
Is there a way to do it instead of Iterating over the List. If Iterating over the List is the only way, how best can I achieve it?
Must be dump.. should have tried this before..
got answer after a little try:
val test: DataFrame = useCaseTagField_l.foldLeft(ds_segments)((df, tag) => df.withColumn(tag._2, lit(null)))

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.

Unwanted quotes appearing in data while saving dataframe as file

I first read a delimited file with multiple rows and index rows using zip with index.
Next I'm trying to write that dataframe which is created from a RDD[Row] to a csv file using scala.
This is my code :
val FileDF = spark.read.csv(inputfilepath)
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1) +: indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val dataframenew = spark.createDataFrame(rdd,FileDFWithSeqNo)
dataframenew.write.format("com.databricks.spark.csv").option("delimiter","|").save("C:\\Users\\path\\Desktop\\IndexedOutput")
where dataframenew is the final dataframe.
Input data is like :
0|0001|10|1|6001825851|0|0|0000|0|003800543||2017-03-02 00:00:00|95|O|473|3.74|0.05|N|||5676|6001661630||473|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230979
0|0001|10|1|6001825853|0|0|0000|0|003811455||2017-03-02 00:00:00|95|O|90|15.14|0.55|N|||1080|6001661630||90|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230980
0|0001|10|1|6001825854|0|0|0000|0|003812898||2017-03-02 00:00:00|95|O|15|7.60|1.33|N|||720|6001661630||15|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230981
I'm zipping with index to get unique identifier for each row.
But this gives me an output file with data like :
1001,"0|0001|10|1|6001825851|0|0|0000|PS|0|0.0000||0||20170303|793358|793358115230979",cabc
While the expected output should be :
1001,0|0001|10|1|6001825851|0|0|0000|PS|0|0.0000||0||20170303|793358|793358115230979,cabc
Why are the extra quotes getting added into the data and how can I eliminate this?