How to rename a duplicate column using column index? - scala

I have a dataframe that has two same name columns, since the first column (agreementID) holds a value, I want to rename the second column) which holds null values to a different name, and different records. I want to use the aggrementID as a key in the future.
enter image description here
enter image description here
Please help on how to rename the column using column position ore index?

val columnIndex = 1
val newColumnName = "new_name"
val cols = df.columns
cols(columnsIndex) = newColumnName
df.toDF(cols)

This should work:
val distinctColumns = Seq("name","agreementId","dupAgreementId")
val df = df.toDF(distinctColumns:_*)

Related

How to get column values from list which contains column names in spark scala dataframe

I have a config defined which contains a list of column for each table to be used as a dedup key
for ex:
config 1 :
val lst = List(section_xid, learner_xid)
these are the column that needs to be used as a dedup keys. This list is dynamic some table will have 1 value some will have 2 or 3 values in it
what I am trying to do is build a single key column from this list
df.
.withColumn( "dedup_key_sk", uuid(md5(concat($"lst(0)",$"lst(1)"))) )
how do I make this dynamic which will work for any number of columns in list .
I tried doing this
df.withColumn("dedup_key_sk", concat(Seq($"col1", $"col2"):_*))
For this to work I had to convert list to Df and each value in list needs to be in separate columns I was not able to figure that out.
tried doing this but didn't work
val res = sc.parallelize(List((lst))).toDF
ANy input here will be appreciated . Thank you
The list of strings can be mapped to a list of columns (using functions.col). This list of columns can then be used with concat:
val lst: List[String] = List("section_xid", "learner_xid")
df.withColumn("dedup_key_sk", concat(lst.map(col):_*)).show()

How can create a new DataFrame from a list?

Hello guys i have this function that gets the row Values from a DataFrame, converts them into a list and the makes a Dataframe from it.
//Gets the row content from the "content column"
val dfList = df.select("content").rdd.map(r => r(0).toString).collect.toList
val dataSet = sparkSession.createDataset(dfList)
//Makes a new DataFrame
sparkSession.read.json(dataSet)
What i need to do to make a list with other column values so i can have another DataFrame with the other columns values
val dfList = df.select("content","collection", "h").rdd.map(r => {
println("******ROW********")
println(r(0).toString)
println(r(1).toString)
println(r(2).toString) //These have the row values from the other
//columns in the select
}).collect.toList
thanks
Approach doesn't look right, you don't need to collect dataframe to just add new columns. Try adding columns to directly to dataframe using withColumn() withColumnRenamed() https://docs.azuredatabricks.net/spark/1.6/sparkr/functions/withColumn.html.
If you want to bring columns from another dataframe try joining. In any case it's not good idea to use collect as it will bring all your data to driver.

Dataframe column substring based on the value during join

I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"
I need to compare this column with another column in a different dataframe based on the column value.
If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")
If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.
For example:
val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))
I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?
You can try adding a new column based on the conditions and join on the new column. Something like this.
val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )
DF4.show(false)
The new column will have values like this.
+------------------+-------------+
|column1 |joinCol |
+------------------+-------------+
|COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
|xxxxx-xx-xxxx |xxxxx-xx-xxxx|
+------------------+-------------+
You can now join based on the new column added.
val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))
Hope this helps.
Simply create a new column to use in the join:
DF2.withColumn("column2",
when($"column1" rlike "COR//.*",
$"column1".substr(lit(4), length($"column1")).
otherwise($"column1"))
Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.
Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

Tagging a HBase Table using Spark RDD in Scala

I am trying add an extra "tag" column to an Hbase table. Tagging is done on the basis of words present in the rows of the table. Say for example, If "Dark" appears in a certain row, then its tag will be added as "Horror". I have read all the rows from the table in a spark RDD and have matched them with words based on which we would tag. A snippet to code looks like this:
var hBaseRDD2=sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
(Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieName"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieSummary"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieActor")))
)
})
Here, "moviesdata" is the columnfamily of the HBase table and "MovieName"&"MovieSummary" & "MovieActor" are column names. "transformedRDD" in the above snippet is of type RDD[String,String,String]. It has been converted into type RDD[String] by:
val arrayRDD: RDD[String] = transformedRDD.map(x => (x._1 + " " + x._2 + " " + x._3))
From this, all words have been extracted by doing this:
val words = arrayRDD.map(x => x.split(" "))
The words which we would are looking for in the HBase Table rows are in a csv file. One of the column, let's say "synonyms" column, of the csv has the words which we would look for. Another column in the csv is a "target_tag" column, which has the words which would be tagged to the row corresponding to which there is match.
Read the csv by:
val csv = sc.textFile("/tag/moviestagdata.csv")
reading the synonyms column: (synonyms column is the second column, therefore "p(1)" in the below snippet)
val synonyms = csv.map(_.split(",")).map( p=>p(1))
reading the target_tag column: (target_tag is the 3rd column)
val targettag = csv.map(_.split(",")).map(p=>p(2))
Some rows in synonyms and targetag have more than one strings and are seperated by "###". The snippet to seperate them is this:
val splitsyno = synonyms.map(x => x.split("###"))
val splittarget = targettag.map(x=>x.split("###"))
Now, to match each string from "splitsyno", we need to traverse every row, and further a row might have many strings, hence, to create a set of every string, I did this:(an empty set was created)
splitsyno.map(x=>x.foreach(y=>set += y)
To match every string with those in "words" created up above, I did this:
val check = words.exists(set contains _)
Now, the problem which I am facing is that I don't exactly know that strings from what rows in csv are matching to strings from what rows in HBase table. This is needed as I would need to find corresponding target string and which row in HBase table to add to. How should I get it done? Any help would be highly appreciated.