How to create a dataframe along with schema from the individual values - scala

i have some individual values with data and i have to convert it into dataframe. and i tried the below . Only one row output will come.
val matchingcount= 3
val notmatchingcount=5
val filename=h:/filename1
import spark.implicits._
val data=Seq("+filename+","+matchingcount+","+notmatchingcount+").toDF("ezfilename","match_count","non_matchcount")
data.show()
throwing error :
Exception in thread "main" java.lang.IllegalArguementException : requirement failed : the number of columns doesn't match.
Old column names (1): value
New column names (8) : ezfilename,match_count,non_matchcount
Any help please

You were almost there! The code that does what you want is the following:
val matchingcount= 3
val notmatchingcount=5
val filename="h:/filename1"
import spark.implicits._
val data=Seq((filename,matchingcount,notmatchingcount)).toDF("ezfilename","match_count","non_matchcount")
data.show()
+------------+-----------+--------------+
| ezfilename|match_count|non_matchcount|
+------------+-----------+--------------+
|h:/filename1| 3| 5|
+------------+-----------+--------------+
There are 3 key differences between your code and the code above here:
In scala, a string has to be surrounded by " characters. So I've added these characters to val filename=
You were correct in the fact that you could use a Seq to use the toDF method after imports spark.implicits._, but each element of the string would represent one row of the dataframe. So instead of creating a dataframe with 3 columns you were creating one with 1 element. The way you can create 3 columns is by adding tuples inside of your Seq. So notice the difference between Seq(bla,bla,bla) and Seq((bla, bla, bla)) where the latter is the correct one. You can also create multiple rows like this by doing: Seq((bla, bli, blu), (blo, ble, bly)).
In Scala, the way you access a variable's value is by simply writing the variable's name. So writing filename instead of "+filename+" is the correct way of doing that.
Hope this helps!

Related

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

How to divide the value of current row with the following one?

In Spark-Sql version 1.6, using DataFrames, is there a way to calculate, for a specific column, the fraction of dividing current row and the next one, for every row?
For example, if I have a table with one column, like so
Age
100
50
20
4
I'd like the following output
Franction
2
2.5
5
The last row is dropped because it has no "next row" to be added to.
Right now I am doing it by ranking the table and joining it with itself, where the rank is equals to rank+1.
Is there a better way to do this?
Can this be done with a Window function?
Window function should do only partial tricks. Other partial trick can be done by defining a udf function
def div = udf((age: Double, lag: Double) => lag/age)
First we need to find the lag using Window function and then pass that lag and age in udf function to find the div
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val dataframe = Seq(
("A",100),
("A",50),
("A",20),
("A",4)
).toDF("person", "Age")
val windowSpec = Window.partitionBy("person").orderBy(col("Age").desc)
val newDF = dataframe.withColumn("lag", lag(dataframe("Age"), 1) over(windowSpec))
And finally cal the udf function
newDF.filter(newDF("lag").isNotNull).withColumn("div", div(newDF("Age"), newDF("lag"))).drop("Age", "lag").show
Final output would be
+------+---+
|person|div|
+------+---+
| A|2.0|
| A|2.5|
| A|5.0|
+------+---+
Edited
As #Jacek has suggested a better solution to use .na.drop instead of .filter(newDF("lag").isNotNull) and use / operator , so we don't even need to call the udf function
newDF.na.drop.withColumn("div", newDF("lag")/newDF("Age")).drop("Age", "lag").show

How to display the results brough from column functions using spark/scala like what show() does to dataframe

I just started how to use dataframe and column in Spark/Scala. I know if I want to show something on the screen, I can just do like df.show() for that. But how can I do this to a column. For example,
scala> val dfcol = df.apply("sgan")
dfcol: org.apache.spark.sql.Column = sgan
this can find a column called "sgan" from the dataframe df then give it to dfcol, so dfcol is a column. Then, if I do
scala> abs(dfcol)
res29: org.apache.spark.sql.Column = abs(sgan)
I just got the result shown on the screen like above. How can I show the result of this function on the screen like df.show() does? Or, in other words, how can I know the results of the functions like abs, min and so forth?
You should always use a dataframe, Column objects are not meant to be investigated this way. You can use select to create a dataframe with the column you're interested in, and then use show():
df.select(functions.abs(df("sgan"))).show()

Is it possible to alias columns programmatically in spark sql?

In spark SQL (perhaps only HiveQL) one can do:
select sex, avg(age) as avg_age
from humans
group by sex
which would result in a DataFrame with columns named "sex" and "avg_age".
How can avg(age) be aliased to "avg_age" without using textual SQL?
Edit:
After zero323 's answer, I need to add the constraint that:
The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).
Let's suppose human_df is the DataFrame for humans. Since Spark 1.3:
human_df.groupBy("sex").agg(avg("age").alias("avg_age"))
If you prefer to rename a single column it is possible to use withColumnRenamed method:
case class Person(name: String, age: Int)
val df = sqlContext.createDataFrame(
Person("Alice", 2) :: Person("Bob", 5) :: Nil)
df.withColumnRenamed("name", "first_name")
Alternatively you can use alias method:
import org.apache.spark.sql.functions.avg
df.select(avg($"age").alias("average_age"))
You can take it further with small helper:
import org.apache.spark.sql.Column
def normalizeName(c: Column) = {
val pattern = "\\W+".r
c.alias(pattern.replaceAllIn(c.toString, "_"))
}
df.select(normalizeName(avg($"age")))
Turns out def toDF(colNames: String*): DataFrame does exactly that. Pasting from 2.11.7 documentation:
def toDF(colNames: String*): DataFrame
Returns a new DataFrame with columns renamed. This can be quite
convenient in conversion from a RDD of tuples into a DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame
// with column name _1 and _2
rdd.toDF("id", "name") // this creates a DataFrame with
// column name "id" and "name"
Anonymous columns, such as the one that would be generated by avg(age) without AS avg_age, get automatically assigned names. As you point out in your question, the names are implementation-specific, generated by a naming strategy. If needed, you could write code that sniffs the environment and instantiates an appropriate discovery & renaming strategy based on the specific naming strategy. There are not many of them.
In Spark 1.4.1 with HiveContext, the format is "_cN" where N is the position of the anonymous column in the table. In your case, the name would be _c1.