Is anyone using the DataFrame package (https://github.com/rothnic/DataFrame)?
I use it because older MATLAB can also use it. However, I just have very basic question:
How to change value in the DataFrame?
In MATLAB's table function, it is straightforward to do it. For example, t{1,{'prior'}} = {'a = 2000'}, where t is a table and I assign a cell to it. I cannot figure out how to do it in DataFrame package.
The DataFrame author seems not maintaining it anymore(?). I hope someone could give more examples of its methods in DataFrame.
Thanks!
Related
How do I combine data from two tables based on certain shared values from the row?
I already tried using the which function and it didn't work.
I think you will have the best luck using the dplyr fuction. Specifically you can use right_join(). You can wright it like this, right_join(df1,df2, by="specification")
This will combine that columns from df2 with the specifications matching the rows according to the shared specification from df1.
For future reference it would be a lot of help if you included a screenshot of code just so it is easier to know exactly what you are asking.
Anyway, let me know if this answers your question!
I need to replace only null values in selected columns in a dataframe. I know we have df.na.fill option . How can we implement it only on selected columns or is there any better option other than using df.na.fill
Reading spark documentation here we can see that fill is well suited for your need. You can do something like:
df.na.fill(0, Seq("colA", "colB"))
I have spark dataframe
Here it is
I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question
col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)
I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).
from pyspark.sql import functions as F
var = df.select(F.col('column_you_want')).toPandas()
Then you can iterate on it like a normal pandas series.
I am currently referring Spark in Action Book in that, I came across using same column in different ways.
val postsIdBody = postsDf.select('id, 'body)
val postsIdBody = postsDf.select($"id", $"body")
val postsIdBody = postsDf.select("id", "body")
we are able to get similar results. Is there any much difference between those? Can anyone clearly explain in what situations we need to implement each type of those.
Thanks in advance
I'm sure the book includes this, but by importing the implicits package in Scala, you can use these symbols to create Column objects without otherwise typing out new Column(name)
You would use column objects rather than strings because you can do ordering and aliasing easier within the dataframe API
I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.