I have many json string lines in many files and they are very similar in schema but there are a few different in some cases.
I made a DataFrame from them and want to see only rows which have a specific column like
DF.filter("myColumn" is present).show
How can I do this?
You can use isNotNull in filter()
import org.apache.spark.sql.functions.isNotNull
df.filter($"myColumn".isNotNull)
Related
How do I combine data from two tables based on certain shared values from the row?
I already tried using the which function and it didn't work.
I think you will have the best luck using the dplyr fuction. Specifically you can use right_join(). You can wright it like this, right_join(df1,df2, by="specification")
This will combine that columns from df2 with the specifications matching the rows according to the shared specification from df1.
For future reference it would be a lot of help if you included a screenshot of code just so it is easier to know exactly what you are asking.
Anyway, let me know if this answers your question!
When I call foo.show(),if the foo dataframe contains too many columns, the result won't be printed in a single row in jupyter notebook.Instead it will be splitted into two rows,and not easy to read for human.How can I solve this problem?
A solution can be to display the dataframe in a different view, using the attributes of pyspark.sql.DataFrame.show:
foo.show(vertical=True, truncate=False)
Otherwise, if you have a small dataframe, you can use a workaround and convert it to Pandas:
foo.toPandas()
I need to replace only null values in selected columns in a dataframe. I know we have df.na.fill option . How can we implement it only on selected columns or is there any better option other than using df.na.fill
Reading spark documentation here we can see that fill is well suited for your need. You can do something like:
df.na.fill(0, Seq("colA", "colB"))
I have json files of the following structure:
{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}
I want to read several such json files and distinct them based on the "name" column inside names.
I tried
df.dropDuplicates(Array("names.name"))
but it didn't do the magic.
This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.
val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
.dropDuplicates("DEDUP_KEY")
.drop("DEDUP_KEY")
just for future reference, the solution looks like
val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY",
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")
I have a groupBy for a DataFrame which is based on 3 columns. I am doing something like this:
myDf.groupBy($"col1", $"col2", $"col3")
Anyway I am not sure how this works.
Does it manage ignore cases? I need that for each column "FOO" and "foo" are considered the same like "" and null.
If this is not the supposed working mode how I can add it? From the API doc I can see something with apply on a column but I could not find any example.
Any idea?
You can run functions inside of your groupBy statement. So in this case it sounds like you will want to convert the strings to lower case when you are grouping them. Check out the lower function
https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.functions$