I am referencing a dataframe name based on the file that I am ingesting. However, a file may have numbers in it. For example, a file name could be DocumentList20200101.
I am trying to remove the numbers so that the dataframe name will be DocumentList
I would really appreciate any help on this.
Related
I've just scraped a dataset of names from a website but the names are coming into the dataframe duplicated. Example:
[MarkMark, SarahSarah, BenBen]
The website I'm scraping from has images in the table and it seems like when I've pulled the table into a dataframe format it duplicates the name. How would I go about cleaning this data so I've only got one name?
Try splitting the name string in the middle
df["name"] = df["name"].apply(lambda name: name[:len(name)/2])
I am iterating through csv files in a folder using for loop and performing some operations on each csv (getting the count of rows for each unique id and storing all these outputs into a pyspark dataframe). Now my requirement is to add the name of the file as well to the dataframe for each iteration. Can anyone suggest some way to do this
you can get the file name as a column using the function pyspark.sql.functions.input_file_name, and if your files have the same schema, and you want to apply the same processing pipeline, then don't need to loop on these files, you can read them using a regex:
df = spark.read.csv("path/to/the/files/*.csv", header=True, sep=";") \
.withColumn("file_name", input_file_name())
I am working on databricks notebook (Scala) and I have a spark query that goes kinda like this:
df = spark.sql("SELECT columnName AS `Column Name` FROM table")
I want to store this as a databricks table. I tried below code for the same:
df.write.mode("overwrite").saveAsTable("df")
But it is giving an error because of the space in the column name. Here's the error:
Attribute name contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
I don't want to remove the space so is there any alternative for this?
No, that's a limitation of the underlying technologies used by Databricks under the hood (for example, PARQUET-677). The only solution here is to rename column, and if you need to have space in the name, do renaming when reading it back.
I have a basic DataFrame containing all the data and several derivative DataFrames that I've been subsequently creating from the basic DF making grouping, joins etc.
Every time I want to append a column to the last DataFrame containing the most relevant data I have to do something like this:
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date")
As you may see I have to change the original column name to new_date_
But I want the column name to remain the same.
However if I don't change the name the column gets dropped. So renaming is just a not too pretty workaround.
How can I preserve the original column name when appending the column?
As far as I know you can not create two columns with the same name in a DataFrame transformation. I rename the new column to the olderĀ“s name like
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date").withColumnRenamed("new_date_", "new_date")
I am reading an XML into a spark Dataframe using com.databricks.spark.xml and trying to generate a csv file as output.
My Input is like below
<id>1234</id>
<dtl>
<name>harish</name>
<age>21</age>
<class>II</class>
</dtl>
My output should be a csv file with the combination of id and remaining whole XML tag like
id, xml
1234,<dtl><name>harish</name><age>21</age><class>II</class></dtl>
Is there a way to achieve the output in the above format.
your help is very much appreciated.
Create a plain RDD to load xml as text file using sc.textFile() without parsing.
Extract id manually with the help of regex/xpath and also try to slice RDD string using string slicing from start of your tag to end of your tag.
Once it's done you will have your data into map like (id,"xml").
I hope this tactical solution will help you...