Data cleaning for duplications within cells of a dataframe - data-cleaning

I've just scraped a dataset of names from a website but the names are coming into the dataframe duplicated. Example:
[MarkMark, SarahSarah, BenBen]
The website I'm scraping from has images in the table and it seems like when I've pulled the table into a dataframe format it duplicates the name. How would I go about cleaning this data so I've only got one name?

Try splitting the name string in the middle
df["name"] = df["name"].apply(lambda name: name[:len(name)/2])

Related

Filter on CassandraJoinRDD

I have applied a join on file and existing Cassandra table via joinWithCassandraTable. Now, I want to apply a filter on joinCassandraRDD. Here is the code and functionality which I have written for extraction of data:
var outrdd = sc.textFile("/usr/local/spark/bin/select_element/src/main/scala/file_small.txt")
.map(_.toString).map(Tuple1(_))
.joinWithCassandraTable(settings.keyspace, settings.table)
.select("id", "listofitems")
Here "/usr/local/spark/bin/select_element/src/main/scala/file_small.txt" is the text file which is having a list of ids. Now, I have some elements in another list, say userlistofitems=["jas", "yuk"], I need to search 'userlistofitems' sublist from 'listofitems' column of joinCassandraRDD.
We have around 2Million ids where we have several user_lists for which we have to extract the data from Cassandra. We are using versions spark=2.4.4, scala=2.11.12, and spark-cassandra-connector=spark-cassandra-connector-2.4.2-3-gda70746.jar.
Any help is highly appreciated.
References Used:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc,
https://www.youtube.com/watch?v=UsenTP029tM

Scala - How to append a column to a DataFrame preserving the original column name?

I have a basic DataFrame containing all the data and several derivative DataFrames that I've been subsequently creating from the basic DF making grouping, joins etc.
Every time I want to append a column to the last DataFrame containing the most relevant data I have to do something like this:
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date")
As you may see I have to change the original column name to new_date_
But I want the column name to remain the same.
However if I don't change the name the column gets dropped. So renaming is just a not too pretty workaround.
How can I preserve the original column name when appending the column?
As far as I know you can not create two columns with the same name in a DataFrame transformation. I rename the new column to the olderĀ“s name like
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date").withColumnRenamed("new_date_", "new_date")

How to merge multiple output from single table using talend?

I have Employee table in database which is having gender column. so I want to Filter Employee data based on number of gender with three column to excel like:
I'm getting this output using below talend schemaStructure 1:
So I want to optimized above structure and trying in this way but I have been stuck with other scenario. Here I'm getting Employee data with gender wise but in three different file so Is there any way so that I can achieve same excel result from one SQL input file and after mapping can be get in a single output excel file?
Structure 2 :
NOTE: I don't want to use same input table many time. I want to get same output using single table and single output excel file. so please suggest me any component which one is useful for me.
Thanks in advance!!!

Incrementally adding to a Hive table w/Scala + Spark 1.3

Our cluster has Spark 1.3 and Hive
There is a large Hive table that I need to add randomly selected rows to.
There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value<number), then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.
When I do the query I use ORDER BY RAND() in the query (using sqlContext). I created a var Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)
I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use. I know that Dataframes aren't intended for incremental adds.
One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.
Any feedback would be great. Thanks
I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:
in Hive:
create external table test(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION '/SOME/HDFS/LOCATION'
And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:
df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)

Split a comma delimited string to a table in BIRT

I'm creating a BIRT report and I need to split a comma delimited string from a dataset into multiple columns in a table.
The data looks like:
256,1400.031,-70.014,1,4.544,0.36,10,31,30.89999962,0
256,1400,-69.984,2,4.574,1.36,10,0,0,0
...
The data is stored this way in the database and I can't change it but I need to be able to display it as a table. I'm new to BIRT, any ideas?
I think the easiest way is to create a computed column in the dataset for each field.
For example if the merged field from database is named "mergedData" you can split it with this kind of expression:
First field (computed column) expression:
var tempArray=row["mergedData"].split(",");
tempArray[0];
Second field:
var tempArray=row["mergedData"].split(",");
tempArray[1];
etc..
Depending on some variables that you did not mention.
If the dataset is stagenent (not updated much or ever). Open the data set with Excel, converiting it from .csv to .xls and save.
Use the Excel as a datasource. Assuming you are using BIRT 4.1 or newer this should work fine.
I don't think there is any SQL code that easily converts .csv