How to update dataframe without using loop - pyspark

I've two source dataframes:
Storeorder: {columns=Store, Type_of_carriers, No_of_carriers, Total_space_required}
Fleetplanner: {columns=Store, Truck_Type, Truck_space, Route}
Requirement is:
Create list with {Store, Type_of_carriers, No_of_carriers, Route}
In Fleetplanner data, one Store can have more than one Truck_type and
Route. Also one Route can have multiple Stores or stops associate.
Each time I take a record from Storeorder, I've to assign how many carriers will go to which route.
At the same time I've to update Fleetplanner data with the space left for next stores.
This I've done in Pandas using loop and it is taking huge time.
Can anyone please suggest how to resolve this problem in alternate way in Spark?
I've solved the problem using Pandas, but want to parallelize in Spark
Described

Related

In Power Query, when duplicating the source query should I duplicate the Transform File folder as well?

My apologies in advance if this question has already been asked, if so I cannot find it.
So, I have this huge data base divided by country where I need to import from each country data base individually and then, in Power Query, append the queries as one.
When I imported the US files, the Power Query automatically generated a Transform File folder with 4 helper queries:
Then I just duplicated the query US - Sales and named it as UK - Sales pointing it to the UK sales folder:
The Transform File folder didn't duplicate, though.
Everything seems to be working just fine right now, however I'd like to know if this could be problem in the near future, because I still have several countries to go. Should I manually import new queries as new connections instead of just duplicating them or it just doesn't matter?
Many thanks!
The Transform Files Folder group contains the code that is called to transform a list of files. It is re-usable code. You can see the Sample File, which serves as the template for the transform actions.
As long as the file that is arrived at for the Sample File has the same structure as the files that you are feeding into the command, then you can use any query with any list of files.
One thing you need to make sure is that the Sample File is not removed from your data source. You may want to create a new dummy file just for that purpose, make sure it won't be deleted, and then point the Sample File query to pull just that file.
The Transform Helper Queries are special queries that you may edit the queries, but you cannot delete and recreate your own manually. They are automatically created by PQ when combining list of contents and are inherently linked to the parent query.
That said, you cannot replicate them, and must use the Combine function provided by PQ to create the helper queries.
You may however, avoid duplicating the queries, instead replicate your steps in the parent query, and use table union to join the list before combining the contents with the same helper queries.

Spark Structured streaming - dropDuplicates with watermark alternate solution

I am trying to deduplicate on streaming data using the dropDuplicate function with watermark. The problem I am facing currently is that I have to two timestamps for a given record
One is the eventtimestamp - timestamp of the record creation from the source.
Another is an transfer timestamp - timestamp from an intermediate process that is responsible to stream the data.
The duplicates are introduced during the intermediate stage so for a given a record duplicate, the eventtimestamp is same but transfer timestamp is different.
For the watermark, I like to use the transfertimestamp because I know the duplicates cant occur more than 3 minutes apart in transfer. But I cant use it within dropDuplicate because it wont capture the duplicates as the duplicates have different transfer timestamp.
Here is an example,
Event 1:{ "EventString":"example1", "Eventtimestamp": "2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:05:00.00" }
Event 2 (duplicate): {"EventString":"example1", "Eventtimestamp": "2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:08:00.00"}
In this case, the duplicate was created during transfer after 3 mins from the original event
My code is like below,
streamDataset.
.withWatermark("transferTimestamp", "4 minutes")
.dropDuplicates("eventstring","transferTimestamp");
The above code won't drop the duplicates as transferTimestamp is unique for the event and its duplicate. But currently, this is the only way as Spark forces me to include the watermark column in the dropDuplicates function when watermark is set.
I would really like to see an dropDuplicate implementation like below which would be a valid case for any at-least once semantics streams where I dont have to use the watermark field in dropDuplicates and still the watermark based state eviction is honored. But that is not the case currently
streamDataset.
.withWatermark("transferTimestamp", "4 minutes")
.dropDuplicates("eventstring");
I cant use the eventtimestamp as it is not ordered and time range varies drastically (delayed events and junk events).
If anyone has an alternate solution or ideas for deduping in such scenario, please let me know.
For your use case you cant use the dropDuplicates API directly . You have to use a arbitrary stateful operation for the same using some spark API like flatmapgroupwithstate

Talend Component with multiple inputs and unrelated outputs

using Talend Open Studio, I have a data-processing component, for which I'd appreciate your advice on how to make this possible (a) in a single component and (b) without a dirty workaround - thanks.
Relating part (a):
I have two different inputs:
One Input (with exactly one row) defines some kind of metadata for my processing.
One Input (with 1...n rows) defines the core data to process.
Currently, I solved this first requirement using two components and passing my metadata to the second component using the globalMap. But it would be nice, if I could integrate both connections into one component.
Relating part (b): After I have read all my input rows, I need to process them all at once. So far, so easy, I could use the end-section - my problem comes here: After that processing, I need to create a number of output-rows for a single output connection. Problem is, that Output-rows can only be created in the main-part and there I don't know when the last row was read...
Currently, I solved this counting the input-rows in advance and then, after that number is reached, I create that output. But this seems a really dirty workaround to me, so maybe someone has a solution for that, too?
Thank you for any useful tips!

talend how can we estimate taggregateSortedRow recordcount parameter value

We are trying out talend and we wanted to aggregate some sorted data on few keys .
Simple enough but when we try to use taggregatesortedrow its asking for Exact number of rows to be specified.
I am not sure how any one can input this on the fly. Dosen't this value change for every run ? am i missing something. surely they cant expect us to know total recs before we run the job.
This has to do with the way in how the Talend component tAggregateSortedRow is programmed. To avoid it omitting data you need to provide the record count. There are a few users with the same question like you asked:
https://www.talendforge.org/forum/viewtopic.php?id=50094
https://www.talendforge.org/forum/viewtopic.php?id=54231
https://www.talendforge.org/forum/viewtopic.php?id=7641
which I found simply by using Google.
Anyway, if you need to do sorting and aggregating, consider using the components tAggregateRow and tSortedRow separately. It should work fine.

Tableau performance

I've a problem with the dashboard in Tableau. In the dashboard there are many worksheets, and all the columns that are in the report are calculable. The problem is that dashboard is being formed for a very long time. The report contains approximately 2 million rows. And it is generated about 5 minutes.
Tell me, what are the solutions in this case?
Maybe I can somehow adjust the page display and not all the records at once?
To reduce the calculation time, try to exclude data you don't need with a data source filter in tableau. You can also hide or delete unused calculated fields. Other things you can do is reduce sheets that are not used.
Here's a link: https://www.tableau.com/about/blog/2016/1/5-tips-make-your-dashboards-more-performant-48574
Steps to follow to reduce calculation time:
Extract the data and use Extract data and also keep option as extract instead of live.Also replace the data source using extract data.
Use "User Filter" to reduce calculation time so that tableau will display of particular user data only.
I hope this will work to solve your problems.
I have one more idea to resolve this issue.
1)when you loan first time your dashboard put into Dashboard Action Filter
First Time load dashboard data exclude in your sheet.
Dashboard Menu->Action->add action->select sheet and exclude option.
2) Live to Extract data source and select radio button extract.
3)use user filter.
I am following the other answers (use extract, dashboard action filter...) and I want to add one point:
Drag every field used by any tablesheet on the dashboard on "Detail" of every tablesheet you are using on the Dashboard. Now Tableau loads all needed data while loading the first tablesheet and can use this data for the other sheets.
i.e. A dashboard contains three tablesheets (A, B, C) now you drag every field used by A on "Deatil" of B and C, every field used by B on "Deatil" of A and C, every field used by C on "Deatil" of B and A.
We are also having a similar issue with 150 million rows but I want to check if you are doing following steps. This may help you. This goes back to fundamentals of Tableau reporting.
1/ Try to make sure your data set is in star schema format. This will help a lot in report.
2/ Try to have tables and views in DB in such a way that same columns are used in Tableau. Any extra columns in tables adds to the performance issue.
3/Make sure indexing is done properly for all the fields that are joined.
4/ In my experience Dashboard adds extra performance lag. So make sure you try to get as much performance tuning on sheets as possible before even going to dashboard.
5/ If required try to use materialized views.
hope this helps.
Try to capture performance metrics using performance recorder option in Tableau.
Check for the underlying DB tables and joins present on the data source layer.
Try using optimized sets and parameters as required and get rid of less relevant filters.
Try using data extracts with scheduled refresh with data source filter for limited business years data.