Row Compare in PCollections Apache Beam - apache-beam

I have to parse a large CSV file, row wise and each element of the row needs to be compared against the respective element from the previous row. What is the best way of doing this with Beam? I would prefer to use Java code

Related

Apache Druid Appending Segment without dropping or summing it

I have three JSON files with the same timestamp but different values to upload to the Druid. I want to upload them separately with the same segment granularity. However, it drops the existing segment and uploads the new one.
I don't want to use appendToExisting: True bc it sums the values of the same rows. This is the situation that I don't want to happen (I may be adding the same file in the future).
Is there a way to add new data to a specific segment without dropping or summing it?

How to get the former few rows of current row in structured streaming dataframe

I want to get the several rows ahead of current row, however, first, take, rows between are not supported in structured steaming DataFrame.
use groupByKey and flatMapGroupsWithState can achieve the goal.

How to add record numbers to TextIO file sources in Apache Beam or Dataflow

I am using Dataflow (and now Beam) to process legacy text files to replicate the transformations of an existing ETL tool. The current process adds a record number (the record number for each row within each file) and the filename. The reason they want to keep this additional info is so that they can tell which file and record offset the source data came from.
I want to get to a point where I have a PCollection which contains File record number and filename as additional fields in the value or part of the key.
I've seen a different article where the filename can be populated into the resulting PCollection, however I do not have a solution for adding the record numbers per row. Currently the only way I can do it is to pre-process the files before I start the Dataflow process (which is a shame since I would want to have Dataflow/Beam to do it all)

Working with huge csv files in Tableau

I have a large csv files (1000 rows x 70,000 columns) which I want to create a union between 2 smaller csv files (since these csv files will be updated in the future). In Tableau working with such a large csv file results in very long processing time and sometimes causes Tableau to stop responding. I would like to know what are better ways of dealing with such large csv files ie. by splitting data, converting csv to other data file type, connecting to server, etc. Please let me know.
The first thing you should ensure is that you are accessing the file locally and not over a network. Sometimes it is minor, but in some cases that can cause some major slow down in Tableau reading the file.
Beyond that, your file is pretty wide should be normalized some, so that you get more row and fewer columns. Tableau will most likely read it in faster because it has fewer columns to analyze (data types, etc).
If you don't know how to normalize the CSV file, you can use a tool like: http://www.convertcsv.com/pivot-csv.htm
Once you have the file normalized and connected in Tableau, you may want to extract it inside of Tableau for improved performance and file compression.
The problem isn't the size of the csv file: it is the structure. Almost anything trying to digest a csv will expect lots of rows but not many columns. Usually columns define the type of data (eg customer number, transaction value, transaction count, date...) and the rows define instances of the data (all the values for an individual transaction).
Tableau can happily cope with hundreds (maybe even thousands) of columns and millions of rows (i've happily ingested 25 million row CSVs).
Very wide tables usually emerge because you have a "pivoted" analysis with one set of data categories along the columns and another along the rows. For effective analysis you need to undo the pivoting (or derive the data from its source unpivoted). Cycle through the complete table (you can even do this in Excel VBA despite the number of columns by reading the CSV directly line by line rather than opening the file). Convert the first row (which is probably column headings) into a new column (so each new row contains every combination of original row label and each column header plus the relevant data value from the relevant cell in the CSV file). The new table will be 3 columns wide but with all the data from the CSV (assuming the CSV was structured the way I assumed). If I've misunderstood the structure of the file, you have a much bigger problem than I thought!

Spark Streaming Iterative Algorithm

I want to create a Spark Streaming application coded in Scala.
I want my application to:
read from a HDFS Text File line by line
analyze every line as String and if needed modify it and:
keep state that is needed for the analysis in some kind of data structures (Hashes probably)
output of everything on text files (any kind)
I've had no problems with the first step:
val lines = ssc.textFileStream("hdfs://localhost:9000/path/")
My analysis consist in searching a match in the Hashes for some fields of the String analyzed, that's why I need to maintain a state and do the process iteratively.
The data in those Hashes is also extracted by the strings analyzed.
What can I do for next steps?
Since you just have to read one HDFS text file line by line, you probably do not need to Spark Streaming for that. You can just use Spark.
val lines = sparkContext.textFile("...")
Then you can use mapPartition to do a distributed processing of the whole partitioned file.
val processedLines = lines.mapPartitions { partitionAsIterator =>
processPartitionAndReturnNewIterator(partitionAsIterator)
}
In that function, you can walk through the lines in the partition, store state stuff in a hashmap, etc. and finally return another iterator of output records corresponding to that partition.
Now if you want share state across partitions, then you probably have to do some more aggregations like groupByKey() or reduceByKey() on processedLines dataset.