Inject big local json file into Druid - druid

It's my first Druid experience.
I have got a local setup of Druid in local machine.
Now I'd like to make some query performance test. My test data is a huge local json file 1.2G.
The idea was to load it into druid and run required SQL query. The file is getting parsed and successfully processed (I'm using Druid web-based UI to submit an injection task).
The problem I run into is the datasource size. It doesn't makes sense that 1.2G of raw json data results in 35M of datasource. Is there any limitation the locally running Druid setup has. I think the test data is processed partially. Unfortunately didn't find any relevant config to change it. Will appreciate if some one is able to shed light on this.
Thanks in advance

With druid 80-90 percent compression is expected. I have seen 2GB CSV file reduced to 200MB druid datasoruce.
Can you query the count to make sure all data is ingested? All please disable approximate algorithm hyper-log-log to get exact count.Druid SQL will switch to exact distinct counts if you set "useApproximateCountDistinct" to "false", either through query context or through broker configuration.( refer http://druid.io/docs/latest/querying/sql.html )
Also can check logs for exception and error messages. If it faces problem to ingest particular JSON record it skips that record.

Related

Insight into Redshift Spectrum query error

I'm trying to use Redshift Spectrum to query data in s3. The data has been crawled by Glue, I've run successful data profile jobs on the files with DataBrew (so I know Glue has correctly read it), and I can see the correct tables in the query editor after creating the schema. But when I try to run simple queries I get one of two errors: if it's a small file I get: "ERROR: Parsed manifest is not a valid JSON object...."; if it's a large file I get: "ERROR: Manifest too large Detail:...". I suspect it's looking for or believes that the file in the query is a manifest, but I have no idea why or how to address it. I've followed the documentation as rigorously as possible, and I've replicated the process via a screen share with an AWS tech support rep who is also stumped.
Discovered the issue: error is happening because I had more than one type of file (i.e., files of differing layouts) in the same s3 folder. There may be other ways to solve the problem, but isolating one type of file for a given s3 folder solved the problem and allowed Redshift Spectrum to successfully execute queries against my file(s).

How to process and insert millions of MongoDB records into Postgres using Talend Open Studio

I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

Apache Spark and MongoDB integration using pymongo

i have a problem working with apache spark and mongodb using the pymongo library. Actually, i am processing thousands of records and for each record, i need to read its corresponding data from the database, update certain info and save it back to the database. Due to the reads and writes, i choosed to use Pymongo instead of using Spark-Mongo Connector which apparently isnt well suited for this task. Unfortunately however, when performing writes, mongodb always returns write successful but when i check the database, some updates where not performed. After debugging for over a week, i realized by setting the server to a single core processor, all writes were successful and written in the database but the application has become tremendously slow.
I would like to know if anyone knows how to solve this issue. Thanks in advance

Adding user information to centralized logging with ELK stack

I am using ELK stack (first project) to centralize logs of a server and visualize some real-time statistics with Kibana. The logs are stored in an ES index and I have another index with user information (IP, name, demographics). I am trying to:
Join user information with the server logs, matching the IPs. I want to include this information in the Kibana dashboard (e.g. to show in real-time the username of the connected users).
Create new indexes with filtered and processed information (e.g. users that have visited more than 3 times certain url).
Which is the best design to solve those problems (e.g. include username in the logstash stage through a filter, do scheduled jobs,...)? If the processing task (2) gets more complex, would it be better to use MongoDB instead?
Thank you!
I recently wanted to cross reference some log data with user data (containing IPs among other data) and just used elasticsearch's bulk import API. This meant extracting the data from a RDBMS, converting it to JSON and outputting a flat file that adhered to the format desired by the bulk import API (basically prefixing a row that describes the index and type).
That should work for an initial import, then your delta could be achieved using triggers in whatever stores your user data. Might simply write to a flat file and process like other logs. Other options might be JDBC River.
I am also interested to know where the data is stored originally (DB, pushing straight from a server..). However, I initially used the ELK stack to pull data back from a DB server using a batch file utilizing BCP (running on a scheduled task) and storing it to a flat file, monitoring the file with Logstash, and manipulating the data inside the LS config (grok filter). You may also consider a simple console/web application to manipulate the data before grokking with Logstash.
If possible, I would attempt to pull your data via SQL Server SPROC/BCP command and match the returned, complete message within Logstash. You can then store the information in a single index.
I hope this helps as I am by no means an expert, but I will be happy to answer more questions for you if you get a little more specific with the details of your current data storage; namely how the data is entering Logstash. RabbitMQ is another valuable tool to take a look at for your input source.