MongoDb Hadoop connector using Pig support connection issue

MongoDb Hadoop connector using Pig support connection issue - mongodb

ERROR 2118: Unable to connect to collection.Unable to connect to collection.
My Pig Code:
REGISTER /home/auto/ykale/jars/mongo/mongo-hadoop-pig_cdh3u3-1.1.0.jar
REGISTER /home/auto/ykale/jars/mongo/mongo-hadoop-core_cdh3u3-1.1.0.jar
REGISTER /home/auto/ykale/jars/mongo/mongo-hadoop-streaming_cdh3u3-1.1.0.jar
REGISTER /home/auto/ykale/jars/mongo/com.mongodb_2.6.5.1.jar
--name1 = load 'mongodb://hfdvmprmongodb1.vm.itg.corp.us.shldcorp.com:27017/member_pricing.testData' USING com.mongodb.hadoop.pig.MongoLoader;
name1 = load 'mongodb://ykale:newpassword4#hfdvmprmongodb1.vm.itg.corp.us.shldcorp.com:27017/member_pricing.testData' USING com.mongodb.hadoop.pig.MongoLoader;
STORE name1 into '/user/ykale/mongo_dump/file1';
When I use the other load command which is commented out in the above code, I get the output as follows, which assigns 0 map reduce jobs.
2013-12-11 05:16:24,769 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-12-11 05:16:24,769 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 0 map reduce job(s) failed!
2013-12-11 05:16:24,770 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2-cdh3u3 0.8.1-cdh3u3 ykale 2013-12-11 05:16:22 2013-12-11 05:16:24 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
Input(s):
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
null
2013-12-11 05:16:24,770 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
I am executing the Pig script with Dmongo.input.split.create_input_splits=false
Any help appreciated.

Related

Hapi Fhir importing Snomed-CT

I am a newbie in Hapi Fhir, in Fhir also. I'm trying to import Snomed-CT / US version on Hapi Fhir. I'm using the client to do this, this way:
To run the server:
java -jar hapi-fhir-cli.jar run-server
to upload snomed-ct
java -jar hapi-fhir-cli.jar upload-terminology -d SnomedCT_RF2Release_INT_20160131.zip -t http://localhost:8080/baseDstu3 -u http://snomed.info/sct
Code system is uploaded succesfully:
{
"resourceType": "Bundle",
"id": "948f4c4b-2e28-475b-a629-3d5122d5e103",
"meta": {
"lastUpdated": "2017-09-11T11:47:56.941+02:00"
},
"type": "searchset",
"total": 1,
"link": [
{
"relation": "self",
"url": "http://localhost:8080/baseDstu3/CodeSystem?_pretty=true"
}
],
"entry": [
{
"fullUrl": "http://localhost:8080/baseDstu3/CodeSystem/1",
"resource": {
"resourceType": "CodeSystem",
"id": "1",
"meta": {
"versionId": "1",
"lastUpdated": "2017-09-11T10:25:43.282+02:00"
},
"url": "http://snomed.info/sct",
"content": "not-present"
},
"search": {
"mode": "match"
}
}
]
}
But I can't find the codes! This is my ValueSet:
{
"resourceType": "Bundle",
"id": "37fff235-1229-4491-a3ab-9bdba2333d57",
"meta": {
"lastUpdated": "2017-09-11T11:49:35.553+02:00"
},
"type": "searchset",
"total": 0,
"link": [
{
"relation": "self",
"url": "http://localhost:8080/baseDstu3/ValueSet?_pretty=true"
}
]
}
This is extracted from my logs:
/hapi-fhir-cli/hapi-fhir-cli-app/target$ java -jar hapi-fhir-cli.jar run-server
------------------------------------------------------------
🔥 HAPI FHIR 3.0.0-SNAPSHOT - Command Line Tool
------------------------------------------------------------
Max configured JVM memory (Xmx): 1.7GB
Detected Java version: 1.8.0_144
------------------------------------------------------------
10:20:49 INFO ca.uhn.fhir.context.FhirContext - Creating new FHIR context for FHIR version [DSTU3]
10:20:49 INFO ca.uhn.fhir.cli.RunServerCommand - Preparing HAPI FHIR JPA server on port 8080
10:20:51 INFO ca.uhn.fhir.cli.RunServerCommand - Starting HAPI FHIR JPA server in DSTU3 mode
10:21:22 INFO ca.uhn.fhir.context.FhirContext - Creating new FHIR context for FHIR version [DSTU3]
10:21:46 WARN o.h.e.jdbc.spi.SqlExceptionHelper - SQL Warning Code: 10000, SQLState: 01J01
10:21:46 WARN o.h.e.jdbc.spi.SqlExceptionHelper - Database 'directory:target/jpaserver_derby_files' not created, connection made to existing database instead.
10:21:46 WARN o.h.e.jdbc.spi.SqlExceptionHelper - SQL Warning Code: 10000, SQLState: 01J01
10:21:46 WARN o.h.e.jdbc.spi.SqlExceptionHelper - Database 'directory:target/jpaserver_derby_files' not created, connection made to existing database instead.
10:21:46 WARN o.h.e.jdbc.spi.SqlExceptionHelper - SQL Warning Code: 10000, SQLState: 01J01
10:21:46 WARN o.h.e.jdbc.spi.SqlExceptionHelper - Database 'directory:target/jpaserver_derby_files' not created, connection made to existing database instead.
10:21:46 WARN o.h.e.jdbc.spi.SqlExceptionHelper - SQL Warning Code: 10000, SQLState: 01J01
10:21:46 WARN o.h.e.jdbc.spi.SqlExceptionHelper - Database 'directory:target/jpaserver_derby_files' not created, connection made to existing database instead.
10:21:47 INFO ca.uhn.fhir.cli.RunServerCommand - Server started on port 8080
10:21:47 INFO ca.uhn.fhir.cli.RunServerCommand - Web Testing UI : http://localhost:8080/
10:21:47 INFO ca.uhn.fhir.cli.RunServerCommand - Server Base URL: http://localhost:8080/baseDstu3/
10:22:09 INFO ca.uhn.fhir.context.FhirContext - Creating new FHIR context for FHIR version [DSTU3]
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 95ms for query 93e6a047-b93f-4e6c-8ee9-03b51d08bd45
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 96ms for query 93e6a047-b93f-4e6c-8ee9-03b51d08bd45
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
10:22:10 INFO c.u.f.j.d.d.SearchParamRegistryDstu3 - Refreshed search parameter cache in 108ms
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 67ms for query 60cc4f7c-887c-4fe6-9ca5-32f24017f91a
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 67ms for query 60cc4f7c-887c-4fe6-9ca5-32f24017f91a
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 2ms for query a1ff7c92-b273-4ef4-8898-870c0377a161
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 3ms for query a1ff7c92-b273-4ef4-8898-870c0377a161
10:22:10 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
10:22:10 INFO c.uhn.fhir.rest.server.RestfulServer - Initializing HAPI FHIR restful server running in DSTU3 mode
10:22:10 INFO c.uhn.fhir.rest.server.RestfulServer - Added 117 resource provider(s). Total 117
10:22:10 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class ca.uhn.fhir.jpa.rp.dstu3.AccountResourceProvider
....
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class ca.uhn.fhir.jpa.rp.dstu3.TestScriptResourceProvider
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class ca.uhn.fhir.jpa.rp.dstu3.ValueSetResourceProvider
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class ca.uhn.fhir.jpa.rp.dstu3.VisionPrescriptionResourceProvider
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Added 2 plain provider(s). Total 2
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class ca.uhn.fhir.jpa.provider.dstu3.JpaSystemProviderDstu3
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class ca.uhn.fhir.jpa.provider.dstu3.TerminologyUploaderProviderDstu3
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class org.hl7.fhir.dstu3.hapi.rest.server.ServerProfileProvider
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class ca.uhn.fhir.jpa.provider.dstu3.JpaConformanceProviderDstu3
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - Scanning type for RESTful methods: class ca.uhn.fhir.rest.server.PageProvider
10:22:11 INFO c.uhn.fhir.rest.server.RestfulServer - A FHIR has been lit on this server
10:22:12 INFO c.u.f.n.BaseThymeleafNarrativeGenerator - Initializing narrative generator
10:22:13 INFO ca.uhn.fhir.to.Controller - Request(GET //localhost:8080/)#70976daf
10:22:20 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 3ms for query fad4433e-825c-41ae-b22e-9bbb0a1624a0
10:22:20 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 3ms for query fad4433e-825c-41ae-b22e-9bbb0a1624a0
10:22:20 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
...
10:23:50 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 3ms for query f0b25957-c166-4b23-be6e-0d06274da565
10:23:50 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 3ms for query f0b25957-c166-4b23-be6e-0d06274da565
10:23:50 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
10:23:52 INFO c.u.f.jpa.term.TerminologyLoaderSvc - Beginning SNOMED CT processing
10:23:59 INFO c.u.f.jpa.term.TerminologyLoaderSvc - Processing file SnomedCT_RF2Release_INT_20160131/Full/Terminology/sct2_Concept_Full_INT_20160131.txt
10:23:59 INFO c.u.f.jpa.term.TerminologyLoaderSvc - * Processed 1 records in SnomedCT_RF2Release_INT_20160131/Full/Terminology/sct2_Concept_Full_INT_20160131.txt
10:23:59 INFO c.u.f.jpa.term.TerminologyLoaderSvc - * Processed 100000 records in SnomedCT_RF2Release_INT_20160131/Full/Terminology/sct2_Concept_Full_INT_20160131.txt
10:24:00 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 2ms for query 6393e886-2878-4c3a-b50a-a19139c01dd4
10:24:00 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 9ms for query 6393e886-2878-4c3a-b50a-a19139c01dd4
10:24:00 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
...
10:25:29 INFO c.u.f.jpa.term.TerminologyLoaderSvc - * Processed 4700000 records in SnomedCT_RF2Release_INT_20160131/Full/Terminology/sct2_Relationship_Full_INT_20160131.txt
10:25:30 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 2ms for query df9f1505-e381-4eda-a04b-934293c54721
10:25:30 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 2ms for query df9f1505-e381-4eda-a04b-934293c54721
10:25:30 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
10:25:40 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 2ms for query 1961cfd8-b2b0-4626-a397-6a7b925e4547
10:25:40 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 2ms for query 1961cfd8-b2b0-4626-a397-6a7b925e4547
10:25:40 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
10:25:41 INFO c.u.f.jpa.term.TerminologyLoaderSvc - Looking for root codes
10:25:41 INFO c.u.f.jpa.term.TerminologyLoaderSvc - Done loading SNOMED CT files - 3 root codes, 319446 total codes
10:25:41 INFO c.u.f.jpa.term.TerminologyLoaderSvc - * Scanning for circular refs - have scanned 0 / 3 codes (0.0%)
...
10:25:43 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 81ms for query 9aa7c541-8dfa-4176-8621-046c7ba886e4
10:25:43 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 82ms for query 9aa7c541-8dfa-4176-8621-046c7ba886e4
10:25:46 INFO ca.uhn.fhir.jpa.dao.BaseHapiFhirDao - Saving history entry CodeSystem/1/_history/1
10:25:46 INFO c.u.f.j.dao.BaseHapiFhirResourceDao - Successfully created resource "CodeSystem/1/_history/1" in 3.314ms
10:25:46 INFO c.u.f.j.term.HapiTerminologySvcDstu3 - CodeSystem resource has ID: CodeSystem/1
10:25:46 INFO c.u.f.j.term.BaseHapiTerminologySvc - Storing code system
10:25:46 INFO c.u.f.j.term.BaseHapiTerminologySvc - Deleting old code system versions
10:25:46 INFO c.u.f.j.term.BaseHapiTerminologySvc - Flushing...
10:25:46 INFO c.u.f.j.term.BaseHapiTerminologySvc - Done flushing
10:25:47 INFO c.u.f.j.term.BaseHapiTerminologySvc - Validating all codes in CodeSystem for storage (this can take some time for large sets)
10:25:47 INFO c.u.f.j.term.BaseHapiTerminologySvc - Have validated 1000 concepts
...
10:25:49 INFO c.u.f.j.term.BaseHapiTerminologySvc - Have validated 319000 concepts
10:25:49 INFO c.u.f.j.term.BaseHapiTerminologySvc - Saving version containing 319446 concepts
10:25:49 INFO c.u.f.j.term.BaseHapiTerminologySvc - Saving code system
10:25:50 INFO c.u.f.j.term.BaseHapiTerminologySvc - Setting codesystemversion on 319446 concepts...
10:25:50 INFO c.u.f.j.term.BaseHapiTerminologySvc - Saving 319446 concepts...
10:25:50 INFO c.u.f.j.term.BaseHapiTerminologySvc - Have processed 1/319446 concepts (0%)
10:25:56 INFO c.u.f.j.term.BaseHapiTerminologySvc - Have processed 10000/319446 concepts (3%)
...
10:25:56 INFO c.u.f.j.term.BaseHapiTerminologySvc - Have processed 310000/319446 concepts (97%)
10:25:56 INFO c.u.f.j.term.BaseHapiTerminologySvc - Done saving concepts, flushing to database
10:25:57 INFO c.u.f.j.term.BaseHapiTerminologySvc - Done deleting old code system versions
10:25:57 INFO c.u.f.j.term.BaseHapiTerminologySvc - Note that some concept saving was deferred - still have 317446 concepts and 472410 relationships
10:25:57 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 7003ms for query a94cc0f2-b99e-4cce-9ba1-16b0b3ce70bb
10:25:57 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 7003ms for query a94cc0f2-b99e-4cce-9ba1-16b0b3ce70bb
10:25:57 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
10:26:00 INFO c.u.f.j.term.BaseHapiTerminologySvc - Saving 2000 deferred concepts...
10:26:04 INFO c.u.f.j.term.BaseHapiTerminologySvc - Saved 2000 deferred concepts (315645 codes remain and 472410 relationships remain) in 3936ms (1ms / code)
...
10:26:25 INFO c.u.f.j.term.BaseHapiTerminologySvc - Saving 2000 deferred concepts...
10:26:26 INFO c.u.f.j.term.BaseHapiTerminologySvc - Saved 2000 deferred concepts (305504 codes remain and 472410 relationships remain) in 852ms (0ms / code)
10:26:27 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Initial query result returned in 1ms for query c71cfc5b-02a4-4f16-a47f-15a947611b71
10:26:27 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - Query found 0 matches in 2ms for query c71cfc5b-02a4-4f16-a47f-15a947611b71
10:26:27 INFO ca.uhn.fhir.jpa.dao.SearchBuilder - The include pids are empty
UPDATE
Ok, I see those concepts on database table TRM_CONCEPT and TRM_CONCEPT_LINK. But, Is there a way to query those concepts?

PySpark on Dataproc stops with SocketTimeoutException

We are currently trying to run a Spark job on a Dataproc cluster using PySpark 2.2.0 except the Spark job stops after a seemingly random amount of time passes with the following error message:
17/07/25 00:52:48 ERROR org.apache.spark.api.python.PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:702)
The error could sometimes take only a couple minutes to happen or it could take 3 hours. From personal experience, the Spark job runs for about 30 minutes to 1 hour before hitting the error.
Once the Spark job hits the error, it just stops. No matter how long I wait, it outputs nothing. On YARN ResourceManager, the application status is still labeled as "RUNNING" and I must Ctrl+C to terminate the program. At that point, the application is labelled as "FINISHED".
I run the Spark job using /path/to/spark/bin/spark-submit --jars /path/to/jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.0.jar spark_job.py command on the master node's console. The JAR file is necessary because the Spark job streams messages from Kafka (running on the same cluster as the Spark job) and pushes some messages back to the same Kafka to a different topic.
I've already looked at some other answers on this site (primarily this and this) and they have been somewhat helpful but we haven't been able to track down where in the log might it state what caused the executors to die. So far, I've monitored the nodes during the task through the YARN ResourceManager as well as gone through the logs located in /var/logs/hadoop-yarn directory in every node. The only "clue" I could find in the log was org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM which is the only line that is written to the dead executor's logs.
As a last ditch effort, we attempted to increase the cluster's memory size in the hopes that the issue will just go away but it hasn't. Originally, the cluster was running on a 1 master 2 workers cluster with 4vCPU, 15GB memory. We created a new Dataproc cluster, this time with 1 master and 3 workers, with the workers each having 8vCPU 52GB memory (master has same specs as previous).
What we would like to know is:
1. Where/how can I see the exception that is causing the executors to be terminated?
2. Is this an issue with how Spark is configured?
3. Dataproc image version is "preview". Could that possibly be the cause of the error?
and ultimately,
4. How do we resolve this issue? What other steps can we take?
This Spark job needs to continuously stream from Kafka for an indefinite amount of time so we would like this error to be fixed rather than prolonging the time it takes for the error to occur.
Here are some screenshots from the YARN ResourceManager to demonstrate what we are seeing:
Cluster Metrics
Executor Summary
The screenshots are from before the Spark job stopped from the error.
And this is the Spark configuration file located in /path/to/spark/conf/spark-defaults.conf (did not change anything from the default setting by Dataproc):
spark.master yarn
spark.submit.deployMode client
spark.yarn.jars=local:/usr/lib/spark/jars/*
spark.eventLog.enabled true
spark.eventLog.dir hdfs://highmem-m/user/spark/eventlog
# Dynamic allocation on YARN
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.executor.instances 10000
spark.dynamicAllocation.maxExecutors 10000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0
spark.yarn.historyServer.address highmem-m:18080
spark.history.fs.logDirectory hdfs://highmem-m/user/spark/eventlog
spark.executor.cores 2
spark.executor.memory 4655m
spark.yarn.executor.memoryOverhead 465
# Overkill
spark.yarn.am.memory 4655m
spark.yarn.am.memoryOverhead 465
spark.driver.memory 3768m
spark.driver.maxResultSize 1884m
spark.rpc.message.maxSize 512
# Add ALPN for Bigtable
spark.driver.extraJavaOptions
spark.executor.extraJavaOptions
# Disable Parquet metadata caching as its URI re-encoding logic does
# not work for GCS URIs (b/28306549). The net effect of this is that
# Parquet metadata will be read both driver side and executor side.
spark.sql.parquet.cacheMetadata=false
# User-supplied properties.
#Mon Jul 24 23:12:12 UTC 2017
spark.executor.cores=4
spark.executor.memory=18619m
spark.driver.memory=3840m
spark.driver.maxResultSize=1920m
spark.yarn.am.memory=640m
spark.executorEnv.PYTHONHASHSEED=0
I'm not quite sure where the User-supplied properties came from.
Edit:
Some additional information about the clusters:
I use the zookeeper, kafka, and jupyter initialization action scripts found at https://github.com/GoogleCloudPlatform/dataproc-initialization-actions in the order of zookeeper -> kafka -> jupyter (unfortunately I don't have enough reputation to post more than 2 links at the moment)
Edit 2:
From #Dennis's insightful questions, we ran the Spark job while paying particular attention to the executors that have higher On Heap Storage Memory used. What I noticed is that it is always the executors from worker #0 that have significantly higher storage memory usage compared to the other executors. The stdout file for the executors of worker #0 are always empty. These three lines are repeated many times over in stderr:
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
It seems to be repeating every 1~3 seconds.
As for the stdout and stderr for the other executors from other worker nodes, they are empty.
Edit 3:
As mentioned from #Dennis's comments, we kept the Kafka topic the Spark job was consuming from with replication factor of 1. I also found that I've forgotten to add worker #2 to zookeeper.connect in the Kafka config file and also forgot to give the consumer streaming messages from Kafka in Spark a group ID. I've fixed those places (remade topic with replication factor of 3) and observed that now the workload mainly focuses on worker #1. Following the suggestions from #Dennis, I've run sudo jps after SSH-ing to worker #1 and get the following output:
[Removed this section to save character space; it was only the error messages from a failed call to jmap so it didn't hold any useful information]
Edit 4:
I'm now seeing this in worker #1 executors' stdout files:
2017-07-27 22:16:24
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
===Truncated===
Heap
PSYoungGen total 814592K, used 470009K [0x000000063c180000, 0x000000069e600000, 0x00000007c0000000)
eden space 799744K, 56% used [0x000000063c180000,0x0000000657e53598,0x000000066ce80000)
from space 14848K, 97% used [0x000000069d780000,0x000000069e5ab1b8,0x000000069e600000)
to space 51200K, 0% used [0x0000000698200000,0x0000000698200000,0x000000069b400000)
ParOldGen total 574464K, used 180616K [0x0000000334400000, 0x0000000357500000, 0x000000063c180000)
object space 574464K, 31% used [0x0000000334400000,0x000000033f462240,0x0000000357500000)
Metaspace used 49078K, capacity 49874K, committed 50048K, reserved 1093632K
class space used 6054K, capacity 6263K, committed 6272K, reserved 1048576K
and
2017-07-27 22:06:44
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
===Truncated===
Heap
PSYoungGen total 608768K, used 547401K [0x000000063c180000, 0x000000066a280000, 0x00000007c0000000)
eden space 601088K, 89% used [0x000000063c180000,0x000000065d09c498,0x0000000660c80000)
from space 7680K, 99% used [0x0000000669b00000,0x000000066a2762c8,0x000000066a280000)
to space 36864K, 0% used [0x0000000665a80000,0x0000000665a80000,0x0000000667e80000)
ParOldGen total 535552K, used 199304K [0x0000000334400000, 0x0000000354f00000, 0x000000063c180000)
object space 535552K, 37% used [0x0000000334400000,0x00000003406a2340,0x0000000354f00000)
Metaspace used 48810K, capacity 49554K, committed 49792K, reserved 1093632K
class space used 6054K, capacity 6263K, committed 6272K, reserved 1048576K
When the error happened, an executor from worker #2 received SIGNAL TERM and was labeled as dead. At this time, it was the only dead executor.
Strangely, the Spark job picked back up again after 10 minutes or so. Looking at the Spark UI interface, only executors from worker #1 are active and the rest are dead. First time this has happened.
Edit 5:
Again, following #Dennis's suggestions (thank you, #Dennis!), this time ran sudo -u yarn jmap -histo <pid>. This is the top 10 of the most memory hogging classes from CoarseGrainedExecutorBackend after about 10 minutes:
num #instances #bytes class name
----------------------------------------------
1: 244824 358007944 [B
2: 194242 221184584 [I
3: 2062554 163729952 [C
4: 746240 35435976 [Ljava.lang.Object;
5: 738 24194592 [Lorg.apache.spark.unsafe.memory.MemoryBlock;
6: 975513 23412312 java.lang.String
7: 129645 13483080 java.io.ObjectStreamClass
8: 451343 10832232 java.lang.StringBuilder
9: 38880 10572504 [Z
10: 120807 8698104 java.lang.reflect.Field
Also, I've encountered a new type of error which caused an executor to die. It produced some failed tasks highlighted in the Spark UI and found this in the executor's stderr:
17/07/28 00:44:03 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 6821.0 (TID 2585)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/07/28 00:44:03 ERROR org.apache.spark.executor.Executor: Exception in task 0.1 in stage 6821.0 (TID 2586)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/07/28 00:44:03 ERROR org.apache.spark.util.Utils: Uncaught exception in thread stdout writer for /opt/conda/bin/python
java.lang.AssertionError: assertion failed: Block rdd_5480_0 is not locked for reading
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720)
at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516)
at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
17/07/28 00:44:03 ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout writer for /opt/conda/bin/python,5,main]
java.lang.AssertionError: assertion failed: Block rdd_5480_0 is not locked for reading
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720)
at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516)
at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Edit 6:
This time, I took the jmap after 40 minutes of running:
num #instances #bytes class name
----------------------------------------------
1: 23667 391136256 [B
2: 25937 15932728 [I
3: 159174 12750016 [C
4: 334 10949856 [Lorg.apache.spark.unsafe.memory.MemoryBlock;
5: 78437 5473992 [Ljava.lang.Object;
6: 125322 3007728 java.lang.String
7: 40931 2947032 java.lang.reflect.Field
8: 63431 2029792 com.esotericsoftware.kryo.Registration
9: 20897 1337408 com.esotericsoftware.kryo.serializers.UnsafeCacheFields$UnsafeObjectField
10: 20323 975504 java.util.HashMap
These are the results of ps ux:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 601 0.8 0.9 3008024 528812 ? Sl 16:12 1:17 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dproc_nodema
yarn 6086 6.3 0.0 96764 24340 ? R 18:37 0:02 /opt/conda/bin/python -m pyspark.daemon
yarn 8036 8.2 0.0 96296 24136 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8173 9.4 0.0 97108 24444 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8240 9.0 0.0 96984 24576 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8329 7.6 0.0 96948 24720 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8420 8.5 0.0 96240 23788 ? R 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8487 6.0 0.0 96864 24308 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8554 0.0 0.0 96292 23724 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8564 0.0 0.0 19100 2448 pts/0 R+ 18:37 0:00 ps ux
yarn 31705 0.0 0.0 13260 2756 ? S 17:56 0:00 bash /hadoop/yarn/nm-local-dir/usercache/<user_name>/app
yarn 31707 0.0 0.0 13272 2876 ? Ss 17:56 0:00 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java
yarn 31713 0.4 0.7 2419520 399072 ? Sl 17:56 0:11 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx6
yarn 31771 0.0 0.0 13260 2740 ? S 17:56 0:00 bash /hadoop/yarn/nm-local-dir/usercache/<user_name>/app
yarn 31774 0.0 0.0 13284 2800 ? Ss 17:56 0:00 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java
yarn 31780 11.1 1.4 21759016 752132 ? Sl 17:56 4:31 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx1
yarn 31883 0.1 0.0 96292 27308 ? S 17:56 0:02 /opt/conda/bin/python -m pyspark.daemon
The pid of the CoarseGrainedExecutorBackEnd is 31780 in this case.
Edit 7:
Increasing heartbeatInterval in the Spark settings did not change anything, which makes sense in hindsight.
I created a short bash script that reads from Kafka with the console consumer for 5 seconds and writes the messages into a text file. The text file is uploaded to Hadoop where Spark streams from. We tested whether the Timeout was related to Kafka through this method.
Streaming from Hadoop and outputting to Kafka from Spark caused SocketTimeout
Streaming from Kafka directly and not outputting to Kafka from Spark caused SocketTimeout
Streaming from Hadoop and not outputting to Kafka from Spark caused SocketTimeout
So we moved on with the assumption that Kafka had nothing to do with the Timeout.
We installed Stackdriver Monitoring to see memory usage as the Timeout occurred. Nothing really interesting from the metrics; memory usage looked relatively stable throughout (hovering around 10~15% at most for the busiest nodes).
We guessed perhaps something to do with the communication between the worker nodes is what could be causing the issue. Right now, our amount of data traffic is very low so even one worker can handle all the workload with relative ease.
Running the Spark job on a single node cluster while streaming from Kafka brokers from a different cluster seemed to have stopped the SocketTimeout... except the AssertionError documented above now frequently occurs.
Per #Dennis's suggestion, I created a new cluster (also single node) without the jupyter initialization script this time which means Spark runs on Python v2.7.9 now (without Anaconda). The first run, Spark encountered SocketTimeoutException in just 15 seconds. The second time ran for just over 2 hours, failing with the same AssertionError. I'm starting to wonder if this is a problem with Spark's internals. The third run ran for about 40 minutes and then ran into SocketTimeoutException.

A client of mine was seeing various production Pyspark jobs (Spark version 2.2.1) fail in Google Cloud Dataproc intermittently with a very similar stack trace to yours:
ERROR org.apache.spark.api.python.PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:711)
I found that disabling ipv6 on the Dataproc cluster VMs seemed to fix the issue. One way to do that is adding these lines to a Dataproc init script so they are run at cluster creation time:
printf "\nnet.ipv6.conf.default.disable_ipv6 = 1\nnet.ipv6.conf.all.disable_ipv6=1\n" >> /etc/sysctl.conf
sysctl -p

Spark SQL + Cassandra: bad performance

I'm just starting using Spark SQL + Cassandra, and probably am missing something important, but one simple query takes ~45 seconds. I'm using cassanda-spark-connector library, and run the local web server which also hosts the Spark. So my setup is roughly like this:
In sbt:
"org.apache.spark" %% "spark-core" % "1.4.1" excludeAll(ExclusionRule(organization = "org.slf4j")),
"org.apache.spark" %% "spark-sql" % "1.4.1" excludeAll(ExclusionRule(organization = "org.slf4j")),
"com.datastax.spark" %% "spark-cassandra-connector" % "1.4.0-M3" excludeAll(ExclusionRule(organization = "org.slf4j"))
In code I have a singleton that hosts SparkContext and CassandraSQLContetx. It's then called from the servlet. Here's how the singleton code looks like:
object SparkModel {
val conf =
new SparkConf()
.setAppName("core")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(conf)
val sqlC = new CassandraSQLContext(sc)
sqlC.setKeyspace("core")
val df: DataFrame = sqlC.cassandraSql(
"SELECT email, target_entity_id, target_entity_type " +
"FROM tracking_events " +
"LEFT JOIN customers " +
"WHERE entity_type = 'User' AND entity_id = customer_id")
}
And here how I use it:
get("/spark") {
SparkModel.df.collect().map(r => TrackingEvent(r.getString(0), r.getString(1), r.getString(2))).toList
}
Cassandra, Spark and the web app run on the same host in virtual machine on my Macbook Pro with decent specs. Cassandra queries by themselves take 10-20 milliseconds.
When I call this endpoint for the first time, it takes 70-80 seconds to return the result. Subsequent queries take ~45 seconds. The log of the subsequent operation looks like this:
12:48:50 INFO org.apache.spark.SparkContext - Starting job: collect at V1Servlet.scala:1146
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Got job 1 (collect at V1Servlet.scala:1146) with 1 output partitions (allowLocal=false)
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Final stage: ResultStage 1(collect at V1Servlet.scala:1146)
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Parents of final stage: List()
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Missing parents: List()
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Submitting ResultStage 1 (MapPartitionsRDD[29] at collect at V1Servlet.scala:1146), which has no missing parents
12:48:50 INFO org.apache.spark.storage.MemoryStore - ensureFreeSpace(18696) called with curMem=26661, maxMem=825564856
12:48:50 INFO org.apache.spark.storage.MemoryStore - Block broadcast_1 stored as values in memory (estimated size 18.3 KB, free 787.3 MB)
12:48:50 INFO org.apache.spark.storage.MemoryStore - ensureFreeSpace(8345) called with curMem=45357, maxMem=825564856
12:48:50 INFO org.apache.spark.storage.MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 8.1 KB, free 787.3 MB)
12:48:50 INFO o.a.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on localhost:56289 (size: 8.1 KB, free: 787.3 MB)
12:48:50 INFO org.apache.spark.SparkContext - Created broadcast 1 from broadcast at DAGScheduler.scala:874
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[29] at collect at V1Servlet.scala:1146)
12:48:50 INFO o.a.s.scheduler.TaskSchedulerImpl - Adding task set 1.0 with 1 tasks
12:48:50 INFO o.a.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 1.0 (TID 1, localhost, NODE_LOCAL, 59413 bytes)
12:48:50 INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 1.0 (TID 1)
12:48:50 INFO com.datastax.driver.core.Cluster - New Cassandra host localhost/127.0.0.1:9042 added
12:48:50 INFO c.d.s.c.cql.CassandraConnector - Connected to Cassandra cluster: Super Cluster
12:49:11 INFO o.a.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:56289 in memory (size: 8.0 KB, free: 787.3 MB)
12:49:35 INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 6124 bytes result sent to driver
12:49:35 INFO o.a.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 1) in 45199 ms on localhost (1/1)
12:49:35 INFO o.a.s.scheduler.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool
12:49:35 INFO o.a.spark.scheduler.DAGScheduler - ResultStage 1 (collect at V1Servlet.scala:1146) finished in 45.199 s
As you can see from the log, the longest pauses are between these 3 lines (21 + 24 seconds):
12:48:50 INFO c.d.s.c.cql.CassandraConnector - Connected to Cassandra cluster: Super Cluster
12:49:11 INFO o.a.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:56289 in memory (size: 8.0 KB, free: 787.3 MB)
12:49:35 INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 6124 bytes result sent to driver
Apparently, I'm doing something wrong. What's that? How can I improve this?
EDIT: Important addition: the size of the tables is tiny (~200 entries for tracking_events, ~20 for customers), so reading them in their whole into memory shouldn't take any significant time. And it's a local Cassandra installation, no cluster, no networking is involved.

"SELECT email, target_entity_id, target_entity_type " +
"FROM tracking_events " +
"LEFT JOIN customers " +
"WHERE entity_type = 'User' AND entity_id = customer_id")
This query will read all of the data from both the tracking_events and customers table. I would compare the performance to just doing a SELECT COUNT(*) on both tables. If it is significantly different then there may be an issue but my guess is this is just the amount of time it takes to read both tables entirely into memory.
There are a few knobs for tuning how reads are done and since the defaults are oriented towards a much a bigger dataset you may want to change these.
spark.cassandra.input.split.size_in_mb approx amount of data to be fetched into a Spark partition 64 MB
spark.cassandra.input.fetch.size_in_rows number of CQL rows fetched per driver request 1000
I would make sure you are generating as many tasks as you have cores (at the minimum) so you can take advantage of all of your resources. To do this shrink the input.split.size
The fetch size controls how many rows are paged at a time by an executor core so increasing this can increase speed in some use cases.

How to read HDF data from HDFS for Hadoop

I am working in Image processing on Hadoop. I am using HDF satellite data for processing, I can access and use jpg and other image types of data in hadoop streaming. But while using HDF data it comes with error. Hadoop couldnt read HDF data from HDFS. It takes more than twenty minutes to show the error also. My HDF data size is more than 150MB single file.
How to solve this problem. How to make hadoop can read this HDF data from HDFS.
Some of my code
hadoop#master:/usr/local/master/hdf/examples$ ./runD1.sh
Buildfile: /usr/local/master/hdf/build.xml
downloader:
setup:
test_settings:
compile:
BUILD SUCCESSFUL
Total time: 0 seconds
Output HIB: /var/www/html/uploads/
14/09/26 15:28:46 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Found host successfully: 0
Repeated host: 1
Repeated host: 2
Repeated host: 3
Tried to get 2 nodes, got 1
14/09/26 15:28:46 INFO input.FileInputFormat: Total input paths to process : 1
First n-1 nodes responsible for 1592259 images
Last node responsible for 1592259 images
14/09/26 15:29:04 INFO mapred.JobClient: Running job: job_201409191212_0006
14/09/26 15:29:05 INFO mapred.JobClient: map 0% reduce 0%
14/09/26 15:39:15 INFO mapred.JobClient: Task Id : attempt_201409191212_0006_m_000000_0, Status : FAILED
Task attempt_201409191212_0006_m_000000_0 failed to report status for 600 seconds. Killing!
14/09/26 15:49:17 INFO mapred.JobClient: Task Id : attempt_201409191212_0006_m_000000_1, Status : FAILED
Task attempt_201409191212_0006_m_000000_1 failed to report status for 600 seconds. Killing!
14/09/26 15:59:19 INFO mapred.JobClient: Task Id : attempt_201409191212_0006_m_000000_2, Status : FAILED
Task attempt_201409191212_0006_m_000000_2 failed to report status for 600 seconds. Killing!
Error log is:
2014-09-26 15:38:45,133 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201409191212_0006_m_-1211757488
2014-09-26 15:38:45,133 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201409191212_0006_m_-1211757488 spawned.
2014-09-26 15:38:45,136 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /usr/local/master/temp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201409191212_0006/attempt_201409191212_0006_m_000000_0.cleanup/taskjvm.sh
2014-09-26 15:38:45,631 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201409191212_0006_m_-1211757488 given task: attempt_201409191212_0006_m_000000_0
2014-09-26 15:38:46,145 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201409191212_0006_m_000000_0 0.0%
2014-09-26 15:38:46,198 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201409191212_0006_m_000000_0 0.0% cleanup
2014-09-26 15:38:46,200 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201409191212_0006_m_000000_0 is done.
2014-09-26 15:38:46,200 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201409191212_0006_m_000000_0 was -1
2014-09-26 15:38:46,200 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 2
2014-09-26 15:38:46,340 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201409191212_0006_m_-1211757488 exited with exit code 0. Number of tasks it ran: 1
Please can anyone help me to solve this problem.

Sqoop installation export and import from postgresql

I v'e just installed sqoop and was testing it . I tried to export some data from hdfs to postgresql using sqoop. When I run it it throws the following exception : java.io.IOException: Can't export data, please check task tracker logs . I think there may also have been a problem in installation.
The File content is :
ustNU 45
MB1bA 0
gNbCO 76
iZP10 39
B2aoo 45
SI7eG 93
5sC4k 60
2IhFV 2
u2A48 16
yvy6R 51
LNhsV 26
mZ2yn 65
80Gp3 43
Wk5Ag 85
VUfyp 93
P077j 94
f1Oj5 11
LxJkg 72
0H7NP 99
Dk406 25
g4KRp 76
Fw3U0 80
6LD59 1
07KHx 91
F1S88 72
Bnb0v 85
A2qM7 79
Z6cAt 81
0M3DO 23
m0s09 44
KIvwd 13
GNUD0 78
um93a 20
19bHv 75
4Of3s 75
5hFen 16
This is the posgres table:
Table "public.mysort"
Column | Type | Modifiers
--------+---------+-----------
name | text |
marks | integer |
The sqoop command is:
sqoop export --connect jdbc:postgresql://localhost/testdb --username akshay --password akshay --table mysort -m 1 --export-dir MySort/input
Followed by the error:
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
14/06/11 18:28:06 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/06/11 18:28:06 INFO manager.SqlManager: Using default fetchSize of 1000
14/06/11 18:28:06 INFO tool.CodeGenTool: Beginning code generation
14/06/11 18:28:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "mysort" AS t LIMIT 1
14/06/11 18:28:06 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop
Note: /tmp/sqoop-hduser/compile/0402ad4b5cf7980040264af35de406cb/mysort.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/06/11 18:28:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/0402ad4b5cf7980040264af35de406cb/mysort.jar
14/06/11 18:28:07 INFO mapreduce.ExportJobBase: Beginning export of mysort
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hbase/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/06/11 18:28:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/06/11 18:28:22 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/06/11 18:28:23 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/06/11 18:28:23 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
14/06/11 18:28:23 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/06/11 18:28:23 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/06/11 18:28:24 INFO input.FileInputFormat: Total input paths to process : 1
14/06/11 18:28:24 INFO input.FileInputFormat: Total input paths to process : 1
14/06/11 18:28:25 INFO mapreduce.JobSubmitter: number of splits:1
14/06/11 18:28:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1402488523460_0003
14/06/11 18:28:25 INFO impl.YarnClientImpl: Submitted application application_1402488523460_0003
14/06/11 18:28:25 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1402488523460_0003/
14/06/11 18:28:25 INFO mapreduce.Job: Running job: job_1402488523460_0003
14/06/11 18:28:46 INFO mapreduce.Job: Job job_1402488523460_0003 running in uber mode : false
14/06/11 18:28:46 INFO mapreduce.Job: map 0% reduce 0%
14/06/11 18:29:04 INFO mapreduce.Job: Task Id : attempt_1402488523460_0003_m_000000_0, Status : FAILED
Error: java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
14/06/11 18:29:23 INFO mapreduce.Job: Task Id : attempt_1402488523460_0003_m_000000_1, Status : FAILED
Error: java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
14/06/11 18:29:42 INFO mapreduce.Job: Task Id : attempt_1402488523460_0003_m_000000_2, Status : FAILED
Error: java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
14/06/11 18:30:03 INFO mapreduce.Job: map 100% reduce 0%
14/06/11 18:30:03 INFO mapreduce.Job: Job job_1402488523460_0003 failed with state FAILED due to: Task failed task_1402488523460_0003_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
14/06/11 18:30:03 INFO mapreduce.Job: Counters: 9
Job Counters
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=69336
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=69336
Total vcore-seconds taken by all map tasks=69336
Total megabyte-seconds taken by all map tasks=71000064
14/06/11 18:30:03 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
14/06/11 18:30:03 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 100.1476 seconds (0 bytes/sec)
14/06/11 18:30:03 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/06/11 18:30:03 INFO mapreduce.ExportJobBase: Exported 0 records.
14/06/11 18:30:03 ERROR tool.ExportTool: Error during export: Export job failed!
This is the log file :
2014-06-11 17:54:37,601 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2014-06-11 17:54:37,602 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2014-06-11 17:54:52,678 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-06-11 17:54:52,777 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2014-06-11 17:54:52,846 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2014-06-11 17:54:52,847 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2014-06-11 17:54:52,855 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2014-06-11 17:54:52,855 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1402488523460_0002, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier#971d0d8)
2014-06-11 17:54:52,901 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2014-06-11 17:54:53,165 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /tmp/hadoop-hduser/nm-local-dir/usercache/hduser/appcache/application_1402488523460_0002
2014-06-11 17:54:53,249 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2014-06-11 17:54:53,249 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2014-06-11 17:54:53,393 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2014-06-11 17:54:53,689 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2014-06-11 17:54:53,899 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: Paths:/user/hduser/MySort/input/data.txt:0+891082
2014-06-11 17:54:53,904 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
2014-06-11 17:54:53,904 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
2014-06-11 17:54:53,904 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
2014-06-11 17:54:54,028 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,028 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: Exception raised during data export
2014-06-11 17:54:54,028 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,028 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: Exception:
java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
2014-06-11 17:54:54,030 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: On input: ustNU 45
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: On input file: hdfs://localhost:9000/user/hduser/MySort/input/data.txt
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: At position 0
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: Currently processing split:
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: Paths:/user/hduser/MySort/input/data.txt:0+891082
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: This issue might not necessarily be caused by current input
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: due to the batching nature of export.
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,032 INFO [Thread-12] org.apache.sqoop.mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
2014-06-11 17:54:54,033 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hduser (auth:SIMPLE) cause:java.io.IOException: Can't export data, please check task tracker logs
2014-06-11 17:54:54,033 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
2014-06-11 17:54:54,037 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
Any help in resolving the issue is appreciated.

Here is the complete procedure for installation and import and export commands for Sqoop. Hope fully it may be helpful to some one. This one is tried and tested by me and actually works.
Download : apache.mirrors.tds.net/sqoop/1.4.4/sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz
sudo mv sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz /usr/lib/sqoop
copy paste followingtwo lines in .bashrc
export SQOOP_HOME=/usr/lib/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
Go to /usr/lib/sqoop/conf folder and copy sqoop-env-template.sh to new file sqoop-env.sh and modify export HADOOP_HOME ,HBASE_HOME,etc to the installation directory
Download the postgresql conector jar file from jdbc.postgresql.org/download/postgresql-9.3-1101.jdbc41.jar
create a directory manager.d in sqoop/conf/
create a file postgresql in conf/ and add the following line in it
org.postgresql.Driver=/usr/lib/sqoop/lib/postgresql-9.3-1101.jdbc41.jar
name the connector.jar file accordingly
For Export
Create a user in postgres:
createuser -P -s -e ace
Enter password for new role: ace
Enter it again: ace
CREATE DATABASE testdb OWNER ace TABLESPACE ace;
create table stud1(id int,name text);
Create a file student.txt
Add lines such as:
1,Ace
2,iloveapis
hadoop fs -put student.txt
sqoop export --connect jdbc:postgresql://localhost:5432/testdb --username ace --password ace --table stud1 -m 1 --export-dir student.txt
check in postgres: Select * from stud1;
For Import:
sqoop import --connect jdbc:postgresql://localhost:5432/testdb --username akshay --password akshay --table stud1 --m 1
hadoop fs -ls -R stud1
Expected Output:
-rw-r--r-- 1 hduser supergroup 0 2014-06-13 18:10 stud1/_SUCCESS
-rw-r--r-- 1 hduser supergroup 21 2014-06-13 18:10 stud1/part-m-00000
hadoop fs -cat stud1/part-m-00000
Expected Output:
1,Ace
2,iloveapis
hadoop fs -copyToLocal stud1/part-m-00000 $HOME/imported_data.txt

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

MongoDb Hadoop connector using Pig support connection issue - mongodb

Related

Hapi Fhir importing Snomed-CT

PySpark on Dataproc stops with SocketTimeoutException

Spark SQL + Cassandra: bad performance

How to read HDF data from HDFS for Hadoop

Sqoop installation export and import from postgresql

Categories

Resources