Troubleshooting > Troubleshooting > Troubleshooting an elastic job
  

Troubleshooting an elastic job

The elastic job failed but there are many logs I can view. Where do I start?
When an elastic job fails, you can troubleshoot the job by examining the logs in the following order:
  1. 1. Execution plan. Used to debug the Scala code for the job.
  2. 2. Session log. Used to debug the logic that compiles the job and generates the Spark execution workflow.
  3. 3. Agent job log. Used to debug how the Secure Agent pushes the Spark execution workflow to the elastic cluster for processing.
  4. 4. Spark driver and executor logs. Used to debug how the Serverless Spark engine runs the job.
You can download the execution plan, session log, agent job log, and Spark driver log in Monitor.
To find the Spark executor log, copy the advanced log location for a specific Spark task that failed. Then, navigate to the log location on your cloud platform and download the log.
I can't find all of the log files for the elastic job that failed. I've tried to download the logs from both Monitor and the log location on my cloud platform.
The logs that are available for an elastic job depend on the step where the job failed during processing.
For example, if the job fails before the job is pushed to the elastic cluster, the Spark driver and executor logs are not generated in the log location, and Monitor cannot query the logs from the cloud platform either.
You can recover some of the log files, but you might have to use other types of logs to troubleshoot the job.
I can't find the Spark driver and Spark executor logs. Can I recover them?
If you cannot download the Spark driver log from the user interface, you can recover the log using the Spark driver Pod. You cannot recover Spark executor logs.
When the Secure Agent pushes an elastic job to the cluster, the Secure Agent creates one Spark driver Pod and multiple Spark executor Pods to run the Spark tasks. You can use the Spark driver Pod to recover the Spark driver log, but you cannot recover the Spark executor logs. The Spark driver Pod deletes the Spark executor Pods immediately after an elastic job succeeds or fails.
Note: When a job succeeds or fails, the Spark driver Pod is deleted after 5 minutes by default. If you need to increase the limit to assist troubleshooting, contact Informatica Global Customer Support.
To recover the Spark driver log, perform the following tasks:
  1. 1. Find the name of the Spark driver Pod in the agent job log. For example, see the name of the Spark driver Pod in the following message:
  2. 2019/04/09 11:10:15.511 : INFO :Spark driver pod [spark-passthroughparquetmapping-veryvery-longlongname-1234567789-infaspark02843891945120475434-driver] was successfully submitted to the cluster.
    If you cannot download the agent job log in Monitor, the log is available in the following directory on the Secure Agent machine:
    <Secure Agent installation directory>/apps/At_Scale_Server/<version>/logs/job-logs/
    The file name of the agent job log uses the format AgentLog-<Spark job ID>.log. You can find the Spark job ID in the session log. For example, the Spark job ID is 0c2c5f47-5f0b-43af-a867-da011452c19dInfaSpark0 in the following message of the session log:
    2019-05-09T03:07:52.129+00:00 <LdtmWorkflowTask-pool-1-thread-9> INFO: Registered job to status checker with Id 0c2c5f47-5f0b-43af-a867-da011452c19dInfaSpark0
  3. 2. Confirm that the Spark driver Pod exists. If the driver Pod was deleted, you cannot retrieve the Spark driver log.
  4. To confirm that the driver Pod exists, navigate to the following directory on the Secure Agent machine:
    <Secure Agent installation directory>/apps/At_Scale_Server/<version>/mercury/services/shared/kubernetes/kubernetes_1.11/bin
    In the directory, run the following command:
    ./kubectl get pods
  5. 3. Find the cluster instance ID in one of the following ways:
  6. 4. Log in to the Secure Agent machine as the sudo user that started the agent.
  7. 5. Set the environment variable KUBECONFIG on the Secure Agent machine to the following value:
  8. <Secure Agent installation directory>/apps/At_Scale_Server/<version>/ccs_home/<cluster ID>/.kube/kubeconfig.yaml
  9. 6. To retrieve the Spark driver log, navigate to the following directory on the Secure Agent machine:
  10. <Secure Agent installation directory>/apps/At_Scale_Server/<version>/mercury/services/shared/kubernetes/kubernetes_1.11/bin
    In the directory, run the following command:
    ./kubectl logs <Spark driver pod name>