Gradient job runner not showing my logs

gradient

#1

I am running following code on Gradient job runner (P5000). Repo: github.com/khushmeet/code-rnn. Problem is that on running, there is no log output. Whereas other repositories I have tested are emitting logs. How can I fix this?

My container is floydhub/pytorch.

Thanks for help.


#2

Hey can you post the command you are using and what the CLI is saying? I.e. does it complete uploading and get stuck at “awaiting logs” or does it terminate earlier in the process?


#3

Command
paperspace jobs create --container floydhub/pytorch:0.3.1-gpu.cuda9cudnn7-py3.27 --machineType P5000 --command 'python main.py -epochs 10 -save_model paperspace_model -train_data data.txt -cuda'

Everything runs as intended, except that after awaiting logs..., nothing shows up.


#4

could you try the same command but just with a nvidia-smi test so:

paperspace jobs create --container floydhub/pytorch:0.3.1-gpu.cuda9cudnn7-py3.27 --machineType P5000 --command 'nvidia-smi'


#5

This is the output i got


#6

So that looks good and leads me to think there might be something going on with the python file. The container/job look good. As a next step I would create a bash file called run.sh that looks like:

#!/bin/bash
echo "Starting logging"
python main.py -epochs 10 -save_model paperspace_model -train_data data.txt -cuda

and then start your job by calling

paperspace jobs create --container floydhub/pytorch:0.3.1-gpu.cuda9cudnn7-py3.27 --machineType P5000 --command 'bash run.sh'

As a sidenote, you have to make sure the run.sh file has Unix File ending (in the off chance you are running your jobs from a Windows machine).

Let me know what that reports. If you see something log, then I would start looking in to the python program itself and adding debug statements to see where it is getting stuck


#7

I did what you said and here’s the output.

It been 7 minutes and still no logs from the program. Another curious thing is, if let’s say my program gives an error during execution, then all the print statements are written to the log. Only then they are shown.


#8

This indicates to me that the program is failing somewhere. Could you try to run it on a K80 or P100 GPU type? Then we can narrow it down to if it’s a machine type or not.