How Deep Learning in Amazon web Service with 200GB of data on EC2 instance


I’m running a transfer-learning algorithm (ResNet-50) on a specific dataset on an AWS EC2-instance. More specifically, I’m using standard Amazon Community AMIs for deep learning on a p3.8xlarge GPU compute instance.

When I ssh into my instance, I source activate the deep learning conda environment. From there, I’m launching jupyter notebooks to run code in the python 3 kernel.

When I first start running my code, it runs normally. Below is the CPU utilization %:

At some point in the code, the connection to the notebook fails. This is the only information I'm getting from terminal:

packet_write_wait: Connection to X.X.X.X IP address port 22: Broken pipe

How do I fix this?