Additional CPU Instructions for Tensorflow inside container


#1

I tried using tensorflow docker images > 1.5 but they are all failing with error Job Failed, exitCode 132, tensorflow 1.5.0 works fine. I also tried a custom container (tf 1.8) that works on many systems which is built with --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.1 --copt=-msse4.2 and that also fails.

These instructions are normally supported on modern CPUs, especially Xeons. Requiring to use such an old version of tensorflow is not really ideal, any chance these instructions could be enabled? I could rebuild the container without these instructions, but that would be a hassle for every new release.

This github issue describes what I am experiencing: https://github.com/tensorflow/tensorflow/issues/17411


#2

@tkurmann We can repro the issue. There is a setting called AVX which is not initially enabled on our systems and is leverage by TensorFlow > 1.5. Our Deepo container was recompiled to ignore the AVX flag until we update our host systems so you can use those containers. Alternatively, you can select a K80, P100 or TPU.

Nice find by the way and sorry for the trouble.


#3

@Daniel thanks for the quick response. Let me know when you activate AVX and ill give it another shot :slight_smile:


#4

@tkurmann We just rebuilt the P4000s with the AVX flag properly enabled. As of now, the following VM types are compatible with TensorFlow > 1.5:

  • P4000
  • K80
  • P100

The other instances are slated to be refreshed over the coming week or so.


#5

The GPU+ and Volta types are also updated. All that is left is the P5000 and C types.


#6

Hi @Daniel,
I found this thread while investigating the exact same issue. I’m on a P5000, and stuck with either tensorflow 1.5 which is several months old now, or having to build later revisions from source, which doesn’t excite me much.
Is the AVX support for P5000 still on the cards?
Thanks!
Laurent


#7

@LaurentS All Gradient nodes should have been updated to support AVX2. Are you using Core by any chance?


#8

@Daniel That was quick!
I’m using a P5000, and yes, I think it’s Core. fast.ai template, in AMS.


#9

@LaurentS Got it. This post is under the Gradient section of the Community, hence my confusion. I would try using Gradient or use a P4000 Core VM (AVX2 has been rolled out to P4000 nodes in NY2 and AMS regions).


#10

Sorry about that, I didn’t realise it was under a specific section. I guess I’ll run that part of the code on my P4000 for now. Does this mean the Core P5000 won’t be upgraded? I was hoping to use the 16GB of RAM on it…


#11

@Daniel my P4000 does not seem to include the AVX instructions (from cat /proc/cpuinfo). My VM is a few months old. Am I supposed to create a new machine for them to be enabled?


#12

All machines will be upgraded over time, I just don’t have an ETA on that.

You would need to create a new machine. If you have a ton of stuff on there, I can create a ticket to migrate it.


#13

@Daniel Ok, understood. I hope the P5000 will be upgraded soon. It seems strange that one of the major ML packages can’t run automatically on your machines. I created a fresh P4000 and moved the stuff I need over. Good to know there’s a way to migrate data if needed!

For anyone who lands here researching the same issue, I created a wheel for tensorflow 1.8.0 with CUDA 9.1 and libcudnn 7.0 on my P5000 (with AVX optimisations disabled), which others should be able to use while waiting for the fix above. https://github.com/laurentS/tensorflow-wheels


#14

@Daniel Hi, I just created a P5000 machine, does that mean I had to create another machine and pay again for the storage?


#15

@Faris_Hafizhan This post is under the Gradient section and all Gradient nodes are AVX2 compatible at this point. This is not yet the case for Core so it might be good to create another post on AVX2 general to track progress of that upgrade.

With respect to creating a new machine: Storage is actually prorated – if you delete a machine, you are credited for unused time. Sorry for the confusion.