Here’s an interesting note: just to try a parallel run with multiple nodes, I requested 9 nodes x 24 cores for 48 hours and got it (maybe because everyone is clearing out their jobs before the shutdown). However, I am just running with a small subset; it scheduled last night and is still running! It finished faster single-core. So, this might be a window into whatever is happening with the parallel runs.
Then, even running parallel w/in a single node, I’m getting that MPI job killed error again, but interestingly only for 2 of the 3 jobs…once again, points to finnicky hardware or connections. My colleague was able to do the same task and not get any errors.