TestDialTimeout has historically been very flaky
(#11872, #13144, #22896, and now #56876),
apparently in part due to implementation details of the socktest
package it relies on.
In reviewing CL 467335, I noticed that TestDialTimeout is the last
remaining use of testHookDialChannel (added for #5349), and that that
hook no longer has any effect for Unix and Windows.
As an experiment, I tried removing both that hook and the call to
time.Sleep in the socktest filter, and to my surprise the test
continued to pass. That greatly undermined my confidence in the test,
since it appears that the “timeout” behavior it observes is caused by
the socktest filter injecting an error rather than anything in the net
package proper actually timing out.
To restore confidence in the test, I think it should be written
against only the public API of the net package, and should test the
publicly-documented behaviors. This change implements that approach.
Notably, when a timeout is set on a Dial call, that does not guarantee
that the listener will actually call Accept on the connection before
the timeout occurs: the kernel's network stack may preemptively accept
and buffer the connection on behalf of the listener. To avoid test
flakiness, the test must tolerate (and leave open) those spurious
connections: when the kernel has accepted enough of them, it will
start to block new connections until the buffered connections have
been accepted, and the expected timeout behavior will occur.
This also allows the test to run much more quickly and in parallel:
since we are relying on real timeouts instead of injected calls to
time.Sleep, we can set the timeouts to be much shorter and run
concurrently with other public-API tests without introducing races.