-
Notifications
You must be signed in to change notification settings - Fork 981
Add TestConnectionClosureRace #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@rschmitt Does this really need to be a JUnit class? Does it need to be executed with each and every CI run? Can still just be a stand-alone executable with a main method? |
@ok2c Currently, this class seems to be exposing a serious bug in the HTTP/2 client. This class's tests run in about seven seconds on my machine, but I'm not planning on leaving it enabled beyond the short term, because it burns through a lot of sockets and ephemeral ports which can cause problems. |
@rschmitt I have started a major revision of the H2 protocol handling code and have been making minor corrections / improvements to the H2 stream termination logic. I am going to use your code as a reproducer for the defect you have been seeing. However I still do think there is little value in running this test with every CI run. This is a type of integration test one may want to execute prior to a GA release or after having done a major code refactoring. I would really appreciate if you could convert these tests to a pure example class or mark them as |
@ok2c OK |
@rschmitt Curiously enough I am unable to reproduce the problem. I have run the test suite with the
I was also unable to reproduce the problem with the latest core 5.3.x and the latest client 5.5.x.
It is entirely possible that apache/httpcomponents-core@a202086 has fixed the cause of the problem but I am a bit doubtful. Could you please re-run the tests in your environment with the latest core master and let me know if the problem persists. In the meantime I will be working on improvement of the stale check. |
@rschmitt I am unable to reproduce the problem neither with core |
@rschmitt I believe I have found the underlying cause of the problem (or misunderstanding if you agree to see it that way) Here is the thing. The HTTP/1.1 protocol handler by default aggressively pipelines request messages. What that basically means the client can write out n requests before the server gets a chance to process the first request and to drop the connection. By that time all n request messages may have already been committed to the execution pipeline
The stale connection check cannot help when pipelining is being employed because the connection is perfectly valid at the time of request initiation. The H2 tests do not exhibit this problem because the server limits the number of concurrent streams to 1
I do not know if this is intentional or not but with the message pipelining not used, I get 100% reliability of message exchanges with both HTTP/1.1 and H2 using code with minor modifications to ensure request messages get executed sequentially
|
That's an important detail. How does this work? Does pipelining only kick in when the max connection pool size is reached? How does this interact with the |
Testing against apache/httpcomponents-core@be71c3a, I can reproduce the HTTP/2 hang on Linux (m5.2xlarge instance running Linux 5.10). I noticed that on macOS, it's much harder to reproduce the bug by running the tests on JDK21, so make sure you're running them on JDK8. |
@rschmitt HTTP/1.1 protocol handler has been working like that since 5.0-alpha1. Requests have always been immediately written out to the connection as soon as submitted by the caller for execution. I do not see a reason why it should be different. It is fully protocol conformant while dropping connection without signalling it is not. If you insist we can add a config knob to disable it. |
@rschmitt I am on vacation at the moment and will not be able to lay my hands on a different hardware until mid September. My Linux box is all I have. I always use Java 8 by default. I have never been able to reproduce it. Not once. I will see if I can reproduce the lock-up on GitHub CI. |
I've amended the PR to rebase atop the latest |
FYI, I can also reproduce the h2 hang against httpcore 5.3.4, in addition to apache/httpcomponents-core@be71c3a. It's definitely timing-sensitive. Newer and faster JDKs are less likely to reproduce the problem, and h2c is less likely to hang than TLS 1.3, which is less likely to hang than TLS 1.2. I can also reproduce it on Windows. |
@rschmitt The |
@rschmitt I honestly do not know how to fix a problem I am unable to reproduce
|
I'm not sure what you mean. The tests are all doing what I intend them to do.
I'm really glad you posted this. I'm running with Corretto-8.462.08.1 (build 1.8.0_462-b08). Just for grins, I tried the reproducer with corretto-8.342.07.3, installed via https://github.com/jdx/mise (handy dandy!), and it's not reproducing the problem. I bet if you grab the latest version you'll be able to reproduce the problem. If you install mise, you can do this by running:
|
I bisected the Corretto 8 releases available from mise. It looks like corretto-8.392.08.1 is the most recent release that does not reproduce the problem. The next available release, corretto-8.402.07.1, exhibits the issue. |
No obvious smoking gun in the release notes: https://builds.shipilev.net/backports-monitor/release-notes-openjdk8u402.html |
@rschmitt Okay. I will have to be blunt, then
Then, in their present form the tests are meaningless / useless for HTTP/1.1 due to the request pipelining. There is no point committing them in their present shape Some do
which is correct. Others do
which is wrong in the given context due to the use of request pipleining with the HTTP/1.1 transport.
I will get the latest Java 8 JRE and will try out to reproduce the defect with it. |
Oh, duh, I see your point now. You're saying that the HTTP/1.1 tests think they're testing a race between connection leasing and half-close, but they're actually testing a race within the duplexer, between the write path (send the next pipelined request) and the read path (read and process connection closure). And that's not a meaningful race, except maybe for determining which exception ends up getting thrown. What I don't understand, then, is why the use of When the connection pool is full, does the async client ever just wait for a connection to become available, instead of pipelining the request onto an existing connection? Alternatively, is there logic within the client (independent of any I'll also check and see whether I can replicate my earlier experiment. |
@rschmitt I made a mistake. I was looking at the problem at the core protocol level but I forgot about the connection pool preventing request pipelining at the client level by leasing connections to a single request at a time. My point still stands, though. The test cases behave inconsistently. This needs to be fixed. |
@rschmitt And no luck with the reproducer
|
@rschmitt I can reliably reproduce the lock-up on one of my client's Windows machines. I will be looking into it. |
@rschmitt This apache/httpcomponents-core@ac42dec should fix the immediate problem with a race between connection status and command submission. This apache/httpcomponents-core@5de2967 should make sure that Both
|
try (final CloseableHttpAsyncClient client = asyncClient(validateConnections)) { | ||
final List<Future<SimpleHttpResponse>> futures = new ArrayList<>(); | ||
for (int i = 0; i < 2500; i++) { | ||
final Future<SimpleHttpResponse> f = sendPing(client); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rschmitt If you want my OK to merge this change-set please help me understand why do
final Future<SimpleHttpResponse> f = sendPing(client);
try {
f.get();
} catch (final ExecutionException ignore) {
}
futures.add(f);
in this instance and do
futures.add(sendPing(client));
in other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common idea in these tests is to submit batches of concurrent requests and then wait for them to finish:
Test case | Batch size | # of batches | Between batches |
---|---|---|---|
testSlowSequentialRequests |
1 | 10 | Sleep |
testRapidSequentialRequests |
1 | 2500 | Block |
testOneLargeBatchOfRequests |
30 | 1 | N/A |
testSpacedOutBatchesOfRequests |
3 | 5 | Sleep |
testRapidSequentialRequests
could Thread.sleep(25)
instead, but blocking on the future is faster. Similarly, the other requests could block on all the futures in a batch, instead of simply sleeping in order to allow the client to quiesce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testRapidSequentialRequests could Thread.sleep(25) instead, but blocking on the future is faster.
@rschmitt I still do not understand why all tests could not behave consistently, and why a faster future is better than Thread.sleep(25) but so be it.
This isn't a test case so much as a reproducer that tries to force a race between client-side connection reuse and server-initiated connection closure on various protocols. Since it's fairly slow and resource-intensive, I have marked it with `@Tag("slow")`. The test can be run directly from an IDE, or from the command line using the `slow-tests` profile: ./mvnw -Pslow-tests
I verified that apache/httpcomponents-core@ac42dec and apache/httpcomponents-core@5de2967 fix the lock-up on my machine. Additionally, I tested b16d6d032d7e2b7e2f2b0f94c5d01375cb1ab21b, using
|
@rschmitt Wait a minute. I did make the HTTP/1.1 duplexer read incoming data greedily with apache/httpcomponents-core@be71c3a. What I did not want to do is to add an extra read at the end of the message only to make things marginally better for endpoints with broken HTTP/1.1 connection persistence while adding an overhead of an unnecessary i/o operation to well-behaving ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rschmitt I leave it up to you if you want to merge the change-set in its present state or not.
I'll take another look at it. The test cases can probably be made more consistent (and therefore less confusing) without changing their behavior. |
No description provided.