Ticket #862 (closed defect: fixed)

Opened 12 years ago

Last modified 12 years ago

Pylons Server on glue regularly freezes

Reported by: spascoe Owned by: pjkersha
Priority: blocker Milestone: PROD Final
Component: security Version:
Keywords: Cc: lawrence, spascoe, domlowe

Description

Sometimes logging-in seems to freeze the pylons server. I can't reproduce the problem reliably but paster.log always ends with routes logging a call to POST /getCredentials:

DEBUG:routes.middleware:Matched POST /getCredentials
Matched POST /getCredentials
DEBUG:routes.middleware:Route path: 'getCredentials', defaults: {'action': u'getCredentials', 'controller': u'login'}
Route path: 'getCredentials', defaults: {'action': u'getCredentials', 'controller': u'login'}
DEBUG:routes.middleware:Match dict: {'action': u'getCredentials', 'controller': u'login'}
Match dict: {'action': u'getCredentials', 'controller': u'login'}
DEBUG:beaker.container:[811b6d5b49d81e1495dab81d8f345ac0] data file /var/www/ndg/data/sessions/container_file/8/81/811b6d5b49d81e1495dab81d8f345ac0.cache

[811b6d5b49d81e1495dab81d8f345ac0] data file /var/www/ndg/data/sessions/container_file/8/81/811b6d5b49d81e1495dab81d8f345ac0.cache

Change History

comment:1 Changed 12 years ago by lawrence

  • Owner changed from lawrence to pjkersha

comment:2 Changed 12 years ago by lawrence

  • Priority changed from critical to blocker

Phil this is making life quite difficult, can you make it a priority to address this!

comment:3 Changed 12 years ago by lawrence

It freezes up sandboxes too, for example, this is what my sandbox does:

Proxy Error
The proxy server received an invalid response from an upstream server.
 The proxy server could not handle the request POST /getCredentials.
 Reason: Error reading from remote server

(i.e. Stephen's analysis is correct)

comment:4 Changed 12 years ago by pjkersha

  • Status changed from new to assigned

This is difficult to reproduce. I can't reproduce on my sandbox but can get it intermittently on glue: the first time I saw it and retried with the same URL but it then worked. :(

I have experimented with changing the action on the form for login.kid. Previously this had a relative URL

action="/getCredentials?$r=${c.returnTo}"

I've changed this to:

action="$g.getCredentials?r=${c.returnTo}"

(with respective update to ows_server.config.ndgMiddleware to add new global) and have not seen the error again. I'll continue to test.

Nb. this fix is active on hacked version of ows_server glue egg but not in SVN. I'll add once I've tested some more.

comment:5 follow-up: ↓ 6 Changed 12 years ago by lawrence

Ok, I'm getting it again, and now with better logging in getCredentials it's hanging on the call :

sessID = smClnt.connect(username, passphrase=passphrase)[-1]

but if I got back a couple of pages, and try again, it works, so it's really intermittent. Next time this happens, can I see logs for what the smClnt is trying to connect to ...?

comment:6 in reply to: ↑ 5 ; follow-up: ↓ 7 Changed 12 years ago by pjkersha

Replying to lawrence:

Ok, I'm getting it again, and now with better logging in getCredentials it's hanging on the call :

sessID = smClnt.connect(username, passphrase=passphrase)[-1]

but if I got back a couple of pages, and try again, it works, so it's really intermittent. Next time this happens, can I see logs for what the smClnt is trying to connect to ...?

Will put more in. I think it needs logging insode the call to SessionMgr.connect.

I've been seeing the problem with disconnect as well called from Discovery logout but I don't see it with the Gatekeeper which also uses the Session Manager in a getAttCert call.

It could be related to the fact the SM runs over https. Something to do with timeouts with SSL. Just a thought at this stage.

comment:7 in reply to: ↑ 6 ; follow-up: ↓ 8 Changed 12 years ago by pjkersha

  • Cc lawrence added

Replying to pjkersha:

Replying to lawrence:

Ok, I'm getting it again, and now with better logging in getCredentials it's hanging on the call :

sessID = smClnt.connect(username, passphrase=passphrase)[-1]

but if I got back a couple of pages, and try again, it works, so it's really intermittent. Next time this happens, can I see logs for what the smClnt is trying to connect to ...?

Will put more in. I think it needs logging insode the call to SessionMgr.connect.

I've been seeing the problem with disconnect as well called from Discovery logout but I don't see it with the Gatekeeper which also uses the Session Manager in a getAttCert call.

It could be related to the fact the SM runs over https. Something to do with timeouts with SSL. Just a thought at this stage.

I've switched the Session Manager to run over http instead of https to test the above.

SessionMgrClient.connect in use an extra print statement which outputs to the paster log this is set immediately before the call to the ZSI generated stub.

The file is: /usr/local/lib/python2.4/site-packages/ndg_security_common-0.8.4_beta_r2855-py2.4.egg/ndg/security/common/SessionMgr/__init__.py

All ZSI security stubs are set to output SOAP request and responses. When the freeze happens there is no request message suggesting something goes wrong before the message is dispatched to the Session Manager server.

comment:8 in reply to: ↑ 7 ; follow-up: ↓ 9 Changed 12 years ago by pjkersha

  • Cc spascoe, domlowe added

Replying to pjkersha:

Replying to pjkersha:

Replying to lawrence:

Ok, I'm getting it again, and now with better logging in getCredentials it's hanging on the call :

sessID = smClnt.connect(username, passphrase=passphrase)[-1]

but if I got back a couple of pages, and try again, it works, so it's really intermittent. Next time this happens, can I see logs for what the smClnt is trying to connect to ...?

Will put more in. I think it needs logging insode the call to SessionMgr.connect.

I've been seeing the problem with disconnect as well called from Discovery logout but I don't see it with the Gatekeeper which also uses the Session Manager in a getAttCert call.

It could be related to the fact the SM runs over https. Something to do with timeouts with SSL. Just a thought at this stage.

I've switched the Session Manager to run over http instead of https to test the above.

SessionMgrClient.connect in use an extra print statement which outputs to the paster log this is set immediately before the call to the ZSI generated stub.

The file is: /usr/local/lib/python2.4/site-packages/ndg_security_common-0.8.4_beta_r2855-py2.4.egg/ndg/security/common/SessionMgr/__init__.py

All ZSI security stubs are set to output SOAP request and responses. When the freeze happens there is no request message suggesting something goes wrong before the message is dispatched to the Session Manager server.

I've now updated M2Crypto from NDG customised 0.16 -> standard 0.18. https is re-instated for the Session Manager. Initial tests show no return of the freeze problem but this is no guarantee given the intermittent nature of the problem.

Another change is that the SSL connection is now set with a 3 second timeout. The default was 600(!)s. If this is the cause of the freezing then we should now at least get an exception rather then pylons locking up.

comment:9 in reply to: ↑ 8 Changed 12 years ago by pjkersha

  • Status changed from assigned to closed
  • Resolution set to fixed

Replying to pjkersha:

Replying to pjkersha:

Replying to pjkersha:

Replying to lawrence:

Ok, I'm getting it again, and now with better logging in getCredentials it's hanging on the call :

sessID = smClnt.connect(username, passphrase=passphrase)[-1]

but if I got back a couple of pages, and try again, it works, so it's really intermittent. Next time this happens, can I see logs for what the smClnt is trying to connect to ...?

Will put more in. I think it needs logging insode the call to SessionMgr.connect.

I've been seeing the problem with disconnect as well called from Discovery logout but I don't see it with the Gatekeeper which also uses the Session Manager in a getAttCert call.

It could be related to the fact the SM runs over https. Something to do with timeouts with SSL. Just a thought at this stage.

I've switched the Session Manager to run over http instead of https to test the above.

SessionMgrClient.connect in use an extra print statement which outputs to the paster log this is set immediately before the call to the ZSI generated stub.

The file is: /usr/local/lib/python2.4/site-packages/ndg_security_common-0.8.4_beta_r2855-py2.4.egg/ndg/security/common/SessionMgr/__init__.py

All ZSI security stubs are set to output SOAP request and responses. When the freeze happens there is no request message suggesting something goes wrong before the message is dispatched to the Session Manager server.

I've now updated M2Crypto from NDG customised 0.16 -> standard 0.18. https is re-instated for the Session Manager. Initial tests show no return of the freeze problem but this is no guarantee given the intermittent nature of the problem.

Another change is that the SSL connection is now set with a 3 second timeout. The default was 600(!)s. If this is the cause of the freezing then we should now at least get an exception rather then pylons locking up.

Problem has not recurred since upgrade to M2Crypto version 0.18 - closing ticket

Note: See TracTickets for help on using tickets.