Friday, June 6, 2014

I love to debug stacks

I'm still chasing down why aufrecht.org takes over six seconds to load.  Since I have an Enterprise-Grade setup, it's an Enterprise-Grade hassle to debug.  Basically, the layers of stuff that happen to serve up a web page are:

  1. My browser
  2. The internet
  3. nginx
  4. Gunicorn
  5. python/django/django_cms
  6. the database
So the first step in debugging is to figure out which level the problem is happening at, with the caveat that, if the simplified model above is not appropriate to the actual problem, it will cause me to look in the wrong place for a very long time.

Apropos of that: I started debugging by noticing that, when I browsed to a page, it took six seconds for the nginx log to do anything.  Naturally, I suspected The Internet and wasted some time looking there.  But then I realized that I didn't know if nginx was logging when it first heard my request, or when it finished answering it.  I turned on detailed logging (a few changes to the config file and a restart), which gives me really good stuff like

2014/06/06 20:29:01 [debug] 7276#0: post event 00000000011555D0
2014/06/06 20:29:01 [debug] 7276#0: delete posted event 00000000011555D0
2014/06/06 20:29:01 [debug] 7276#0: accept on 0.0.0.0:80, ready: 0
2014/06/06 20:29:01 [debug] 7276#0: posix_memalign: 000000000110CFD0:256 @16
2014/06/06 20:29:01 [debug] 7276#0: *1091 accept: 50.174.84.3 fd:3
[300 more lines like this]
2014/06/06 20:29:02 [debug] 7276#0: *1091 writev: 374
2014/06/06 20:29:02 [debug] 7276#0: *1091 chain writer out: 0000000000000000
2014/06/06 20:29:02 [debug] 7276#0: *1091 event timer del: 8: 1402101002127
2014/06/06 20:29:02 [debug] 7276#0: *1091 event timer add: 8: 60000:1402101002127
2014/06/06 20:29:08 [debug] 7276#0: *1091 post event 00000000011556A0
2014/06/06 20:29:08 [debug] 7276#0: *1091 post event 0000000001168EB0
2014/06/06 20:29:08 [debug] 7276#0: *1091 delete posted event 0000000001168EB0
2014/06/06 20:29:08 [debug] 7276#0: *1091 http upstream request: "/foobar/?"
201
[almost 200 more lines]
2014/06/06 20:29:08 [debug] 7276#0: *1091 malloc: 0000000001115530:1024
2014/06/06 20:29:08 [debug] 7276#0: *1091 recv: fd:3 -1 of 1024
2014/06/06 20:29:08 [debug] 7276#0: *1091 recv() not ready (11: Resource temporarily unavailable)
2014/06/06 20:29:08 [debug] 7276#0: *1091 free: 0000000001115530


I've trimmed a lot to show you the exciting part.  Notice anything about the timestamps?

Yes, there's the six-second gap.  The nginx debug log responds almost immediately to the request, then waits for something for six seconds, then replies.  So I can rule out The Internet (almost certainly) and nginx (probably) as locations for the delay.  Now I have to figure out what gunicorn is doing.

That actually turns out to be equally simple.  I turn on debugging and restart it, and I can see that the log responds only a second after I make a request.  Okay, so it's Django or Python or Django-cms that's taking six seconds.  Let's turn on the next debug log and look.

Um.  Okay, it doesn't seem like this is going to be simple.  There's lots of instructions on how to write some complicated stuff to the configuration that ... does nothing more than enable you to then add your own debugging commands within your own code.  Ack.  This is not going to be easy.  I am going to need Enterprise-Grade chocolate.

UPDATE an hour later (because this is so boring I'm not going to make another post altogether for this UPDATE): when I disable memcache, the page loads in about one second.  Good enough for now; making memcache help instead of hurt goes on the backlog for later.

No comments :

Post a Comment