Postmortem

Brayan Florez
3 min readMay 23, 2020

The following is a postmortem for a project that is called “Web stack debugging #3” which is proposed at Holberton School.

Summary

Any GET request was returning a status code 500 (Internal server error). Talking about the impact is it possible to say that any user could have had access to the website since the server was really affected by the outage. I’ll explain the timeline and the root cause in the following sections.

Timeline

Timezone: GMT-5

  • Tue, 19 May 2020, 12:00 a.m. - The project was released.
  • Tue, 19 May 2020, 6:00 p.m. - I checked it and I was like…
  • Tue, 19 May 2020, 6:30 p.m. to almost 8:00 p.m. - I read information about it so I realized that it was not that difficult using the amazing tool strace in order to get the possible errors a PID was having.
  • Tue, 19 May 2020, 10:00 p.m. - Yes, two hours later after reading all the lines that were invoked by strace and searching for a possible error which was kinda hard to me (I wear glasses, you might understand me or not). I realized it was a typo that was causing this huge error. So, the solution was changing the typo of a file.

class-wp-locale.php to class-wp-locale.phpp

  • Tue, 19 May 2020, 10:20 p.m. - I spent 20 minutes doing the puppet file I was asked for. I tested it in another container, and voilà it worked. Everything was working fine returning a 200 status code (OK). I pushed it to my GitHub account and that was all.

Root cause and resolution

What was really causing the issue was a human-error, a typo. Maybe when the person was writing the config file he or she added an additional p at the end of .php file making it .phpp.

Let’s checked the steps I followed to solve this problem.

  • I opened two terminals, In one. First, I ran ps -A so I checked all the processes running on the server, took the Apache2 PID in order to execute it using strace -p PID . in the other one I ran curl -sI 127.0.0.1 to do a request to that IP.
  • Go back to the terminal strace was executed to check the list of all the system calls and signals received.
  • The problem was in sight but it took me 2 hours as I said before.
  • The outage was located in the following route /var/www/html/wp-includes/
  • The solution I came up with was changing manually the file name from class-wp-locale.php to class-wp-locale.phpp as I mentioned it in the timeline section.

Corrective and preventative measures

After changing any file is really important to check if everything works as it did before doing those changes. This might be a common error since it’s a typo but that’s why tests are that helpful. It would be nice to monitor and test the servers in order to avoid outages like this one installing an application or something that checks server status regularly like Datadog.

--

--