Compiled Drupal core in Zephir

Some of you might be acquainted with Phalcon - "The fastest PHP MVC in the world!!!". The reason it can be so fast is because the core is compiled into a PHP extension rather then being interpreted in real-time (or cached with APC/OPCache). Usually it takes real long time to write extensions and it would really not be feasible. But Phalcon also introduced Zephir, which is a high-level mix of C and PHP that makes it pretty easy to write fast extensions. While learning the language I thought that I would try to make some of the Drupal core functions into a PHP extension just to see how they would perform.

tl;dr

So if you are not really interested in a wall of text, but rather a wall of code it can be found here: https://github.com/ivanboring/Drupal-Zephir

Some basics first

Interpreted code

PHP is something called interpreted code, which is code that is translated into something called byte code and is then executed into whatever result you want. If you add some kind of cache on top of that like APC/OpCache, that means that the next time the code runs the byte code is stored in memory, so it doesn't have to be translated again.

JIT (Just-in-time)

JIT compilers basically takes the interpreted code you have created, compiles it on the fly and then executes it as native code. This can be done in PHP by running Facebooks HHVM. This usually brings down the CPU usage as well as loading time for normal PHP scripts. The advantage with JIT compilation compared to normal compilation is that you can save your code and try right away without compilation.

Compiled Extensions

PHP is basically written in C code and anyone can extend it with PHP extensions by writing their own C code. For instance the popular cURL or GD library is extension on top of PHP. While the compiled code most certainly will be faster then writing in PHP the drawbacks of this is that you have to learn C, usually write a much bigger code mass then PHP and also compile the code every time you make changes to the code.

Allegory

So to make a fairly stupid allegory of the above cases. Think of it as being bilingual in German and English and think of your brain as a CPU. You have a German book that you have to translate into English in front of a crowd. The allegory would be something like this:

Interpreted code - Please translate the book ad-lib. The advantage is that the crowd can listen right away. The disadvantage is that you might not use the perfect translations and that your brain will have to work full time.

Interpreted code with cache - This could be compared to have read the book one time. Still not perfect translation, but your brain will not have to work as much since you remember much from the last time.

JIT - You have Google Translate at your disposal and can just run the text through that. It will still not be a perfect translation, but your brain will almost not have to work at all since you are just reading the translation.

Compiled Extension - You have time to translate the text beforehand. The translation will be perfect, your brain will just be used when you translate, not so much when you read. The disadvantage of this is that you have to translate everything beforehand. This also applies to each change to the text.

So where does Zephir and Drupal fit into this?

Zephir is a PHP extension compiler that is really simple to use compared to C code. It does not require you to write the same amount of code and it even gives you the flexibility to use PHP's user space, meaning that you can run your str_replace() or substr(). Zephir also recreates some of the PHP functions as native Zephir methods, like for instance array_reverse() or strlen(). The advantage of all this of course is that you can take the reusable and static parts of your code and rewrite this in Zephir. A perfect example of this would be to rewrite the Drupal 7 core since it is pretty stable today and since it is reused everywhere. This is what I started to do as an exercise to learn some Zephir and also to do some benchmarking to see if we could speed up Drupal some. There are many aspects of Zephir that you have to take account of. Some important that comes to mind when working with Drupal is not to use the PHP userspace extensively (and thus not reusing Drupals internal function). The problems with rewriting and benchmarking Drupal in Zephir is that Drupal is incredibly nested. So its hard to get an overview of what exactly are the bottlenecks and what can be transformed without having to transform many functions.

Example of code - drupal_strip_dangerous_protocols()

So drupal_strip_dangerous_protocols is basically a function that does what it sound like - it strips dangerous protocols from a url string. In Drupals native php code it looks like this:

function drupal_strip_dangerous_protocols($uri) {
  static $allowed_protocols;

  if (!isset($allowed_protocols)) {
    $allowed_protocols = array_flip(variable_get('filter_allowed_protocols', array('ftp', 'http', 'https', 'irc', 'mailto', 'news', 'nntp', 'rtsp', 'sftp', 'ssh', 'tel', 'telnet', 'webcal')));
  }

  // Iteratively remove any invalid protocol found.
  do {
    $before = $uri;
    $colonpos = strpos($uri, ':');
    if ($colonpos > 0) {
      // We found a colon, possibly a protocol. Verify.
      $protocol = substr($uri, 0, $colonpos);
      // If a colon is preceded by a slash, question mark or hash, it cannot
      // possibly be part of the URL scheme. This must be a relative URL, which
      // inherits the (safe) protocol of the base document.
      if (preg_match('![/?#]!', $protocol)) {
        break;
      }
      // Check if this is a disallowed protocol. Per RFC2616, section 3.2.3
      // (URI Comparison) scheme comparison must be case-insensitive.
      if (!isset($allowed_protocols[strtolower($protocol)])) {
        $uri = substr($uri, $colonpos + 1);
      }
    }
  } while ($before != $uri);

  return $uri;
}

So the first thing I tried out was to translate it 1 to 1 into Zephir, except the parts that was in the PHP static. The code would look something like this:

public static final function drupal_strip_dangerous_protocols(string uri, array allowed_protocols) {
   while(true) {
          let before = uri;
          let colonpos = strpos(uri, ":");
          if colonpos > 0 {
               let protocol = substr(uri, 0, colonpos);

              if(preg_match("![/?#]!", protocol)) {
                    break;
             }

             if !isset allowed_protocols[strtolower(protocol)] {
                    let uri = substr(uri, (colonpos + 1));
             }
          }
          if(before == uri) {
                break;
         }
      }
      return uri;
}

As you can see I'm constantly referring to the PHP user space, so the result of this was actually slower then the running it in PHP. The user Yoghaki then helped me out by pointing me towards how you should think of Zephir code, by showing his own example. Basically you should try to treat it in a way as a low level language, and try to stay away from using PHP functions when possible. So with this knowledge I wrote this instead

public static final function drupal_strip_dangerous_protocols(string uri) {
    	
    string tempuri = "";
    string returnuri = "";
    char ch;
    boolean foundcolon = false;
    boolean allok = false;

    if (self::drupal_strip_dangerous_protocols_allowed_protocols == null) {
    	let self::drupal_strip_dangerous_protocols_allowed_protocols = ZephirHelper::array_flip(Bootstrap::variable_get("filter_allowed_protocols", ["ftp", "http", "https", "irc", "mailto", "news", "nntp", "rtsp", "sftp", "ssh", "tel", "telnet", "webcal"]));
    }
    let uri = strtolower(uri);

    for ch in uri {
        if allok {
            let returnuri .=  ch;
        }
        elseif ch == '?' || ch == '#' || ch == '/' {
            if foundcolon == true {
                if isset self::drupal_strip_dangerous_protocols_allowed_protocols[tempuri] {
                    let returnuri .= tempuri . ":";
                    let tempuri = "";
                } else {
                    let tempuri = "";
                }
                let foundcolon = false;
            } else {
                let allok = true;
            }
            let returnuri .= ch;
        }
        elseif ch != ':' {
            if foundcolon == true {
                if !isset self::drupal_strip_dangerous_protocols_allowed_protocols[tempuri] {
                    let tempuri = "";
                }
                let foundcolon = false;
            }
            let tempuri .= ch; 
        }
        else {
            let foundcolon = true;
        }
    }
    let returnuri .= tempuri;
    return returnuri;
}

Example results

Now the results where a little bit better, I did some benchmarking and compared the two functions and the results are as follows according to XHProf:

SystemCallsIncl. wall timeIWall%Incl. CPUICpu%Incl. MemUseIMemUse%Incl. MemPeakUseIPeakMemUse%
PHP+APC100089414.5%138197.1%109040.3%103600.3%
Zephir100043622.2%31311.6%1409684.0%109040.3%

So as you can see Zephir is more then 100% faster the PHP opcached with APC and uses less then 25% of the CPU. The memory consumption though is a lot higher in total, but the maximum memory used at one time is almost the same. So it's pretty good results.

Conclusions

Zephir makes improvements where it is possible to write much logical code without calling other PHP functions. The above function was just such code. The problem is that Drupal is filled with a lot of functions that do I/O or specific PHP functionality that you can't translate so easy into Zephir. I had other functions that actually became a lot slower when running them in Zephir. And since Drupal uses it's own modularity via hooks rather then object oriented code, this requires that the functions still exists in modules to work, even if that function is just calling an extension method. And this is true for the core as well because the user contributed modules calls the core function API. So you would get a lot of overhead of functions calling extension methods.

Another beef that I have with Zephir is that it's kind of a middle ground - it's not C and it's not PHP. So the use is pretty limited - for Phalcon it's perfect since it's OO, for Drupal's core it might also be good, but it might as well not be. Someone would need to redo the entire core to see how much can be gained. And if it's only a couple of functions in the end that can be optimized clearly, then it might be better to rewrite them in C to get the optimal speed. Even though benchmarks shows that it's not that much faster. The third option is to run it all in HHVM because that pretty much gives you a lot of speed as well, without having to write and compile a lot of code (this comes with it own sets of problems though). Thoughts? You can download the Drupal Zephir code here.

I do believe that it will prove more useful in Drupal 8 then Drupal 7 because of the new OO architecture and if Zephir solves the pass by reference feature that would cause less headache for whoever rewrites Drupal into Zephir. But that remains to be seen.