The Architect

A Technical Architects Thoughts

March 1st, 2010

I have been looking into middleware solutions as a push mechanism between server and client.  One of the aspects that I had to consider was latency.  Hence it was important to find a lean technology agnostic transport format.  Most server-client platforms use a serialization technique to serialize into a leaner data format, and then de-serialize on the receiving end.  So in my search for a technology I now had to consider speed of serialization.

Many languages offer native serialization APIs, but when serializing the data using the native API, Metadata about the class is serialized into the output too.  I needed to find a technology that would serialize only the data values and not the additional Metadata about the object serialized.

I also needed to identify the best data format to serialize to.  XML (SOAP), strings and data dictionaries are common data formats, but a Byte Array is far more efficient, and is the proper serialization format when dealing with a client and server platform that are built on the same technology.

I came across two technologies ‘Google Protocol Buffers’ and ‘Apache Avro’:

Google Protocol Buffers

Protocol Buffers is a serialisation format with an interface description language developed by Google.  It is available under free software, open source license.  Protocol Buffers design goals are emphasized performance and simplicity.  It is a language and platform neutral technology that is an extensible mechanism for serializing structured data.

It works by you defining how you want your data to be structured via proto files, which are simply structure text files.  Once you have decided the structure in your proto file, the proto executable is called on it, and a generated class (Adobe Actionscript 3, Java, C, C++, Python) is produced.  The class can be generated into multiple different technologies, which means the class can be generated for the client and server technologies.  Thus securing a data contract (which is type safe) between the two.  The protocol buffer technology provides the ability to update the data structure without breaking deployed programs that are compiled against the old format.

Protocol buffers claims it takes between 100 to 200 nanoseconds to parse.   As the overhead of the data structure is not needed in protocol buffers, only the object fields’ values is serialized.  Protocol buffers will find the most compact serialisation technique for a particular data type (always primitives), and only serialize fields that are not null.

Apache Avro

Avro is another very recent serialisation system.  It provides rich data structures that are compact, and are transported in a binary data format.

Avro relies on a schema-based system that defines a data contract to be exchanged.  When Avro data is read, the schema used when writing it is always present.  Similar to Protocol Buffers, it is only the values in the data structure that are serialized and sent.  The strategy employed by Avro (and Protocol Buffers), means that a minimal amount of data is generated, enabling fast transport.

The schemas are equivalent to protocol buffers proto files, but they do not have to be generated.  The JSON format is used to declare the data structures.

Results

I ran a few benchmark tests and concluded the following: The distinction to be made between the two comes down to implementation, extensibility and compatibility.

Implementation: Protocol Buffers was a much cleaner implementation than Avro.  Avro was messy with limited availability of online resources.  Avro uses a JSON object in string form to represent a schema. Defining an Avro schema is cumbersome and difficult to maintain; as well as increasing the risk of runtime errors when the structure wasn’t quite right.  The contract is not type safe, and it becomes very easy to set values against object fields of the wrong type.  Such errors can only caught at runtime, rather than compile time.

Google’s Protocol buffer does not have such complexities.  Protocol Buffers prompts the coder as soon as an error is reported through the protocol buffer compiler.  Protocol Buffers allows null able fields (something that Avro doesn’t), which means that when protocol buffers is serializing, it will ignore fields that are null, and thus reduce the overhead of serializing irrelevant data (unlike Avro).

Winner – Google’s Protocol Buffers

Extensibility: Google’s Protocol buffer provides a much richer API for defining a data contract than Avro. Below is a list of features available to Protocol Buffers and not Avro:

  1. Declare nested types
  2. Define requires, repeated and optional fields
  3. Specify default values on fields
  4. Declare enumerations and set a fields default value from it
  5. Multiple message types in the same document
  6. Import other proto files
  7. Declare a range of field numbers in a message available for third party extensions (Extensions)
  8. Nested Extensions
  9. Define services

Winner – Google’s Protocol Buffers

Compatibility: Avro is only compatible with C, Java and Python, and hence restricts client technology candidate options, although they do plan for other technology languages.

Protocol Buffers is compatible with C, C++, Adobe Actionscript 3, Java and Python.  As there is a C++ version is available, Microsoft Silverlight and WPF is therefore compatible with Google’s Protocol Buffers, but there are projects to port a Protocol Buffer compiler to C# and other technologies.

Winner – Google’s Protocol Buffers

  • admin (6)
  • 2 Responses to “ Google Protocol Buffers vs Apache Avro ”

    1. Jason Madsen says:

      Cool Thanks for the article. I am new at development and this got me straight.

    2. Daniel Cohen says:

      “Protocol Buffers is compatible with C, C++, Adobe Actionscript 3, Java and Python.”

      Cheers to that. I do not disagree per se, for lack of knowledge on my part. However we are actively seeking the best way to use Protocol Buffers with a Flex client and have not found a solution short of rolling our own. Is there an AS3 library out there for Protocol Buffers that my team has missed?

      Much oblidged.

    Leave a Reply