Archive for June, 2014

Transparently use avro schemata (.avsc) files in a python module

Sunday, June 8th, 2014

One of the cool things about avro is that it has bindings in a couple of different languages. However, I think the only one that has native code generation support for working with avro objects is Java, which makes working with avro in the other languages a bit harder. Here’s a simple way to load your schemata dynamically (and if you don’t want to write your schemata by hand, then using maven you can generate it from AVDL files using the tip from my previous post).

What this bit of code does is overrides the __getattr__ function on the module, so anytime you try to access a type on the module, it will attempt to load the avro schema from a file of the same name with the avsc extension. To use this code, create a file called __init__.py in your directory of .avsc files, and paste the following code in.

import sys
from os.path import join, dirname
import avro.schema

class AvroSchemaLoader(object):
    '''
        This object allows us to lazily load schemata files in the current
        directory and parse them as needed.
        
        It is intended to be used as a replacement of the current module in
        sys.modules, so usage of this object should be transparent to users.
        
        For example, to access the Foo wrapper object, you would do the
        following:
        
            >>> from this_dir_name import Foo
            >>> print type(Foo)
            <avro.schema.RecordSchema at ...>
            >>>
    '''
    
    def __init__(self, module):
        # things break in odd ways if you don't keep a reference to the module here
        self.__module = module  

    def __getattr__(self, name):
        if name.startswith('__'):
            return object.__getattr__(self, name)
        
        with open(join(dirname(__file__), '%s.avsc' % name), "r") as fp:
            schema = avro.schema.parse(fp.read())
        
        setattr(self, name, schema)
        return schema


# Replace this module instance with the dynamic loader
sys.modules[__name__] = AvroSchemaLoader(sys.modules[__name__])

There’s a lot you can do to make this better — like load a wrapper around the schema instead of using the schema directly. I’ll leave that as an exercise for the reader. 🙂

Automatically generating avro schemata (avsc files) using maven

Sunday, June 8th, 2014

I’ve been using avro for serialization a bit lately, and it seems like a really useful, flexible, and performant technology. To use avro containers, you have to define a schema for them — but writing out JSON files is a bit of a pain. Avro provides an IDL that you can use to specify the object types instead, and it’s much easier to work with. The avro-maven-plugin is quite useful because you can automatically generate Java objects from the IDL files — but what if you’re working with the same Avro files in a different language that can’t use the IDL?

Until they add the functionality to the maven plugin, there’s an easy way you can automate this yourself using a bit of maven magic. First thing to do is add the following dependency to your project’s pom.xml

<dependency>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro-tools</artifactId>
  <version>1.7.6</version>
</dependency>

Next, you need to add a simple little class that converts an entire directory from avdl files to avsc files. Avro-tools ships with a useful class called IdlToSchemataTool that will convert a single file for you, so converting an entire directory is just a simple wrapper around that. There is a bit of improvement that could be done here, but this gets the job done assuming your directory only has avdl files in it.

package main;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import org.apache.avro.tool.IdlToSchemataTool;

/**
 * Converts an entire directory from Avro IDL (.avdl) to schema (.avsc)
 */
public class ConvertIdl {

	public static void main(String [] args) throws Exception {
		IdlToSchemataTool tool = new IdlToSchemataTool();
		
		File inDir = new File(args[0]);
		File outDir = new File(args[1]);
		
		for (File inFile: inDir.listFiles()) {
			List<String> toolArgs = new ArrayList<String>();
			toolArgs.add(inFile.getAbsolutePath());
			toolArgs.add(outDir.getAbsolutePath());
			
			tool.run(System.in, System.out, System.err, toolArgs);
		}
	}
}

Finally, you add the following to the plugins section of pom.xml to actually generate the avsc files. This uses the exec-maven-plugin to run the class we created above during compilation. This configuration assumes that you are storing your avdl files in src/main/avro, and that you want to place the files in schemata. Obviously you can reconfigure this however you want.

<plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>exec-maven-plugin</artifactId>
    <version>1.3</version>
    <executions>
        <execution>
            <phase>compile</phase>
            <goals>
                <goal>java</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <mainClass>main.ConvertIdl</mainClass>
        <arguments>
            <argument>${project.basedir}/src/main/avro/</argument>
            <argument>${project.basedir}/schemata/</argument>
        </arguments>
    </configuration>
</plugin>

And that’s it! To actually convert your avdl files to avsc files, run ‘mvn compile’ and the output directory should be filled with avsc files containing the JSON schema for your avro containers. Hope this helps you out, let me know if you find any bugs or have improvements.