
At mozilla.ai we have developed encoderfiles to make it easy to deploy pure encoders with zero dependencies on multiple platforms. Although not as glamorous as LLMs, they still power many of the current RAG or sentiment analysis pipelines, to name a few applications. Some prebuilt encoderfiles are already available at HuggingFace, but you can easily build your own from ONNX weights and a JSON config.
Current encoderfiles allow only text processing, either for embeddings or classification. We have expanded the encoderfile project to carry out image tasks, starting with image classification. Object detection and image segmentation will follow soon.
Image Inputs and Interfaces
Image tasks have somewhat different requirements from text tasks. The binary and large nature of images makes them unsuitable as CLI parameters, so encoderfile reads them from file paths passed as arguments instead.
We consciously chose not to retrieve remote URLs to reduce the attack surface. The binary nature of images also restricts the use of JSON input, requiring a multipart request in the HTTP/S interface. Since gRPC handles binary data natively, no interface changes are needed there.
Preprocessing
As opposed to text, image tasks are also heavier on preprocessing steps than postprocessing steps. In the case of text embeddings, some kind of pooling can be done on the individual word embeddings, for which we allow Lua scripts to be included in the encoderfile. In the case of images, they usually need to be rescaled to some size (usually 224 x 224 for historical reasons), and normalized from the standard 0-255 byte range to a floating point quantity with some predefined mean and standard deviation. The number of channels is also important, whether it is 1 for grayscale images or 3 for full RGB. So we decided to also allow preprocessing scripts in Lua to make this step flexible, including a default that will usually work out of the box with the most common models.
What’s Next
We plan to support special files like standard input, Unix named pipes or Windows Named Pipes seamlessly, allowing other components to connect to the encoderfile to feed it with data.
Object detection tasks share many similarities with image classification, and hence it feels a natural next step. The output is still reduced, since it is just a JSON holding bounding boxes and class tags. The image segmentation task, however, requires sending the user back a potentially large shaded image covering each object. For HTTP requests and local standard output, we will again use multipart payloads; for gRPC, it is even easier to return a blob back.
This is our current roadmap for image processing in encoderfile. We are eager to explore other tasks and other types of input to help make models easy and efficient to deploy, so please let us know what features you would like to see down the road.
This article first appeared on Read More