An introduction to DocArray, an open source AI library

DocArray is hosted by the Linux Foundation to provide an inclusive and standard multimodal data model within the open source community and beyond.
1 reader likes this.
Brain on a computer screen

opensource.com

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, and so on. It allows deep-learning engineers to efficiently process, embed, search, store, recommend, and transfer multi-modal data with a Pythonic API. Starting in November of 2022, DocArray is open source and hosted by the Linux Foundation AI & Data initiative so that there’s a neutral home for building and supporting an open AI and data community. This is the start of a new day for DocArray.

In the ten months since DocArray’s first release, its developers at Jina AI have seen more and more adoption and contributions from the open source community. Today, DocArray powers hundreds of multimodal AI applications.

Hosting an open source project with the Linux Foundation

Hosting a project with the Linux Foundation follows open governance, meaning there’s no one company or individual in control of a project. When maintainers of an open source project decide to host it at the Linux Foundation, they specifically transfer the project’s trademark ownership to the Linux Foundation.

In this article, I’ll review the history and future of DocArray. In particular, I’ll demonstrate some cool features that are already in development.

A brief history of DocArray

Jina AI introduced the concept of "DocArray" in Jina 0.8 in late 2020. It was the jina.types module, intended to complete neural search design patterns by clarifying low-level data representation in Jina. Rather than working with Protobuf directly, the new Document class offered a simpler and safer high-level API to represent multimodal data.

Image demonstrating a stream of bytes on a network.

(Jina AI, CC BY-SA 4.0)

Over time, we extended jina.types and moved beyond a simple Pythonic interface of Protobuf. We added DocumentArray to ease batch operations on multiple DocumentArrays. Then we brought in IO and pre-processing functions for different data modalities, like text, image, video, audio, and 3D meshes. The Executor class started to use DocumentArray for input and output. In Jina 2.0 (released in mid-2021) the design became stronger still. Document, Executor, and Flow became Jina's three fundamental concepts:

• Document is the data IO in Jina
• Executor defines the logic of processing Documents
• Flow ties Executors together to accomplish a task.

The community loved the new design, as it greatly improved the developer experience by hiding unnecessary complexity. This lets developers focus on the things that really matter.

Image of Jina's new design.

(Jina AI, CC BY-SA 4.0)

As jina.types grew, it became conceptually independent from Jina. While jina.types was more about building locally, the rest of Jina focused on service-ization. Trying to achieve two very different targets in one codebase created maintenance hurdles. On the one hand, jina.types had to evolve fast and keep adding features to meet the needs of the rapidly evolving AI community. On the other hand, Jina itself had to remain stable and robust as it served as infrastructure. The result? A slowdown in development.

We tackled this by decoupling jina.types from Jina in late 2021. This refactoring served as the foundation of the later DocArray. It was then that DocArray's mission crystallized for the team: to provide a data structure for AI engineers to easily represent, store, transmit, and embed multimodal data. DocArray focuses on local developer experience, optimized for fast prototyping. Jina scales things up and uplifts prototypes into services in production. With that in mind, Jina AI released DocArray 0.1 in parallel with Jina 3.0 in early 2022, independently as a new open source project.

We chose the name DocArray because we want to make something as fundamental and widely-used as NumPy's ndarray. Today, DocArray is the entrypoint of many multimodal AI applications, like the popular DALLE-Flow and DiscoArt. DocArray developers introduced new and powerful features, such as dataclass and document store to improve usability even more. DocArray has allied with open source partners like Weaviate, Qdrant, Redis, FastAPI, pydantic, and Jupyter for integration and most importantly for seeking a common standard.

In the DocArray 0.19 (released on Nov. 15), you can easily represent and process 3D mesh data.

Image of the DocArray 0.19 release where you can easily represent and process 3D mesh data.

(Jina AI, CC BY-SA 4.0)

The future of DocArray

Donating DocArray to the Linux Foundation marks an important milestone where we share our commitment with the open source community openly, inclusively, and constructively.

The next release of DocArray focuses on four tasks:

Representing: support Python idioms for representing complicated, nested multimodal data with ease.
Embedding: provide smooth interfaces for mainstream deep learning models to embed data efficiently.
Storing: support multiple vector databases for efficient persistence and approximate nearest neighbor retrieval.
Transiting: allow fast (de)serialization and become a standard wire protocol on gRPC, HTTP, and WebSockets.

In the following sections, DocArray maintainers Sami Jaghouar and Johannes Messner give you a taste of the next release.

All-in-dataclass

In DocArray, dataclass is a high-level API for representing a multimodal document. It follows the design and idiom of the standard Python dataclass, letting users represent complicated multimodal documents intuitively and process them easily with DocArray's API. The new release makes dataclass a first-class citizen and refactors its old implementation by using pydantic V2.

How to use dataclass

Here's how to use the new dataclass. First, you should know that a Document is a pydantic model with a random ID and the Protobuf interface:

From docarray import Document

 To create your own multimodal data type you just need to subclass from Document:

from docarray import Document
from docarray.typing import Tensor
import numpy as np
class Banner(Document):
   alt_text: str
   image: Tensor
banner = Banner(text='DocArray is amazing', image=np.zeros((3, 224, 224)))

Once you've defined a Banner, you can use it as a building block to represent more complicated data:

class BlogPost(Document):
   title: str
   excerpt: str
   banner: Banner
   tags: List[str]
   content: str

Adding an embedding field to BlogPost is easy. You can use the predefined Document models Text and Image, which come with the embedding field baked in:

from typing import Optional
from docarray.typing import Embedding
class Image(Document):
   src: str
   embedding: Optional[Embedding]
class Text(Document):
   content: str
   embedding: Optional[Embedding]

Then you can represent your BlogPost:

class Banner(Document): 
alt_text: str
   image: Image
class BlogPost(Document):
   title: Text
   excerpt: Text
   banner: Banner
   tags: List[str]
   content: Text

This gives your multimodal BlogPost four embedding representations: title, excerpt, content, and banner.

Milvus support

Milvus is an open-source vector database and an open-source project hosted under Linux Foundation AI & Data. It's highly flexible, reliable, and blazing fast, and supports adding, deleting, updating, and near real-time search of vectors on a trillion-byte scale. As the first step towards a more inclusive DocArray, developer Johannes Messner has been implementing Milvus integration.

As with other document stores, you can easily instantiate a DocumentArray with Milvus storage:

from docarray import DocumentArray
da = DocumentArray(storage='milvus', config={'n_dim': 10})

Here, config is the configuration for the new Milvus collection, and n_dim is a mandatory field that specifies the dimensionality of stored embeddings. The code below shows a minimum working example with a running Milvus server on localhost:

import numpy as np
from docarray import DocumentArray
N, D = 5, 128
da = DocumentArray.empty(
   N, storage='milvus', config={'n_dim': D, 'distance': 'IP'}
)  # init
with da:
   da.embeddings = np.random.random([N, D])
print(da.find(np.random.random(D), limit=10))

To access persisted data from another server, you need to specify collection_name, host, and port. This allows users to enjoy all the benefits that Milvus offers, through the familiar and unified API of DocArray.

Embracing open governance

The term "open governance" refers to the way a project is governed — that is, how decisions are made, how the project is structured, and who is responsible for what. In the context of open source software, "open governance" means the project is governed openly and transparently, and anyone is welcome to participate in that governance.

Open governance for DocArray has many benefits:

• DocArray is now democratically run, ensuring that everyone has a say.
• DocArray is now more accessible and inclusive, because anyone can participate in governance.
• DocArray will be of higher quality, because decisions are being made in a transparent and open way.

The development team is taking actions to embrace open governance, including:

• Creating a DocArray technical steering committee (TSC) to help guide the project.
• Opening up the development process to more input and feedback from the community.
• Making DocArray development more inclusive and welcoming to new contributors.

Join the project

If you're interested in open source AI, Python, or big data, then you're invited to follow along with the DocArray project as it develops. If you think you have something to contribute to it, then join the project. It's a growing community, and one that's open to everyone.


This article was originally published on the Jina AI blog and has been republished with permission.

Dr.Han Xiao
Dr. Han Xiao is the Founder & CEO of Jina AI (jina.ai), a commercial opensource company on MLOps platform for multimodal AI.

Comments are closed.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.