Build a web application with ONNX Runtime

This document explains the options and considerations for building a web application with ONNX Runtime.

Options for deployment target

Inference in browser

Runtime and model are downloaded to client and inferencing happens inside browser. Use onnxruntime-web in this scenario.
Inference on server

Browser sends user’s input to server, server inferences and gets the result and sends back to client.

Use native ONNX Runtime to get best performance.

To use Node.js as the server application, use onnxruntime-node (ONNX Runtime node.js binding) on the server.
Electron

Electron uses a frontend (based on chromium, technically a browser core) and a backend (based on Node.js).

If possible, use onnxruntime-node for inference in the backend, which is faster. Using onnxruntime-web in frontend is also an option (for security and compatibility concerns).
React Native

React-native is a framework that uses the same API to reactjs, but builds native applications instead of web app on mobile. Should use onnxruntime-react-native.

You need to understand your web app’s scenario and get an ONNX model that is appropriate for that scenario.

ONNX models can be obtained from the ONNX model zoo, converted from PyTorch or TensorFlow, and many other places.

You can convert the ONNX format model to ORT format model, for optimized binary size, faster initialization and peak memory usage.

You can perform a model-specific custom build to further optimize binary size.

Bootstrap your web application according in your web framework of choice e.g. vuejs, reactjs, angularjs.

You can skip this step if you already have a web application and you are adding machine learning to it with ONNX Runtime.

Install onnxruntime-web. These command line will update the application’s package.json file.

yarn add onnxruntime-web

npm install onnxruntime-web

Add “@dev” to the package name to use the nightly build (eg. npm install onnxruntime-web@dev).

Import onnxruntime-web See import onnxruntime-web
Initialize the inference session See InferenceSession.create

Session initialization should only happen once.
Run the session See session.run

Session run happens each time their is new user input.

Refer to ONNX Runtime Web API docs for more detail.

Raw input is usually a string (for NLP model) or an image (for image model). They come from multiple forms and formats.

Use a tokenizer in JS/wasm to pre-process it to number data, create tensors from the data and feed to ORT for model inferencing.
Use one or more custom ops to deal with strings. Build with the custom ops. The model can directly process string tensor inputs. Refer to theonnxruntime-extensions library, which contain a set of possible custom operators.

Use a JS/wasm library to pre-process the data, and create tensor as input to fulfill the requirement of the model. See the image classification using ONNX Runtime Web tutorial.
Modify the model to include the pre-processing inside the model as operators. The model will expect a certain web image format (eg. A bitmap or texture from canvas).

The output of a model vary, and most need their own post-processing code. Refer to the above tutorial as an example of Javascript post processing.

[This section is coming soon]