Deploying a Model with vLLM

In this section will will deploy a Granite 3.0 8B Instruct model using vLLM.

For our Model Server we will be deploying a vLLM instance using a model packaged into an OCI container with ModelCar.

ModelCar is Tech Preview as of OpenShift AI 2.14.

ModelCar is a great option for smaller models like our 8B model. While it is still a relatively large container (15Gb) it is still reasonable to easily pull into a cluster.

Treating the model as an OCI artifact allows us to easily promote the model between different environments using customers existing promotion processes. By contrast, dealing with promoting models between S3 instances in different environments may create new challenges.

Creating the vLLM Instance

  1. Open the OpenShift AI Dashboard and select the composer-ai-apps project from the list of Data Science Projects

    Composer AI Apps Project
  2. Select the Models tab and click Select single-model

    Single Model
  3. Select Deploy models

    Deploy Models
  4. Enter the following information

    Model deployment name: vllm
    Serving runtime: vLLM ServingRuntime for KServe
    Number of model server replicas to deploy: 1
    Model server size: Custom
    CPUs requested: 2 Cores
    CPUs limit: 4 Cores
    Memory requested: 16 GiB
    Memory limit: 20 GiB
    Accelerator: nvidia-gpu
    Number of accelerators: 1
    Make deployed models available through an external route: Checked
    Require token authentication: Unchecked
    Model Options
  5. In the Source model location section, choose the option to Create connection. Enter the following information:

    Connection type: URI - v1
    Connection name: granite-3-0-8b-instruct
    URI: oci://quay.io/redhat-ai-services/modelcar-catalog:granite-3.0-8b-instruct
    URI Connection

    You can find the image container our model here alongside other ModelCar images that you can try.

    Additionally, the source for building these ModelCar images can be found on GitHub.

    For more information on ModelCar see the KServe Serving models with OCI images documentation.

    A copy of the image has already been pulled onto the GPU node to help speed up deploying the model, but deploying LLMs can take quite some time.

    KServe uses KNative Serverless to manage the model servers which has a default timeout of 10 minutes which means that if the model server takes longer than 10 minutes to deploy it will automatically terminate the pod and mark it as failed.

    You can extend the timeout by adding the following annotation to the predictor section of the InferenceService:

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: vllm
    spec:
      predictor:
        annotations:
          serving.knative.dev/progress-deadline: 30m
  6. A new vLLM instance will be created in the OpenShift AI Dashboard. Return to the OpenShift Web Console and check the pods in the composer-ai-apps project. You should find a pod called vllm-predictor-00001-deployment-*. Check the pods Events and Logs to follow the progress for the pod until it becomes ready.

  7. (Optional) The OpenShift AI Dashboard created two KServe objects, a ServingRuntime and an InferenceService. From the OpenShift Web Console, navigate to the Home > Search page and use the Resources drop down menu to search for and select those objects. Spend a few minutes reviewing the objects created by the Dashboard.

    KServe Objects

Testing vLLM Endpoints

Accessing the Swgger Docs

To start will test our vLLM endpoint to make sure it is responding by accessing the Swagger docs for vLLM.

  1. To start we will need to find the endpoint URL for the served model. From the OpenShift AI Dashboard, navigate to the Models tab and click on the Internal and external endpoint details to find the URL.

    Model endpoint

    Our vLLM instance does not create a normal OpenShift route so you won’t find it under the normal Networking > Routes menu.

    Instead it creates a KNative Serving Route object which can be found with the following:

    oc get routes.serving.knative.dev -n composer-ai-apps
  2. Use the copy option for the route found in the previous step and paste it into a new tab with /docs at the end to access a FastAPI Swagger Docs page for vLLM.

  3. Use the Try it out option of the GET /v1/models endpoint to list the models being deployed by this server. Note that the id for our model matches the name of the model server we created in the OpenShift AI Dashboard.

Testing the model from Composer AI UI

Now that we have done some basic testing we are ready to try the model from inside of the Composer AI Studio UI.

Our Composer instance is already setup to point to the vLLM endpoint we created so no additional configuration is required.

  1. Find the chatbot-ui Route from the OpenShift Web Console and open it as a new tab.

    Chatbot Route
  2. In the top left hand corner select the Default Assistant

    Default Assistant
  3. Ask a question in the UI to verify that the LLM is able to respond.

    LLM Response