Deploy a vLLM server with Hugging Face in one command

Introduction

In this article, we will explore how to deploy an OpenAI-compatible vLLM server on Hugging Face infrastructure. This simple process uses a single command to launch a private endpoint without having to provision servers or manage Kubernetes—all with per-second billing. We will cover each step, from prerequisites to advanced model management.

Prerequisites

Before you begin, make sure you meet the following conditions:

Valid payment method or a positive prepaid credit balance.
Recent version of huggingface_hub: run pip install -U "huggingface_hub>=1.20.0" to update your installation.
Authentication completed locally with hf auth login.

Good to know

Hugging Face Jobs is billed per minute based on hardware usage. Make sure to select the hardware configuration suitable for your model.

Starting the vLLM server

To launch the server, use the following command with hf jobs run, which is the Hugging Face equivalent of docker run.

>_Bash

1hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \
2  vllm/vllm-openai:latest \
3  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

Command details

--flavor a10g-large: Requests an A10G GPU to run the command.
--expose 8000: Exposes the public port via the Hugging Face proxy.
--timeout 2h: Sets the maximum execution time to 2 hours.

Once launched, the command returns a public URL to access the server:

>_Bash

1✓ Job started
2  id: 6a381ca1953ed90bfb947332
3  url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332
4Hint: Exposed ports are reachable at (requires an HF token with read access to the job):
5  https://6a381ca1953ed90bfb947332--8000.hf.jobs

Tip

Keep the job ID (in this example: 6a381ca1953ed90bfb947332) for tracking the next steps. All requests require this unique number.

Verify startup

Wait a few minutes for the weights to be downloaded and the server to start. Once the logs display "Application startup complete", the server is operational.

Making requests to the server

The vLLM server uses the OpenAI API. Here are two methods to interact with it:

Method with Curl

Use curl with the Hugging Face token to send a request:

>_Bash

1curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
2  -H "Authorization: Bearer $(hf auth token)" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model": "Qwen/Qwen3-4B",
6    "messages": [{"role": "user", "content": "Hello!"}],
7    "chat_template_kwargs": {"enable_thinking": false}
8  }'

Method with Python

Access the API using the OpenAI library:

🐍Python

1from huggingface_hub import get_token
2from openai import OpenAI
3 
4client = OpenAI(
5    base_url="https://<job_id>--8000.hf.jobs/v1",
6    api_key=get_token(),
7)
8resp = client.chat.completions.create(
9    model="Qwen/Qwen3-4B",
10    messages=[{"role": "user", "content": "Hello!"}],
11    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
12)
13print(resp.choices[0].message.content)

Health check

Before starting fully, perform a simple test to verify active models:

>_Bash

1curl https://<job_id>--8000.hf.jobs/v1/models -H "Authorization: Bearer $(hf auth token)"

Warning

The endpoint is not public. Each request must include a Hugging Face token with the required access rights.

Cleanup

Once you are done using it, stop the job to avoid unnecessary charges:

>_Bash

1hf jobs cancel <job_id>

Save

The --timeout command serves as a safety net, but explicitly stopping the server can reduce overall costs.

Configuration for larger models

The steps described above also apply to large models. For example, for the Qwen3.5-122B model:

>_Bash

1hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \
2  vllm/vllm-openai:latest \
3  vllm serve Qwen/Qwen3.5-122B-A10B \
4  --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
5  --max-model-len 32768 --max-num-seqs 256

In this example, --tensor-parallel-size must match the number of GPUs. If errors occur, such as "out-of-memory", reduce the values of --max-model-len and --max-num-seqs.

Conclusion

Using Hugging Face Jobs greatly simplifies the deployment of LLM servers for testing, evaluation, or occasional use. For more sustainable and production-ready solutions, explore Inference Endpoints.

Learn more

To serve other models like GGUF or SGLang, consult the Hugging Face guide "Serve Models on Jobs".

Deploy a vLLM server with Hugging Face in one command

Introduction

Prerequisites

Starting the vLLM server

Command details

Verify startup

Making requests to the server

Method with Curl

Method with Python

Health check

Cleanup

Configuration for larger models

Conclusion

Houssem MAKHLOUF

Related articles

Microsoft Cloud, AI and Security Certifications: Anticipate 2026

Understanding and Using Claude Skills for Automation

Copilot Memory: Essential Updates for Users