Introduction
In this article, we will explore how to deploy an OpenAI-compatible vLLM server on Hugging Face infrastructure. This simple process uses a single command to launch a private endpoint without having to provision servers or manage Kubernetes—all with per-second billing. We will cover each step, from prerequisites to advanced model management.
Prerequisites
Before you begin, make sure you meet the following conditions:
- Valid payment method or a positive prepaid credit balance.
- Recent version of huggingface_hub: run
pip install -U "huggingface_hub>=1.20.0"to update your installation. - Authentication completed locally with
hf auth login.
Good to know
Hugging Face Jobs is billed per minute based on hardware usage. Make sure to select the hardware configuration suitable for your model.
Starting the vLLM server
To launch the server, use the following command with hf jobs run, which is the Hugging Face equivalent of docker run.
1hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \2 vllm/vllm-openai:latest \3 vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000Command details
--flavor a10g-large: Requests an A10G GPU to run the command.--expose 8000: Exposes the public port via the Hugging Face proxy.--timeout 2h: Sets the maximum execution time to 2 hours.
Once launched, the command returns a public URL to access the server:
1✓ Job started2 id: 6a381ca1953ed90bfb9473323 url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb9473324Hint: Exposed ports are reachable at (requires an HF token with read access to the job):5 https://6a381ca1953ed90bfb947332--8000.hf.jobsTip
Keep the job ID (in this example: 6a381ca1953ed90bfb947332) for tracking the next steps. All requests require this unique number.
Verify startup
Wait a few minutes for the weights to be downloaded and the server to start. Once the logs display "Application startup complete", the server is operational.
Making requests to the server
The vLLM server uses the OpenAI API. Here are two methods to interact with it:
Method with Curl
Use curl with the Hugging Face token to send a request:
1curl https://<job_id>--8000.hf.jobs/v1/chat/completions \2 -H "Authorization: Bearer $(hf auth token)" \3 -H "Content-Type: application/json" \4 -d '{5 "model": "Qwen/Qwen3-4B",6 "messages": [{"role": "user", "content": "Hello!"}],7 "chat_template_kwargs": {"enable_thinking": false}8 }'Method with Python
Access the API using the OpenAI library:
1from huggingface_hub import get_token2from openai import OpenAI3 4client = OpenAI(5 base_url="https://<job_id>--8000.hf.jobs/v1",6 api_key=get_token(),7)8resp = client.chat.completions.create(9 model="Qwen/Qwen3-4B",10 messages=[{"role": "user", "content": "Hello!"}],11 extra_body={"chat_template_kwargs": {"enable_thinking": False}},12)13print(resp.choices[0].message.content)Health check
Before starting fully, perform a simple test to verify active models:
1curl https://<job_id>--8000.hf.jobs/v1/models -H "Authorization: Bearer $(hf auth token)"Warning
The endpoint is not public. Each request must include a Hugging Face token with the required access rights.
Cleanup
Once you are done using it, stop the job to avoid unnecessary charges:
1hf jobs cancel <job_id>Save
The --timeout command serves as a safety net, but explicitly stopping the server can reduce overall costs.
Configuration for larger models
The steps described above also apply to large models. For example, for the Qwen3.5-122B model:
1hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \2 vllm/vllm-openai:latest \3 vllm serve Qwen/Qwen3.5-122B-A10B \4 --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \5 --max-model-len 32768 --max-num-seqs 256In this example, --tensor-parallel-size must match the number of GPUs. If errors occur, such as "out-of-memory", reduce the values of --max-model-len and --max-num-seqs.
Conclusion
Using Hugging Face Jobs greatly simplifies the deployment of LLM servers for testing, evaluation, or occasional use. For more sustainable and production-ready solutions, explore Inference Endpoints.
Learn more
To serve other models like GGUF or SGLang, consult the Hugging Face guide "Serve Models on Jobs".



