What You Get
High-throughput LLM serving with vLLM. Deploy models with OpenAI-compatible API, PagedAttention, and continuous batching.
Features
- OpenAI-compatible REST API out of the box
- PagedAttention for 24x higher throughput
- Continuous batching for optimal GPU utilization
- Tensor parallelism for multi-GPU serving
- AWQ/GPTQ/FP8 quantization support
- Structured output with guided decoding
Requirements
- Hermes Agent or OpenClaw v2.0+
- NVIDIA GPU with 16GB+ VRAM
- CUDA 12.0+
Installation
hermes skills install vllm-serving





Reviews
There are no reviews yet.