Создание рабочего процесса с ускорением на GPU в Ollama LangChain с помощью агентов RAG, мониторинг производительности многосессионного чата

В этом руководстве мы создаём локальный стек LLM с поддержкой GPU, который объединяет Ollama и LangChain. Мы устанавливаем необходимые библиотеки, запускаем сервер Ollama, извлекаем модель и оборачиваем её в пользовательский LangChain LLM, что позволяет нам контролировать температуру, ограничения токенов и контекст.

Установка пакетов

«`python
import os
import sys
import subprocess
import time
import threading
import queue
import json
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass
from contextlib import contextmanager
import asyncio
from concurrent.futures import ThreadPoolExecutor

def install_packages():
«»»Install required packages for Colab environment»»»
packages = [
«langchain»,
«langchain-community»,
«langchain-core»,
«chromadb»,
«sentence-transformers»,
«faiss-cpu»,
«pypdf»,
«python-docx»,
«requests»,
«psutil»,
«pyngrok»,
«gradio»
]

for package in packages:
subprocess.check_call([sys.executable, «-m», «pip», «install», package])

install_packages()
«`

Определение класса `OllamaConfig`

«`python
@dataclass
class OllamaConfig:
«»»Configuration for Ollama setup»»»
model_name: str = «llama2»
base_url: str = «http://localhost:11434»
max_tokens: int = 2048
temperature: float = 0.7
gpu_layers: int = -1
context_window: int = 4096
batch_size: int = 512
threads: int = 4
«`

Определение класса `OllamaManager`

«`python
class OllamaManager:
«»»Advanced Ollama manager for Colab environment»»»

def init(self, config: OllamaConfig):
self.config = config
self.process = None
self.is_running = False
self.models_cache = {}
self.performance_monitor = PerformanceMonitor()

def install_ollama(self):
«»»Install Ollama in Colab environment»»»
try:
subprocess.run([
«curl», «-fsSL», «https://ollama.com/install.sh», «-o», «/tmp/install.sh»
], check=True)

subprocess.run([«bash», «/tmp/install.sh»], check=True)
print(» Ollama installed successfully»)

except subprocess.CalledProcessError as e:
print(f» Failed to install Ollama: {e}»)
raise
«`

Определение класса `PerformanceMonitor`

«`python
class PerformanceMonitor:
«»»Monitor system performance and resource usage»»»

def init(self):
self.monitoring = False
self.stats = {
«cpu_usage»: [],
«memory_usage»: [],
«gpu_usage»: [],
«inference_times»: []
}
self.monitor_thread = None

def start(self):
«»»Start performance monitoring»»»
self.monitoring = True
self.monitorthread = threading.Thread(target=self.monitor_loop)
self.monitor_thread.daemon = True
self.monitor_thread.start()

def stop(self):
«»»Stop performance monitoring»»»
self.monitoring = False
if self.monitor_thread:
self.monitor_thread.join()

def monitorloop(self):
«»»Main monitoring loop»»»
while self.monitoring:
try:
cpupercent = psutil.cpupercent(interval=1)
memory = psutil.virtual_memory()

self.stats[«cpuusage»].append(cpupercent)
self.stats[«memory_usage»].append(memory.percent)

for key in [«cpuusage», «memoryusage»]:
if len(self.stats[key]) > 100:
self.stats[key] = self.stats[key][-100:]

time.sleep(5)

except Exception as e:
print(f»Monitoring error: {e}»)

def get_stats(self) -> Dict[str, Any]:
«»»Get current performance statistics»»»
return {
«avgcpu»: sum(self.stats[«cpuusage»][-10:]) / max(len(self.stats[«cpu_usage»][-10:]), 1),
«avgmemory»: sum(self.stats[«memoryusage»][-10:]) / max(len(self.stats[«memory_usage»][-10:]), 1),
«totalinferences»: len(self.stats[«inferencetimes»]),
«avginferencetime»: sum(self.stats[«inferencetimes»]) / max(len(self.stats[«inferencetimes»]), 1)
}
«`

Определение класса `OllamaLLM`

«`python
class OllamaLLM(LLM):
«»»Custom LangChain LLM for Ollama»»»

model_name: str = «llama2»
base_url: str = «http://localhost:11434»
temperature: float = 0.7
max_tokens: int = 2048
performance_monitor: Optional[PerformanceMonitor] = None

@property
def llmtype(self) -> str:
return «ollama»

def _call(
self,
prompt: str,
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
kwargs: Any,
) -> str:
«»»Make API call to Ollama»»»
start_time = time.time()

try:
payload = {
«model»: self.model_name,
«prompt»: prompt,
«stream»: False,
«options»: {
«temperature»: self.temperature,
«numpredict»: self.maxtokens,
«stop»: stop or []
}
}

response = requests.post(
f»{self.base_url}/api/generate»,
json=payload,
timeout=120
)

response.raiseforstatus()
result = response.json()

inferencetime = time.time() — starttime

if self.performance_monitor:
self.performancemonitor.stats[«inferencetimes»].append(inference_time)

return result.get(«response», «»)

except Exception as e:
print(f» Ollama API error: {e}»)
return f»Error: {str(e)}»
«`

Определение класса `RAGSystem`

«`python
class RAGSystem:
«»»Retrieval-Augmented Generation system»»»

def init(self, llm: OllamaLLM, embedding_model: str = «sentence-transformers/all-MiniLM-L6-v2»):
self.llm = llm
self.embeddings = HuggingFaceEmbeddings(modelname=embeddingmodel)
self.vector_store = None
self.qa_chain = None
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)

def adddocuments(self, filepaths: List[str]):
«»»Add documents to the vector store»»»
documents = []

for filepath in filepaths:
try:
if file_path.endswith(‘.pdf’):
loader = PyPDFLoader(file_path)
else:
loader = TextLoader(file_path)

docs = loader.load()
documents.extend(docs)

except Exception as e:
print(f» Error loading {file_path}: {e}»)

if documents:
splits = self.textsplitter.splitdocuments(documents)

if self.vector_store is None:
self.vectorstore = FAISS.fromdocuments(splits, self.embeddings)
else:
self.vectorstore.adddocuments(splits)

self.qachain = RetrievalQA.fromchain_type(
llm=self.llm,
chain_type=»stuff»,
retriever=self.vectorstore.asretriever(search_kwargs={«k»: 3}),
returnsourcedocuments=True
)

print(f» Added {len(splits)} document chunks to vector store»)

def query(self, question: str) -> Dict[str, Any]:
«»»Query the RAG system»»»
if not self.qa_chain:
return {«answer»: «No documents loaded. Please add documents first.»}

try:
result = self.qa_chain({«query»: question})
return {
«answer»: result[«result»],
«sources»: [doc.metadata for doc in result.get(«source_documents», [])]
}
except Exception as e:
return {«answer»: f»Error: {str(e)}»}
«`

Определение класса `ConversationManager`

«`python
class ConversationManager:
«»»Manage conversation history and memory»»»

def init(self, llm: OllamaLLM, memory_type: str = «buffer»):
self.llm = llm
self.conversations = {}
self.memorytype = memorytype

def getconversation(self, sessionid: str) -> ConversationChain:
«»»Get or create conversation for session»»»
if session_id not in self.conversations:
if self.memory_type == «buffer»:
memory = ConversationBufferWindowMemory(k=10)
elif self.memory_type == «summary»:
memory = ConversationSummaryBufferMemory(
llm=self.llm,
maxtokenlimit=1000
)
else:
memory = ConversationBufferWindowMemory(k=10)

self.conversations[session_id] = ConversationChain(
llm=self.llm,
memory=memory,
verbose=True
)

return self.conversations[session_id]

def chat(self, session_id: str, message: str) -> str:
«»»Chat with specific session»»»
conversation = self.getconversation(sessionid)
return conversation.predict(input=message)

def clearsession(self, sessionid: str):
«»»Clear conversation history for session»»»
if session_id in self.conversations:
del self.conversations[session_id]
«`

Определение класса `OllamaLangChainSystem`

«`python
class OllamaLangChainSystem:
«»»Main system integrating all components»»»

def init(self, config: OllamaConfig):
self.config = config
self.manager = OllamaManager(config)
self.llm = None
self.rag_system = None
self.conversation_manager = None
self.tools = []
self.agent = None

def setup(self):
«»»Complete system setup»»»
print(» Setting up Ollama + LangChain system…»)

self.manager.install_ollama()
self.manager.start_server()

if not self.manager.pullmodel(self.config.modelname):
print(» Failed to pull default model»)
return False

self.llm = OllamaLLM(
modelname=self.config.modelname,
baseurl=self.config.baseurl,
temperature=self.config.temperature,
maxtokens=self.config.maxtokens,
performancemonitor=self.manager.performancemonitor
)

self.rag_system = RAGSystem(self.llm)

self.conversation_manager = ConversationManager(self.llm)

self.setuptools()

print(» System setup complete!»)
return True

def setuptools(self):
«»»Setup tools for the agent»»»
search = DuckDuckGoSearchRun()

self.tools = [
Tool(
name=»Search»,
func=search.run,
description=»Search the internet for current information»
),
Tool(
name=»RAG_Query»,
func=lambda q: self.rag_system.query(q)[«answer»],
description=»Query loaded documents using RAG»
)
]

self.agent = initialize_agent(
tools=self.tools,
llm=self.llm,
agent=AgentType.ZEROSHOTREACT_DESCRIPTION,
verbose=True
)
«`

Основная функция `main()`

«`python
def main():
«»»Main function demonstrating the system»»»

config = OllamaConfig(
model_name=»llama2″,
temperature=0.7,
max_tokens=2048
)

system = OllamaLangChainSystem(config)

try:
if not system.setup():
return

print(«\n Testing basic chat:»)
response = system.chat(«Hello! How are you?»)
print(f»Response: {response}»)

print(«\n Testing model switching:»)
models = system.manager.list_models()
print(f»Available models: {models}»)

print(«\n Testing agent:»)
agentresponse = system.agentchat(«What’s the current weather like?»)
print(f»Agent Response: {agent_response}»)

print(«\n Performance Statistics:»)
stats = system.getperformancestats()
print(json.dumps(stats, indent=2))

except KeyboardInterrupt:
print(«\n Interrupted by user»)
except Exception as e:
print(f» Error: {e}»)
finally:
system.cleanup()
«`

Мы используем `ConversationManager` для управления историей бесед, позволяя сохранять контекст в виде буфера или сводки. В `OllamaLangChainSystem` мы интегрируем всё: устанавливаем и запускаем Ollama, извлекаем нужную модель, оборачиваем её в совместимый с LangChain LLM, подключаем систему RAG, инициализируем память для чата и регистрируем внешние инструменты, такие как веб-поиск.

1. Какие основные классы и компоненты используются для создания рабочего процесса с ускорением на GPU в Ollama LangChain?

В статье описаны следующие классы и компоненты: `OllamaConfig`, `OllamaManager`, `PerformanceMonitor`, `OllamaLLM`, `RAGSystem`, `ConversationManager` и `OllamaLangChainSystem`.

2. Какие параметры можно настроить при определении класса `OllamaConfig`?

При определении класса `OllamaConfig` можно настроить следующие параметры: `modelname`, `baseurl`, `maxtokens`, `temperature`, `gpulayers`, `contextwindow`, `batchsize` и `threads`.

3. Как работает класс `PerformanceMonitor` и какие данные он собирает?

Класс `PerformanceMonitor` отслеживает использование ресурсов системы, таких как процессор и память. Он собирает данные о загрузке процессора, использовании памяти, использовании GPU и времени вывода (inference time).

4. Какие методы используются для управления беседой в `ConversationManager`?

В `ConversationManager` используются методы `get_conversation` для получения или создания беседы для сессии и `chat` для общения с конкретной сессией.

5. Какие инструменты можно зарегистрировать для использования с агентом в `OllamaLangChainSystem`?

В `OllamaLangChainSystem` можно зарегистрировать инструменты, такие как `DuckDuckGoSearchRun` для поиска в интернете и `RAG_Query` для запроса загруженных документов с использованием RAG.

Источник