sLM 구축을 위한 기반 기술

Huggingface 기초
OLLama
LangChain
Vector DB
RAG

1. Huggingface 기초

Huggingface 주요 기능

모델 & 데이터셋
API token 발급받아서 write 기능 O

코드 예시

HF_API_KEY: Hugging Face (HF)의 API키 넣기

import requests

HF_API_KEY = 'hf_xxx'
BASE_URL = "https://api-inference.huggingface.co/models/{}"
headers = {"Authorization": f"Bearer {HF_API_KEY}"}

def get_url(BASE_URL, model_name):
    return BASE_URL.format(model_name)
  
def query(API_URL, headers, payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

(1) Summarization task

API_URL = get_url(BASE_URL, 'facebook/bart-large-cnn')
payload = {
        "inputs": "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.",
        "parameters": {"do_sample": False},
    }

output = query(API_URL, headers, payload)

(2) Question Answering task

API_URL = get_url(BASE_URL, 'deepset/roberta-base-squad2')
payload = {
        "inputs": {
            "question": "What's my name?",
            "context": "My name is Clara and I live in Berkeley.",
        }
    }

output = query(API_URL, headers, payload)

(3) Sentence Similarity task

API_URL = get_url(BASE_URL, 'sentence-transformers/all-MiniLM-L6-v2')
payload = {
        "inputs": {
            "source_sentence": "That is a happy person",
            "sentences": ["That is a happy dog", "That is a very happy person", "Today is a sunny day"],
        }
    }

output = query(API_URL, headers, payload)

(4) Text Classification task

API_URL = get_url(BASE_URL, 'distilbert-base-uncased-finetuned-sst-2-english')
payload = {"inputs": "I like you. I love you"}

output = query(API_URL, headers, payload)

(5) Translation task

API_URL = get_url(BASE_URL, 'Helsinki-NLP/opus-mt-ru-en')
payload = {"inputs": "Меня зовут Вольфганг и я живу в Берлине"}

output = query(API_URL, headers, payload)

(6) Audio Automatic Speech Recognition task

import json
def query_asr(API_URL, filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))
    
API_URL = get_url(BASE_URL, 'facebook/wav2vec2-base-960h')


output = query_asr(API_URL, "sample1.flac")

2. OLLaMA

Open-source LLM을 로컬에서 쉽게 실행할 수 있게하는 도구
Modelfile 관리: 모델 가중치, 설정, 데이터셋 등을 하나의 ‘model file’로 관리함

코드 예시

pip install ollama

ollama pull mistral

import ollama

(1) 텍스트 생성

response = ollama.chat(model="mistral", 
                       messages=[{"role": "user", "content": "Explain quantum computing"}])

output = response["message"]["content"]

(2) 대화 문맥 유지

아래와 같이, 대화의 흐름을 유지할 수 있음!

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]

response = ollama.chat(model="mistral", 
                       messages=messages)
messages.append(response["message"])  # 모델의 응답을 메시지 목록에 추가
messages.append({"role": "user", 
                 "content": "Can you give me an example?"})

response = ollama.chat(model="mistral",
                       messages=messages)

output = response["message"]["content"]

3. LangChain

LLM을 쉽게 활용할 수 있도록 도와주는 라이브러리

단순히 프롬프트를 넣고 답변을 받는 게 아니라, 더 복잡한 작업을 수행할 수 있도록 도와줌

LangChain의 핵심 개념

LLM: 다양한 모델을 연결
Prompt Templates: 프롬프트를 체계적으로 관리
Chains: 여러 개의 단계를 연결해서 더 복잡한 흐름을 만들 수 있음
Memory: 대화의 맥락을 유지 가능 (챗봇 만들 때 유용!)
Agents & Tools LLM이 검색, 데이터베이스, API 호출 등을 직접 할 수 있음

코드 예시

pip install langchain
pip install -qU langchain-openai

import os
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = "xxxxx"
model = ChatOpenAI(model="gpt-3.5-turbo")

요청 사항

SystemMessage: 역할 (Task) 부여
HumanMessage: 요청 내용

from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage(content="Translate the following from English into Italian"),
    HumanMessage(content="hi!"),
]

(1) LM 사용하기

model.invoke(messages)

AIMessage(content='Ciao!', response_metadata={'token_usage': {'completion_tokens': 3, 'prompt_tokens': 20, 'total_tokens': 23}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-c3907c33-ca0f-46ff-a502-fa46a181b84b-0', usage_metadata={'input_tokens': 20, 'output_tokens': 3, 'total_tokens': 23})

(2) OutputParsers

위 (1)의 출력 결과는 AIMessage이다.
이 대신에, string response를 원한다면, StrOutputParser parser를 사용하자!

from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()
result = model.invoke(messages)
parser.invoke(result)

'Ciao!''

(3) Prompt Templates

Template활용을 통해 보다 편리하게 관리할 수 있음!

from langchain_core.prompts import ChatPromptTemplate

system_template = "Translate the following into {language}:"

prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template), ("user", "{text}")]
)

result = prompt_template.invoke({"language": "italian", "text": "hi"})

result

ChatPromptValue(messages=[SystemMessage(content='Translate the following into italian:'), HumanMessage(content='hi')])

result.to_messages()

[SystemMessage(content='Translate the following into italian:'),
 HumanMessage(content='hi')]

(4) Vector stores and retriever

LangChain에서 vector store(벡터 저장소)와 retriever(검색기)는 데이터를 효과적으로 검색하는 기능을 제공함

특히 RAG (Retrieval-Augmented Generation) 에서 중요
필요성: LLM이 답변을 생성 시, 기존의 지식만으로는 부족할 수 있음. 따라서, 외부 DB나 문서를 참고!
Vector stores (벡터 저장소)
- 문서를 숫자로 변환(벡터화)해서 저장하는 공간
- Ex) FAISS, Chroma등의 DB
Retrievers (검색기)
- 질문과 관련 있는 문서를 벡터 저장소에서 찾아주는 역할
- 사용자가 질문하면, retriever가 적절한 문서를 찾아 LLM에게 전달함

pip install langchain langchain-chroma langchain-openai

Step 1) 참고할만한 문서(내용)을 직접 입력함

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"source": "fish-pets-doc"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"source": "bird-pets-doc"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

Step 2) 문서를 벡터화 한 이후, 저장하기!

Chroma: 벡터 저장소를 사용하기 위한 모듈
OpenAIEmbeddings: 문서를 벡터로 변환하는 임베딩 모델

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(),
)

Step 3) 특정 내용 기반으로 similarity search

vectorstore.similarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'})]

vectorstore.similarity_search_with_score("cat")

[(Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
  0.375326931476593),
 (Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
  0.375326931476593),
 (Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
  0.4833090305328369),
 (Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
  0.4833090305328369)]

Retriever

단순히 저장하는 것만 하는 VectorStore와 다르게, Retrievers는 직접 실행 가능한(Runnable) 객체
따라서, LangChain Expression Language (LCEL)에서 체인으로 쉽게 연결할 수 있음

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

방법 1

retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1) 

방법 2

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(["cat", "shark"])

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

VectorStoreRetriever는 검색할 때 3가지 방법을 지원

(1) Similarity (기본값)
- 입력된 쿼리와 가장 유사한 문서를 찾음.
(2) MMR (Maximum Marginal Relevance)
- 유사한 문서 중에서도 서로 다른 정보를 포함한 문서를 다양하게 선택함.
(3) Similarity Score Threshold
- 유사도가 일정 기준(Threshold) 이상인 문서만 반환함.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

response = rag_chain.invoke("tell me about cats")
response.content

'Cats are independent pets that often enjoy their own space.''

4. Vector DB

Vector DB: 정보를 벡터로 저장하는 DB

Top 5: Chorma, Weaviate, Qdrant, Milvus, Faiss

Step 1) 설치하기 & Client instance 생성

pip install chromadb
pip install sentence_transformers

import chromadb
chroma_client = chromadb.Client()

Step 2) 나만의 DB 생성

collection = chroma_client.create_collection(name="my_collection")

Step 3) Langchain 활용하기

pip install langchain-chroma
pip install langchain-community
pip install langchain-text-splitters

from langchain_chroma import Chroma
from langchain_community.document_loaders import CSVLoader
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_text_splitters import CharacterTextSplitter

# Step 1) 문서 불러오기
loader = CSVLoader("/content/naver-news-summarization-ko/test_10row.csv", encoding='cp949')
documents = loader.load()

# Step 2) Chunking
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Step 3) Embedding 모델 불러오기
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Step 4) Embedding 후에 DB에 저장하기
db = Chroma.from_documents(docs, embedding_function)

# Step 5) query 기반으로 DB에서 search
query = "코오롱 수소 밸류체인 플랫폼"
docs = db.similarity_search(query)

# Step 6) 결과 확인
print(docs[0].page_content)

5. RAG

RAG(Retrieval-Augmented Generation)

검색 부분(Retrieval Component) : 정보 찾기
- 사용자의 질문 입력에 대해, 관련된 정보를 대규모 DB에서 검색
생성 부분(Generation Component) : 답변 생성
- 검색된 정보를 기반으로 자연스러운 언어의 답변 생성

코드 실습

pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-openai langchain-chroma bs4

pip install -qU langchain-openai

pip install bitsandbytes==0.40.0 einops==0.6.1

pip install accelerate

Reference

[패스트캠퍼스] 8개의 sLM모델로 끝내는 sLM 파인튜닝

Twitter Facebook LinkedIn

(sLM-3) sLM 구축을 위한 기반 기술

Seunghan Lee

sLM 구축을 위한 기반 기술

Contents

1. Huggingface 기초

코드 예시

(1) Summarization task

(2) Question Answering task

(3) Sentence Similarity task

(4) Text Classification task

(5) Translation task

(6) Audio Automatic Speech Recognition task

2. OLLaMA

코드 예시

(1) 텍스트 생성

(2) 대화 문맥 유지

3. LangChain

LangChain의 핵심 개념

코드 예시

(1) LM 사용하기

(2) OutputParsers

(3) Prompt Templates

(4) Vector stores and retriever

4. Vector DB

5. RAG

코드 실습

Reference

You May Also Enjoy