LLM을 이용한 의미 기반 검색
쉽고 빠르게 익히는 실전 LLM (https://product.kyobobook.co.kr/detail/S000212147276)
실습 1. Pinecone에 정보 업로드/불러오기
( 참고: https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/02_semantic_search.ipynb )
Procedures
- Step 1) 관련 패키지 설치 & 불러오기
- Step 2) (VectorDB) Pinecone 불러오기
- Step 3) text에 대한 embedding 얻기 (from LLM)
- Step 4) pinecone에 업로드 준비
- Step 5) pinecone에 업로드
- Step 6) pinecone에서 특정 텍스트와 유사한 top K 정보 불러오기
- Step 7) 실제 데이터로 실습
Step 1) 관련 패키지 설치 & 불러오기
!pip install pinecone-client openai sentence-transformers tiktoken datasets
from openai import OpenAI
from datetime import datetime
import hashlib
import re
import os
from tqdm import tqdm
import numpy as np
from torch import nn
import logging
# (1) Model (LLM)
from sentence_transformers import CrossEncoder
# (2) VectorDB
from pinecone import Pinecone, ServerlessSpec
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
Step2) (VectorDB) Pinecone 불러오기
pinecone_key = os.environ.get('PINECONE_API_KEY')
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY")
)
NAMESPACE = 'default'
ENGINE = 'text-embedding-3-large' # has vector size 3072
pc = Pinecone(
api_key=pinecone_key
)
Step 3) text에 대한 embedding 얻기 (from LLM)
# v1) Multiple inputs (list)
def get_embeddings(texts, engine=ENGINE):
response = client.embeddings.create(
input=texts,
model=engine
)
return [d.embedding for d in list(response.data)]
# v2) Single input (test)
def get_embedding(text, engine=ENGINE):
return get_embeddings([text], engine)[0]
Step 4) pinecone에 업로드 준비
우선, (임베딩 업로드 전에) index부터 등록하자!
- index = 일종의 단어를 담는 “사전(dictionary)”으로써 이해해도 무방!
INDEX_NAME = 'semantic-search-test'
if INDEX_NAME not in pc.list_indexes().names():
print(f'Creating index {INDEX_NAME}')
pc.create_index(
name=INDEX_NAME, # The name of the index
dimension=3072, # The dimensionality of the vectors for our OpenAI embedder
metric='cosine', # The similarity metric to use when searching the index
spec=ServerlessSpec(
cloud='aws',
region='us-west-2'
)
)
index = pc.Index(name=INDEX_NAME)
index.describe_index_stats()
{'dimension': 3072,
'index_fullness': 0.0,
'namespaces': {},
'total_vector_count': 0}
(단어, 임베딩, 메타정보)를 실제로 업로드할 준비!
def my_hash(s):
return hashlib.md5(s.encode()).hexdigest()
def prepare_for_pinecone(texts, engine=ENGINE):
# 1.(업로드 시점의) 시간
now = datetime.utcnow()
# 2.임베딩 벡터
embeddings = get_embeddings(texts, engine=engine)
# 3.튜플 (hash, embedding, metadata)
return [
(my_hash(text),
embedding,
dict(text=text, date_uploaded=now)
)
for text, embedding in zip(texts, embeddings)
]
예시) 업로드할 (세 가지) 정보들이 잘 출력됨을 확인할 수 있다.
texts = ['hi']
_id, embedding, metadata = prepare_for_pinecone(texts)[0]
print('ID: ',_id, '\nLEN: ', len(embedding), '\nMETA:', metadata)
ID: 49f68a5c8493ec2c0bf489821c21fc3b
LEN: 3072
META: {'text': 'hi', 'date_uploaded': datetime.datetime(2024, 7, 1, 16, 33, 16, 847362)}
Step 5) pinecone에 업로드
정확히는, pinecone에 아까 만들었던 index에 업로드 (upsert
)
def upload_texts_to_pinecone(texts,
namespace=NAMESPACE,
batch_size=None,
show_progress_bar=False):
total_upserted = 0
if not batch_size:
batch_size = len(texts)
_range = range(0, len(texts), batch_size)
for i in tqdm(_range) if show_progress_bar else _range:
batch = texts[i: i + batch_size]
prepared_texts = prepare_for_pinecone(batch)
# upsert(): Pinecone에 업로드
total_upserted += index.upsert(
vectors=prepared_texts,
namespace=namespace
)['upserted_count']
return total_upserted
upload_texts_to_pinecone(texts)
step 6) pinecone에서 특정 텍스트와와 유사한 top K 정보 불러오기
pinecone에 아까 만들었던 index에서 불러오기 (query
)
def query_from_pinecone(query,
top_k=3,
include_metadata=True):
query_embedding = get_embedding(query, engine=ENGINE)
return index.query(
vector=query_embedding,
top_k=top_k,
namespace=NAMESPACE,
include_metadata=include_metadata # gets the metadata (dates, text, etc)
).get('matches')
반대로, 지울 수도 있다!
def delete_texts_from_pinecone(texts,
namespace=NAMESPACE):
hashes = [hashlib.md5(text.encode()).hexdigest() for text in texts]
return index.delete(ids=hashes, namespace=namespace)
step 7) 실제 데이터로 실습
from datasets import load_dataset
# rename test -> train and val -> test (as we will use it in later in this chapter)
dataset = load_dataset("xtreme", "MLQA.en.en")
dataset['train'] = dataset['test']
dataset['test'] = dataset['validation']
del dataset['validation']
dataset
DatasetDict({
test: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 1148
})
train: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 11590
})
})
print(len(dataset['train']))
print(len(dataset['test']))
11590
1148
dataset['train'][0]
{'id': 'a4968ca8a18de16aa3859be760e43dbd3af3fce9',
'title': 'Area 51',
'context': 'In 1994, five unnamed civilian contractors and the widows of contractors Walter Kasza and Robert Frost sued the USAF and the United States Environmental Protection Agency. Their suit, in which they were represented by George Washington University law professor Jonathan Turley, alleged they had been present when large quantities of unknown chemicals had been burned in open pits and trenches at Groom. Biopsies taken from the complainants were analyzed by Rutgers University biochemists, who found high levels of dioxin, dibenzofuran, and trichloroethylene in their body fat. The complainants alleged they had sustained skin, liver, and respiratory injuries due to their work at Groom, and that this had contributed to the deaths of Frost and Kasza. The suit sought compensation for the injuries they had sustained, claiming the USAF had illegally handled toxic materials, and that the EPA had failed in its duty to enforce the Resource Conservation and Recovery Act (which governs handling of dangerous materials). They also sought detailed information about the chemicals to which they were allegedly exposed, hoping this would facilitate the medical treatment of survivors. Congressman Lee H. Hamilton, former chairman of the House Intelligence Committee, told 60 Minutes reporter Lesley Stahl, "The Air Force is classifying all information about Area 51 in order to protect themselves from a lawsuit."',
'question': 'Who analyzed the biopsies?',
'answers': {'answer_start': [457],
'text': ['Rutgers University biochemists']}}
test 데이터셋에 있는 텍스트들을 32개의 묶음으로 업로드하기
unique_passages = list(set(dataset['test']['context']))
for idx in tqdm(range(0, len(unique_passages), 32)):
passages = unique_passages[idx:idx + 32]
upload_texts_to_pinecone(passages)
업로드한 pinecone으로부터, query와 유사한 정보 받아오기
query_from_pinecone('Does an infection for Sandflies go away over time?')
[{'id': '2f90090e21f19450887d5f3ff781e541',
'metadata': {'date_uploaded': '2024-02-04T15:47:20.914703',
'text': 'Pappataci fever is prevalent in the subtropical zone of '
'the Eastern Hemisphere between 20°N and 45°N, '
'particularly in Southern Europe, North Africa, the '
'Balkans, Eastern Mediterranean, Iraq, Iran, Pakistan, '
'Afghanistan and India.The disease is transmitted by the '
'bites of phlebotomine sandflies of the Genus '
'Phlebotomus, in particular, Phlebotomus papatasi, '
'Phlebotomus perniciosus and Phlebotomus perfiliewi. The '
'sandfly becomes infected when biting an infected human '
'in the period between 48 hours before the onset of '
'fever and 24 hours after the end of the fever, and '
'remains infected for its lifetime. Besides this '
'horizontal virus transmission from man to sandfly, the '
'virus can be transmitted in insects transovarially, '
'from an infected female sandfly to its '
'offspring.Pappataci fever is seldom recognised in '
'endemic populations because it is mixed with other '
'febrile illnesses of childhood, but it is more '
'well-known among immigrants and military personnel from '
'non-endemic regions.'},
'score': 0.436064631,
'values': []},
{'id': '00661b04eb84a4664717245513ea30cd',
...