LLM Fine-tuning 실습 프로젝트 - Part 3

Import Packages
PDF to Image
OCR
Chunking
LLM을 통한 (질문 & 답안) 생성
전처리
저장하기

1. Import Packages

import time
import pandas as pd
import cv2
import json
import matplotlib.pyplot as plt
import shutil
import os
import random
try:
 from PIL import Image
except ImportError:
 import Image

import requests
import uuid
import time
import json
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 5000)

fitz 패키지: pdf \(\rightarrow\) img 변환 위해

import fitz
import os

2. PDF to Image

Step 1) pdf를 이미지로 저장

file_path= './insurance_biz.pdf'
doc = fitz.open(file_path)
folder_path='./'
for i, page in enumerate(doc):
    #if i == 0: # 첫번째 페이지만 이미지로 변환하는 경우
    img = page.get_pixmap()
    img.save(f"{folder_path}/{i:04d}.jpg")

Step 2) 저장된 img 리스트 확인

img_lst= [img for img in os.listdir() if img.endswith('.jpg')]
img_lst=sorted(img_lst)
print(img_list[0])

0000.jpg

3. OCR

from openai import OpenAI
import json
import requests
import uuid
import time
import pandas as pd
 
# CLOVA OCR 호출
api_url = 'your_api_url'
secret_key = 'your_api_key'

Procedure

Step 1) JPG 파일을 불러옴
Step 2) OCR 통해 텍스트로
Step 3) 텍스트를 JSON으로 저장

string_result = ''

for image_file in img_lst:
    request_json = {
        'images': [
            {
                'format': 'jpg',
                'name': 'demo'
            }
        ],
        'requestId': str(uuid.uuid4()),
        'version': 'V2',
        'timestamp': int(round(time.time() * 1000))
    }

    payload = {'message': json.dumps(request_json).encode('UTF-8')}
    files = [
    ('file', open(image_file,'rb'))
    ]
    headers = {
    'X-OCR-SECRET': secret_key
    }

    response = requests.request("POST", api_url, headers=headers, data = payload, files = files)

    json_data = response.json()


    for i in json_data['images'][0]['fields']:
        if i['lineBreak'] == True:
            linebreak = ' '
        else:
            linebreak = ' '
        string_result = string_result + i['inferText'] + linebreak        
    print(image_file)

    json_file_path = f'./json_file{image_file}.json'
    with open(json_file_path, 'w', encoding='utf-8') as file:
        json.dump(json_data, file, ensure_ascii=False, indent=4)

생성된 text 결과 (string_result) 확인해보기

string_result=string_result.rstrip()
string_result

'보험약관 카드회원단체상해보험 목차 보통약관 특별약관 해외여행중 배상책임 특별약관 국내여행중 공공교통승용구탑승중 배상책임 특...'

4. Chunking

방법 1) Sliding window

def sliding_window_chunking(text, window_size, step_size):
    chunks = []
    for i in range(0, len(text) - window_size + 1, step_size):
        chunk = text[i:i + window_size]
        chunks.append(chunk)
    if len(text) % window_size != 0:
        chunks.append(text[-window_size:])
    return chunks

text = "이 텍스트는 윈도우 청킹을 사용하여 나누어집니다. 윈도우 청킹은 텍스트의 연속성을 유지하면서 적절한 크기의 청크로 나누는 방법입니다."
window_size = 50
step_size = 10

chunks = sliding_window_chunking(text, window_size, step_size)
for i, chunk in enumerate(chunks, 1):
    print(f"청크 {i}: {chunk}")

방법 2) 문장 단위 분할

kss 패키지 사용

from kss import Kss

split_sentences = Kss("split_sentences")
split_result=split_sentences(string_result)
print(len(split_result))

결과 확인하기

split_result

['보험약관 카드회원단체상해보험 목...',
'xxxx',
'xxxx',
...]

방법 3) token 단위 분할

from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core import Document

document = [Document(text=string_result)]
chunker = TokenTextSplitter(chunk_size=512, chunk_overlap=0)

result = chunker.get_nodes_from_documents(documents=document)
chunked_texts =  list(map(lambda x: x.text, result))

5. LLM 통한 (질문 & 답안) 생성

목적: 내용을 주고, 해당 내용을 통해 만들 법한 질문지 &답지 생성하기

위에서 생성한 (Token 단위로 분할한) 텍스트 데이터 불러오기

df6= pd.read_csv('token.csv')
df6['len']=df6['contents'].str.len()
df6.head()

deepinfra 들어가서 새로운 API key를 발급받기

openai = OpenAI(
    api_key="",
    base_url="https://api.deepinfra.com/v1/openai",)

앞서 저장했던 (보험 문서) text를 사용하여…

\(\rightarrow\) LLM을 통한 질문지 & 답지 생성하기!

def generate_training_data(corpus,openai):
    
    evolving= f"나는 너가 질문 작성자로 활동하기를 원해. 너의 목표는 주어진 문단을 활용하여 AI가 학습할 질문을 만드는거야. 질문은 합리적이어야 하고, 일상에서 나올 수 있는 질문이어야하며, 인간이 이해하고 응답할 수 있어야 해.\n\
        그리고 답변은, 근거를 들어서 논리적으로 답변을 해야해. 또한,질문과 답변을 다섯개를 만들어줘.\n\
        최대한 질문이 장황해지지 않도록 노력해야 하며, 추론이나 일상생활에서 생길 수 있는 문제들을 담아야해. 무조건 한국어로 작성해줘. 생성된 질문에는 질문이나 답변에 숫자 없이 '#질문:'이라는 단어로 시작해야하고, 생성된 답변에는 '#답변:'이라는 단어로 시작해야해.\n\
        주어진 문단: {corpus}\n\n"
    
    chat_completion = openai.chat.completions.create(
        model='microsoft/WizardLM-2-8x22B',
        messages=[{"role:":"user", "content" : evolving}],
        temperature=0.7,
        max_tokens=8192)
    
    return chat_completion.choices[0].message.content

특정 내용에 대해, 생성된 5개의 (질문 & 답변) 예시:

example1=generate_training_data(df6['contents'][1],openai)
example1.lstrip()

"  지급사유) 제1호 '사망' 에 해당하는 보험금은 사망일로부터 60일 이내에 지급합니다. 단, 사망일이 확인되지 않는 경우에는 사망이 확인된 날로부터 60일 이내에 지급합니다.\n\n#질문: 보험개발원이 공시하는 보험계약대출이율은 어떤 상황에 적용되나요?\n#답변: 보험개발원이 공시하는 보험계약대출이율은 보험회사가 보험금의 지급이나 보험료의 환급을 지연할 때 적용됩니다. 이는 정기적으로 산출되어 공시되는 이율로, 지연된 금액에 대해 적용되어 추가로 지급해야 할 금액을 계산하는 기준이 됩니다.\n\n#질문: 보험기간이란 무엇인가요?\n#답변: 보험기간은 보험 계약에 따라 보장을 받는 기간을 말합니다. 즉, 해당 기간 동안 보험 약관에 따라 보험금을 받을 수 있는 일정한 기간을 의미합니다.\n\n#질문: 영업일이 무엇을 포함하며, 어떤 날짜는 영업일에서 제외되나요?\n#답변: 영업일은 회사가 영업점에서 정상적으로 영업하는 날을 말하며, 토요일, 관공서의 공휴일에 관한 규정에 따른 공휴일 및 근로자의 날을 제외합니다. 따라서 이러한 날짜는 영업일로 계산되지 않습니다.\n\n#질문: 해외여행 보험에서 사망보험금을 받을 수 있는 조건은 무엇인가요?\n#답변: 해외여행 보험에서 사망보험금을 받기 위한 조건은 보험증권에 기재된 해외여행을 목적으로 주거지를 출발하여 여행을 마치고 주거지에 도착할 때까지의 기간 동안 발생한 상해의 직접적인 결과로 사망한 경우입니다. 단, 질병으로 인한 사망은 제외됩니다.\n\n#질문: 보험금 지급에 있어서 실종선고를 받은 경우와 관공서에서 사망을 통보한 경우, 어떤 기준을 따라야 하나요?\n#답변: 실종선고를 받은 경우, 법원에서 인정한 실종기간이 끝나는 시점에서 사망한 것으로 봅니다. 관공서에서 수해, 화재나 그 밖의 재난을 조사하고 사망을 통보하는 경우에는 가족관계등록부에 기재된 사망연월일을 기준으로 합니다. 이러한 기준에 따라 보험금 지급 사유가 인정되며, 해당 보험금이 지급됩니다." ...

print(len(example1.lstrip().split('#'))

11 = 1개의 corpus + 5개의 질문 + 5개의 답

6. 전처리

def remove_str(tmp_str):
    if '질문:' in tmp_str:
        tmp_str= tmp_str[3:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '답변:' in tmp_str:
        tmp_str= tmp_str[3:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '질문1:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '답변1:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '질문2:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '답변2:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '질문3:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '답변3:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '질문4:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '답변4:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '질문5:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    elif '답변5:' in tmp_str:
        tmp_str= tmp_str[4:]
        tmp_str=tmp_str.lstrip()
        tmp_str=tmp_str.rstrip()
    else: 
        print('############## warning!!!!!!!!!!! ##############')
        print(tmp_str)
        print('#################################################')
        tmp_str=False

    return tmp_str

q_list=[]
a_list=[]

for i in range(len(df6)):
    

    tmp=True
    chunk=df6['contents'][i]

    while tmp:
        res= generate_training_data(chunk,openai)

        res_lst=res.lstrip().split('#')

        if (len(res_lst)==11) and '질문:' in res_lst[1] and '답변:' in res_lst[2] and '질문:' in res_lst[3] and '답변:' in res_lst[4] and '질문:' in res_lst[5] and '답변:' in res_lst[6] and \
        '질문:' in res_lst[7] and '답변:' in res_lst[8] and '질문:' in res_lst[9] and '답변:' in res_lst[10]: 
            q1 = remove_str(res_lst[1])
            q2 = remove_str(res_lst[3])
            q3 = remove_str(res_lst[5])
            q4 = remove_str(res_lst[7])
            q5 = remove_str(res_lst[9])

            a1 = remove_str(res_lst[2])
            a2 = remove_str(res_lst[4])
            a3 = remove_str(res_lst[6])
            a4 = remove_str(res_lst[8])
            a5 = remove_str(res_lst[10])

            if q1!=False and q2!=False and q3!=False and q4!=False and q5!=False and a1!=False and a2!=False and a3!=False and a4!=False and a5!=False:
                tmp=False
                q_list+=[q1,q2,q3,q4,q5]
                a_list+=[a1,a2,a3,a4,a5]
            else:
                print('something wrong!!!')

        else:
            print(res)

#질문1: 만약 어떤 사람이 심신상실로 인해 자신을 해쳤을 때, 그 사람은 후유장해보험금을 받을 수 있나요?
#답변1: 네, 피보험자가 심신상실 등으로 자유로운 의사결정을 할 수 없는 상태에서 자신을 해친 경우에는 보험금을 지급합니다. 따라서 심신상실로 인해 자신을 해친 경우라면 후유장해보험금을 받을 수 있습니다.

#질문2: 전쟁 중에 상해를 입어 후유장해가 발생했을 때, 해당 보험사는 보험금을 지급해야 하나요?
#답변2: 아니요, 전쟁이 발생한 경우에는 보험사가 보험금을 지급하지 않습니다. 전쟁은 보험금을 지급하지 않는 사유 중 하나로, 제5조에 따르면 전쟁, 외국의 무력행사, 혁명, 내란, 사변, 폭동에 참여하거나 이에 의해 발생한 상해는 보험금 지급 사유에서 제외됩니다.

#질문3: 어떤 사람이 해외 여행 중에 단순육체노동자로서 일하다가 상해를 입었을 때, 그 사람은 보험금을 받을 수 있나요?
#답변3: 아니요, 피보험자가 해외 여행 중에 단순육체노동자로서 일하다가 상해를 입었을 경우, 회사는 보험금을 지급하지 않습니다. 해당 행위는 제3조에 따라 보험금 지급 사유에서 제외됩니다.
...

7. 저장하기

train_df= pd.DataFrame({'question':q_list,'response':a_list})
train_df.to_excel('insurance_train.xlsx',index=False)

(추후) evolving을 위해 원본 corpus도 저장하기

cor_lst=[]

for i in range(len(df6)):
    cor_tmp = df6['contents'][i]
    cor_lst+=[cor_tmp, cor_tmp, cor_tmp, cor_tmp, cor_tmp]

train_df['corpus']=cor_lst
train_df.to_excel('insurance_train_corpus.xlsx',index=False)

Reference

https://fastcampus.co.kr/data_online_gpu

Twitter Facebook LinkedIn

LLM Fine-tuning 실습 프로젝트 - Part 3

Seunghan Lee

LLM Fine-tuning 실습 프로젝트 - Part 3

Contents

1. Import Packages

2. PDF to Image

3. OCR

4. Chunking

방법 1) Sliding window

방법 2) 문장 단위 분할

방법 3) token 단위 분할

5. LLM 통한 (질문 & 답안) 생성

6. 전처리

7. 저장하기

Reference

You May Also Enjoy