Conversation API and parameters
The English version of this document is under construction and will be available soon.
Conversation
- Llama3.3-FFM-70B
- Llama3.1-FFM (8B 、70B)
- Llama3-FFM (8B 、70B)
- Taide-LX (7B)
- FFM-Mistral (7B) 、FFM-Mixtral (8x7B)
- FFM-Llama2-v2 (7B 、 13B 、70B)
- FFM-Llama2 (7B 、 13B 、70B)
- Meta-CodeLlama (7B 、 13B 、34B)
一般使用
依對話順序,依照角色位置把對話內容填到 Content 欄位中。
Role Order Content user 問題1 人口最多的國家是? assistant 答案1 人口最多的國家是印度。 user 問題2 主要宗教為? Role Order Content system 預設指令 你是一位只會用表情符號回答問題的助理。 user 問題1 明天會下雨嗎? assistant 答案1 🤔 🌨️ 🤞 user 問題2 意思是要帶傘出門嗎?
- Llama 3 系列與 Llama 2 系列模型支援預設指令。預設指令可以協助優化系統的回答行為,在對話的每一段過程中都會套用。
- 現已新增 user 角色替代原來 human 角色, human 還是可持續使用。與 human 相比, user 是一個更中性的術語,能表示於更多的應用情境。
範例一:無 預設指令
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
curl "${API_URL}/models/conversation" \
-H "X-API-KEY:${API_KEY}" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages":[
{
"role": "user",
"content": "人口最多的國家是?"
},
{
"role": "assistant",
"content": "人口最多的國家是印度。"
},
{
"role": "user",
"content": "主要宗教為?"
}],
"parameters": {
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
輸出:包括生成的文字、token 個數以及所花費的時間秒數。
{
"generated_text": "印度的主要宗教是印度教,宗教信徒佔該國人口的79.8%。其他重要的宗教包括伊斯蘭教(佔14.8%的人口)、佛教、基督教、錫克教、佛教、猶太教和耆那教。\n\n註:印度是一個多元化的社會,承認和尊重多種宗教。雖然印度教是最大的宗教,但該國有著悠久的宗教多樣性歷史,不同的宗教都有著重要的存在。",
"details": null,
"total_time_taken": "2.38 sec",
"prompt_tokens": 45,
"generated_tokens": 149,
"total_tokens": 194,
"finish_reason": "stop_sequence"
}
Python 範例
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def conversation(contents):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY}
roles = ["user", "assistant"]
messages = []
for index, content in enumerate(contents):
messages.append({"role": roles[index % 2], "content": content})
data = {
"model": MODEL_NAME,
"messages": messages,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ""
try:
response = requests.post(API_URL + "/models/conversation", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("\n")
contents = ["人口最多的國家是?", "人口最多的國家是印度。", "主要宗教為?"]
result = conversation(contents)
print(result)
輸出:
印度教
範例二:設定 預設指令
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
curl "${API_URL}/models/conversation" \
-H "X-API-KEY:${API_KEY}" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages":[
{
"role": "system",
"content": "你是一位只會用表情符號回答問題的助理。"
},
{
"role": "user",
"content": "明天會下雨嗎?"
},
{
"role": "assistant",
"content": "🤔 🌨️ 🤞"
},
{
"role": "user",
"content": "意思是要帶傘出門嗎?"
}],
"parameters": {
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
輸出:包括生成的文字、token 個數以及所花費的時間秒數。
{
"generated_text": "🌂️👍",
"details": null,
"total_time_taken": "0.19 sec",
"prompt_tokens": 77,
"generated_tokens": 8,
"total_tokens": 85,
"finish_reason": "stop_sequence"
}
Python 範例
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def conversation(system, contents):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY}
roles = ["user", "assistant"]
messages = []
if system is not None:
messages.append({"role": "system", "content": system})
for index, content in enumerate(contents):
messages.append({"role": roles[index % 2], "content": content})
data = {
"model": MODEL_NAME,
"messages": messages,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ""
try:
response = requests.post(API_URL + "/models/conversation", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("\n")
system_prompt = "你是一位只會用表情符號回答問題的助理。"
contents = ["明天會下雨嗎?", "🤔 🌨️ 🤞", "意思是要帶傘出門嗎?"]
result = conversation(system_prompt, contents)
print(result)
輸出:
🌂 👍
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
curl "${API_URL}/models/conversation" \
-H "X-API-KEY:${API_KEY}" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages":[
{
"role": "system",
"content": "你是一個活潑的五歲小孩,回答問題時都使用童言童語的語氣。"
},
{
"role": "user",
"content": "明天會下雨嗎?"
},
{
"role": "assistant",
"content": "嗯,我不知道,但我希望如此!我喜歡玩雨水,穿上我的雨靴和雨衣。這就像一個大派對外面!如果你很幸運,也許你可以看到一個彩虹!"
},
{
"role": "user",
"content": "彩虹有幾種顏色呢?"
}],
"parameters": {
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
輸出:包括生成的文字、token 個數以及所花費的時間秒數。
{
"generated_text": "彩虹有七種顏色!你能猜出它們是哪些嗎?它們是紅色、橙色、黃色、綠色、藍色、靛色和紫色。這就是為什麼彩虹是彩虹的原因!",
"details": null,
"total_time_taken": "1.01 sec",
"prompt_tokens": 134,
"generated_tokens": 60,
"total_tokens": 194,
"finish_reason": "stop_sequence"
}
Python 範例
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def conversation(system, contents):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY}
roles = ["user", "assistant"]
messages = []
if system is not None:
messages.append({"role": "system", "content": system})
for index, content in enumerate(contents):
messages.append({"role": roles[index % 2], "content": content})
data = {
"model": MODEL_NAME,
"messages": messages,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ""
try:
response = requests.post(API_URL + "/models/conversation", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("\n")
system_prompt = "你是一個活潑的五歲小孩,回答問題時都使用童言童語的語氣。"
contents = ["明天會下雨嗎?", "嗯,我不知道,但我希望如此!我喜歡玩雨水,穿上我的雨靴和雨衣。這就像一個大派對外面!如果你很幸運,也許你可以看到一個彩虹!", "彩虹有幾種顏色呢?"]
result = conversation(system_prompt, contents)
print(result)
輸出:
彩虹有七種顏色!它們是紅色、橙色、黃色、綠色、藍色、靛色和紫色。這就是為什麼我們唱"紅橙黃綠藍靛紫"的歌曲,因為這是彩虹的顏色!
使用 Stream 模式
Server-sent event (SSE):伺服器主動向客戶端推送資料,連線建立後,在一步步生成字句的同時也將資料往客戶端拋送,和先前的一次性回覆不同,可加強使用者體驗。若有輸出大量 token 文字的需求,請務必優先採用 Stream 模式,以免遇到 Timeout 的情形。
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
curl "${API_URL}/models/conversation" \
-H "X-API-KEY:${API_KEY}" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages":[
{
"role": "user",
"content": "人口最多的國家是?"
},
{
"role": "assistant",
"content": "人口最多的國家是印度。"
},
{
"role": "user",
"content": "主要宗教為?"
}],
"parameters": {
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1},
"stream": true}'
輸出:每個 token 會輸出一筆資料,最末筆則是會多出生成的總 token 個數和所花費的時間秒數。
data: {"generated_text": "", "details": null, "finish_reason": null}
data: {"generated_text": "印", "details": null, "finish_reason": null}
data: {"generated_text": "度", "details": null, "finish_reason": null}
data: {"generated_text": "教", "details": null, "finish_reason": null}
data: {"generated_text": "", "details": null, "total_time_taken": "0.13 sec", "prompt_tokens": 45, "generated_tokens": 5, "total_tokens": 50, "finish_reason": "stop_sequence"}
- 每筆 token 並不一定能解碼成合適的文字,如果遇到該種情況,該筆 generated_text 欄位會顯示空字串,該 token 會結合下一筆資料再來解碼,直接能呈現為止。
- 本案例採用 sse-starlette,在 SSE 過程中約 15 秒就會收到 ping event,目前在程式中如果連線大於該時間就會收到以下資訊 (非 JSON 格式),在資料處理時需特別注意,下列 Python 範例已經有包含此資料處理。
event: ping
data: 2023-09-26 04:25:08.978531
Python 範例
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def conversation(contents):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY}
roles = ["user", "assistant"]
messages = []
for index, content in enumerate(contents):
messages.append({"role": roles[index % 2], "content": content})
data = {
"model": MODEL_NAME,
"messages": messages,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
},
"stream": True
}
messages = []
result = ""
try:
response = requests.post(API_URL + "/models/conversation", json=data, headers=headers, stream=True)
if response.status_code == 200:
for chunk in response.iter_lines():
chunk = chunk.decode('utf-8')
if chunk == "":
continue
# only check format => data: ${JSON_FORMAT}
try:
record = json.loads(chunk[5:], strict=False)
if "status_code" in record:
print("{:d}, {}".format(record["status_code"], record["error"]))
break
elif "finish_reason" in record and record["finish_reason"] is not None :
message = record["generated_text"]
messages.append(message)
print(">>> " + message)
result = ''.join(messages)
break
elif record["generated_text"] is not None:
message = record["generated_text"]
messages.append(message)
print(">>> " + message)
else:
print("error")
break
except:
pass
else:
print("error")
except:
print("error")
return result.strip("\n")
contents = ["人口最多的國家是?", "人口最多的國家是印度。", "主要宗教為?"]
result = conversation(contents)
print(result)
輸出:
印
度
的
主
要
宗
教
是
印
度
教
印度的主要宗教是印度教
LangChain 使用方式
Formosa Foundation Model Wrapper
'''Wrapper LLM APIs.'''
from typing import Any, Dict, List, Mapping, Optional, \
Iterator, AsyncIterator
from langchain.llms.base import BaseLLM
import requests
from langchain.callbacks.manager import CallbackManagerForLLMRun, \
AsyncCallbackManagerForLLMRun
from langchain.schema.language_model import BaseLanguageModel
from langchain.schema import Generation, LLMResult
from langchain.schema.output import GenerationChunk
from langchain_core.runnables import run_in_executor
import json
from langchain.schema import (
BaseMessage,
)
from langchain_core.messages import (
AIMessageChunk,
)
from langchain_core.outputs import (
ChatGenerationChunk
)
class _FormosaFoundationCommon(BaseLanguageModel):
base_url: str = "http://localhost:12345"
"""Base url the model is hosted under."""
model: str = "meta-llama3-70b"
"""Model name to use."""
temperature: Optional[float]
"""
The temperature of the model. Increasing the temperature will
make the model answer more creatively.
"""
stop: Optional[List[str]]
"""Sets the stop tokens to use."""
top_k: int = 50
"""
Reduces the probability of generating nonsense.
A higher value (e.g. 100) will give more diverse answers, while
a lower value (e.g. 10) will be more conservative. (Default: 50)
"""
top_p: float = 1
"""
Works together with top-k. A higher value (e.g., 0.95) will lead
to more diverse text, while a lower value (e.g., 0.5) will
generate more focused and conservative text. (Default: 1)
"""
max_new_tokens: int = 350
"""
The maximum number of tokens to generate in the completion.
-1 returns as many tokens as possible given the prompt and
the models maximal context size.
"""
frequence_penalty: float = 1
"""Penalizes repeated tokens according to frequency."""
model_kwargs: Dict[str, Any] = {}
"""
Holds any model parameters valid for `create` call not explicitly
specified.
"""
ffm_api_key: Optional[str] = None
@property
def _default_params(self) -> Dict[str, Any]:
"""Get the default parameters for calling FFM API."""
normal_params = {
"temperature": self.temperature,
"max_new_tokens": self.max_new_tokens,
"top_p": self.top_p,
"frequence_penalty": self.frequence_penalty,
"top_k": self.top_k,
}
return {**normal_params, **self.model_kwargs}
def _call(
self,
prompt,
service_path="/api/models/conversation",
stop: Optional[List[str]] = None,
**kwargs: Any,
) -> str:
if self.stop is not None and stop is not None:
raise ValueError(
"`stop` found in both the input and default params.")
elif self.stop is not None:
stop = self.stop
elif stop is None:
stop = []
params = {**self._default_params, "stop": stop, **kwargs}
parameter_payload = {"parameters": params, "model": self.model}
if isinstance(prompt, str):
parameter_payload = {"inputs": prompt, **parameter_payload}
service_path = "/api/models/conversation"
else:
parameter_payload = {"messages": prompt, **parameter_payload}
service_path = "/api/models/conversation"
# HTTP headers for authorization
headers = {
'X-API-KEY': self.ffm_api_key,
'Content-Type': 'application/json'
}
endpoint_url = f"{self.base_url}{service_path}"
# send request
try:
response = requests.post(
url=endpoint_url,
headers=headers,
data=json.dumps(parameter_payload,
ensure_ascii=False).encode('utf8'),
stream=False,
)
response.encoding = "utf-8"
generated_text = response.json()
if response.status_code != 200:
detail = generated_text.get("detail")
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f"error raised with status code {response.status_code}\n"
f"Details: {detail}\n"
)
except requests.exceptions.RequestException as e:
# This is the correct syntax
raise ValueError(f"FormosaFoundationModel error raised by \
inference endpoint: {e}\n")
if generated_text.get('detail') is not None:
detail = generated_text['detail']
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f'error raised by inference API: {detail}\n'
)
if generated_text.get('generated_text') is None:
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f'Response format error: {generated_text}\n'
)
return generated_text
class FormosaFoundationModel(BaseLLM, _FormosaFoundationCommon):
"""Formosa Foundation Model
Example:
.. code-block:: python
ffm = FormosaFoundationModel(model_name="llama2-7b-chat-meta")
"""
@property
def _llm_type(self) -> str:
return 'FormosaFoundationModel'
@property
def _identifying_params(self) -> Mapping[str, Any]:
'''Get the identifying parameters.'''
return {
**{
"model": self.model,
"base_url": self.base_url
},
**self._default_params
}
def _generate(
self,
prompts: List[str],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> LLMResult:
"""Call out to FormosaFoundationModel's generate endpoint.
Args:
prompt: The prompt to pass into the model.
stop: Optional list of stop words to use when generating.
Returns:
The string generated by the model.
Example:
.. code-block:: python
response = FormosaFoundationModel("Tell me a joke.")
"""
generations = []
token_usage = 0
for prompt in prompts:
final_chunk = super()._call(
prompt,
stop=stop,
**kwargs,
)
generations.append(
[
Generation(
text=final_chunk["generated_text"],
generation_info=dict(
finish_reason=final_chunk["finish_reason"]
)
)
]
)
token_usage += final_chunk["generated_tokens"]
llm_output = {"token_usage": token_usage, "model": self.model}
return LLMResult(generations=generations, llm_output=llm_output)
def _stream(
self,
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> Iterator[ChatGenerationChunk]:
service_path = "/api/models/conversation"
endpoint_url = f"{self.base_url}{service_path}"
headers = {
"Content-type": "application/json",
"accept": "application/json",
"X-API-KEY": self.ffm_api_key
}
payload = {
"model": self.model,
"messages": kwargs['kwargs']['messages'],
"parameters": {
"max_new_tokens": self.max_new_tokens,
"temperature": self.temperature,
"top_k": self.top_k,
"top_p": self.top_p,
"frequence_penalty": self.frequence_penalty
},
}
response = requests.post(endpoint_url, headers=headers,
json=payload, stream=True)
for chunktxt in response.iter_lines(decode_unicode=True):
content = chunktxt.lstrip('data: ')
if len(content) == 0:
yield ChatGenerationChunk(message=AIMessageChunk(content=''))
else:
contentresult = json.loads(content)
chunk = ChatGenerationChunk(
message=AIMessageChunk(
content=contentresult['generated_text']))
yield chunk
async def _astream(
self,
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> AsyncIterator[GenerationChunk]:
result = await run_in_executor(
None,
self._stream,
messages,
stop=stop,
run_manager=run_manager.get_sync() if run_manager else None,
**kwargs,
)
for chunk in result:
yield chunk
- 完成以上封裝後,就可以在 LangChain 中使用特定的 FFM 大語言模型。
更多資訊,請參考 LangChain Custom LLM 文件。
if __name__ == "__main__":
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
ffm = FormosaFoundationModel(
base_url=API_URL,
max_new_tokens=350,
temperature=0.5,
top_k=50,
top_p=1.0,
frequence_penalty=1.0,
ffm_api_key=API_KEY,
model=MODEL_NAME
)
kwargs = {}
kwargs["messages"] = [
{"role": "user", "content": "人口最多的國家是?"},
{"role": "assistant", "content": "人口最多的國家是印度。"},
{"role": "user", "content": "主要宗教為?"},
]
for token in ffm.stream("", kwargs=kwargs):
print(token, end="", flush=True)
輸出:
印度的主要宗教是印度教