5. Inferencing

Inference Specification

In order to utilize an LLM model through the API, you will submit a request that includes the input data and your API key. Subsequently, you will receive a response containing the output generated by the model. Prior to using the API, kindly configure the environment variables as outlined below. This configuration can be use in both AFS Cloud and [CCS] (Container Computing Service) for GAI (Generative Artificial Intelligence) Service.

export API_ENDPOINT = 'https://*****.afs.twcc.ai'       # copy this information from your AFS Cloud or CCS for GAI service
export API_KEY      = '********-****-****-****-********'    # copy this information after you login to FFM Chat interface

You can replace start(*) markers to fit your setting.

FFM-Bloom Inferencing

The input string is required for inference.

export ENDPOINT_FFM=${API_ENDPOINT}
export API_KEY_FFM=${API_KEY}
export MODEL_NAME=ffm-bloom-7b-chat # VERIFY THIS in your CHAT interface, please!
curl -X POST "${ENDPOINT_FFM}/text-generation/api/models/generate" \
    -H "X-API-KEY: ${API_KEY_FFM}" \
    -H 'Content-Type: application/json; charset=utf-8' \
    -d '{"inputs":"What is the highest mountain in Taiwan?", "model": "${MODEL_NAME}"}'

The response will be like following:

{
  "generated_text":"\n\nMt. Yushan, at 3,952 m (13,406 ft)","total_time_taken":"2.46 sec",
  "generated_tokens":18
}

Parameters for Inferencing

In this API, here are the parameter details

max_new_tokens int (Optional), default vaule is 20, value range is (0,∞). The maximum number of tokens that can be generated at once.
temperature float (Optional), default vaule is 1.0, value range is (0,∞). The value controls the randomness and diversity of generated text. A higher value leads to more creative and diverse text, while a lower value results in more conservative text closer to the training data of the model.
top_p float (Optional), default vaule is 1.0, value range is (0,1]. When the cumulative probability of candidate tokens reaches or exceeds this value, the selection of additional candidate tokens stops. A higher value leads to more diversely generated text, while a lower value results in more conservative text.
top_k int (Optional), default vaule is 50, value range is [1…100]. Limits the model to choose only from the top K tokens with the highest probabilities. A higher value leads to more diversely generated text, while a lower value results in more conservative text.
frequence_penalty float (Optional), default vaule is 1.0., value range is (0,∞) The value controls the probability of generating repeated tokens. A higher value reduces the frequency of repeated tokens.

curl -X POST "${ENDPOINT_FFM}/text-generation/api/models/generate" \
  -H "X-API-KEY:${API_KEY}" \
  -H 'Content-Type: application/json; charset=utf-8' \
  -d '{
    "inputs":"Plan one day trip in Taipei",
    "parameters":{
      "max_new_tokens":350,"temperature":0.5,"top_k":50,"top_p":1,"frequence_penalty":1
    },
    "model": "${MODEL_NAME}"
  }'

The response will be like following:

{
  "generated_text":"\n\n1. Visit Taipei 101: Taipei 101 is a popular tourist attraction and offers stunning views of the city from the observation deck. You can also visit the shopping mall and enjoy the food court.2. Visit Shilin Night Market: Shilin Night Market is one of the most popular night markets in Taipei, with a variety of food and souvenirs to choose from. You can also try some traditional Taiwanese snacks such as taro balls, taro dumplings, and steamed buns.3. Visit Taipei Zoo: Taipei Zoo is a popular attraction for families, and it features a variety of animals, including pandas, elephants, and sea lions. You can also take a train ride around the zoo and see the animals up close.4. Visit Taipei Botanical Garden: Taipei Botanical Garden is a beautiful garden with a variety of flowers and plants. You can take a stroll through the garden and enjoy the scenery, or visit the cafe for a relaxing break.5. Visit Taipei City Hall: Taipei City Hall is a beautiful building with stunning architecture and a variety of art exhibitions. You can also visit the observation deck for a view of the city.6. Visit Taipei Art Museum: Taipei Art Museum features a variety of art exhibitions and is a great place to learn about Taiwanese art and culture.7. Visit Taipei Nangang Wetlands: Taipei Nangang Wetlands is a nature reserve with a variety of plants and animals. You can take a walking tour of the wetlands and enjoy the scenery.",
  "total_time_taken":"37.92 sec",
  "generated_tokens":316
}

Stream Mode, Using SSE (Server-sent event)

Server-sent event, or Stream Mode, is the server actively pushes data to the client. After the connection is established, it generates sentences step by step while sending data to the client. This is different from previous one-time responses and can enhance the user experience. To use stream mode in FFM API, you just add "stream":true while doing generating results, like this:

export ENDPOINT_FFM=${API_ENDPOINT}
export API_KEY_FFM=${API_KEY}
export MODEL_NAME=ffm-llama2-7b-chat # VERIFY THIS in your CHAT interface, please!
curl -X POST "${ENDPOINT_FFM}/text-generation/api/models/generate" \
  -H "X-API-KEY:${API_KEY}" \
  -H 'Content-Type: application/json; charset=utf-8' \
  -d '{
    "inputs":"Plan one day trip in Taipei",
    "parameters":{
      "max_new_tokens":350,"temperature":0.5,"top_k":50,"top_p":1,"frequence_penalty":1
    },
    "model": "ffm-7b", 
    "stream":true
  }'

The output will be send out token by token like following:

data: {"generated_text": "\n\n", "details": null, "total_time_taken": null, "generated_tokens": null, "finish_reason": null}
data: {"generated_text": "Sure", "details": null, "total_time_taken": null, "generated_tokens": null, "finish_reason": null}
data: {"generated_text": ",", "details": null, "total_time_taken": null, "generated_tokens": null, "finish_reason": null}
data: {"generated_text": " here", "details": null, "total_time_taken": null, "generated_tokens": null, "finish_reason": null}
data: {"generated_text": " is", "details": null, "total_time_taken": null, "generated_tokens": null, "finish_reason": null}
... Chunked ... 
data: {"generated_text": " summer", "details": null, "total_time_taken": null, "generated_tokens": null, "finish_reason": null}
data: {"generated_text": " months", "details": null, "total_time_taken": null, "generated_tokens": null, "finish_reason": null}
data: {"generated_text": ".", "details": null, "total_time_taken": null, "generated_tokens": null, "finish_reason": null}
data: {"generated_text": "", "details": null, "total_time_taken": "7.13 sec", "generated_tokens": 267, "finish_reason": "eos_token"}

The results will be pushed to the user's client token by token. At the end of the generation process, the "finish_reason": "eos_token" will be presented along with "total_time_taken" and "generated_tokens" for the entire process.

FFM-Llama2 Inferencing

FFM-Llama use Conversation Mode to facilitate inferencing. For applications like chatbots, that is a very useful feature when designing situations that require the chatbot to remember user context in order to facilitate a continuing dialogue. Chatbots with Conversation Mode are able to keep track of key information provided by the user during a conversation. This allows the chatbot to understand references the user makes to previous parts of the conversation, even if those references occur over multiple dialogue turns. By memorizing context, Conversation Mode significantly improves the user experience and conversational abilities of a chatbot. It allows the chatbot to facilitate multi-turn conversations that feel more natural and human-like. This makes Conversation Mode a very useful feature to implement when designing chatbots for situations requiring ongoing, contextual dialogue as opposed to single turn interactions.

Here is the example of using API:

curl -X "POST" "${ENDPOINT_FFM}/text-generation/api/models/conversation" \
     -H 'Content-Type: application/json; charset=utf-8' \
     -H "X-API-KEY: ${API_KEY}" \
     -d $'{
  "model": "ffm-llama2-7b-chat",
  "messages": [
    {
      "content": "The most populous country in Africa is? Keep it short.",
      "role": "human"
    },
    {
      "content": " The most populous country in Africa is Nigeria, with a population of over 200 million people. ",
      "role": "assistant"
    },
    {
      "content": "What is the primary religion?",
      "role": "human"
    }
  ]
}'

Parameters for Inferencing

While using inference API, you can add parameters and SSE (Server-sent event) while for better results.

For using SSE, include this parameter:

"stream": true,

For fine-tuning inferencing parameters:

max_new_tokens
temperature
top_p
top_k
frequence_penalty

Here is the example:

curl -X "POST" "${ENDPOINT_FFM}/text-generation/api/models/conversation" \
     -H 'Content-Type: application/json; charset=utf-8' \
     -H "X-API-KEY: ${API_KEY}" \
     -d $'{
  "model": "ffm-llama2-7b-chat",
  "stream": true,
  "parameters": {
    "temperature": 0.5,
    "top_p": 1,
    "max_new_tokens": 350,
    "top_k": 50,
    "frequence_penalty": 1
  },
  "messages": [
    {
      "content": "The most populous country in Africa is? Keep it short.",
      "role": "human"
    }
  ]
}'

Prompting

Prof. Andrew Ng suggests two good principles for writing effective prompts in his course ChatGPT Prompt Engineering for Developers:

Principle 1: Write clear and specific instructions.
Principle 2: Give the model time to "think."

The principles for effective prompts also apply to using the Formosa Foundation Model (FFM), which uses single-brace as delimiters to separate context and inquiry. An example of text summarization of AFS introduction page content by using FFM-176B, you can put following string in your playground:

{

Generative AI Tailored Specifically for Enterprises

TWSC, the only provider of commercial AI supercomputing services in Asia, has unveiled the first large-scale enterprise-level language model in Traditional Chinese called "Formosa Foundation Model." This model, built on Taiwan's Taiwania-2 supercomputer, comprises an impressive 176 billion parameters. It combines semantic understanding and text generation capabilities in Traditional Chinese, offering enterprise-grade, highly secure, and deployable generative AI solutions.

Generative AI is driving a productivity revolution, and the era of Moore’s Law for artificial intelligence systems has arrived. Enterprises can leverage the “Formosa Foundation Model,” a pre-trained model with general knowledge, to enhance their expertise using enterprise data. By deploying dedicated models in a trusted environment, businesses can establish their own tailored AI solutions and create an enterprise-specific brain.

AFS provides a customized advantage by utilizing dedicated models that align with the enterprise’s culture and specific needs. It offers unique practicality while ensuring compliance with enterprise security, regulations, and privacy. With AFS, sensitive data remains within the system, so you can safely and seamlessly integrate with internal enterprise systems without any worries.

}

Identify following iteam in the review above:
- Service Name:
- Service Features:
- Service Providor:
- The reason of using this service:
Response in Traditional Chinese

The output results will be

服務名稱: Formosa Foundation Model
服務特點: Formosa Foundation Model是專為企業設計的語言模型，具備176億個參數，能夠理解語意並生成繁體中文文本。透過使用Taiwania-2超級電腦，這個模型可以提供企業級高效率、高安全性的AI解決方案。
服務提供者: TWSC
使用該服務的原因: 企業可以通過使用Formosa Foundation Model，利用企業內部資料，提高其專業知識，並建立自定義的AI解決方案。同時，Formosa Foundation Model使用專為企業設計，確保符合企業安全、法規和隱私要求，並且不會將敏感資料外洩。此外，它還可以安全地整合到企業內部系統中，從而提高生產力和效率。

This example demonstrates how FFM-176 can extract relevant information from the single-brace context we provided and list the required information according to a given list. Benefit from the cross language ability in LLM (Large Language Model), the output result can be traslated in Traditional Chinese for easy reading.

Take-aways

In this summarization example, there are saveral take-aways for you:

Good Principles: (1) write clear and specific instructions; (2) Give the model time to "think."
FFM uses single-brace to isolate context that you provided
Prompting string is suggested to placed in the end of whole query
Translation and preferred output format can be used together

General evaluation

Suggested context / semantic evaluation

We have the option to employ a tool available on AFS_tools to assess semantics using your model. You can utilize your own dataset for validation purposes.

Example:

curl -X POST "${ENDPOINT_FFM}/text-generation/api/models/generate" \
  -H "X-API-KEY:${API_KEY}" \
  -H "content-type: application/json" \
  -d '{
    "inputs":"When Jeffery doesn’t feel like cooking, he often orders pizza online and has it ____ to his house. (A) advanced (B) delivered (C) offered (D) stretched ",
    "model": "${MODEL_NAME}"
  }'

{
  "generated_text":"\n\nWhen Jeffery doesn’t feel like cooking, he often orders pizza online and has it delivered to his house.",
  "total_time_taken":"3.70 sec",
  "generated_tokens":24
}

Case study

Case 1: Web Crawler Chatbot

In use case webcrawler_chatbot, web combines web crawling with Scrapy, data enhancement through FFM embeddings, and interactive user engagement using Streamlit and Langchain which integrated with FFM in AFS Cloud.

We provide a a step-by-step guide to generate an essential dataset for embedding, install required packages, update parameters, vectorize website dataset, and launch the API server and chat user interface server. The steps include opening the "scrapy" folder, following the instructions in the "README.md" file, installing necessary packages, updating parameters, running the command to create vectors for the website dataset, launching the API server, and starting the chat user interface server. For additional information, please visit webcrawler_chatbot.

Case 2: RMA case

Regarding the sales and maintenance statistics of our products, we can use FFM (Field-aware Factorization Machine) for integration and question answering. We can utilize the data from https://www.kaggle.com/c/pakdd-cup-2014 for this purpose. First, we need to organize the data in the jsonl format using the tool available at https://github.com/twcc/AFS_tools/blob/main/rma2jsonl.py.

The following steps can be followed:

Download and prepare the data: Download the dataset from https://www.kaggle.com/c/pakdd-cup-2014
Use the tool at https://github.com/twcc/AFS_tools/blob/main/rma2jsonl.py to organize the data in the jsonl format.
Upload jsonl data to COS.
Register dataset in dataset management.
Create Platform Jobs
Set up Cloud Service

5. Inferencing

Inference Specification​

FFM-Bloom Inferencing​

Parameters for Inferencing​

Stream Mode, Using SSE (Server-sent event)​

FFM-Llama2 Inferencing​

Parameters for Inferencing​

Prompting​

Take-aways​

General evaluation​

Suggested context / semantic evaluation​

Case study​

Case 1: Web Crawler Chatbot​

Case 2: RMA case​

Inference Specification

FFM-Bloom Inferencing

Parameters for Inferencing

Stream Mode, Using SSE (Server-sent event)

FFM-Llama2 Inferencing

Parameters for Inferencing

Prompting

Take-aways

General evaluation

Suggested context / semantic evaluation

Case study

Case 1: Web Crawler Chatbot

Case 2: RMA case