Skip to main content

3. Data preparation

Data format and structure

Tools (Online/Offline)

Tokenization is the process of splitting text into individual units, such as words or subwords. This is typically done offline as a preprocessing step before feeding text into a machine learning model. In our GitHub project AFS_Tools, we provide tools for tokenizing text using a variety of different algorithms.

To use FFM (Formosa Foundation Model), we use Python libraries like transformers to help with tokenization. The following sample code shows how to use transformers to tokenize text:

import transformers

def tokenize_text(text):
"""Tokenizes text using the transformers library."""

tokenizer = transformers.AutoTokenizer.from_pretrained("bigscience/tokenizer")
tokens = tokenizer(text=text)["input_ids"]
return tokens

if __name__ == "__main__":
text = "This is a sample text."
tokens = tokenize_text(text)
print(len(tokens))

Data format

To fine-tune your model for a specific application domain, you need to gather a dataset of text that is representative of that domain. This dataset should be stored in a format that is compatible with the large language model that you are using for fine-tuning.

The Formosa Foundation Model (FFM) family offers two types of fine-tuned foundation models for you to create your own proprietary models. These fine-tuned foundation models are exceptionally well-suited for advanced domain-specific applications and possess extensive local knowledge pertaining to Taiwan and Southeast Asia. We utilize the JSON Lines format to structure our datasets, albeit with slight variations in dataset composition.

You can use on-line tool JSON Lines - Validator for verification, and it will look like the example above.

For FFM-Bloom model

Dataset use JSON Lines format, and each line in a JSON Lines dataset is a JSON object, and the inputs and targets key names are used to store the input and target text respectively. There are some additional details about the inputs and targets key names:

  1. The sum of the lengths of the inputs and targets strings must not exceed 2048 characters. This is because FFM is a large language model, and it can only process a limited amount of text at a time.
  2. The target text should be as clear and concise as possible. If it is too short, the model may not be able to understand the intended meaning.
  3. The inputs string can contain no words, but it must at least contain the end-of-line symbol </s>. This symbol tells the model that the input text is complete.

Here is the example dataset for fine-tuning FFM-Bloom model

{"inputs":"Can Rclone software synchronization tool be installed on Taiwania2?","targets":"Group: twnia2-hpc-cli\nAssistant: Taiwania2 has the latest version of Rclone installed. You can use the command `module load rclone`` to obtain the working environment for Rclone. Rclone is written in the Go programming language and can be directly used by extracting it in the home directory."}
{"inputs":"</s>","targets":"Group: twnia2-ccs\nAssistant: The development container service enables rapid deployment of development environments. Users can choose different artificial intelligence frameworks according to their needs and quickly edit programs through the built-in Jupyter Notebook. The container service will automatically mount the HFS high-speed storage system, and users can also use cloud object storage services within the container to store training data and models as needed."}

For FFM-Llama2 model

The FFM-Lllama2 supports multi-turn conversations, and the dataset preparation also accommodates multi-turn dialogues.

In a JSON Lines dataset, it contains only JSON array object for each line, and each element in JSON array contains an object with role and content fields for three types of dialogues: system, human, and assistant. Here's a definition for each of these roles:

  • system: This role represents the general behavior for each conversation.
  • human: This role corresponds to the fine-tuned input text provided by the user. It is similar to the inputs in the FFM-Bloom dataset preparation.
  • assistant: This role is responsible for suggesting responses from FFM-Llama2. It is similar to the targets in the FFM-Bloom dataset preparation.

Here is the single-turn dataset for fine-tuning FFM-Llama2 model

[{'role':'system', 'content':'Use emoji to respose any question.'}, {'role':'human', 'content':'Recommend a Valentine's Day product and write a 300-word marketing copy.'}, {'role':'assistant', 'content':'💘 Love is in the air! 🌟\nAre you looking for a unique and special way to celebrate Valentine's Day with your loved one? 💕 Look no further! Our Valentine's Day product is the perfect way to show your love and appreciation. 💗\nOur product is a beautifully crafted heart-shaped box filled with delicious chocolates and a special surprise inside. 🍫🎁 The box is made of high-quality materials and is designed to impress. 💯'}]

In this dataset, we prepare a system prompt for overall conversation that is "Use emoji to respose any question.", as well as fine-tuning dataset of human and assistant for FFM-Llama2 to learn your context.

In the context of multi-turn dialogues, you have the option to include additional conversation within the input array, as demonstrated below.

[{'role':'system', 'content':'response in zh-TW'}, {'role':'human', 'content':'Where is the capital city of Taiwan?'}, {'role':'assistant', 'content':'台灣的首都是台北市。'}, {'role':'human', 'content':'How about India?'}, {'role':'assistant', 'content':'印度的首都是新德里。'}]

In this demonstration dataset, we define a language preference as the system prompt, "response in zh-TW," followed by a series of questions about inquiring about capital cities. The FFM-Llama2 is capable of comprehending the context as "inquiring about the capital city" and providing a suitable response to the second query.

There are saveral additional information you need to know:

  1. The sum of each element in JSON array must not exceed 4096 characters.
  2. Leave blank while there is nothing to filling in both human and assistant fields.
  3. Ensure that both the texts in the human and assistant fields are clear and concise for your application.
  4. You can combine single-turn and multi-turn dialogues when preparing your dataset.

Suggested Flow for Preparing Dataset

In order to make sure the quality of your language model, we suggest preparing your dataset according to the process that MassiveWeb1 suggested:

  1. Content Text Quality Filtering. You can use Google's SafeSearch or other tools to effectively filter out inappropriate content or avoid unintentionally filtering out harmless content.
  2. Text Extraction. According to MassiveWeb recommendation, storing text data in plain text format while preserving the original formatting, such as indentation, newlines and bullet points, is an efficient and accurate approach to extracting valuable information from web pages.
  3. Quality Filtering. This step requires applying natural language processing (NLP) techniques to eliminate meaningless textual data. It includes filtering out sentences that are too short or too long, excessively long English character sequences, removing excessive symbols (while maintaining a character-to-symbol ratio of 0.1), and filtering out words from a stop word list.
  4. Repetition Removal. It is recommended to exclude documents that have a significant proportion of duplicate lines, paragraphs, or n-grams, based on predefined thresholds. The specific threshold ratios may vary depending on the language being used.
  5. Document Deduplication. In MassiveWeb, they suggest using MinHash algorithm to compute 13-gram Jaccard similarities and randomly remove near-duplicates of each other.
  6. Test-set Filtering. It is also recommended to perform Jaccard similarity evaluation on the high quality test-set corpus before constructing it.

1 Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., ... & Irving, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.

Toxicity

A toxic content is a statement that has the potential to cause damage, whether it is physical harm or reputational harm. Such statements may include inaccurate information, offensive language, or confidential data. The presence of harmful content can result in legal consequences, financial losses, and do harm to a company's reputation. Here are some examples of toxic statements:

  • [physical harm] A manufacturing plant's safety manual that contains inaccurate information about how to handle hazardous chemicals.
  • [reputational harm] A marketing proposal that contains offensive language about a particular group of people.
  • [physical harm] A financial report that contains confidential data about a company's clients.

Here are more examples you can check on Social Media Toxicity Dataset and SemEval-2021 task 5: Toxic spans detection.

To remove toxic content, we sugguest you build a lexicon of toxic tokens, according to Kurita et al.(2019) study, and remove them from your collected dataset. To prevent the leakage of sensitive client data, we recommend reviewing the content of all datasets before preparing them to avoid any potential disclosure of sensitive information.

However, when you upload data to the personal dedicated storage space of AFS (AI Foundry Service), we ensure that all data is protected in accordance with privacy and information security policies, and operated in compliance with ISO 27017 and ISO 27018 standards, guaranteeing the non-disclosure of your data.

2 Kurita, K., Belova, A., & Anastasopoulos, A. (2019). Towards robust toxic content classification. arXiv preprint arXiv:1912.06872.