How to prepare text data for Relevance AI

Before uploading a dataset, run through the checklist to make sure your data meets our recommendations and requirements.

  • The general format for uploading data to Relevance AI is CSV.
  • 300000 rows in the maximum number of rows in your dataset. Please contact us if your dataset is larger.

File format: CSV

Your dataset should be saved in valid CSV format before being uploaded to Relevance AI's platform.

CSV files are table-like data formats, similar to what is seen on an Excel sheet. Make sure, all columns have a unique name/header and follow the same data type and format for values in each column (see the "Field values" section below).

No|Name|Company|Age
--|----|-------|---
1 |Jim |  ABC  |32
--|----|-------|---
2 |Jack|  XYZ  |24
--|----|-------|---
3 |Dave|  LMN  |39

Headers: Names of fields/columns

  • Column names/headers are included ONLY in the FIRST row of the file
  • Column names/headers should be in one line (i.e multiple-line headers are not accepted)
  • Should be short but descriptive (recommended)
  • No duplicate column names (i.e. unique column name, otherwise we automatically add numbers to headers)
  • Names can only contain letters, numbers, dashes or underscoresโ€‹ (any other character will be replaced by our upload engine)
    Note that
    • white spaces will be replaced with -by our upload engine
    • . will be replaced with -by our upload engine
    • if your dataset includes vectors, make sure the vector field name ends in _vector_
      Vector fields:
      Vector fields are representations of data in another format (i.e. a list of numbers or vectors to be precise). For instance, if your dataset includes a field/column named "description" which shows the description of items in the dataset in text format, after vectorizing each description value, you will have access to the corresponding vectors. These vectors can be saved in the original dataset under a vector field (an example is provided under Data FAQs).
  • Combine separate columns representing multiple-choice fields in a survey. Relevance AI treats columns/fields independent of each other, so, if your columns are part of on top category modify your CSV file to be able to benefit from features like filtering on the platform. An example is shown below which represent a multiple choice question about age category:
Name |21-30|31-40|41-50|51-60                 Name | Age Category
-----|-----|-----|-----|-----                ------|-------------
Jack |  0  |  1  |  0  |  0                   Jack |    31-41
-----|-----|-----|-----|-----      ====>     ------|-------------
 Tom |  1  |  0  |  0  |  0                   Tom  |    21-30
-----|-----|-----|-----|-----                ------|-------------
Sarah|  0  |  0  |  0  |  1                  Sarah |    51-60

Values: Values under headers

  • Include only one data type and format in each column. For instance:
    • All digit values - Values under fields such as age, phone number or scores, that are composed of only digit, must be all digits in all cells. Meaning entries like None, NA, and white spaces will break the upload. Delete such inconsistencies (i.e. delete the string values under numeric fields)
    • DATES - All date fields formatted in "YYYY-MM-DD" format (recommended)
    • None / No values - When there is nothing as a value, simply leave it as an empty cell in your CSV file. An empty cell is a cell with nothing typed in it; not a white space, not 0, not N/A, not None, not null, literally no nothing. A common sample case is when people do not respond to a question.
    • Note 1: keep the format the same
      Example: CURRENCY - Values in digits only, without the currency sign (e.g. 119.50), as opposed to: price = [$119.50, 200 dollars]
    • Note 2: when values in a column are both "all digits" and "digit and characters"
      Example: POSTCODES - If your data contains a postcode field that contains both numeric (e.g. 90210) and string format (e.g. SW1A 1AA) values in the one field (e.g postcodes across countries), ensure that the first postcode value (under the column header 'postcode') is a string format postcode (e.g. SW1A 1AA) and not a full digit one.

Categorical measures

  • If your data has coded values (i.e. Is Member = 1/0), we recommend changing the data to natural language for businesses to understand i.e. Is Member / Is Not Member, or Yes / No or True / False.

Numeric measures

  • For numeric scores like NPS, we recommend including both columns: numeric scores (e.g. 0-10 scores) and the coded value/label as an additional field (e.g. detractor, passive, promoter).

No values

When there is nothing as a value, simply leave it as an empty cell in your CSV file. An empty cell is a cell with nothing typed in it; no white space, Not 0, not N/A, not None, not null, literally no nothing.. A common sample case is when people do not respond to a question.

A unique identifier field (Highly recommended)

We recommend including an id field in your CSV. The name of the field can be any string value (e.g. id, original order, customer id, number, etc.) and values should be unique per row (e.g. sequential numbers starting from 1).

Note 1: This unique identifier is very useful when you Export the analysis results. And is your way of mapping the export data to your original CSV.

Note 2: There is a unique identifier per entry (_id) in datasets sitting on the Relevance AI's platform. The _id field can preexist in a CSV (i.e. included in the upload CSV by the dataset owner). Otherwise, the platform automatically adds the field with unique values.

Cleaning text data (optional)

When working with text data it is recommended (i.e. not required) to apply certain preprocessing steps which can potentially improve the analysis results. Common text pre-processing are:

  • Stop words removal: to remove frequent but not important words used in our language (e.g. the, there).

  • Lemmatization: replacing words with their common root (e.g. changes or changing become change)

  • Lowercasing: converting all characters to their lowercase form

  • Breaking into shorter pieces of text: when automatically analyzing text, processing smaller pieces of text (e.g. a sentence vs paragraph) often produces more precise results.

  • Noise removal: this step is completely data specific. Popular cleaning methods are html, URL or hashtag removal.

  •  _id              Image-URL             project
    -----|--------------------------------|---------
      1  |  https://my-repo/my-image1.jpg |   X1
      2  |  https://my-repo/my-image2.jpg |   X2
      3  |  https://my-repo/my-image3.jpg |   X1
      4  |  https://my-repo/my-image4.jpg |   X3
      5  |  https://my-repo/my-image5.jpg |   X2
    

Useful links:

See the guide on How to update an existing dataset which covers all the below items

  • adding new items (i.e. rows) to an existing dataset
  • adding new fields (columns) to an existing dataset
  • modifying existing values in a dataset

What is next?

You are ready to move on to analysing the data under the many available workflows. Two hero text processing workflows are AI Text Clustering and AI Tagging.