fooddata-vegattributes

Attributes specifying whether a food is vegan or vegetarian for certain USDA FDC FoodData datasets.

View the Project on GitHub v3gtb/fooddata-vegattributes

About

The aim of this project is to provide free and open add-on data for the USDA’s FDC FoodData datasets containing attributes that categorize foods as vegan, vegetarian, or neither.

This will hopefully enable applications and services that deal with nutrition data to take these dietary preferences into account, e.g. when displaying or suggesting foods to users. The target audience of this project are mainly small open source projects, although nothing keeps you from using it in a commercial project (see the License section below).

Accuracy

The data are generated using a naive heuristic based only on the descriptions of each food and those of its ingredients, which are compared to hardcoded lists of phrases and the most likely categories they suggest. Neither this approach nor the lists of phrases are perfect, so there are still many incorrectly categorized foods that will hopefully become fewer over time.

A rough estimate for a lower bound on the percentage of errors is the percentage of known failures in the reference data, which is currently 4.7%. The real percentage of errors will be larger than that as known failures are likely to be fixed, after which the lines in question remain in the reference data with the known failure mark removed to serve as regression tests.

Note that categories listed in the reference data override those determined by the heuristic in the final exported data, so any individual known failure listed there has already been corrected.

If you find any mistakes, feel free to open an issue.

Strictness

For those foods whose categorization as vegan/vegetarian/omni depends on one’s level of “strictness”, an attempt is made to classify them as an appropriate composite category. E.g., wines should ideally all be categorized as VEGAN_VEGETARIAN_OR_OMNI because certain filtration methods normally used in the winemaking process involve animal products, some of which require killing the animal to extract, although it’s plausible that a subset of vegans/vegetarians would consider them vegan/vegetarian regardless.

Note that these same composite categories are also used more generally in cases in which it’s impossible to tell from the available information whether something is vegan/vegetarian or not. Although this meaning is technically distinct from the strictness-dependent categorization above, in practice they tend to overlap almost perfectly. Returning e.g. to the example above, there do exist strictly vegan wines made without resorting to animal products in any step of the process, but a description saying just “wine” could refer to either these or the non-vegan variants.

Supported datasets

Attributes are provided for foods in the FNDDS (“Survey”) and SR Legacy datasets. Data for both datasets are provided together in one file as foods are uniquely identified by e.g. their FDC ID and the file size is small anyway. As of now there are no plans to extend this project to the other FDC datasets, but who knows.

Download and file format

The latest generated dataset can be found on the GitHub releases page.

It is shipped as a JSON file containing a list of entries of the form

{
  "fdcId": 123,
  "vegCategory": "CATEGORY",
  "description": "FDC description/name of the food",
  # either:
  "foodCode": 456,
  # or:
  "ndbNumber": 789
}

where CATEGORY is one of the categories listed in the section below and fdcId, foodCode, ndbNumber and description correspond to the fields of the same names in the FDC datasets. foodCode only appears in the FNDDS data and ndbNumber only in the SR Legacy data, so which one of these will be present in a given entry depends on which dataset the food entry came from. The description is only included as a debugging help - the proper way to find it or any other properties of a given food is to perform a lookup in the FDC datasets by the given IDs.

The FDC ID by itself is enough to uniquely identify foods, but it is my understanding that a new FDC ID is assigned to “the same” food on every release of a FDC dataset, while IDs like the Food Code and NDB Number remain the same, the idea being that the FDC ID identifies not just a food but also the specific properties (e.g. determined nutrients) associated with it in that release. So for easier cross-FDC-release compatibility, Food Code or NDB Number are included here as well. Note, however, that I’m not sure whether e.g. the ingredient list or description can be updated as well between releases, which would have the potential to change the categorization as determined by this project. In that case, it might be more correct to use only on the FDC ID, although the number of errors caused by updated descriptions or ingredients is expected to be much, much lower than that caused by failures of the heuristic.

Web preview

For debugging and demoing purposes, the current lists of foods in each category can be viewed here:

Source code and development

The script used to generate the data released by this project from FDC data via the heuristic explained above can be found in the project’s GitHub repository.

Some incomplete notes on development can be found here.

License

Like the USDA FDC datasets themselves, the data published by this project is hereby released into the public domain or, in jurisdictions where this is not possible, the closest legal equivalent.

The script to generate the data is provided under the MIT license.