“Big data” is being talked about everywhere, including increasingly in the context of food safety and food quality. For example, while only one symposium covered “big data” in the 2014 annual meeting of the International Association of Food Protection (IAFP), the recent 2015 IAFP annual meeting included at least four sessions that mentioned “big data” in the session title or abstract. While the potential of big data and data analytics to improve our ability to address food safety and quality issues is increasingly recognized, use of these tools in food safety and quality still appears to be limited. Even if “big data” are used in this space, many may argue that the amount of data used in these cases rarely qualify as truly being big data, rather these data may often simply be large traditional datasets. While big data may only be slowly making their way into food safety and quality, there is a need for food science professionals to critically discuss and contemplate the impact of big data and associated analytics to allow for timely and appropriate implementation and use of these tools in food safety and quality to achieve improved decision making.
Big Data Introduction
While many definitions exist for “big data,” a common definition reads along the lines of “Big data is a broad term for datasets so large or complex that traditional data processing applications are inadequate” (Wikipedia, accessed Aug. 3, 2015). Based on Douglas Laney’s definition of data by the “3Vs,” today a “4V” definition of big data is often used, which can be summarized as “Big data represents high volume, high velocity, high veracity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery, and process optimization.” Often, “big data” also is linked to predictive analytics, as compared to the more typical use of data in food safety, which focuses on retrospective identification of associations and increasingly real-time or near real-time monitoring of processes. Most uses of large datasets and big data analytics in food safety and quality to date focus on providing improved root cause and retrospective analyses, but development and use predictive analytics in food safety is likely to grow quickly in the near future.
Big Data Sources for Food
Many of the early discussions on big data have focused on the use of genomics data as well as social media-related information in food safety. Whole genome sequencing (WGS)-based subtyping has been used for more than five years to create large sets of data that can be used for high resolution subtype characterization of foodborne pathogens (and spoilage organisms), which allows for better outbreak detection and source attribution. Importantly, WGS data for foodborne pathogens are also often rapidly released by public health and regulatory agencies, allowing for use of these data by industry. For example, WGS data for Listeria monocytogenes isolates identified as having been obtained from ice cream in Kansas became publicly available soon after a listeriosis outbreak linked to ice cream (with cases in Kansas) was reported in early 2015. Other omics datasets, such as metagenomics data, have also been used to identify and characterize food spoilage issues. It is likely that these types of data sources will also increasingly become available to the food industry.
Use of social media-related information has seen considerable early enthusiasm based on initial reports that suggested that “Google Flu Trends” can allow for early detection of flu outbreaks. Subsequent studies have suggested though that this tool may often inaccurately predict flu outbreaks. However, a recent CDC report suggests that mining of Yelp reviews can help public health agencies to identify foodborne disease outbreaks, which are linked to restaurants and may have otherwise gone undetected. Similarly, sales data, including data from shopper club cards and similar instruments, are also available to many retailers and companies and can be used to help detect and identify foodborne disease outbreaks, aiding in rapid initiation of product recalls and other consumer safety actions.
In addition to data sources briefly discussed above, food safety professionals can also have the opportunity to access a number of other structured and unstructured data sources, including often large amounts of data that are automatically captured through recording devices in food processing and retail environments (e.g., temperature data for heat treatment steps or refrigerated storage) and employment data (identifying the individuals that perform certain tasks, such as sanitation, on a given day). Unstructured data that could be mined for relevant information include, but are not limited to, video-captured data of facilities and employees.
It is also possible to rapidly acquire, often with no cost (other than computer and personnel time), large sets of metadata associated with samples that have been collected for microbiological or other testing. For example, public data sources are available that provide weather patterns (temperature, rain events, wind direction and speed, etc.) that are associated with a sample collection site and a specific sample collection date. These type of data can be used to rapidly determine whether out-of-spec samples (for example, samples positive for a pathogen or indicator organism) are associated with specific weather patterns (for instance, rain in the preceding day(s)), which can help in root cause analysis; for instance, associations with rain may indicate roof leaks or other water intrusions as a root cause. These same metadata could also be used for predictive analytics that may show an increased risk of pathogen findings or spoilage events after certain weather patterns, which could trigger enhanced preventive efforts.
Examples of Approaches in Food
One of the most mature examples of the use of large datasets in food safety is the use of WGS-based subtyping methods by both public health and regulatory agencies. In the U.S., the CDC and state partners are performing WGS on every human clinical Listeria monocytogenes isolate. Similarly, regulatory agencies such as the U.S. FDA are currently performing WGS of foodborne pathogen isolates obtained from foods and food associated sources. WGS will determine the sequence of virtually all 3 million nucleotides (A, T, C, and Gs) in the Listeria monocytogenes genome, typically with at least a 20-fold coverage, therefore creating 60 million data point per genome, which is used for extremely high resolution subtyping. Use of these WGS tools has significantly improved the ability of public health agencies to detect human listeriosis outbreaks, which allows for identification of more outbreaks than with previous subtyping tools (i.e., pulse field gel electrophoresis), including detection of smaller outbreaks (with less than five cases) that may also have gone undetected previously. As these tools are being applied to other pathogens, in particular Salmonella, the number of detected outbreaks caused by these other pathogens will likely increase considerably.
In addition to WGS, metagenomics-based tools also provide large datasets (often providing gigabases of sequence data), which can help characterize total microbial populations in samples. These tools have allowed for detection of new or previously unrecognized pathogens in clinical and food samples and have been shown to detect pathogens that were undetected by traditional microbiological methods. These methods also can facilitate detection and identification of spoilage issues and could be used as untargeted screening tools for raw materials streams and ingredients.
Use of geographic information system (GIS)-based datasets to predict and manage food safety risks are also rapidly gaining traction. For example, recent studies have shown how GIS data can be used to predict locations and time intervals that may represent a higher risk for foodborne pathogen contamination in fields.
While there clearly is considerable potential for big data-based approaches to facilitate improved approaches to food safety and food quality, a number of challenges remain for industry to take advantage of these tools. Most of these challenges are not unique to this industry, but some of them may be more pronounced. For example, data capture in the food industry is still often manual and often involves paper records that cannot be used easily for data mining. Also, there are few trained data scientists who are also familiar with food systems type issues (or food systems scientists who can work with large datasets), which further affects the ability of industry to develop and implement effective systems that utilize large datasets to address food safety and quality issues. Based on these and other challenges, there is a clear need for the industry to take action to prepare to take advantage of big-data tools and solutions for food safety and quality dilemmas.
What Could the Future Bring?
With the rapid advances in both collection and analysis of big data, it can be valuable to speculate on what the medium- and long-term future may look like as these tools are increasingly applied to food safety and quality. For example, the use of WGS for characterization of foodborne pathogen isolates by regulatory and public health agencies in the U.S. has gone hand-in-hand with rapid public release of full sequencing data. This puts industry in a position where it may soon be able to monitor subtype data for human clinical isolates and where it can then rapidly detect possible outbreaks, e.g. through comparisons with subtype data for isolates from processing facilities and other data (e.g., distribution pathways, purchase patterns). In the processing environment, integration of diverse data sources with historical microbial testing data may not only allow for improved and accelerated root cause analysis, but also for prediction of time intervals that may present lower and higher risk for spoilage or food safety issues; this information could be used to adjust food safety and operational practices in near real-time to include additional barriers and controls, including adjustments in preventative maintenance schedules, etc. Data sources that could be used in these analyses include weather patterns, environmental parameters in a facility (monitoring humidity, dews points, etc.), and equipment related parameters (vibration, flow rates, etc.).
With possibilities that may seem nearly unlimited, it’s essential for industry to critically evaluate its needs and high impact areas and define specific questions and issues, rather than simply collecting increasingly large datasets and hoping that “something useful will come out of it.”
Dr. Wiedmann is the Gellert family professor of food safety in the Department of Food Science at Cornell University and a member of the Cornell Institute for Food Systems. He also serves as director of graduate studies for the Field of Food Science and Technology at Cornell. Reach him at email@example.com.
AUTHOR ACKNOWLEDGEMENTS: I acknowledge helpful and stimulating discussion with many colleagues on the topic of big data in food safety, including with Frank Yiannas, Pajau Vangay, Laura Strawn, Jamie Kaufman, Sean Leighton, Barbara Kowalcyk, Julie Stafford, Courtney Parker, and many others. This article is based on a presentation at the 2014 Cornell Food Systems Global Summit.