Research Scientist Pooja Rao from Google Research discusses the importance of health datasets in research and medical education. Creating a dataset that accurately represents real-world conditions can be challenging, especially in fields like dermatology where conditions vary in appearance and severity across different skin tones. Existing dermatology image datasets often lack representation of common everyday conditions and tend to focus on lighter skin tones, with missing race and ethnicity information making it difficult to assess disparities.
To address these limitations, Google Research and physicians at Stanford Medicine have collaborated to release the Skin Condition Image Network (SCIN) dataset. This dataset aims to reflect a broader range of skin concerns that people search for online, supplementing clinical datasets and including images across various skin tones and body parts to ensure that future AI tools are effective for all individuals.
The SCIN dataset, which is freely available as an open-access resource, contains over 10,000 images of skin, nail, and hair conditions contributed by individuals experiencing them. Contributors in the US provided images voluntarily with informed consent, along with demographic information and details about their skin concerns. Dermatologists retrospectively labeled each image with up to five dermatology conditions, providing valuable data for model testing and training.
Compared to existing dermatology datasets that focus on tumors, the SCIN dataset predominantly features allergic, inflammatory, and infectious conditions, with a high representation of early-stage concerns. The dataset also includes self-reported and dermatologist-estimated skin type and tone information to enable future research in dermatology representation.
Utilizing a crowdsourcing method, the SCIN dataset was created with a low spam rate, with the privacy of contributors being a top priority. Contributors were informed of re-identification risks and precautions were taken to protect their privacy, including manual redaction and metadata removal.
The SCIN dataset serves as a valuable resource for inclusive dermatology research, education, and AI tool development. By demonstrating an alternative dataset creation method that prioritizes representation and privacy, SCIN sets a precedent for future research in areas where self-reported data and retrospective labeling are feasible.
Source link