The DSG Consumer Health Vocabulary currently consists of four flat files:
  1. concepts_terms_flat_file (download) - updated May 20, 2006
  2. ngrams_flat_file (download)
  3. stop_concepts_flat_file (download)
  4. incorrect_mappings_flat_file (download) - updated May 12, 2006

You can download all the files here: CHV_flat_files.zip

The values in these files are separated by tabs and are best viewed in Excel.

The Concepts Terms Flat File

The concepts_terms_flat_file is ordered by the concepts that are most frequently mapped to by terms in the query log data. Each concept may have many terms that have mapped to it. Each of these terms is listed on a separate row, which means that there is more than one line associated with each concept. There is no particular order to the terms themselves, however. The first column is the concept ID (CUI) and the second column is the term. For each concept-term combination, there are flags indicating whether that term is the consumer preferred name, whether it is the UMLS preferred name, and whether it is disparaged. A disparaged term is one that is misspelled or has some other abnormality about it.

With each term is a column titled "status," which may contain the values "AMBIG" or "VAGUE". An ambiguous term is one that maps to more than one concept, and one or the other of these concepts is in mind. A vague term is one that maps to more than one concept, and it is not clear which one is in mind.

Each term also has a termscore, which is a calculation of how easily understandable that term is. This number is derived from counting the frequency of the term is several large text corpora. The cuiscore is a similar number which represents how understandable the concept is. It is derived from determining how closely related the concept is to other concepts whose cuiscores are known. The process is begun by manually assigning cuiscores to several CUIs whose scores are easily determined manually. The combo score attempts to combine the cuiscore and termscore to arrive at another approximation of how understandable the cui/term is.

Another column is titled "reviewed". This simply refers to whether the mapping between the term and the concept has been manually reviewed. If it has not been reviewed, it may still be a valid mapping. If it has been reviewed, then it is definitely (in the opinion of the reviewers) a valid mapping.

Finally, a date column simply gives the date on which the flat file was generated.

The Ngrams Flat File

The ngrams flat file lists terms and phrases that have not mapped to the UMLS, but which, in the estimation of the reviewers, should map to medical concepts. The ngrams are not arranged in any particular order. For each ngram, a flag indicates whether it is meta, mod, disparaged, or misspelled. Every misspelled concept should be automatically disparaged. There is also a column which may contain a comment.

The Stop Concepts Flat File

The stop concepts flat file simply lists the CUIs and the names of concepts which we have judged to be excluded from the consumer health vocabulary. No term should map to any of these concepts.

The Incorrect Mappings Flat File

The incorrect mappings flat file lists combinations of CUIs and terms which are incorrect mappings. Of course many terms should not map to many concepts, but these are terms which actually have been mapped to these concepts under some system and which the reviewers have judged to be incorrect mappings. This list is not exhaustive. Many terms not listed next to a particular concept will also be incorrect mappings for that concept.