The gene encoding the toxin A protein of Clostridium difficile (strain VPI 10463) was cloned and sequenced. The coding region of 8,133 base pairs has a mol% G + C of 26.9 and encodes 2,710 amino acids. The deduced polypeptide has a molecular mass of ca. 308 kilodaltons. Nearly a third of the gene, at the 3' end, consists of 38 repeating sequences. The repeating units were grouped into two classes, I and II, on the basis of length and the low levels of DNA sequence similarities between them. There were seven class I repeating units, each containing 90 nucleotides, and 31 class II units, which, with two exceptions, were either 60 or 63 nucleotides in length. On the basis of DNA sequence similarities, the class II repeating units were further segregated into subclasses: 7 class IIA, 13 class IIB, 5 class IIC, and 6 class IID. The dipeptide tyrosine-phenylalanine was found in all 38 repeating units, and other amino acid sequences were unique to a specific class or subclass. This region of the protein has epitopes for the monoclonal antibody PCG-4 and includes the binding region for the Galα1-3Galβ1-4GlcNAc carbohydrate receptor. Located 1,350 base pairs upstream from the toxin A translation start site is the 3' end of the toxin B gene. Between the two toxin genes is a small open reading frame, which encodes a deduced polypeptide of ca. 16 or 19 kilodaltons. The role of this open reading frame is unknown.