The generation of realistic 3D models of whole cities has become a vibrant and highly competitive market through the recent activities of, most notably, Goggle Earth and Microsoft Virtual Earth. While the first generation of these systems only delivered high-quality zoomable images of the ground, the current trend is heavily geared towards 3D that is, users can access three-dimensional height- fields of the terrain, and even 3D models of individual buildings. Simple building models, basically extruded polygons with different types of roofs, can be generated today from aerial images completely automatically. This is a solved problem. Far from solved, however, is the problem of generating automatically detailed buildings with faades. Input data for this problem are registered range maps obtained by stereo matching and sequences of highly overlapping thus redundant images (taken from a car driving in the road) where each pixel has not only a color but also a depth, a z-value. Although range maps can be directly rendered in principle, the data size is huge and, more importantly, the pixels have no semantics: A priori there is no difference between a pixel on the floor, on the wall, or on a door. But these shape semantics are required by all downstream applications using the city model. Shape grammars, on the other hand, have recently become (again) a popular method in research for representing 3D buildings. Their great advantage is that they allow to parameterize buildings, which can be used for populating virtual cities with believable architectural buildings, e.g., for 3D games. The goal of the CITYFIT project is, given highly redundant input imagery and range maps from an arbitrary building in Graz, to synthesize a shape grammar that, when evaluated, creates a clean, CAD- quality reconstruction of that building that fits the original data very closely and makes the semantics of all major architectural features explicit. These shape semantics can even be transferred back to inform the original data, so each of these semantically enriched data points can tell whether it belongs to ground, wall, or door.