With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100k, an extensive repository comprising $100$k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.
AutoGeo-100K is the largest dataset of geometry shapes, comprising 100,000 images of geometry shapes along with their corresponding natural language descriptions and clauses that determine the semantic meaning of the shapes. This dataset is generated automatically by combining 77 clauses through the AutoGeo pipeline, which includes a Rule-based Clause Selector, an image generation module capable of producing various images with the same semantic meaning, as well as a language generation module that generates 20 caption templates for each clause. We propose Geometry Captioning Tasks (GC) based on the AutoGeo-100K dataset and evaluate the performance of several state-of-the-art MLLMs on this task. Additionally, we compare the performance differences of baseline models and models tuned on smaller subsets of the AutoGeo dataset (AutoGeo-10K, AutoGeo-30K, AutoGeo-50K) on the Geometry Question and Answer Task (GQA). Our findings indicate a performance increase in both tasks as the volume of tuning data from the AutoGeo dataset increases, highlighting the importance of having a large amount of geometry data for training and evaluation purposes. Given the limited data volume of current geometry datasets and the challenges associated with obtaining geometry image and caption data, our goal is to provide the research community with a comprehensive dataset for studying MLLMs in the context of geometry. Additionally, we present a methodology for generating a large quantity of new geometry data, which we believe will be beneficial for further research and development in this field.
Please contact huanzh@zju.edu.cn for questions about AutoGeo.