<p dir="ltr">Automatic image captioning is an important research topic that connects computer vis?ion and natural language processing. It has many uses, such as helping people with dis?abilities and understanding visual content better. A key aspect in image captioning task is visual and language attention, which allows a model to focus on the most important parts of an image and text and helps the model identify key areas and generate more accurate captions, similar to how human vision works. </p><p dir="ltr">Visual-language attention remains challenging, particularly in multi-modal image tasks. Existing attention algorithms in computer vision and language are often ineffective or poorly suited to image structures and scene complexity. A key limitation is their failure to consider semantic representations of image objects, leading to captions that, while gram?matically correct, often lack meaningful interpretation of the image. This thesis intro?duces a novel semantic transformer attention model for image captioning. The proposed approach enhances the image encoder by mapping visual features into a semantic space. Leveraging transformer-based self-attention, this image encoder and language decoder framework integrates visual and semantic representations, producing more accurate and meaningful image captions. Furthermore, two kinds of language decoders were proposed to create two different models for image captioning. One generated captions sequen?tially, while the other processed language in parallel. The proposed models were eval?uated on standard evaluation metrics using both public and private datasets to ensure their generalizability. Comparison of the proposed models with state-of-the-art models showed how effective the semantic transformer attention models were in bridging the gap between image and caption. This research reached scores of 120.9 and 132.0 on the MS-COCO dataset based on the CIDE metric. In addition, two new frameworks were intro?duced: the first was designed for Internet of Things environments, using edge and cloud computing to improve image captioning efficiency. The second integrated a pre-trained large language model to make the captioning system more user-friendly by allowing eas?ily upload images and receive captions through a simple user interface. The findings underscored the importance of semantic knowledge understanding in im?proving the performance of image captioning models, provided the alignment between image features and linguistic generation, and described image captions accurately.</p>
History
School affiliated with
School of Engineering and Physical Sciences (Theses)