Using Multi-modal LLMs from Spring AI

A use case I’m currently building revolves around understanding the “Customer Basket” element of shopping.
The goal is to encourage bank customers to upload their shopping receipts so consumer behavior can be better understood.

To implement the use case, I thought I’d use multi-modal LLMs to look at photos and tell me whats in the basket.
(use an LLM to turn it into a JSON).

Yes, this use case could have been built using a simple OCR approach, and - to some degree - that’s all I’m initially using the LLM for. But, the speed and simplicity afforded to the application by offloading the minutae of OCR to the LLM, and being able to access the information within the image in a dynamic and prompt-driven manner means the application can be built very quickly.

The non-AI analogy I can think of is Relational Databases and where logic resides. A table row filter can be done in the application layer, but why not leverage the power of SQL and push predicates to the database?

Here’s a quick sample of what’s being built… more soon

Dependencies

plugins {
    id 'org.springframework.boot' version '3.5.0'
    id 'io.spring.dependency-management' version '1.1.4'
    id 'java'
}

group = 'ai.someexamplesof'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = '18'

repositories {
  mavenCentral()
  maven { url 'https://repo.spring.io/milestone' }
  maven { url 'https://repo.spring.io/snapshot' }
  maven {
    name = 'Central Portal Snapshots'
    url = 'https://central.sonatype.com/repository/maven-snapshots/'
  }
  
}

dependencies {
    // AI specific imports
    implementation 'org.springframework.ai:spring-ai-client-chat:1.0.0'

    // inference service implementation
    implementation 'org.springframework.ai:spring-ai-starter-model-vertex-ai-gemini:1.0.0'

    // generic
    implementation 'org.springframework.boot:spring-boot-starter-actuator'
    implementation 'org.springframework.boot:spring-boot-starter-web'
    implementation 'io.micrometer:micrometer-registry-prometheus' //prometheus exposure

    testImplementation 'org.springframework.boot:spring-boot-starter-test' 
}

The Core Code

    @PostMapping(value="/receipt", consumes=MediaType.MULTIPART_FORM_DATA_VALUE)
    public Map<String,Object> processReceipt(@RequestParam("receipt") MultipartFile multipartFile, HttpServletResponse response) throws Exception {
        logger.info("invoked /receipt");
        Map<String,Object> responseObject = null;
        //upload the file
        try {
            Path tempDir = Files.createTempDirectory("receipt-upload");
            Path destination = tempDir.resolve(multipartFile.getOriginalFilename());
            multipartFile.transferTo(destination); 
            //TODO put the receipt somewhere else
            //convert to the byte array     
            Resource imageResource = new PathResource(destination);
            // UserMessage userMessage = new UserMessage(prompt,List.of(new Media(MimeTypeUtils.IMAGE_JPEG,imageResource)),null);
            UserMessage userMessage = UserMessage.builder()
                .media(new Media(MimeTypeUtils.IMAGE_JPEG,imageResource))
                .text(prompt)
                .build();
            //send the message and get the response
            String reply = chatModel.call(userMessage);
            //parse the JSON out of the response
            if (reply.contains(NEGATIVE_ANSWER)) {
                //set the 400
                response.setStatus(HttpStatus.BAD_REQUEST.value());
                responseObject = new HashMap<>();
                responseObject.put("Error","Unfortunately, this image is not a receipt");
                return responseObject;
            } else {
                responseObject = vertexResponseProcessor.parseJSON(reply);
            } //end if
        } catch (Exception e) {
            logger.error("Error uploading kubeconfig file: ", e);
            // return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).build(); 
            response.setStatus(HttpStatus.INTERNAL_SERVER_ERROR.value());
            return null; //dump out
        }        
        //return
        return responseObject;
    }

Multi-modal LLMs Seriously Speed Things Up!

Dependencies

The Core Code