Currently working on a little project to help with transaction data anomaly detection and ran into a data issue; there was none. To help quickly keep the project moving, mock data generation was the key and though there are existing ones out there, we needed to ensure the data met a couple of rules (was a certain age, contained certain repayments, and then no repayments, different interest rate charges, etc.)

To build a generator can take some time, but enter using an LLM to assit.

Using publicly available, free (no sign-up required) LLMs, the build time for a successfuly python-based dataframe generator was brought down to less than 2 hours. It was still a long time but mainly due to the various rules and subtleties needed.

Lessons Learnt

  1. The smaller the better!
    • ask the LLM to generate functions that are simple
    • use your code and your archtiecture to link them together
    • asking, in a single prompt, for a complex algorithm proved dicey
    • small, concise function asks were the key
  2. Abstract for hints
    • asking the LLM to do your exact code can be frustrating and a lot of time is spent in debugging
    • this can be due to language mis-interpretation, as well as the very statistical “have-I-seen-that-before” nature of LLMs
    • instead, ask for structures “write a python function that resorts a dataframe by the date column and then reindexs”
    • these fragements then help assemble the code with little bugs and errors

The Code

It’s a Work In Progress but can be found (here)[https://github.com/willschipp/collection-ai/blob/dev/src/txn_simulator.py]