DTC vs LLM: ML Edition

With all the excitement about Anthropic's Claude being one of the best language models yet, with a simple prompt having the power to generate a fully-fledged retro arcade game, I couldn't help but wonder if this a sign that the end is nigh for coders like myself.

Inspired by a debate at the recent HETT (Healthcare Excellence Through Technology) conference and blown away by recent advances in the technology, I found myself pondering the question, will LLMs replace coders?

Just as in the technological revolutions of the past, whereby the fishing rod has been retired in favor of the net, are coders about to be pushed into an idyllic, fondly remembered past?

Romantically reminisced as those who stayed up in the dark until 4 am fixing bugs in code, whilst consuming too much caffeine and listening to weird niche music...

Some think so, and I think I'm among them.

So, let's put the theory to the test!

Today's post is a showdown between Doctors That Code (i.e. The Humans) vs. LLM's (i.e. The language Models): Machine Learning Edition.

The Rules

I can go to any language model I want and keep asking it to generate code and fix problems.

When we hit a dead end, or I'm going around in circles, or I'm bored... and I have to fix the code, humans win.

When it gets it right and I'm not needed, the language models win.

The Task

For a bit of fun, and as it's actually a project I've been working on recently, I figured let's take on the language models at their own game. Let's code some machine learning (ML)!

We're going to build a Random Forest algorithm, put data (that we're going to clean) into it, get some results, draw some outputs, and see what it all means. Fun.

Some of you may be asking, am I qualified to face off as the defacto data scientist against the mighty state-of-the-art LLMs?

No, of course not!

However, in an effort to secure my financial future for when robots take over the medical profession, Robodocs™ who are capable of being both empathetic and able to get the answer right every time in comparison to their human counterparts, (who at this point will all be so burnt out and and fallible that they couldn't tell a pneumonia from a lower respiratory tract infection). I've been attempting to skill up my data scientist chops.

It's my belief that when the propheted general AI comes, there will be only two jobs left: A data scientist, whose role it is to update the AI models, and, milking cows.

The latter because despite all the advances we have made as a civilisation, the ability to milk a cow is a skill with such finesse, such fine touch and caress, that even the most advanced robots are yet to master it... or so I assume.

Either way, given the choice, I'd rather do the data science bit.

Back to the task at hand.

Scenario

We have a bunch of de-identified data taken from cancer patients who have unfortunately been admitted to critical care.

For those unfamiliar, this is where you're in worse shape than your typical hospital patient and are moved to a special bit of the hospital.

In critical care, you get closer monitoring, and sometimes some very clever machines and drugs to help you breathe and keep your blood pressure up.

The dataset we have is fairly big, there are about 400 rows and 82 columns of data (32,000 pieces of information). The columns contain data on things like age, blood tests, cancer type etc.

What I want our clever machine learning algorithm to do, is to look at this data and tell me if anything stands out when we try to answer three specific questions.

The Questions

Will they ever be fit for cancer treatment again?
Are they in a good state of fitness when / if they survive and get discharged?
Do they survive for 6 months after being discharged?

Straightforward enough.

Github link is here for those who want to follow along.

So DTC (The humans) vs LLMs (The machines). Let's do this!

Round 1: The data.

The thing with patient data is, you can't feed it into an LLM.

Even if it's de-identified, because them's the rules.

Not only may it contain non-direct identifiers, but when you filled in the forms to get permission to work with it, (Information Governance, Caldicott, Ethics etc.)

If you said that you wanted to send this data to an LLM, with its data centre probably housed in Utah, in the United States of Mexico, then that's an automatic do not pass go, do not collect $200. At least in our current regulatory environment.

So this one could be chalked up to an automatic win on the human's side. Hurrah!!

However, I don't want to call it this early. As part of my mission to improve, inspire discussion, and educate others who want to become a data scientist in favour of a milker when the time comes, I want to see if our LLMs can work around this issue. Synthetic data creation!

Before we take this to our AI counterparts, I want to take this opportunity to explain why we start with data rather than going straight to the algorithm.

When using machine learning, the data, rather than the algorithm, is where it all begins.

The algorithm choice is informed by the dataset. What is in it, how much of it you have, and the question you want to answer.

In this case, the question is what data affects the 3 outcomes we are interested in.

The outcomes are all binary, meaning they have either a yes or no answer.

This means we need to select a machine learning algorithm that is capable of handling binary classification outcomes.

We also have a mix of continuous numerical and categorical data in our dataset, like BMI which is a number, and Sex, which is a category. Another consideration is missing results. We have an incomplete set of blood test results and observations, and some of our patients were lost to follow-up.

So we need an algorithm that can handle these things well.

Another really important question to ask of your data is how much do you have, and how complex is the likely relationship between our data and the outcomes going to be.

There are some incredible algorithms out there, and you've no doubt heard of deep learning and the mighty neural network, which you probably want to use for this type of problem as it is likely a complex relationship we are looking for... But the truth is, without enough data, it just won't work.

It also takes some serious compute to use them, and sometimes they overcomplicate the solution which could be served by a more simple approach. You don't want to use a hedge trimmer when a pair of scissors will suffice.

That being said, let's get the LLMs to generate our synthetic data.

I first went to Anthropics Claude...

Hello Claude 
I have a data set, but I cant upload it to you as it contains sensitive patient information. However, I have uploaded the "Credit ML Dataset LLM" so you can see all the fields. I have added 2 examples of fictious data to it so you can see the type of values it would expect.
For the diagnosis categories and treatment categories, I have uploaded two further CSV files named appropriately so you know what they contain. 
I want you to create a dataset with approximately 400 synthetic values.

To its credit, it took my prompt, ingested my 3 csv files without complaint, and produced a 350 line piece of code that looked like it would create the correct fields with all the correct values in a matter of seconds. That was impressive.

However, failure struck when I came to compile it...

"Exception has occurred: ValueError: Probabilities are not non-negative"

What the hell did that mean?!

Another problem with not writing your own code is that you will have no idea why the program is doing anything with probabilities. (And why is my compiler giving me double negatives!? ... )

Before I move on, it's worth stating here that a single python file containing 350 lines is bad code. It's monolithic and un-maintainable. It violates the SOLID principles of software development. (See attached link, well worth understanding and adhering to).

There were no modules, a single function, just line after line of data creation.

I'm sure there are better ways of doing this. With respect, It would have taken a human hours to do this code, but as it stands, it isn't code that I would ever write, and it didn't work.

Claude and I had a few more interactions with me pasting the error and the affected lines each time. However, each time, new, similar errors appeared.

It wasn't taking the lessons learnt from the earlier corrections and applying them to the other regions of the code where it had made similar mistakes.

Also, one of the major problems with it generating this single monolithic piece of code is that it has to keep rewriting and updating it. In doing so, it fills up the context window (which is like the working memory of the language model). So in not a particularly long game of back and forth, it ran out of memory space to debug the code and answer my questions.

When it declared that I had to wait 6 hours to continue using the premium model... Pfft... Moving on to Microsoft's Copilot.

I can't paste the whole code to Co-pilot - it's too big. So now I'm delving into the error message and pasting that and the lines of affected code into Co-pilot.

A new focus of failure has emerged!

The LLM is failing to generate a synthetic Charlson Comorbidity Index (CCI) score.

This is a score that gives you a probability of being alive in 10 years based on how many medical problems (like heart disease and diabetes) you have. Cheerful.

This is the code that is failing...

  # Charlson Comorbidity Index: 0-19
    # Make sure probabilities sum to 1.0
    charlson_probs = [0.05, 0.10, 0.15, 0.15, 0.15, 0.10, 0.10, 0.05, 0.05, 0.02, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01, 0.005, 0.005, 0.005]
    # Add any remaining probability to the last element
    charlson_probs.append(1.0 - sum(charlson_probs))
    data['Charlson Comorbidity Index'] = np.random.choice(range(0, 20), n_samples, p=charlson_probs)

I can see what it is trying to do here, it wants to use a range of probabilities for each component of the overall score for the CCI.

It manages to fix this with a few prompts, but something has occurred to me - this code is fundamentally flawed.

The Charlson comorbidity index has 17 values in it, not 20.

Another problem with getting an LLM to generate your code is that you're making assumptions about the accuracy of the output based on what is essentially a black box if you can't code. It's time to step in and help the LLMs out.

I'm calling this round... Humans 1 - LLMS 0.

Match Review

There are some good ideas in there, but there are some flaws in how it has approached things. Especially when considering the distribution of probabilities.

It takes some joined-up thinking when it comes to generating a data set.

For example, the Glasgow Coma Scale (GCS), which we commonly use to assess how responsive a patient is, ranges from 3 to 15 (3 is completely unresponsive, 15 is totally normal).

There is a little adage in medicine: "GCS under 8, intubate," which means that the patient is so unresponsive that they will likely need breathing support. So if you generate a GCS under 8, you need to make sure the rest of the data makes sense around that. For example, breathing support should be a 'yes' in those cases. There is no logic in this code to support that. Or anything else like that.

The DTC approach

If I were generating a data set with this many variables that needed some joined-up logic, I'd create a class that captured the patient information.

The class would have properties for each variable and reusable modules for generating probability distributions and number sets.

This would allow it be re-used wherever it is needed for the data, and it would be a much cleaner, more maintainable bit of code.

You could then save this class and re-use and extend it for other datasets.

This is the core idea behind object-oriented development.

I've not done this; it would take too long :)

So I've just patched up what the LLM created to get the dataset finished for the remainder of the post...

Round 2: Pre-processing the data

Pre-processing the data is probably the most challenging part of the whole process.

It relies on applying specialist knowledge (as in this case, medical knowledge) to make informed and practical decisions about what is important and what is not important in the data. It allows you to make decisions about what you can invent, and what you can't.

This process is called imputation.

There are lots of ways to go about doing this, lots of methods that can be used.

It's a common problem in medical data, and many papers have been written to reflect the approaches you can take to deal with it. So rather than just a coding test, this is also a reasoning test.

The real data set had a lot of missing values, the synthetic set was obviously complete.

So, I've taken my Excel spreadsheet and done some random deleting to see how it handles the problems I've created. (Cue excessive Evil laughter....)

ChatGPT is first up this time.

Dear Chat GPT, 
I've got an incomplete dataset of medical values, I need to create a python program that will logically impute missing values here. Using methods such as mean, mode, last observation carried forward, and any other suggestions you'd make. Can you help me? 
Please find the attached spreadsheet.

Wow... It read the file, listed a number of suggestions for imputation (all of which sounded completely reasonable) and picked up that there were missing values... Amazing.

Check out this response!

Absolutely! Let’s break this down into steps. First, I’ll take a look at your file (synthetic_cancer_critical_care_data_missing.csv) to understand its structure and the missing values. Then, I’ll help you build a Python program that can handle logical imputation using techniques like:

Mean imputation (for numerical variables)
Mode imputation (for categorical variables)
Last Observation Carried Forward (LOCF) (for time-series-like data)
Median imputation (as an alternative to mean if outliers are present)
Group-wise imputation (e.g., imputing based on patient groups, if you have patient IDs)
Let me load and inspect the data first:

Thanks for sharing the dataset! Here's a quick summary of what we have:

400 rows with 76 columns.
A mix of numerical (float64, int64) and categorical (object) columns.
Many columns have missing values, some more than others (e.g., "Features of sepsis?" is missing about 71 rows).

The code it generated was 55 lines long. It compiled perfectly first time.

And get this, it even generated an output spreadsheet called "Imputed data set!"... :O

For the humans, I had written about 10 functions for this part of the program with probably a couple of hundreds of lines of code.

I'd handled each component separately, like blood values and mechanical ventilation, thinking about missing sex and BMI. So when I see 55 lines and a single function to do this, alarm bells are ringing. Too good to be true? Yes it was!

Delving into the inputed data set, several of my binary categories, such as "Anti-cancer therapy within 6 weeks" and "acute kidney injury" which are yes / no categories that had previously been populated with 0's and 1's now had numbers like "0.3577235" in all the missing fields.

A quick prompt back to ChatGPT to see if it can save the day!

Oh no!!
In some of the missing fields that were previously binary (like previous anti-cancer therapy and AKI) they are now no longer binary!! Help!

Much to my amazement, with that simple prompt, it replied, explaining that this is a common issue when handling mixed data - (categorical, binary, numeric) and that it misclassified certain fields.

It then re-read the data and pulled out the categories that were binary...


# Define binary columns (these should only ever have 'Yes', 'No' or 0/1)
binary_cols = [
    'Previous anti-cancer therapy',
    'AKI present on admission?',
    'Features of sepsis?',
    'Inotropes required?',
    'Was the patient ventilated?',
    'Acute admission to Critical Care?'
]

I ran the new program, which had now crept up to 80 lines long.

It compiled, and it worked. Still missed some binary columns, which it was able to be updated on and correct.

However, there was a big error in its logic in one of the imputations.

Last observation carried forward.

Last observation carried forward is a great way of handling missing values if it's done appropriately.

For example, let's imagine you're taking a heart rate every hour, and you miss a recording here and there.

If you know that the heart rate was 82 at 2pm, and at 4pm it is 80, in the world of machine learning, you're probably okay to assume the middle one was around 80 as well. So you can take that observation forward and fill in 3pm as 82.

This isn't fool proof of course. There may have been a total disaster where they went into a rapid pulseless ventricular heart rhythm, an emergency was put out... Sirens blaring as the emergency team ran in, a few chest compressions were delivered in a flurry of activity whilst people put on defibrillator pads and shouted instructions to one another.

The patient was triple shocked out of it and then spontaneously recovered. Within an hour, heart rate back to normal. Could have happened. Probably didn't.

In the the code it has treated each row as a different time point in the same patient, but of course, each row is a different patient.

So this is making up observations based on the previous patients data.

This won't do at all.

# Columns to apply Last Observation Carried Forward (LOCF) (time-series like vitals)
locf_columns = [
    'Final NEWS 2 score Before Critical Care admission',
    'Final HR before Critical Care admission',
    'Final RR before Critical Care admission',
    'First GCS on Critical Care admission',
    'Lowest temp',
    'Highest HR',
    'Lowest HR',
    'Highest RR',
    'Lowest RR',
    'Lowest GCS'
]

To its credit, it's almost like it knew this could be a potential problem.

As you can see in the comments of the code above, it states it is treating it like a time series .. and it also captures many of the fields I would consider carrying forward the last observation on.

Because I didn't necessarily read the code to see this, following the ad-hoc rules of the game, I take this back to ChatGPT.

It suggested that that LOCF was very difficult to implement as it didn't know where to get the previous result from, and that we'd in fact be better going for Mode or Median in this case. That is a fair compromise!

It also went a bit off-piste and threw me a curveball. It suggested a K-Nearest Neighbour imputation approach. K nearest neighbour is where you find the patients (or records within your dataset) with the most similar values to yours (something like a case-control matching situation) and apply their values to what's missing on yours.

This was awesome because it was a really good idea and I didn't think of it when I was coding, and probably should have. This is an example of where the LLMs can really come into their own. So that was a total win for team LLM.

The final step when it comes to the pre-processing is turning everything into a format that our random forest can read. The random forest needs numbers!

It's not able to read words, it needs everything to be turned into some sort of numerical equivalent to allow the decision trees to be built and the maths to be done.

So when we have diagnosis categories in our data set, like breast cancer, lung cancer, Bowel cancer, etc. We need to turn these categories into numbers.

The wrong way of doing this is to assign numbers like 1 - 10 for each of your categories because the machine learning model might accidentally consider higher numbers to imply something, and so you can mess things up that way.

The right way of doing this is to do something called one hot encoding, and that's where you take all of the categories you could have, and you turn them into an extra set of columns, with the relevant one having a 1 in it, and the rest 0's.

This took a bit of back-and-forth with ChatGPT, I ended up using up my premium allowance, so I went back across to Co-Pilot - and it smashed it.

It has separated into relevant functions for each category. I was able to give it a few tweaks here and there to handle missing values, and it worked.

I'm impressed. It wasn't perfect, but this round goes to the machines.

Round 3: The Showdown

This we've all been waiting for. The final round!

I went back to Anthropics Claude as what is considered to be the premium propreitary model at present. Learning from my previous prompt wars, I thought I would be very specific in what I wanted it to do. Here we go!


I have a dataset, I want you to write a python code that will perform a random forest on this dataset, perform a grid search on it and then run with the best hyperparameters, then output the results showing appropriate metrics, and a feature importance box plot and a SHAP waterfall plot. The outcome measures are "Survival 6 months post crit care", "ECOG PS: 1=<2; 0=>3" and  "Oncology treatment, 0=no, 1=yes"

The code it produced was about 220 lines long and used a single mammoth function, not ideal from a context window perspective.

At this point, I wonder if I was to instruct the prompt to follow SOLID prinples of development, it would produce clean,r better code?

Oh well, too far down the rabbit hole now.. and I'm not rewriting this post.

The first compile failed with the error:

No module named 'seaborn'

Fair enough, this is a library I've not got installed in my machine learning environment.

It's a visualisation library. I pip install this and re-run.

The second shot fails as well, this time because it's trying to read in a csv file rather than the original data's Excel. A quick prompt corrects this. Then I ran successfully.

The first part generates a beautiful confusion matrix and a feature plot, but it crashes out on the SHAP plot! Disaster.

We get an error to do with a type - without deliving in the code ,I'm just going to throw this straight back to Claude and see what it does.


TypeError: only integer scalar arrays can be converted to a scalar index

Claude takes it in its stride. Makes four changes, including wrapping the whole process in a try-and-catch block (which is a good idea, actually; in case part of the program fails, it will soldier on through the rest ).

I especially like this when running on my local machine, as it is rinsing my GPU every time it runs the random forest grid search.

It runs successfully, we are getting some beautiful diagrams out.

The SHAP summary plot still doesn't work, sadly.

And then BAMN - We have what I think is a concurrency error...

UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

And as if by exhaustion, Claude taps out, unable to continue until 11 pm.

So I throw this hot potato over to ChatGPT to see what it thinks.

My dearest ChatGPT,
I'm getting errors with my SHAP diagram and a concurrency error with this code - can you help?

ChatGPT handles it well, does some LM'splaining (Like man'splaining - but with more sirens and green ticks in the text), condescending much?

It also produces some fixed code.

A point worth noting here, if the code being given back needs to be copy and pasted, and you are using nested try and catch blocks with if else conditions within, epecially in a language like python where white space matters, you need to know how to code to a certain degree to make this work - you can't just paste it in because you've got to go through aligning the code blocks correctly.

Which even with a seasoned eye with someone elses code can take a bit of thought and concentration. The code may still run, but the consequences can mean it won't work as intended, which is confusing as hell if you didn't write the code.

I'm starting to waver on this one, the LLM is going to need something special to pull this round out of the bag. Arguably because it needs a DTC (Human coder) at this point.

But despite this, let's keep throwing the ball back. This is an important round!

SHAP analysis encountered an error: The waterfall plot can currently only plot a single explanation, but a matrix of explanations (shape (23, 2)) was passed!Skipping SHAP visualization and continuing with analysis...

Okay, it did it.

A few more back and forth and the SHAP came out beautifully. I went through the code, it made sense. So, for the win then, I put it into Claude after 11pm, optimise this code for me beastie!

Holy potatoes. It organised it into functions.

It added parallel processing and analysed multiple outcomes at the same time.

It optimised the Grid Search and improved the code across the board.

Honestly, this was better by a long way than what I had done. Yeah, I'm tapping out. The LLMs have won it.

Final Commentry

The LLM's have blown me away with what they can do.

Impressive doesn't cover it. It started off a bit ropey, but as the code got more complex and I think my prompts improved based on the outputs, working together with it - It just worked. Let's put it this way, I built in less than a day what has taken me over a week to do alone.

Reflection

Although the LLMs came out on top, I think this was a bit of a rigged fight.

I know how to navigate the development environment and how to interpret the errors I saw. I knew what to cut and paste and where. I knew that when it wanted to use the seaborn library, I could just install it into my environment.

The Comorbidity index issue has stuck with me though, I knew how to check the code- that is one area that isn't lost to the AI yet. Because when you're putting out stats that you might want to put in a paper, you've got got to be able to check them.

That being said, I have no way of checking the mathematics that the random forest did. I trust the process. So, maybe that's where we are going to be in the future. You don't double-check the workings of a calculator.

I suppose the question is, is this a good thing?

One of the circular conversations I've had around this technology is that you're going to get a bunch of developers who just don't know how to code.

I'm sure that isn't true. Like any amazing tool, it depends on how you use it.

It's like having a teacher sitting next to you all the time while you're working on your problems. Instead of being left to struggle for too long, it can give you a nudge in the right direction if things are failing. And each time, you can learn from that nudge.

However, as I write this I realise maybe those days of sitting in purgatory trying to work out how to fix the code in front of me, that problem that sits in your head for days, until bam you're in the shower and the idea that comes to you to get over that hurdle works...

Perhaps thats where the learning was? Perhaps that's what you need to go through to learn.

Final Thoughts

What we've shown today is that you can prompt your way to a final program, probably most of the time. It will save you time. It will get you over difficult problems.

Obviously this was a synthetic data set, the real data set and expirement will be released shortly and attempted to be published, where I used improvements suggested by the LLM to answer the questions.

Can it replace coders? Not yet. No you need to be ofay with the environment, you need to know how to troubleshoot and copy and paste :D

But working together with LLMs, you might be able to turn everyone who knows how to code into a superstar as long as they put the work in to learn from the LLMs outputs.

And that's probably the take home message.

Much like a lot of AI solutions being peddled, they are not going to replace humans for a good long while, but they may make efficiency improvements.

Squeezing even more out of us as worker ants.

Until next time gang! :)

Sam

DTC vs LLM: ML Edition

Recent Posts

Comentarios