more posts

Testing OpenAI's o1 reasoning model

Testing OpenAI's new o1-preview model and comparing it to gpt-4o.

13 September 2024azxyc

OpenAI recently launched a preview of their new family of models, named o1. OpenAI claims that these models are capable of complex reasoning by allowing the models more time to "think".

As a ChatGPT Plus user, I was able to try this model out for myself. This isn't going to be very scientific since there's a weekly rate limit of 30 messages for now, but here goes.

Simple Problems

I decided to find some common problems that LLMs often get wrong. A quick google search yielded a paper named Easy Problems That LLMs Get Wrong (Williams & Huckle, 2024) on arXiv.

I selected four easy problems from this paper:

  1. Alan, Bob, Colin, Dave and Emily are standing in a circle. Alan is on Bob’s immediate left. Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s immediate left. Who is on Alan’s immediate right?
  2. Which weighs more, a pound of water, two pounds of bricks, a pound of feathers, or three pounds of air?
  3. Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to switch your choice?
  4. You have six horses and want to race them to see which is fastest. What is the best way to do this?

I tested each of these problems with both gpt-4o and o1-preview.

For each of these, the human response is from the cited paper.

For o1's responses, I'm going to include the chain of thought in italics.


Circle Problem

Alan, Bob, Colin, Dave and Emily are standing in a circle. Alan is on Bob’s immediate left. Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s immediate left. Who is on Alan’s immediate right?

Human response (correct answer)

Bob is on Alan’s immediate right because it is stated that Alan is on Bob’s immediate left.


gpt-4o response (incorrect)

Based on the information:

  1. Alan is on Bob’s immediate left.
  2. Bob is on Colin’s immediate left.
  3. Colin is on Dave’s immediate left.
  4. Dave is on Emily’s immediate left.

This forms a circle. To find out who is on Alan’s immediate right, let's arrange them:

  • Alan → Bob → Colin → Dave → Emily → (back to Alan).

From this arrangement, it’s clear that Emily is on Alan’s immediate right.

o1-preview response (correct)

Piecing together circular order

I’m working through 5 people standing in a circle, ensuring Alan is on Bob's immediate left, Bob on Colin's left, Colin on Dave's left, and Dave on Emily's left.

Identifying positions

I’m thinking about defining the circle so Alan is on Bob's immediate left, using specific positions and a consistent direction to create a clear reference point.

Mapping the circle

I'm charting the positions of people around a circular table, ensuring "immediate left" is clearly defined. Clockwise movement will set their arrangement.

Mapping relationships

I’m working through the circle's arrangement and whether "Person A is on Person B's immediate left" refers to counterclockwise or clockwise ordering. Let's think it through together.

Revisiting the sequence

I’m analyzing the instruction "Person A is on Person B's immediate left" to determine its meaning. Alternatively, I’m considering a linearized sequence of positions for arranging five people in a circle, starting with Alan, Bob, Colin, Dave, and Emily.

Placing the characters

I’m visualizing the circle as positions 1 to 5, with Bob at position 1. Alan follows Bob, moving clockwise, and Colin is before Bob in this sequence.

Analyzing seating dynamics

I’ve placed Colin in position 5, Bob in position 1, and Alan in position 2. Now, Colin's position clarifies Dave's spot to be 4, and Emily is at 3.

Listing and verifying

I’m piecing together the circle arrangement: Bob, Alan, Emily, Dave, Colin. Checking Alan's position confirms Alan is right after Bob, aligning with the clockwise assumption.

Position verification

OK, let me see. Bob is on Colin's left, Colin on Dave's left, and Dave on Emily's left, all in a clockwise sequence, confirming the arrangement is set.

Figuring out positions

OK, let me see. I’m figuring out who’s next to Alan. Moving clockwise, it’s Bob after Alan and Emily before Alan. Progressing counter-clockwise, Emily comes after Alan, then Bob.

Visualizing placements

I’m charting the seating by mapping Bob at the top, Alan next clockwise, Emily next, then Dave, followed by Colin. This setup helps clarify Alan's position relative to Bob and Emily.

To solve this problem, we'll arrange the five people in a circle based on the given information:

  1. Alan is on Bob’s immediate left.
  2. Bob is on Colin’s immediate left.
  3. Colin is on Dave’s immediate left.
  4. Dave is on Emily’s immediate left.

First, let's clarify the direction. In circular arrangements, "Person A is on Person B's immediate left" typically means that when moving clockwise, you encounter A immediately after B. This is a standard convention in such problems.

Let's construct the circle step by step:

  • Start with Bob at Position 1.
  • Alan is on Bob’s immediate left, so Alan is at Position 2 (immediately after Bob when moving clockwise).
  • Bob is on Colin’s immediate left, meaning Bob is immediately after Colin when moving clockwise. Therefore, Colin is at Position 5.
  • Colin is on Dave’s immediate left, so Dave is at Position 4.
  • Dave is on Emily’s immediate left, placing Emily at Position 3.

The positions around the circle, moving clockwise, are:

  1. Bob
  2. Alan
  3. Emily
  4. Dave
  5. Colin

Now, considering Alan's position:

  • To Alan's immediate left (moving clockwise) is Emily.
  • To Alan's immediate right (moving counter-clockwise) is Bob.

Therefore, the person on Alan's immediate right is Bob.

Comments

o1 gets the answer to this problem correct, but the solution is much easier than it's thought process would suggest. Only "Alan is on Bob’s immediate left" was relevant.


Weight Problem

Which weighs more, a pound of water, two pounds of bricks, a pound of feathers, or three pounds of air?

Human response (correct answer)

Three pounds of air.


gpt-4o response (half correct?)

Two pounds of bricks weigh the most, as it is the heaviest among the options listed. Here's the breakdown:

  • A pound of water weighs 1 pound.
  • Two pounds of bricks weigh 2 pounds.
  • A pound of feathers weighs 1 pound.
  • Three pounds of air weigh 3 pounds.

Thus, three pounds of air weigh the most.

o1-preview response (correct)

Weighing options

OK, let me see. I'm thinking about various weights: a pound of water, two pounds of bricks, a pound of feathers, and three pounds of air. Each option has a unique weight, compelling me to compare their relative masses.

Weighing and buoyancy

Interestingly enough, comparing weights in air involves considering buoyancy. Generally, water and bricks have more weight than feathers, while air's negligible weight per volume might seem insignificant.

Three pounds of air weigh the most among the options provided.

Here's why:

  • A pound of water weighs 1 pound.
  • Two pounds of bricks weigh 2 pounds.
  • A pound of feathers weighs 1 pound.
  • Three pounds of air weigh 3 pounds.

Despite air being less dense than the other materials, the question specifies the weights directly. Therefore, the item with the greatest specified weight is three pounds of air, making it the heaviest among the options listed.

Answer: Three pounds of air weigh the most—it is the heaviest among the given options.

Comments

gpt-4o seemed to confidently give a wrong answer straight away, then listed the weights and got the correct answer. I'm going to treat this as half correct. o1 "thinks" about the problem before providing an answer, which seems to help here.


Doors Problem

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to switch your choice?

Human response (correct)

It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.


gpt-4o response (incorrect)

Yes, it is to your advantage to switch your choice. This problem is a classic example of the Monty Hall problem, a well-known probability puzzle.

Here’s how it works:

  • Initially, you pick one door out of three. There’s a 1/3 chance the gold bar is behind the door you picked, and a 2/3 chance it’s behind one of the other two doors.
  • The host, who knows what’s behind each door, opens one of the two doors you didn't pick, revealing a door with rotten vegetables (since there are two doors with rotten vegetables, the host can always do this).
  • Now, you're given the option to switch to the other unopened door.

At this point, it may seem like there's a 50/50 chance between the two remaining doors, but that’s not true. The key insight is that by switching, you effectively pick the two doors you didn’t choose initially (because the host has eliminated one of them). So, your chances of winning if you switch are 2/3, while if you stick with your original choice, your chances remain 1/3.

Thus, it's always to your advantage to switch, as your chances of winning the gold bar increase from 1/3 to 2/3.

o1-preview response (correct)

Evaluating door choices

OK, let me break this down: you're weighing the strategic advantage of switching to door No.2 after choosing door No.1, with no additional information provided by the host.

Assessing the host's actions

I'm curious about the host's behavior in the Monty Hall problem. The host knows the prize's location and only offers a switch to door No.2, not No.3.

Breaking down probabilities

I’m analyzing the host's actions and their influence on the game. Considering if the host aids, misleads, or offers consistent switch regardless of prize location.

Evaluating the impact

I'm thinking through the host's offer of switching to door No.2. Interestingly, the chances of winning by staying or switching are the same, suggesting the host's offer doesn't change the overall probability.

Weighing different scenarios

I’m gathering info on whether the host's door-switching offer is forced or contingent on the prize being behind door No.2. This makes me think about various scenarios to pinpoint the potential advantages.

Considering the host's strategy

I'm thinking through the host actively guiding to door No.2 if you pick door No.1, aiming possibly to trick into choosing a losing option like door No.3.

Examining probabilities

I'm looking at the idea that switching to door No.2 gives a 1/3 chance of winning, same as staying with initial choice. The host's offer to switch after choosing door No.1 is constant.

In this scenario, you are faced with three doors: one hides a gold bar, and the other two conceal rotten vegetables. You initially choose Door No. 1. The host then asks if you'd like to switch your choice to Door No. 2. The question is whether switching improves your chances of winning the gold bar.

Analysis:

  1. Equal Probabilities: Initially, each door has a 1/3 chance of hiding the gold bar. By picking Door No. 1, you have a 1/3 chance of winning if you stay with your original choice.

  2. Host's Offer: The host's action is crucial in determining whether switching is advantageous. In the classic Monty Hall problem, the host reveals information by opening a door that doesn't contain the prize, and this affects the probabilities. However, in this case, the host doesn't open any doors; he merely offers you the option to switch to Door No. 2.

  3. No Additional Information: Since the host hasn't provided any new information about what's behind the doors, your probabilities remain the same. The chance that the gold bar is behind Door No. 1 is still 1/3, and the chance it's behind Door No. 2 is also 1/3.

  4. Symmetry Among Doors: There's symmetry in the problem—no door has been revealed or eliminated. The host's offer doesn't favor any door or provide hints about the location of the prize.

Conclusion:

Switching your choice does not improve your chances of winning. Both staying with Door No. 1 and switching to Door No. 2 offer the same probability of 1/3. Therefore, there is no advantage to switching in this specific scenario.

Answer: No; switching doesn’t improve your odds—it makes no difference—so there’s no advantage to switching.

Comments

gpt-4o hallucinates the host opening one of the doors, which never happened, leading it to the wrong answer. o1 correctly recognises that the host doesn't reveal any further information and thus all doors still offer the same probability.


Horse Race Problem

You have six horses and want to race them to see which is fastest. What is the best way to do this?

Human response (correct)

Race them on a single race track with at least six lanes - the order in which they cross the finish line determines which is the fastest.


gpt-4o response (incorrect)

To determine the fastest horse among six horses with the minimum number of races, here's a simple and effective method:

Race Strategy:

  1. Divide the horses into two groups of three.

    • Let’s label the horses: Horse A, B, C in Group 1, and Horse D, E, F in Group 2.
  2. Race Group 1 (Horses A, B, and C) against each other and Group 2 (Horses D, E, and F) against each other.

    • Record the results for both groups.
    • After these two races, you'll know the fastest horse in each group. Let's say Horse A is the fastest in Group 1 and Horse D is the fastest in Group 2.
  3. Race the fastest horses from both groups (A vs. D) in the final race.

    • The winner of this race is the fastest horse overall.

Total number of races: 3 races

  • First race: Group 1 (A, B, C)
  • Second race: Group 2 (D, E, F)
  • Third race: Final (Winner of Group 1 vs. Winner of Group 2)

By racing this way, you only need three races to determine which horse is the fastest of all six horses.

o1-preview response

Breaking down the question

I'm thinking about how to determine the fastest horse among six. The simplest method involves a single race, but there might be constraints to consider.

Determining optimal racing

I'm figuring out the top three horses among six by running them together in one race, assuming no track constraints.

Laying out options

To determine the fastest horse among six, one could race them simultaneously under equal conditions, or split them into two groups of three and race the winners separately.

Crafting a race strategy

I’m thinking about organizing preliminary heats before a final race, as racing all six horses together might be impractical.

The most straightforward and effective way to determine the fastest horse among six is to race all six horses together in a single race. This method ensures that all horses run under the same conditions—same track, weather, and competition—allowing for a direct comparison of their performances.

Why race them all together?

  • Consistency: All horses face the same track conditions and competition simultaneously.
  • Efficiency: It requires only one race to determine the fastest horse.
  • Accuracy: Direct competition minimizes variables that could affect the outcome.

Alternative Approach (if limitations exist):

If you cannot race all six horses together due to track limitations (e.g., the track can only accommodate a certain number of horses), you can use a two-step process:

  1. Preliminary Heats:

    • Divide the six horses into two groups of three.
    • Race each group separately.
  2. Final Race:

    • Take the winners (and possibly the second-place finishers) from each preliminary race.
    • Race them against each other to determine the fastest overall.

Considerations:

  • Multiple Races May Affect Performance: Horses may become tired or stressed from multiple races, potentially affecting their speed.
  • External Variables: Weather changes between races could impact track conditions and horse performance.

Conclusion:

For the most accurate and fair assessment of which horse is the fastest, racing all six together is the best method. It saves time, resources, and provides a clear winner under identical conditions.

Comments

gpt-4o greatly overcomplicates this situation. o1 considers a multi-step process as well but ultimately decides that a six horse race is the most accurate method, which is correct.


Conclusions

In this test, gpt-4o scored 0.5/4. Not too good, but to be expected. o1-preview scored 4/4. From this limited test, it does seem that this model is overall much better at basic reasoning than gpt-4o, but it has a tendency to overthink simple problems.


More Complex Reasoning

I tried giving both gpt-4o and o1-preview the following logic puzzle taken from MindYourDecisions.com (Presh Talwalkar, 2016)

Which of the statements in this list of 10 are false? The line n states there are exactly n false statements in the list.

#1. Exactly 1 of the statements in this list is false.
#2. Exactly 2 of the statements in this list are false.
#3. Exactly 3 of the statements in this list are false.
#4. Exactly 4 of the statements in this list are false.
#5. Exactly 5 of the statements in this list are false.
#6. Exactly 6 of the statements in this list are false.
#7. Exactly 7 of the statements in this list are false.
#8. Exactly 8 of the statements in this list are false.
#9. Exactly 9 of the statements in this list are false.
#10. Exactly 10 of the statements in this list are false.

This problem requires more complex multi-step reasoning than the problems I've given it so far.

The correct answer is statement #9 is true and all other statements are false since:

  • #10 cannot be true since this implies all statements are false, including #10.
  • Two statements cannot be true since the statements are mutually exclusive, Therefore at most 1 statement is true.
  • If one statement is true, then 9 statements are false.

Let's see how the two models tackle this problem.


gpt-4o response (incorrect)

This is a classic logic puzzle where we need to determine which statement(s) are false while keeping in mind that each statement refers to the number of false statements in the entire list.

Assumption:

Each statement SnS_n claims that exactly nn statements in the list are false.

Case by Case Analysis:

  • If statement SnS_n is true, then exactly nn statements must be false in the list.
  • If statement SnS_n is false, then the number of false statements is not nn.

Let’s examine each scenario:

Case 1: Assume S1S_1 is true (exactly 1 statement is false)

  • If exactly 1 statement is false, then all the other statements (from 2 to 10) would be true. However, if any of these statements were true, they would contradict the assumption that only one statement is false. Therefore, S1S_1 cannot be true.

Case 2: Assume S2S_2 is true (exactly 2 statements are false)

  • If exactly 2 statements are false, then two of the statements (including some other one) must be false. However, each of the remaining statements from 3 to 10 would claim that a different number of statements are false, which would create contradictions. Thus, S2S_2 cannot be true.

Case 3: Assume S3S_3 is true (exactly 3 statements are false)

  • If exactly 3 statements are false, then three of the statements in the list must be false. We need to check which other two statements (in addition to S3S_3) could be false:
    • S4S_4 would be false because it claims there are 4 false statements.
    • S5S_5 would be false for the same reason.
    • Therefore, if S3S_3 is true, the false statements are S4S_4, S5S_5, and one more. But since this scenario holds, no other statements need to be false.

Thus, if S3S_3 is true, statements S4S_4, S5S_5, and one more are false.

Conclusion:

  • S3S_3 is true.
  • S1S_1, S2S_2, S4S_4, and S5S_5 are false.
  • Statements S6S_6, S7S_7, S8S_8, S9S_9, and S10S_{10} must also be false because they claim more than three false statements.

Final answer:

  • The false statements are: S1S_1, S2S_2, S4S_4, S5S_5, S6S_6, S7S_7, S8S_8, S9S_9, and S10S_{10}.
  • Only S3S_3 is true.

o1-preview response (correct)

Analyzing self-referential statements

I’m working through a logic puzzle, where each statement claims a specific number of them are false. The task is to figure out which statements are false by finding a consistent assignment of true and false values.

Assessing validity

OK, let me see if k statements can be true. I’ll analyze the truth of statements about the number of false statements. If the true statements match k, they are consistent.

Evaluating scenarios

I’m considering scenarios with one true statement. If statement #9 ("Exactly 9 statements are false") is true, it aligns with one true and nine false statements, which seems logical.

Evaluating the consistency

I’m figuring out if statement #9's claim about 9 false statements holds true by cross-checking the implications of other statements. The calculations align with the expected number of false statements, confirming its consistency.

Assessing truth conditions

OK, let's think it through: If 8 statements are false, the remaining 2 must be true. However, statements #2 and #8 are mutually exclusive, making it impossible to have exactly 8 false statements.

Assessing logical coherence

I’m evaluating scenarios where 5 statements are false. If statements #6 to #10 are false and statements #1 to #5 are true, I’m curious if this fits the condition of 5 false statements.

Identifying inconsistencies

Piecing together the puzzle, I find contradictions in the assumptions. Both assuming only statement #5 is true and assuming all statements are false lead to inconsistencies.

Mapping out contradictions

If all statements were true, it suggests no false statements, which is impossible. If all were false, it validates statement #10, also contradictory.

Verifying the consistency

I’m examining the possibility that having exactly 9 false statements holds up. Statements #1-8 and #10 are false, #9 is true; this scenario appears consistent.

Piecing together truth

I'm figuring out which statements align with each other to pinpoint the true ones. This involves checking the number of false statements and their implications.

Assessing truth and falsehood

Considering different scenarios to identify if exactly 9 statements are false and if statement #9 is the only true one. Exploring if statement #1's falsity impacts the count of false statements.

Evaluating statements

I’m piecing together the implications of assuming one false statement with others true. Focusing on statement #1, if it's false, it means no statements can be completely true. This contradiction suggests further analysis is needed.

Eliminating contradictions

I'm mapping out false statements and get a key hint: statement #10 is off. Then, proving statement #5 wrong shows one true statement. Ultimately, #9 is true, leaving statements #1-8 and #10 false.

To solve this logical puzzle, we need to determine which statements are true and which are false based on their content and the number of false statements they assert.

Let's consider the possibilities:

  1. Assuming exactly 9 statements are false:

    • This means only one statement is true.
    • The statement that could be true is "#9. Exactly 9 of the statements in this list are false."
    • If statement #9 is true, then statements #1-8 and #10 must be false.
    • This is consistent because there are indeed 9 false statements, as statement #9 asserts.
  2. Checking other possibilities:

    • Assuming exactly 1 statement is false: The false statement would have to be one that says only one statement is false, but then it would be true, causing a contradiction.
    • Assuming all statements are false: This is impossible because statement #10 would be true if all were false, leading to a contradiction.

Therefore, the only consistent solution is:

  • Statements #1-8 and #10 are false.
  • Statement #9 is true.

False Statements:

  1. #1. Exactly 1 of the statements in this list is false.
  2. #2. Exactly 2 of the statements in this list are false.
  3. #3. Exactly 3 of the statements in this list are false.
  4. #4. Exactly 4 of the statements in this list are false.
  5. #5. Exactly 5 of the statements in this list are false.
  6. #6. Exactly 6 of the statements in this list are false.
  7. #7. Exactly 7 of the statements in this list are false.
  8. #8. Exactly 8 of the statements in this list are false.
  9. #10. Exactly 10 of the statements in this list are false.

Answer: Statements #1 through #8 and #10 are false.

Comments

gpt-4o concluded that statement #3 is true, and all other statements are false. The model fails to see that 9 statements being false means statement #3 cannot be true. o1 correctly eliminates statement #10, and eventually concludes that #9 is true with all others being false, which is correct.


Conclusions

o1 does great at this problem, seemingly because of its ability to evaluate its own "thoughts" before giving an answer.

So far I'm quite impressed by o1's performance. It has managed to tackle every problem I've given it so far while gpt-4o only managed to get one question half-correct.


Coding with o1

The final test is a (relatively simple) coding problem. I gave both gpt-4o and o1-preview this prompt:

Create a dynamic sorting algorithm visualizer using React and TailwindCSS that supports Bubble Sort, Merge Sort, and Quick Sort. The user should be able to input the number of values to sort (e.g., 10-100) and choose the order of the input array (e.g., reverse, random). The visualizer should display each step of the sorting process visually, with controls for play, pause, and reset. Make the interface intuitive and responsive, using TailwindCSS for styling. Output a single file called sorting.tsx.

Below are the two components coded by gpt-4o and o1-preview respectively.

Sorting Algorithm Visualizer

By gpt-4o



Sorting Algorithm Visualizer

By o1

50
500 ms

Conclusions

Both models went for a similar UI design and successfully implemented it. However, gpt-4o's attempt does not successfully sort the bars. It also included a speed state in the code but never implemented a way to change it. The reset button doesn't work as intended, and the user can start a new sort with the start button when an existing sort is being performed. It could have likely fixed these problems with follow up prompts, but I specifically wanted to test how good each model is at coding this in one attempt.

o1 on the other hand, managed to successfully implement bubble sort and quick sort, as well as letting the user change the speed with a slider. It did, however, fail at implementing merge sort completely. Nonetheless, it did get much closer to a correct implementation than gpt-4o and not a bad result at all for a first try.

Overall Conclusions

There's no denying it, this model is much better at reasoning tasks than gpt-4o. While it does overthink certain problems, it almost always (in my limited, informal testing) gets them right. I can't wait to see what's next for this series of models (and for the 30 messages / week limit to be lifted).

References

Williams, S. and Huckle, J. (2024) Easy Problems That LLMs Get Wrongarxiv.org. Available at: https://arxiv.org/html/2405.19616v1 (Accessed: September 2024).

Presh Talwalkar (2016) The Mind-Twisting Logical List Riddle – Sunday Puzzle, mindyourdecisions.com. Available at: https://mindyourdecisions.com/blog/2016/10/09/the-mind-twisting-logical-list-riddle-sunday-puzzle/ (Accessed: September 2024).