Data Definition

Section 1.1 Data Definition

Subsection 1.1.1 Overview and Objectives

This section introduces the importance of data design in any program. Students often jump straight into coding—tackling a problem by writing loops, conditionals, or function calls—without stopping to clarify exactly what data they’ll process and how it’s structured. This leads to hidden confusion or brittle "quick fixes" that eventually fail when requirements change.

🔗

We follow the story of Audrey, a music lover who naively starts storing her favorite songs in code. Each time she adds a new feature, unplanned changes to her data break old parts of the program. By exploring her incremental fiasco, we’ll see why it’s crucial to explicitly design and document your data before (and during) coding—even if it seems like "extra work."

🔗

Objectives

Identify how undocumented data assumptions lead to maintenance problems and bugs
🔗

🔗
Write clear data definitions that specify structure and constraints
🔗

🔗
Document assumptions about valid values and relationships between fields
🔗

🔗
Recognize how typed languages can help enforce data definitions automatically
🔗

🔗

🔗

Subsection 1.1.2 Audrey’s Journey

Subsubsection 1.1.2.1 The Music App Dream

Meet Audrey, a CS student who loves music and programming. After learning Python last semester, she decides to build an app to track her growing music collection. "How hard could it be?" she thinks, opening her laptop one Friday evening. All she needs is to:

🔗

Store songs with ratings (1-5 stars)

🔗
Update ratings when she changes her mind

🔗
Find her favorite songs quickly

🔗

Little does she know, her "quick weekend project" is about to teach her why planning data structure is crucial...

🔗

Subsubsection 1.1.2.2 Hard-Coded Indices

Audrey decides each song should be a list [title, rating]. Having worked with lists before, she feels confident and jumps straight into coding:

🔗

Listing 1.1.1. Audrey’s Hard-Coded Indices

🔗

music_library = [
["Imagine", 5], # index 0
["Thriller", 4], # index 1
["Hey Jude", 5] # index 2
]

# Change "Thriller" rating from 4 to 3
music_library[1][1] = 3
print(music_library)

When run, Python prints [['Imagine', 5], ['Thriller', 3], ['Hey Jude', 5]]. This works for now, but only because Audrey remembers that:

"Thriller" is at position 1 in the outer list

🔗
The rating is always at position 1 in each inner list

🔗

These assumptions make the code fragile - any change to the order or structure could break it.

🔗

Subsubsection 1.1.2.3 Incremental "Fixes"

Realizing that hard-coded indices like music_library[1][1] are risky, Audrey tries to make her code more flexible with helper functions:

🔗

Listing 1.1.2. Searching by Title Instead of Position

🔗

def find_song_index(library, song_title):
    for i in range(len(library)):
        if library[i][0] == song_title: # Title is at index 0
            return i

def update_rating(library, song_title, new_rating):
    idx = find_song_index(library, song_title)
    library[idx][1] = new_rating  # Still assumes rating is at index 1

music_library = [
    ["Imagine", 5],
    ["Thriller", 4],
    ["Hey Jude", 5]
]

update_rating(music_library, "Thriller", 1)
print(music_library)

This is better - we’re no longer dependent on knowing which row contains "Thriller". But we’re still assuming that:

The title is always at index 0

🔗
The rating is always at index 1

🔗

These assumptions are buried in the code, making it hard to spot potential problems.

🔗

Subsubsection 1.1.2.4 Adding Play Counts

A week later, Audrey wants to track how often she listens to each song. She modifies each entry to include a play count:

🔗

Listing 1.1.3. Introducing a New Field

🔗

music_library = [
    ["Imagine", 5, 120],    # Now includes play count
    ["Thriller", 4, 230],
    ["Hey Jude", 5, 150]
]

This seemingly simple change creates several problems:

The update_rating function still uses index 1 for ratings, which still works... for now

🔗
If Audrey later reorders the fields to [title, play_count, rating], the function will corrupt data

🔗
New functions to update play counts might accidentally modify ratings if they guess wrong about indices

🔗

🔗

Key Lesson: Without a clear definition of the data structure, even small changes can introduce subtle bugs that might not be immediately apparent.

🔗

Checkpoint 1.1.4.

What could go wrong with Audrey’s latest approach? Consider:

🔗

What happens if someone reorders the fields?

🔗
How would you know what values are valid for each field?

🔗
Could you accidentally put text where a number should go?

🔗

🔗

Subsection 1.1.3 Designing Data: Why It Matters

Programs fundamentally do this: they take input data, transform or analyze it, and produce output data. Whether that data is a single integer or a complex structure with nested fields, the program’s success relies on everyone understanding exactly what the data is and how to interpret it.

🔗

If you never plan or document this structure, you’re effectively “designing” as you code—mixing two hard tasks at once. This often leads to:

Ambiguous fields: Is song[2] the rating or the play count? Could it become genre if we shift the order next week?

🔗
Misinterpreted units: Is weight in pounds or kilograms? Are dates in MM/DD/YYYY or DD/MM/YYYY format?

🔗
Silent breakage: Old code “thinks” a new field is something else, leading to bizarre outputs or corruption that can take hours to debug.

🔗

🔗

In earlier courses, you might have been protected from this problem because the instructor gave you a well-structured Point class, or a “student record” type, or a database schema. You simply followed it. But now, you’re the one deciding how to store data. Failing to define it clearly is a recipe for frustration.

🔗

Subsection 1.1.4 The First Step: Defining Your Data

A data definition is a concise statement of:

🔗

Which fields or attributes exist (e.g., title, rating, play_count).

🔗
How they’re typed or constrained (e.g., rating is an integer 1–5, play_count is nonnegative, etc.).

🔗
Any key assumptions (like “rating can’t exceed 5,” or “dates must be in MM/DD/YYYY format”).

🔗

For Audrey’s music library, we might define:

🔗

A Song is a Python list of the form [title, rating, play_count], where:

title is a string (e.g., "Imagine").

🔗

rating is an integer in [1..5].

🔗

play_count is a nonnegative integer (0 or greater).

🔗

🔗

🔗

Why is this useful? Because if Audrey later decides “I need to store artist,” she’ll update the definition to: [title, rating, play_count, artist]. Every function that touches Song refers back to this “blueprint,” instead of guessing how many fields there are or which index is which.

🔗

Assumptions & Constraints: Notice we said rating can’t exceed 5. In Python, you might or might not write code to check that. But if you do—say, with an if new_rating & 5 check—then you can detect bad data early. In other languages, like Java, you could have a method that rejects any rating above 5 automatically. Either way, once you’ve written down “rating is 1–5,” it’s easier to test or enforce that assumption.

🔗

Micro-Exercise: If you decide play_count can’t exceed 9999, how would you reflect that in your definition? Could you detect an invalid play_count in code and raise an error or log a warning?

🔗

Subsection 1.1.5 Separating Design from Implementation

One reason novices skip data definitions is they jump right into “let’s make a loop in Python” or “let’s code a quick function to do X.” But when you do that, you’re implicitly designing the data while you implement logic—a mental juggling act that can be error-prone.

🔗

Best Practice: Pause before coding. Ask: "What data do I have? Where do these fields come from? Do they have constraints? How might they evolve over time?” By addressing these questions early, you create a stable foundation that makes the actual coding smoother and less likely to break with new requirements.

🔗

This is the first step of a broader design recipe you’ll practice throughout this course. You’ll see that once data is well-defined, you can systematically define each function’s contract (what it takes in, what it returns), write tests, and iterate your design.

🔗

Subsection 1.1.6 Looking Ahead: Java’s Built-in Protection

While Python lets us write data definitions as comments, some languages can actually enforce our rules automatically. Let’s peek at how Java helps prevent data-related mistakes. Don’t worry about the syntax—focus on what the code accomplishes.

🔗

Subsubsection 1.1.6.1 A Song in Java

Listing 1.1.5.

🔗

public class Song {
    // 'private' means these variables can only be changed
    // using the methods we provide below
    private String title;        // The song's name
    private int rating;          // Must be 1-5
    private int playCount;       // Must be >= 0
    
    // Methods to safely change the rating
    public void setRating(int newRating) {
        if (newRating < 1 || newRating > 5) {
              throw new IllegalArgumentException("Rating must be 1-5");
        }
        rating = newRating;
    }
    
    // Methods to safely change the play count
    public void setPlayCount(int newCount) {
        if (newCount < 0) {
            throw new IllegalArgumentException("Play count can't be negative");
        }
        playCount = newCount;
    }
    
    // Method to safely change the title
    public void setTitle(String newTitle) {`
        if (newTitle == null || newTitle.isEmpty()) {
            throw new IllegalArgumentException("Title cannot be empty");
        }
        title = newTitle;
    }
}

This Java code enforces our data definition in several ways:

🔗

The variables are private, meaning they can only be changed using our special methods.

🔗
Each "setter" method checks that the new value is valid before making any changes.

🔗
The compiler ensures we can’t accidentally put a number where text should go (or vice versa).

🔗

🔗

Subsubsection 1.1.6.2 The Compiler as Your Assistant

Listing 1.1.6.

🔗

Song song = new Song();

// These won't compile - Java catches type mismatches
song.setRating("five");         // Error: rating must be a number!
song.setPlayCount("lots");      // Error: play count must be a number!

// These compile but fail when run - our methods catch invalid values
song.setRating(10);             // Error: rating must be 1-5
song.setPlayCount(-5);          // Error: can't have negative plays
song.setTitle("");              // Error: title can't be empty

Again, don’t stress if this Java syntax feels overwhelming—you’ll learn it step by step. The key idea is that Java’s type system acts like a helpful assistant, catching many data-related mistakes before your program even runs.

🔗

Subsubsection 1.1.6.3 Focus on the Concepts First

For now, focus on understanding why data definitions matter. The Java examples above are just a preview of how programming languages can help enforce good data design. Feel free to:

🔗

Ask your instructor to explain any Java code that interests you

🔗
Use AI tools like ChatGPT to explore Java concepts

🔗
Discuss the code with classmates—explaining things to others often helps both parties learn

🔗

We’ll cover Java thoroughly in other chapters. For now, just appreciate how its type system can turn our data definition "rules" into automatically-checked constraints.

🔗

Subsection 1.1.7 Reflection and Next Steps

Audrey’s incremental “fixes” all failed the moment she changed the order of fields or added a new one. By contrast, a data definition—like [title, rating, play_count]— serves as a stable reference point: any code that modifies or reads a field can check the blueprint to avoid silent breakage.

🔗

Takeaways:

All programs revolve around data: if you don’t define it up front, you risk layering assumptions until they collapse under new features.

🔗
Constraints become testable: once you say “rating is ≤ 5,” you can systematically check that in Python, or in a typed language like Java, you might disallow setting rating above 5 altogether. Both approaches reduce guesswork.

🔗
A design recipe starts with data: it might feel “extra,” but separating data layout from coding logic frees you to adapt your program without everything breaking unexpectedly.

🔗

🔗

In the next section, we’ll see how to define each function’s “Contract” once we have a clear data definition. Just as data needs a specification, so do functions: we’ll specify what inputs they take, what outputs they give, and any invariants (like not accepting a play_count less than 0). By combining these steps— data definition, then function contracts, then testing—we build a robust, iterative path to maintainable code.

🔗

Subsection 1.1.8 Data Definition Exercises

Checkpoint 1.1.7. Hard-Coded Indices Pitfalls.

Audrey’s code once did library[0][1] to change a song’s rating. Which of the following is the best explanation of why this is risky?

Because Python always indexes from 1, making [0][1] out of range.
Python is 0-based, so that’s not the root problem.
Changing the data structure (like adding a new field) can break old assumptions about which index stores the rating.
Exactly. If a new field pushes rating to [2], code referencing [1] corrupts data.
It is never valid to access a nested list by two indices in Python.
Nested indexing is valid in Python—it’s the assumption about *which* index stores what that’s the problem.
Hard-coding indices only matters in typed languages like Java, not in Python.
Even in Python, undocumented “magic” indices can cause silent breakage if data changes.

🔗

Checkpoint 1.1.8. Defining Data Early.

Based on this section, why is it crucial to define data (e.g., [title, rating, play_count]) before writing code?

You cannot write loops in Python without a prior data definition.
You *can* write loops, but they may be fragile without a clear data structure.
It makes the program run 100x faster automatically.
Defining data is about clarity and maintainability, not about huge performance boosts by itself.
It only matters for projects with over 1,000 lines of code.
Even small projects benefit from well-defined data.
It prevents hidden assumptions about field positions, so adding new fields later does not silently break old logic.
Right. Explicit data definitions reduce the chance of “magic index” errors and ease future expansion.

🔗

Checkpoint 1.1.9. Data Definitions and Typed Languages.

True.
Data definitions can help clarify structure in *any* language, typed or not.
False.
Data definitions can help clarify structure in *any* language, typed or not.

🔗

Checkpoint 1.1.10. Parsons: Audrey’s Data Definition Journey.

Arrange these statements to reflect how Audrey’s approach evolved, leading her to realize a formal data definition was necessary.

Initially stores songs as [title, rating] without documenting any structure.

---
Realizes that adding new fields like play_count breaks the code's assumptions.

---
Recognizes that “index 0 is title, index 1 is rating” is too fragile.

---
Defines data explicitly (e.g., [title, rating, play_count]) to prevent “magic
              index” bugs.

🔗

Checkpoint 1.1.11. Create a Data Definition.

Suppose you’re designing a system for a library of books. Propose a simple data definition (in the style of this section) describing how to store each book. Specify:

🔗

The structure (e.g., [title, author, pages])

🔗
Any constraints (e.g., pages >= 1, or title non-empty)

🔗

Write your definition below.

🔗

Solution.

Sample Data Definition:

🔗

A Book is a list of the form [title, author, page_count] where:

🔗

title is a non-empty string (e.g. "Pride and Prejudice").

🔗
author is a non-empty string (e.g. "Jane Austen").

🔗
page_count is an integer >= 1.

🔗

You could refine constraints (e.g., max pages = 9999) or store extra fields like genre and publication_year. The crucial idea is documenting each field and its valid range or type.

🔗

You have attempted of activities on this page.

🔗

Prev Top Next