Section 1.2 Combining Bytes into Larger Structures
Now that we have the idea of a byte as a small collection of bits (usually eight) that can be used to store and transmit things like letters and punctuation marks, we can start to build up to bigger and better things. First, it is very easy to see that we can put bytes together into lists in order to make a "string" of letters, what is often referred to as a "character string." If we have a piece of text, like "this is a piece of text" we can use a collection of bytes to represent it like this:
011101000110100001101001011100110010000001101001011100110010 000001100001001000000111000001101001011001010110001101100101 001000000110111101100110001000000111010001100101011110000111 0100
Now nobody wants to look at that, let alone encode or decode it by hand, but fortunately, the computers and software we use these days takes care of the conversion and storage automatically. For example, when we tell the open source data language "R" to store "this is a piece of text" for us like this:
...we can be certain that inside the computer there is a long list of zeroes and ones that represent the text that we just stored. By the way, in order to be able to get our piece of text back later on, we have made a kind of storage label for it (the word "myText" above). Anytime that we want to remember our piece of text or use it for something else, we can use the label "myText" to open up the chunk of computer memory where we have put that long list of binary digits that represent our text. The left-pointing arrow
<-
made up out of the less-than character ("<") and the dash character ("-") gives R the command to take what is on the right hand side (the quoted text) and put it into what is on the left hand side (the storage area we have labeled "myText"). Some people call this <-
the Assignment Arrow, and it is used in some computer languages like R to make it clear to the human who writes or reads it which direction the information is flowing.
From the computerβs standpoint, it is even simpler to store, remember, and manipulate numbers instead of text. Remember that an eight bit byte can hold 256 combinations, so just using that very small amount we could store the numbers from 0 to 255. (Of course, we could have also done 1 to 256, but much of the counting and numbering that goes on in computers starts with zero instead of one.) Really, though, 255 is not much to work with. We couldnβt count the number of houses in most towns or the number of cars in a large parking garage unless we can count higher than 255. If we put together two bytes to make 16 bits we can count from zero up to 65,535, but that is still not enough for some of the really big numbers in the world today (for example, there are more than 200 million cars in the U.S. alone).
Most of the time, if we want to be flexible in representing an Integer (a number with no decimals), we use four bytes stuck together. Four bytes stuck together is a total of 32 bits, and that allows us to store an integer as high as 4,294,967,295.
Things get slightly more complicated when we want to store a negative number or a number that has digits after the decimal point. If you are curious, try looking up "twoβs complement" for more information about how signed numbers are stored and "floating point" for information about how numbers with digits after the decimal point are stored. For our purposes in this book, the most important thing to remember is that text is stored differently than numbers, and among numbers integers are stored differently than floating point. Later we will find that it is sometimes necessary to convert between these different representations, so it is always important to know how it is represented.
So far we have mainly looked at how to store one thing at a time, like one number or one letter, but when we are solving problems with data we often need to store a group of related things together. The simplest place to start is with a list of things that are all stored in the same way. For example, we could have a list of integers, where each thing in the list is the age of a person in your family. The list might look like this: 43, 42, 12, 8, 5. The first two numbers are the ages of the parents and the last three numbers are the ages of the kids. Naturally, inside the computer each number is stored in binary, but fortunately we donβt have to type them in that way or look at them that way. Because there are no decimal points, these are just plain integers and a 32 bit integer (4 bytes) is more than enough to store each one. This list contains items that are all the same "type" or "mode."
In R, a "Vector" is an ordered list of data elements of the same data type (e.g. all integers or all strings, etc). We can create a vector with R very easily by listing the data elements, separated by commas and inside parentheses:
-
c(43, 42, 12, 8, 5)
The letter "c" in front of the opening parenthesis stands for concatenate, which means to join things together. We can also put in some of what we learned above to store our vector in a named location:
When we learn to run our code, we will have just created our first "data set." It is very small, for sure, only five items, but also very useful for illustrating several major concepts about data. Hereβs a recap:
-
In the heart of the computer, all data are represented in binary. One binary digit, or bit, is the smallest chunk of data that we can send from one place to another.
-
Although all data are at heart binary, computers and software help to represent data in more convenient forms for people to see. Three important representations are: "character" for representing text, "integer" for representing numbers with no digits after the decimal point, and "floating point" for numbers that may have digits after the decimal point. The list of numbers in our tiny data set just above are integers.
-
Numbers and text can be collected into lists, which the open source program "R" calls vectors. A vector has a length, which is the number of items in it, and a "mode" which is the type of data stored in the vector. The vector we were just working on has a length of 5 and a mode of integer.
-
In order to be able to remember where we stored a piece of data, most computer programs, including R, give us a way of labeling a chunk of computer memory. We chose to give the 5-item vector up above the name "myFamAge." Some people might refer to this named list as a "Variable," because the value of it varies, depending upon which member of the list you are examining.
-
If we gather together one or more variables into a sensible group, we can refer to them together as a "data set." Usually, it doesnβt make sense to refer to something with just one variable as a data set, so usually we need at least two variables. Technically, though, even our very simple "myFamAge" counts as a data set, albeit a very tiny one.
-
Later in the book we will install and run the open source "R" data program and learn more about how to create data sets, summarize the information in those data sets, and perform some simple calculations and transformations on those data sets.
You have attempted of activities on this page.