π€ Computing Statistics with Kiva DataΒΆ
Kiva is an international nonprofit, founded in 2005 and based in San Francisco, with a mission to connect people through lending to alleviate poverty. We celebrate and support people looking to create a better future for themselves, their families and their communities. By lending as little as $25 on Kiva, anyone can help a borrower start or grow a business, go to school, access clean energy or realize their potential. For some, itβs a matter of survival, for others itβs the fuel for a life-long ambition. The following table contains some data that we will use to practice on some basic descriptive statistics that are commonly used in data science.
id |
loan_amount |
country_name |
status |
time_to_raise |
num_lenders_total |
---|---|---|---|---|---|
212763 |
1250.0 |
Azerbaijan |
funded |
193075.0 |
38 |
76281 |
500.0 |
El Salvador |
funded |
1157108.0 |
18 |
444097 |
1450.0 |
Bolivia |
funded |
1552939.0 |
51 |
402224 |
200.0 |
Paraguay |
funded |
244945.0 |
3 |
634949 |
700.0 |
El Salvador |
funded |
238797.0 |
21 |
1383386 |
100.0 |
Philippines |
funded |
1248909.0 |
1 |
351 |
250.0 |
Philippines |
funded |
773599.0 |
10 |
35651 |
225.0 |
Nicaragua |
funded |
116181.0 |
8 |
784253 |
1200.0 |
Guatemala |
funded |
2288095.0 |
42 |
1328839 |
150.0 |
Philippines |
funded |
51668.0 |
1 |
1094905 |
600.0 |
Paraguay |
funded |
26717.0 |
18 |
336986 |
300.0 |
Philippines |
funded |
48030.0 |
6 |
163170 |
700.0 |
Bolivia |
funded |
24078.0 |
28 |
1323915 |
125.0 |
Philippines |
funded |
71117.0 |
5 |
528261 |
650.0 |
Philippines |
funded |
580401.0 |
16 |
495978 |
175.0 |
Madagascar |
funded |
800427.0 |
7 |
1251510 |
1800.0 |
Georgia |
funded |
1156218.0 |
54 |
642684 |
1525.0 |
Uganda |
funded |
1166045.0 |
1 |
974324 |
575.0 |
Kenya |
funded |
2924705.0 |
18 |
7487 |
700.0 |
Tajikistan |
funded |
470622.0 |
22 |
957 |
1450.0 |
Jordan |
funded |
3046687.0 |
36 |
647494 |
400.0 |
Kenya |
funded |
260044.0 |
12 |
706941 |
200.0 |
Philippines |
funded |
445938.0 |
8 |
889708 |
1000.0 |
Ecuador |
funded |
201408.0 |
24 |
882568 |
350.0 |
Kenya |
funded |
2370450.0 |
8 |
There are some great (more advanced) tools in Python for working with massive tables of data. In fact this table is a random sample of a data set from Kiva that contains 1.4 million rows! We will move on to more and bigger data sets in time, but for now we need a simple way to work with this sample. To do that we will represent each column of the table as its own list.
To keep your coding easier and cleaner we will show you these lists here, but they will be automatically included for you in later activecodes. You can just use the list by name and it will be fine.
Level 1 QuestionsΒΆ
What is the total amount of money loaned?
What is the average loan amount?
What is the largest/smallest loan?
What country got the largest/smallest loan?
What is the variance of the money loaned?
What is the average number of days needed to fund a loan?
The questions in the list above are the way you would probably think of them when brainstorming or having a discussion with a colleague. Answering them in code often requires more precision in the way the questions are posed. We will restate these questions below and make them more precise.
Compute the total amount of money loaned and store it in the variable loan_total
Compute the average amount of money loaned and store it in the variable loan_average
Store the amount of the minimum loan in min_loan
and the amount of the maximum loan in max_loan
Then, store the name of the country that received the largest loan in max_country
and the smallest loan in min_country
Hint: max
and min
are built in Python functions that you can use to find the minimum value or maximum value in any sequence.
Compute the average number of lenders per loan and store it in a variable average_lenders
Compute the total number of loans made to the Philippines and store it in a variable philippines_count
For each unique country name, print a line that shows the name of the country and then the number of loans made in that country, like this: βGuatemala 1β
Level 2 QuestionsΒΆ
What is the average amount of loans made to people in the Philippines?
In which country was the loan granted that took the longest to fund?
What is the average amount of time / dollar it takes to fund a loan?
What is the standard deviation of the money loaned? The Empirical Rule or 68-95-99.7% Rule reminds us that 68% of the population falls within 1 standard deviation. Does this hold for our data?
Is there a relationship between the loan amount and the number of people? Or time to fund? How would we measure this? Covariance? Correlation?
The index positions for the Phillipines are [5, 6, 9, 11, 13, 14, 22]
Use that information to compute the average loan amount for the Phillipines. Store your result in the variable p_average
What is the name of the country with the loan that took the longest to raise? Store your result in the variable longest_to_fund
What is the arithmetic mean of the time / dollar it takes to fund a loan? The arithmetic mean is the average of the individual time/dollar calculations, not the average of the sum of time divided by the sum of dollar amounts. Store your result in the variable a_mean
For our final few exercises we are interested in exploring the distribution of the data as well as the relationships between two of our variables. To do this we need to introduce a few more statistical concepts including variance, standard deviation, covariance and correlation.
Variance looks at a single variable and measures how far the set of numbers are spread out from their average value. However its a bit hard to interpret because the units are squared so its not on the same scale as our original numbers. This is why most of the time we use the standard devation, which is just the square root of the variance. A large standard deviation tells us that our data is quite spread out while a small standard deviation tells us that most of our data is pretty close to the mean.
Donβt let the fancy math get you down the variance is just the sum of the squared values of each value minus the average for that value divided by the number of values. This is a little more complicated that what you have done before but you can definitely do this.
Calculate the standard deviation of the loan_amount variable and store the variance in loan_var and the standard deviation in loan_stdev
.
In data science we are often most interested in two variables that seem to influence one another. That is, we can observe that as one variable grows a second grows with it, or as one variable grows another variable shrinks at a similar rate. We will look at two ways to explore the relationships between these variables.
Covariance measures the larger values of one variable correspond to the larger values of a second variable as well as the extent to which the smaller values of one variable correspond to the smaller values of a second variable. If the covariance is positive it means the two variables grow together (positive correlation). If the magnitude is negative it means one variable grows while the other shrinks. The magnitude is hard to interpret because it depends on the values of the variables. So Most often the covariance is normalized so that the values are between minus 1 and positive 1, this is the pearson correlation coefficient A -1 indicates a strong negative correlation, a value of 0 indicates that the variables are not correlated at all, and a +1 indicates a strong positive correlation.
Historically the pearson correlation coefficient has been used in recommender systems to find groups of like minded shoppers that can recommend products to each other. It was the basis of Amazon.comβs recommender system from 1997 to 2000. I know this because I was part of the team that wrote that software :-)
Calculate the pearson correlation between the loan_amount and the num_lenders_total or between time_to_raise and the loan_amount or between num_lenders_total and time_to_raise. If you divide up the class you can compare values to see which pair has the strongest correlation.
Post Project Questions
-
During this project I was primarily in my...
- 1. Comfort Zone
- 2. Learning Zone
- 3. Panic Zone
-
Completing this project took...
- 1. Very little time
- 2. A reasonable amount of time
- 3. More time than is reasonable
-
Based on my own interests and needs, the things taught in this project...
- 1. Don't seem worth learning
- 2. May be worth learning
- 3. Are definitely worth learning
-
For me to master the things taught in this project feels...
- 1. Definitely within reach
- 2. Within reach if I try my hardest
- 3. Out of reach no matter how hard I try