10.5. Visualizing Data¶

In this section, we will learn how to visualize our work so far.

You can get some inspiration from https://python-graph-gallery.com/.

How can we show top combinations of two things?
Visualize the flow from department to department or from aisle to aisle.

To accomplish this, you will have to dig into some new packages that we have not used so far. But this is all part of the process.

We need to create a square adjacency matrix: Aisle to Aisle. We’ll use this to build our chord diagram and other graph-like visualizations.

Merge the order_product data frame with the aisle data frame so we have the aisle number for each product. (We can drop the aisle name to save memory.)
Iterate over each order.
Order the order by add_to_cart_order.
Increase the count in from aisle (row) to aisle (column); this is a directed graph.

aisle_mat = pd.DataFrame(0, index=range(1, 135), columns=range(1, 135))
flowdf = op.merge(products, on='product_id').merge(adf, on='aisle_id')

%%time

tco = flowdf.groupby('order_id')

for order in tco.groups.keys():
    contents = tco.get_group(order).sort_values('add_to_cart_order')
    rowit = contents.iterrows()
    start_aisle = next(rowit)[1]['aisle_id']

    for ix, row in rowit:
        #print(start_aisle, row['aisle_id'])
        try:
            aisle_mat.loc[start_aisle][row['aisle_id']] += 1
        except:
            print("bad index", start_aisle, row['aisle_id'],
                  type(start_aisle), type(row['aisle_id']))
        start_aisle = row['aisle_id']

CPU times: user 3h 4min 26s, sys: 2min 7s, total: 3h 6min 34s
Wall time: 3h 11min 18s

aisle_mat.to_csv('aisle_mat.csv')

for ix, row in contents.iterrows():
    print(row['product_id'], row['aisle_id'])

x = contents.iterrows()
next(x)[1]['aisle_id']

for i, j in x:
    print(j['product_id'])

sbn.heatmap(aisle_mat)

<matplotlib.axes._subplots.AxesSubplot at 0x22a687e48>

A heat map that represents the squere adjacency matrix. The colors range from black to red to white in accending order. Most of the map is black with some red and white.

Looks like a lot of small values. Let’s make a histogram of the whole thing and see.

for i in range(1,135):
    aisle_mat.loc[i][i] = 0

x = aisle_mat.values.flatten()
sall = aisle_mat.values.sum()
y = aisle_mat.applymap(lambda x: x/sall)
z = y.applymap(lambda x: x if x > 0.001 else np.nan)

sbn.heatmap(z)

<matplotlib.axes._subplots.AxesSubplot at 0x386ea27f0>

The same heat map as above but the smaller values (in black) are removed so that the larger values are more obvious.

aisle_mat = pd.read_csv('aisle_mat.csv',index_col='aid')
aisle_mat.head()

	1	2	3	4	5	6	7	8	9	10	...	125	126	127	128	129	130	131	132	133	134
aid
1	5151	177	621	362	83	74	56	152	336	8	...	54	20	30	401	388	205	344	8	19	12
2	216	2692	464	387	168	62	152	88	882	20	...	34	30	52	602	322	218	933	8	21	19
3	632	417	126287	1871	311	322	247	380	1455	47	...	1152	128	299	1509	1849	2106	1800	48	158	33
4	356	405	1844	20762	717	192	186	165	2519	49	...	255	122	268	1234	2176	1176	3130	32	110	27
5	90	169	266	681	2325	57	110	47	673	44	...	31	42	104	639	539	245	744	5	27	8

5 rows × 134 columns

aisle_mat['total'] = aisle_mat.apply(lambda x : x.sum(), axis=1)

aisle_mat.sort_values('total', ascending=False, inplace=True)
aisle_mat.head()

	1	2	3	4	5	6	7	8	9	10	...	126	127	128	129	130	131	132	133	134	total
aid
24	6545	6616	33754	12545	3020	2600	2829	2646	13162	443	...	709	1248	16043	13076	13474	16945	228	910	372	3324654
83	4473	8381	17158	11751	6362	2133	3818	1698	20901	615	...	693	981	17890	12322	10233	25437	219	676	294	3143603
123	4134	4197	13228	6850	2397	1204	1747	1198	9405	292	...	454	764	8630	7515	6414	11072	173	461	163	1600584
120	2354	2498	16219	5950	1080	1072	801	937	5231	99	...	306	668	5669	5561	5921	6478	106	307	78	1354392
21	1760	6626	8445	6728	2257	640	1038	700	10388	268	...	296	659	9307	6071	3891	10790	100	312	153	888985

5 rows × 135 columns

row_order = aisle_mat.index
row_order = row_order.tolist()
aisle_mat.index

Int64Index([ 24,  83, 123, 120,  21,  84, 115, 107,  91, 112,
            ...
            118, 134,  55, 109,  10,  44, 102,  82, 132, 113],
           dtype='int64', name='aid', length=134)

aisle_map = pd.merge(aisle_mat, adf, left_index=True, right_on='aisle_id')['aisle']
aisle_map.values.tolist()[:10]

['fresh fruits',
 'fresh vegetables',
 'packaged vegetables fruits',
 'yogurt',
 'packaged cheese',
 'milk',
 'water seltzer sparkling water',
 'chips pretzels',
 'soy lactosefree',
 'bread']

am = aisle_mat.values.tolist()[:20][:20]
for i in range(len(am)):
    am[i][i] = 0.0

pickle.dump(am,file=open('am.pkl', 'wb'))

depts = pd.read_csv('ecomm/departments.csv')
depts

	department_id	department
0	1	frozen
1	2	other
2	3	bakery
3	4	produce
4	5	alcohol
5	6	international
6	7	beverages
7	8	pets
8	9	dry goods pasta
9	10	bulk
10	11	personal care
11	12	meat seafood
12	13	pantry
13	14	breakfast
14	15	canned goods
15	16	dairy eggs
16	17	household
17	18	babies
18	19	snacks
19	20	deli
20	21	missing

Lesson Feedback

1. Comfort Zone
2. Learning Zone
3. Panic Zone

1. Very little time
2. A reasonable amount of time
3. More time than is reasonable

1. Don't seem worth learning
2. May be worth learning
3. Are definitely worth learning

1. Definitely within reach
2. Within reach if I try my hardest
3. Out of reach no matter how hard I try

You have attempted of activities on this page