123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573 |
- ---
- redirect_from:
- - "/chapters/08/4/joining-tables-by-columns"
- interact_link: content/chapters/08/4/Joining_Tables_by_Columns.ipynb
- kernel_name: python3
- has_widgets: false
- title: |-
- Joining Tables by Columns
- prev_page:
- url: /chapters/08/3/Cross-Classifying_by_More_than_One_Variable.html
- title: |-
- Cross-Classifying
- next_page:
- url: /chapters/08/5/Bike_Sharing_in_the_Bay_Area.html
- title: |-
- Bike Sharing in the Bay Area
- comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
- ---
- <div class="jb_cell tag_remove_input">
- <div class="cell border-box-sizing code_cell rendered">
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <h3 id="Joining-Tables-by-Columns">Joining Tables by Columns<a class="anchor-link" href="#Joining-Tables-by-Columns"> </a></h3><p>Often, data about the same individuals is maintained in more than one table. For example, one university office might have data about each student's time to completion of degree, while another has data about the student's tuition and financial aid.</p>
- <p>To understand the students' experience, it may be helpful to put the two datasets together. If the data are in two tables, each with one row per student, then we would want to put the columns together, making sure to match the rows so that each student's information remains on a single row.</p>
- <p>Let us do this in the context of a simple example, and then use the method with a larger dataset.</p>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <p>The table <code>cones</code> is one we have encountered earlier. Now suppose each flavor of ice cream comes with a rating that is in a separate table.</p>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing code_cell rendered">
- <div class="input">
- <div class="inner_cell">
- <div class="input_area">
- <div class=" highlight hl-ipython3"><pre><span></span><span class="n">cones</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
- <span class="s1">'Flavor'</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="s1">'strawberry'</span><span class="p">,</span> <span class="s1">'vanilla'</span><span class="p">,</span> <span class="s1">'chocolate'</span><span class="p">,</span> <span class="s1">'strawberry'</span><span class="p">,</span> <span class="s1">'chocolate'</span><span class="p">),</span>
- <span class="s1">'Price'</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="mf">3.55</span><span class="p">,</span> <span class="mf">4.75</span><span class="p">,</span> <span class="mf">6.55</span><span class="p">,</span> <span class="mf">5.25</span><span class="p">,</span> <span class="mf">5.75</span><span class="p">)</span>
- <span class="p">)</span>
- <span class="n">cones</span>
- </pre></div>
- </div>
- </div>
- </div>
- <div class="output_wrapper">
- <div class="output">
- <div class="jb_output_wrapper }}">
- <div class="output_area">
- <div class="output_html rendered_html output_subarea output_execute_result">
- <table border="1" class="dataframe">
- <thead>
- <tr>
- <th>Flavor</th> <th>Price</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>strawberry</td> <td>3.55 </td>
- </tr>
- <tr>
- <td>vanilla </td> <td>4.75 </td>
- </tr>
- <tr>
- <td>chocolate </td> <td>6.55 </td>
- </tr>
- <tr>
- <td>strawberry</td> <td>5.25 </td>
- </tr>
- <tr>
- <td>chocolate </td> <td>5.75 </td>
- </tr>
- </tbody>
- </table>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing code_cell rendered">
- <div class="input">
- <div class="inner_cell">
- <div class="input_area">
- <div class=" highlight hl-ipython3"><pre><span></span><span class="n">ratings</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
- <span class="s1">'Kind'</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="s1">'strawberry'</span><span class="p">,</span> <span class="s1">'chocolate'</span><span class="p">,</span> <span class="s1">'vanilla'</span><span class="p">),</span>
- <span class="s1">'Stars'</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="mf">2.5</span><span class="p">,</span> <span class="mf">3.5</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
- <span class="p">)</span>
- <span class="n">ratings</span>
- </pre></div>
- </div>
- </div>
- </div>
- <div class="output_wrapper">
- <div class="output">
- <div class="jb_output_wrapper }}">
- <div class="output_area">
- <div class="output_html rendered_html output_subarea output_execute_result">
- <table border="1" class="dataframe">
- <thead>
- <tr>
- <th>Kind</th> <th>Stars</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>strawberry</td> <td>2.5 </td>
- </tr>
- <tr>
- <td>chocolate </td> <td>3.5 </td>
- </tr>
- <tr>
- <td>vanilla </td> <td>4 </td>
- </tr>
- </tbody>
- </table>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <p>Each of the tables has a column that contains ice cream flavors: <code>cones</code> has the column <code>Flavor</code>, and <code>ratings</code> has the column <code>Kind</code>. The entries in these columns can be used to link the two tables.</p>
- <p>The method <code>join</code> creates a new table in which each cone in the <code>cones</code> table is augmented with the Stars information in the <code>ratings</code> table. For each cone in <code>cones</code>, <code>join</code> finds a row in <code>ratings</code> whose <code>Kind</code> matches the cone's <code>Flavor</code>. We have to tell <code>join</code> to use those columns for matching.</p>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing code_cell rendered">
- <div class="input">
- <div class="inner_cell">
- <div class="input_area">
- <div class=" highlight hl-ipython3"><pre><span></span><span class="n">rated</span> <span class="o">=</span> <span class="n">cones</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s1">'Flavor'</span><span class="p">,</span> <span class="n">ratings</span><span class="p">,</span> <span class="s1">'Kind'</span><span class="p">)</span>
- <span class="n">rated</span>
- </pre></div>
- </div>
- </div>
- </div>
- <div class="output_wrapper">
- <div class="output">
- <div class="jb_output_wrapper }}">
- <div class="output_area">
- <div class="output_html rendered_html output_subarea output_execute_result">
- <table border="1" class="dataframe">
- <thead>
- <tr>
- <th>Flavor</th> <th>Price</th> <th>Stars</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>chocolate </td> <td>6.55 </td> <td>3.5 </td>
- </tr>
- <tr>
- <td>chocolate </td> <td>5.75 </td> <td>3.5 </td>
- </tr>
- <tr>
- <td>strawberry</td> <td>3.55 </td> <td>2.5 </td>
- </tr>
- <tr>
- <td>strawberry</td> <td>5.25 </td> <td>2.5 </td>
- </tr>
- <tr>
- <td>vanilla </td> <td>4.75 </td> <td>4 </td>
- </tr>
- </tbody>
- </table>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <p>Each cone now has not only its price but also the rating of its flavor.</p>
- <p>In general, a call to <code>join</code> that augments a table (say <code>table1</code>) with information from another table (say <code>table2</code>) looks like this:</p>
- <pre><code>table1.join(table1_column_for_joining, table2, table2_column_for_joining)
- </code></pre>
- <p>The new table <code>rated</code> allows us to work out the price per star, which you can think of as an informal measure of value. Low values are good – they mean that you are paying less for each rating star.</p>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing code_cell rendered">
- <div class="input">
- <div class="inner_cell">
- <div class="input_area">
- <div class=" highlight hl-ipython3"><pre><span></span><span class="n">rated</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">'$/Star'</span><span class="p">,</span> <span class="n">rated</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Price'</span><span class="p">)</span> <span class="o">/</span> <span class="n">rated</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Stars'</span><span class="p">))</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
- </pre></div>
- </div>
- </div>
- </div>
- <div class="output_wrapper">
- <div class="output">
- <div class="jb_output_wrapper }}">
- <div class="output_area">
- <div class="output_html rendered_html output_subarea output_execute_result">
- <table border="1" class="dataframe">
- <thead>
- <tr>
- <th>Flavor</th> <th>Price</th> <th>Stars</th> <th>$/Star</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>vanilla </td> <td>4.75 </td> <td>4 </td> <td>1.1875 </td>
- </tr>
- <tr>
- <td>strawberry</td> <td>3.55 </td> <td>2.5 </td> <td>1.42 </td>
- </tr>
- <tr>
- <td>chocolate </td> <td>5.75 </td> <td>3.5 </td> <td>1.64286</td>
- </tr>
- <tr>
- <td>chocolate </td> <td>6.55 </td> <td>3.5 </td> <td>1.87143</td>
- </tr>
- <tr>
- <td>strawberry</td> <td>5.25 </td> <td>2.5 </td> <td>2.1 </td>
- </tr>
- </tbody>
- </table>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <p>Though strawberry has the lowest rating among the three flavors, the less expensive strawberry cone does well on this measure because it doesn't cost a lot per star.</p>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <p><strong>Side note.</strong> Does the order we list the two tables matter? Let's try it. As you see it, this changes the order that the columns appear in, and can potentially changes the order of the rows, but it doesn't make any fundamental difference.</p>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing code_cell rendered">
- <div class="input">
- <div class="inner_cell">
- <div class="input_area">
- <div class=" highlight hl-ipython3"><pre><span></span><span class="n">ratings</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s1">'Kind'</span><span class="p">,</span> <span class="n">cones</span><span class="p">,</span> <span class="s1">'Flavor'</span><span class="p">)</span>
- </pre></div>
- </div>
- </div>
- </div>
- <div class="output_wrapper">
- <div class="output">
- <div class="jb_output_wrapper }}">
- <div class="output_area">
- <div class="output_html rendered_html output_subarea output_execute_result">
- <table border="1" class="dataframe">
- <thead>
- <tr>
- <th>Kind</th> <th>Stars</th> <th>Price</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>chocolate </td> <td>3.5 </td> <td>6.55 </td>
- </tr>
- <tr>
- <td>chocolate </td> <td>3.5 </td> <td>5.75 </td>
- </tr>
- <tr>
- <td>strawberry</td> <td>2.5 </td> <td>3.55 </td>
- </tr>
- <tr>
- <td>strawberry</td> <td>2.5 </td> <td>5.25 </td>
- </tr>
- <tr>
- <td>vanilla </td> <td>4 </td> <td>4.75 </td>
- </tr>
- </tbody>
- </table>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <p>Also note that the join will only contain information about items that appear in both tables. Let's see an example. Suppose there is a table of reviews of some ice cream cones, and we have found the average review for each flavor.</p>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing code_cell rendered">
- <div class="input">
- <div class="inner_cell">
- <div class="input_area">
- <div class=" highlight hl-ipython3"><pre><span></span><span class="n">reviews</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
- <span class="s1">'Flavor'</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="s1">'vanilla'</span><span class="p">,</span> <span class="s1">'chocolate'</span><span class="p">,</span> <span class="s1">'vanilla'</span><span class="p">,</span> <span class="s1">'chocolate'</span><span class="p">),</span>
- <span class="s1">'Stars'</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
- <span class="p">)</span>
- <span class="n">reviews</span>
- </pre></div>
- </div>
- </div>
- </div>
- <div class="output_wrapper">
- <div class="output">
- <div class="jb_output_wrapper }}">
- <div class="output_area">
- <div class="output_html rendered_html output_subarea output_execute_result">
- <table border="1" class="dataframe">
- <thead>
- <tr>
- <th>Flavor</th> <th>Stars</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>vanilla </td> <td>5 </td>
- </tr>
- <tr>
- <td>chocolate</td> <td>3 </td>
- </tr>
- <tr>
- <td>vanilla </td> <td>5 </td>
- </tr>
- <tr>
- <td>chocolate</td> <td>4 </td>
- </tr>
- </tbody>
- </table>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing code_cell rendered">
- <div class="input">
- <div class="inner_cell">
- <div class="input_area">
- <div class=" highlight hl-ipython3"><pre><span></span><span class="n">average_review</span> <span class="o">=</span> <span class="n">reviews</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">'Flavor'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">average</span><span class="p">)</span>
- <span class="n">average_review</span>
- </pre></div>
- </div>
- </div>
- </div>
- <div class="output_wrapper">
- <div class="output">
- <div class="jb_output_wrapper }}">
- <div class="output_area">
- <div class="output_html rendered_html output_subarea output_execute_result">
- <table border="1" class="dataframe">
- <thead>
- <tr>
- <th>Flavor</th> <th>Stars average</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>chocolate</td> <td>3.5 </td>
- </tr>
- <tr>
- <td>vanilla </td> <td>5 </td>
- </tr>
- </tbody>
- </table>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <p>We can join <code>cones</code> and <code>average_review</code> by providing the labels of the columns by which to join.</p>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing code_cell rendered">
- <div class="input">
- <div class="inner_cell">
- <div class="input_area">
- <div class=" highlight hl-ipython3"><pre><span></span><span class="n">cones</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s1">'Flavor'</span><span class="p">,</span> <span class="n">average_review</span><span class="p">,</span> <span class="s1">'Flavor'</span><span class="p">)</span>
- </pre></div>
- </div>
- </div>
- </div>
- <div class="output_wrapper">
- <div class="output">
- <div class="jb_output_wrapper }}">
- <div class="output_area">
- <div class="output_html rendered_html output_subarea output_execute_result">
- <table border="1" class="dataframe">
- <thead>
- <tr>
- <th>Flavor</th> <th>Price</th> <th>Stars average</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>chocolate</td> <td>6.55 </td> <td>3.5 </td>
- </tr>
- <tr>
- <td>chocolate</td> <td>5.75 </td> <td>3.5 </td>
- </tr>
- <tr>
- <td>vanilla </td> <td>4.75 </td> <td>5 </td>
- </tr>
- </tbody>
- </table>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- </div>
- <div class="jb_cell">
- <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
- <div class="text_cell_render border-box-sizing rendered_html">
- <p>Notice how the strawberry cones have disappeared. None of the reviews are for strawberry cones, so there is nothing to which the <code>strawberry</code> rows can be joined. This might be a problem, or it might not be - that depends on the analysis we are trying to perform with the joined table.</p>
- </div>
- </div>
- </div>
- </div>
-
|