Read in CSV Files for Data Analysis¶

Date published: 2018-04-08

Category: Python

Subcategory: Beginner Concepts

Tags: input, csv

Python is a great programming language for data analysis.

CSV files are comma-separated values to store data, similar to a table format.

The CSV format is a common import and export format for spreadsheets and databases. You'll likely encounter it early on in the field of data science.

We'll cover how to read in data from a CSV in which a CSV file is in your Python program's directory/folder on your computer.

Import CSV module¶

In [53]:

                
                    Copied!
                    
import csv
import csv

Read in bike rides CSV¶

In my directory as this Python program, I created a CSV named bike_rides.csv.

open is a Python built-in function to open a file on your local computer with the argument being the file path. A default argument of open is mode=r to open the contents of a file with only permission to read the file, not write to it or perform a different operation. We'll provide an alias as file so we can easily use that name in another operation.

From the csv module in Python, we'll call the reader method and pass in file to create a csv reader object.

Once we have this csv reader object, we can iterate over the rows of our CSV file and print the output.

In [54]:

                
                    Copied!
                    
with open('bike_rides.csv') as file:
    csv_reader_object = csv.reader(file)
    for row in csv_reader_object:
        print("CSV row: {0}".format(row))
with open('bike_rides.csv') as file:
    csv_reader_object = csv.reader(file)
    for row in csv_reader_object:
        print("CSV row: {0}".format(row))

CSV row: ['ride_number', 'date', 'miles_rode', 'ride_duration']
CSV row: ['1', '4-02-2018', '5.1', '29.4']
CSV row: ['2', '4-03-2018', '3.3', '20']
CSV row: ['3', '4-04-2018', '4.5', '21.8']

Each row in our CSV is a list of strings since there are single-quotes around each value.

Calculate Descriptive Statistics on Bike Rides¶

Commonly, when we read in data from a CSV, we may want to analyze the data or perform some analysis. We can perform these calculations as we loop over all data in our CSV.

Count the number of bike rides¶

Our csv_reader_object has a method called line_num that returns the number of lines in our CSV. If we perform this operation, we're returned 4.

In [55]:

                
                    Copied!
                    
csv_reader_object.line_num
csv_reader_object.line_num

Out[55]:

There are 4 rows in this CSV; however, there may be a header row (displaying field names) so we shouldn't count this row in our calculation for the count of bike rides.

The csv class in Python has a Sniffer class to deduce the format of a CSV. With the has_header method of Sniffer, we can check if the first row contains column headers.

If the first row is a header, let's skip that row in counting the number of bike ride records in our csv. To skip that header row, we can use the built-in next function and pass in our csv_reader_object; this will retrive the next item after our header to begin our data analysis.

In [56]:

                
                    Copied!
                    
count_of_rides = 0
count_of_rides = 0

In [57]:

                
                    Copied!
                    
                        
                        
                    
                    

            
with open('bike_rides.csv') as file:
    csv_reader_object = csv.reader(file)
    if csv.Sniffer().has_header:
        next(csv_reader_object)
    for row in csv_reader_object:
        count_of_rides += 1
with open('bike_rides.csv') as file:
    csv_reader_object = csv.reader(file)
    if csv.Sniffer().has_header:
        next(csv_reader_object)
    for row in csv_reader_object:
        count_of_rides += 1

In [58]:

                
                    Copied!
                    
count_of_rides
count_of_rides

Out[58]:

Based on our manual observation above, we can confirm there are 3 records of bike rides.

Average miles rode per bike ride¶

All rows in our CSV contains strings. To perform a math operation, we need to convert the values to an arithmetic data type - either an integer or a float.

We'll convert the values in the miles_rode field to floats.

Also, note we still have to skip the header row in this calculation.

In [59]:

                
                    Copied!
                    
total_miles_rode = 0
total_miles_rode = 0

In each row, a record of a bike ride, at index 2, there's a value for miles_rode.

In [60]:

                
                    Copied!
                    
row_index_miles_rode = 2
row_index_miles_rode = 2

In [61]:

                
                    Copied!
                    
                        
                        
                    
                    

            
with open('bike_rides.csv') as file:
    csv_reader_object = csv.reader(file)
    if csv.Sniffer().has_header:
        next(csv_reader_object)
    for row in csv_reader_object:
        float_miles_rode = float(row[row_index_miles_rode])
        total_miles_rode += float_miles_rode
with open('bike_rides.csv') as file:
    csv_reader_object = csv.reader(file)
    if csv.Sniffer().has_header:
        next(csv_reader_object)
    for row in csv_reader_object:
        float_miles_rode = float(row[row_index_miles_rode])
        total_miles_rode += float_miles_rode

In [62]:

                
                    Copied!
                    
total_miles_rode
total_miles_rode

Out[62]:

12.899999999999999

In [63]:

                
                    Copied!
                    
average_miles_per_ride = total_miles_rode / count_of_rides
average_miles_per_ride = total_miles_rode / count_of_rides

In [64]:

                
                    Copied!
                    
average_miles_per_ride
average_miles_per_ride

Out[64]:

4.3