Read in CSV Files for Data Analysis¶
Date published: 2018-04-08
Category: Python
Subcategory: Beginner Concepts
Tags: input, csv
Python is a great programming language for data analysis.
CSV files are comma-separated values to store data, similar to a table format.
The CSV format is a common import and export format for spreadsheets and databases. You'll likely encounter it early on in the field of data science.
We'll cover how to read in data from a CSV in which a CSV file is in your Python program's directory/folder on your computer.
Import CSV module¶
import csv
Read in bike rides CSV¶
In my directory as this Python program, I created a CSV named bike_rides.csv
.
open
is a Python built-in function to open a file on your local computer with the argument being the file path. A default argument of open
is mode=r
to open the contents of a file with only permission to read the file, not write to it or perform a different operation. We'll provide an alias as file
so we can easily use that name in another operation.
From the csv
module in Python, we'll call the reader
method and pass in file
to create a csv reader object.
Once we have this csv reader object, we can iterate over the rows of our CSV file and print the output.
with open('bike_rides.csv') as file:
csv_reader_object = csv.reader(file)
for row in csv_reader_object:
print("CSV row: {0}".format(row))
CSV row: ['ride_number', 'date', 'miles_rode', 'ride_duration'] CSV row: ['1', '4-02-2018', '5.1', '29.4'] CSV row: ['2', '4-03-2018', '3.3', '20'] CSV row: ['3', '4-04-2018', '4.5', '21.8']
Each row in our CSV is a list of strings since there are single-quotes around each value.
Calculate Descriptive Statistics on Bike Rides¶
Commonly, when we read in data from a CSV, we may want to analyze the data or perform some analysis. We can perform these calculations as we loop over all data in our CSV.
Count the number of bike rides¶
Our csv_reader_object
has a method called line_num
that returns the number of lines in our CSV. If we perform this operation, we're returned 4
.
csv_reader_object.line_num
4
There are 4 rows in this CSV; however, there may be a header row (displaying field names) so we shouldn't count this row in our calculation for the count of bike rides.
The csv
class in Python has a Sniffer
class to deduce the format of a CSV. With the has_header
method of Sniffer
, we can check if the first row contains column headers.
If the first row is a header, let's skip that row in counting the number of bike ride records in our csv. To skip that header row, we can use the built-in next
function and pass in our csv_reader_object
; this will retrive the next item after our header to begin our data analysis.
count_of_rides = 0
with open('bike_rides.csv') as file:
csv_reader_object = csv.reader(file)
if csv.Sniffer().has_header:
next(csv_reader_object)
for row in csv_reader_object:
count_of_rides += 1
count_of_rides
3
Based on our manual observation above, we can confirm there are 3 records of bike rides.
Average miles rode per bike ride¶
All rows in our CSV contains strings. To perform a math operation, we need to convert the values to an arithmetic data type - either an integer or a float.
We'll convert the values in the miles_rode field to floats.
Also, note we still have to skip the header row in this calculation.
total_miles_rode = 0
In each row, a record of a bike ride, at index 2
, there's a value for miles_rode
.
row_index_miles_rode = 2
with open('bike_rides.csv') as file:
csv_reader_object = csv.reader(file)
if csv.Sniffer().has_header:
next(csv_reader_object)
for row in csv_reader_object:
float_miles_rode = float(row[row_index_miles_rode])
total_miles_rode += float_miles_rode
total_miles_rode
12.899999999999999
average_miles_per_ride = total_miles_rode / count_of_rides
average_miles_per_ride
4.3