If MS includes python in excel, sheets/workbooks could be crafted with python scripts and deploy the more people w/o provide support for installing excel add-ins and python virtual environments, so this will be more 'portable', efficient and less time-consuming. Using Python And Excel For Data Science. You will probably already know that Excel is a spreadsheet application developed by Microsoft. You can use this easily accessible tool to organize, analyze and store your data in tables.
Note Check out our. The latest version of Mac OS X, High Sierra, comes with Python 2.7 out of the box. You do not need to install or configure anything else to use Python. Having said that, I would strongly recommend that you install the tools and libraries described in the next section before you start building Python applications for real-world use. In particular, you should always install Setuptools, as it makes it much easier for you to install and manage other third-party Python libraries. The version of Python that ships with OS X is great for learning, but it’s not good for development.
The version shipped with OS X may be out of date from the, which is considered the stable production version. Virtual Environments A Virtual Environment (commonly referred to as a ‘virtualenv’) is a tool to keep the dependencies required by different projects in separate places, by creating virtual Python environments for them. It solves the “Project X depends on version 1.x but, Project Y needs 4.x” dilemma, and keeps your global site-packages directory clean and manageable. For example, you can work on a project which requires Django 1.10 while also maintaining a project which requires Django 1.8. To start using this and see more information: docs.
This page is a remixed version of, which is available under the same license.
Working with Excel Files Unlike the previous chapter’s data, not all the data in this and the following chapter will easily import into Python without a little work. This is because some data formats were made to be machine readable, while others, such as the ones we’ll look at next, were meant to be interacted with through desktop tools. In this chapter and the next, we will look at two example file types—Excel files and PDFs—and provide some generic instructions to follow if you encounter a different file type. So far in this book, the solutions you’ve learned about for importing data have been pretty standard. In this chapter, we will begin to learn about processes which will vary greatly each time you perform them.
Although the processes are more difficult, the end goal is the same: to extract the useful information and put it into a usable format in Python. The examples we use in this chapter and the next contain data from UNICEF’s 2014 report on. The data is available in PDF and Excel format. When you have to extract data from files in these more difficult formats, you might think there is someone out there who hates you, because it can be painful. We assure you in most cases, the person who generated the file with the data inside simply did not identify the importance of also releasing it in a machine-readable format. Installing Python Packages Before we can continue, we need to learn how to install external Python packages (or libraries). Up until this point, we were using Python libraries that came standard with Python when you installed it.
Do you remember importing the csv and json packages in? Those were packages in the standard library—they came with your Python installation. Python comes with a set of frequently used libraries. Because many libraries serve a niche purpose, you have to explicitly install them. This is so your computer doesn’t get bloated with every Python library available. Python packages are collected in an online directory called, which stores the packages along with their metadata and any documentation.
In this chapter, we are looking at Excel files. If you visit PyPI in your browser, you can search for and see lists of matching package results you can download. This is one way to explore which package you should use. We will be using pip from this point forward to install packages. There are multiple ways to, and you should have already done so in of this book. First, we will be evaluating Excel data.
Let’s install the package to do that—. To install the package, we use pip in the following way: pip install xlrd To remove the package, we would run the uninstall command: pip uninstall xlrd Try installing, uninstalling, and then reinstalling xlrd. It’s good to get a handle on the pip commands, as you’ll be using them throughout this book and your data wrangling career. Why did we choose xlrd when there are many possible packages? Choosing a Python library is an imperfect process.
There are different ways to go about your selection. Don’t worry about trying to figure out what is the right library. When you are perfecting your skills and you find a couple of options, use the library that makes sense to you.
The first thing we recommend is searching the Web to see which libraries other people recommend. If you search for, you will find the xlrd library surfaces at the top of the search results.
However, the answer is not always obvious. In, we will learn more about the selection process when looking into Twitter libraries. Parsing Excel Files Sometimes the easiest way to extract data from an Excel sheet is finding a better way to get the data.
There are times when parsing is not the answer. Before you start parsing, ask yourself the following questions:. Have you tried to find the data in another form? Sometimes other forms might be available from the same source. Have you tried to to figure out if the data is available in another form? Check out for more tips. Have you tried to export the tab or tabs into CSV form from Excel (or your document reader)?
This is a good solution if you only have a couple of tabs of data or isolated data in one tab on the Excel sheet. If you have exhausted these options and you still don’t have the data you need, you’ll need to use Python to parse your Excel file.
Getting Started with Parsing The library we identified for parsing Excel files is xlrd. This library is part of a series of. There are three main libraries for handling Excel files: xlrd Reads Excel files xlwt Writes and formats Excel files xlutils A set of tools for more advanced operations in Excel (requires xlrd and xlwt) You’ll need to install each separately if you want to use them; however, in this chapter we will only use xlrd. Because we want to read Excel files into Python, you’ll need to make sure you have xlrd installed before continuing: pip install xlrd. Warning If you get the following error, that means you don’t have pip installed: -bash: pip: command not found For installation instructions, refer to.
Set up your work environment for this Excel file by doing the following (or something like it, depending on your organizational system):. Create a folder for your Excel work. Create a new Python file called parseexcel.py and put it in the folder you created. Place the Excel file from called SOWC 2014 Stat TablesTable 9.xlsx in the same folder. From this folder, type the following command in your terminal to run the script from the command line: python parseexcel.py By the end of this chapter, we will write a script to parse child labor and marriage data stored in this Excel file. To start our script, we need to import xlrd and open our Excel workbook in Python.
We store the opened file in the book variable: import xlrd book = xlrd. Openworkbook ( 'SOWC 2014 Stat TablesTable 9.xlsx' ) Unlike CSVs, Excel books can have multiple tabs or sheets. To get at our data, we are going to pull out only the sheet with the data we want. If you have a couple of sheets, you could just guess at the index, but that won’t work if you have lots of sheets. So, you should know about the command book.sheetbyname( somename), where somename is the name of the sheet you want to access.
Let’s check out the names of the sheets we have: import xlrd book = xlrd. Openworkbook ( 'SOWC 2014 Stat TablesTable 9.xlsx' ) for sheet in book. Sheets : print sheet. Name The sheet that we are looking for is Table 9. So, let’s put that into our script: import xlrd book = xlrd. Openworkbook ( 'SOWC 2014 Stat TablesTable 9.xlsx' ) sheet = book.
Sheetbyname ( 'Table 9' ) print sheet If you run that code, it exits with an error that provides you with the following information: xlrd. XLRDError: No sheet named At this point, you might be really confused. The problem lies in the difference between what we see and what actually exists. If you open up your Excel workbook and select the name of the sheet by double-clicking it, you will find that there is an extra space at the end. This space is not visible to users in the browser.
In, we will learn how to troubleshoot this in Python. For now, update your code to reflect the space. Change this line: sheet = book. Sheetbyname ( 'Table 9' ) to this: sheet = book. Sheetbyname ( 'Table 9 ' ) Now, if we run our script it should work.
You will see output similar to this: Let’s explore what we can do with a sheet. Add the following after you assign the sheet variable and rerun your script: print dir ( sheet ) In the returned list, you’ll see a method called nrows. We will use this method to iterate over all rows. If we write print sheet.nrows, the total number of rows will be returned.
Try this now: print sheet. Nrows You should have gotten back 303. We need to iterate over each row, which means we need a for loop. As we learned in, for loops iterate over items in a list, so we need to turn 303 into a list we can iterate over 303 times. To do this, we will use the range function. What Is range?
Remember how we mentioned that Python has some helpful built-in functions? Well, range is one of those. The function will take the number as an argument and output a list of that many items. To see how it works, open up your Python interpreter and try the following: range ( 3 ) The output should be: 0, 1, 2 Three items were returned. Now, we can create a for loop to loop three times by iterating over that list.
Some things to note about range:. The list returned starts with 0. This is because Python starts list counts with 0. If you need your list to start at 1, you can set the start and end of the range. For example, range(1, 4) would return 1, 2, 3. Notice the last number is not included in the list, so to get 1, 2, 3 we had to set the end number to 4.
There is another function called xrange in Python 2.7. There are slight differences, but not anything you would notice unless you are processing very large datasets— xrange is faster. With the addition of the range function we can transform 303 into a list our for loop can iterate over, our script should look like the following: import xlrd book = xlrd. Openworkbook ( ' SOWC 2014 Stat TablesTable 9.xlsx ' ) sheet = book. Sheetbyname ( ' Table 9 ' ) for i in range ( sheet. Nrows ): print i Loops over the index i in range(303), which will be a list of 303 integers incrementing by one. Outputs i, which will be the numbers from 0 to 302.
From here, we need to do a lookup on each of the rows to pull out the contents of each row instead of just printing the number. To do a lookup, we will use i as an index reference to take the nth row.
To get each row’s values we will use rowvalues, which was another method returned by dir(sheet) earlier. We can see from that the method expects an index number and returns the corresponding row’s values. Update your for loop to reflect this and rerun your script: for i in range ( sheet.
Nrows ): print sheet. Rowvalues ( i ) Uses i as the index to look up the row’s values. Because it is in a for loop that spans the length of the sheet, we call this method for each row in our sheet. When you run this code, you will see a list for each row.
The following is a subset of the data you’ll see: ', u 'TABLE 9. CHILD PROTECTION', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ' ', ', u 'TABLEAU 9. PROTECTION DE L u2019 ENFANT', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ', ' ', ', ', u 'TABLA 9.
Tip Planning what you want the outcome to look like and writing an example of your data will save you time as you begin coding. Once you have identified how you’d like to format your data, you can ask yourself, “What do I need to do next to get there?” This is especially helpful when you feel blocked on your next step.
There are two Python constructs we are going to use to pull the data out. The first method we will use is a nested for loop, which is a for loop inside another for loop.
This is often used when you have x rows that contain y objects. To access each object you need a for loop for each row, then another for loop for each object. We also used a nested for loop in an example in. We are going to use a nested for loop to output each cell from each row. This will output the items we saw earlier, where each row was listed. For i in xrange ( sheet. Nrows ): row = sheet.
Rowvalues ( i ) for cell in row: print cell Takes the list that is each row and saves it to the row variable. This makes our code more readable. Loops over each item in the list, which represents each cell for the current row. Outputs the cell value.
If you run your complete code with the nested for loop, you will notice your output is not so helpful anymore. That brings us to the second mechanism to explore our Excel file— a counter. Warning Before you continue, output a couple of records and check the number in the dictionary. It is easy to end up one index off and ruin the rest of your data. Finally, to preview our data, we can use pprint instead of a print statement. In complicated data structures (like dictionaries), this makes it a lot easier to review the output. Add the following to the end of your file to preview the data in a formatted fashion: import pprint pprint.
Pprint ( data ) Imports the pprint library. Normally, import statements come at the beginning of the file, but we are putting it here for simplicity.
After you are done, you will want to delete these lines, because they are not critical to your script. Passes data to the pprint.pprint function. If you scroll through your output, you will notice the majority of it looks good. But there are a couple of records that seem out of place. If you look at the spreadsheet, you should note the last row for countries is Zimbabwe. So, we want to look for when the country is equal to 'Zimbabwe', and exit there.
To exit, we add a break to our code, which is how we prematurely break the for loop to continue with the rest of script. Let’s add that as our stopping point. At the end of the for loop, add the following and rerun your code: if country ' Zimbabwe ': break If the country is equal to Zimbabwe Exits out of the for loop. Warning After adding the break, did you end up with a NameError: name 'country' is not defined error?
If so, check your indentation. The if statement should be indented four spaces to be in the for loop. Stepping through code can be helpful in identifying an issue. If you need to troubleshoot to figure out what a variable, such as country, is equal to in a for loop, try adding print statements inside the for loop and watching the values before your script exits with an error. They will likely give you a hint as to what is happening.
At this point, our script’s output matches our end goal. The last thing we want to do to our script is to make sure we document it with some comments. Summary The Excel format is an odd in-between category that is kind of machine readable.
Excel files were not meant to be read by programs, but they are parsable. To handle this nonstandard format, we had to install external libraries. There are two ways to find libraries: by looking on, the Python package index, or by searching for tutorials and how-tos to see what other people have done. Once you have identified the library you want to install, use the pip install command to do it; to uninstall a library, use pip uninstall. Besides learning how to parse Excel using the xlrd library, we also learned a few new Python programming concepts, which are summarized in.
New Python programming concepts Concept Purpose This turns a number into a consecutive list of numbers. Example: range(3) will output 0, 1, 2.
Counting starts at 0, not 1 This is a computer construct to be aware of; it occurs throughout programming. It is important to note when using range, indexing, or slicing.
Use this to pull out a specific subset of a string or list. Counters Use this as a tool to control for loops. Nested for loops Use when iterating over a data structure within a data structure, such as a list of lists, a list of dictionaries, or a dictionary of dictionaries. Pprint pprint is a way to output data into the terminal in a nice format. This is good to use when programming with complicated data structures. Break You can exit from a for loop early by using break. This will stop executing the loop and continue on to the next part of the script.
Commenting It is important to keep all your code commented, so you know what is happening for future reference. As you read on and dig into PDFs, you will learn the importance of exploring alternatives to the data you have or finding alternate ways to locate and find data you need to answer your research questions. With Safari, you learn the way you learn best.
Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.