Example Usage
In this tutorial, you’ll learn a few techniques for array manipulation using mds_array_manipulation.
This package is designed to perform array manipulation functions such as Searching, Sorting, Counting non-zero elements, Finding indices of max value.
For this tutorial, we’ll be using the housing prices dataset from Kaggle. The dataset lists the price, area and other attributes for a collection of houses from different areas.
Imports
We’ll first load our library along with numpy and pandas, to make some manipulations of the dataset easier.
import sys
import os
import numpy as np
import pandas as pd
module_path = os.path.abspath(os.path.join('..'))
sys.path.append(module_path+"\\src")
#Load package functions
from mds_array_manipulation.search_array import search_array
from mds_array_manipulation.argmax import argmax
from mds_array_manipulation.sort_array import sort_array
from mds_array_manipulation.count_nonzero_elements import count_nonzero_elements
Load Housing Prices Data
We’ll load housing price data into a pandas DataFrame and take a quick overview of the dataset’s structure and initial entries.
We are going to explore different columns in the dataframe using each of the functions in mds_array_manipulation package.
housing_data = pd.read_csv("Housing.csv")
housing_data.head()
| price | area | bedrooms | bathrooms | stories | mainroad | guestroom | basement | hotwaterheating | airconditioning | parking | prefarea | furnishingstatus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13300000 | 7420 | 4 | 2 | 3 | yes | no | no | no | yes | 2 | yes | furnished |
| 1 | 12250000 | 8960 | 4 | 4 | 4 | yes | no | no | no | yes | 3 | no | furnished |
| 2 | 12250000 | 9960 | 3 | 2 | 2 | yes | no | yes | no | no | 2 | yes | semi-furnished |
| 3 | 12215000 | 7500 | 4 | 2 | 2 | yes | no | yes | no | yes | 3 | yes | furnished |
| 4 | 11410000 | 7420 | 4 | 1 | 2 | yes | yes | yes | no | yes | 2 | no | furnished |
Search Array
Imagine you’re searching for a house suitable for you. With many kids, you need a house with 6 bedrooms.
We can start by looking at the bedroom count (i.e., column bedrooms in the housing price dataset) for the first 30 houses.
bedroom_data = housing_data['bedrooms'].to_numpy()
bedroom_data[0:30]
array([4, 4, 3, 4, 4, 3, 4, 5, 4, 3, 3, 4, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3,
3, 3, 3, 4, 3, 3, 5, 4])
We convert the housing data to a numpy array, to allow us to search it using our search_array function, then see if we have a house which fits our criteria (i.e., 6 bedrooms).
search_array(bedroom_data, 6)
112
We can index back into our original dataframe to find additional information on this house, to see if it’s otherwise suitable for us.
housing_data.loc[112]
price 6083000
area 4300
bedrooms 6
bathrooms 2
stories 2
mainroad yes
guestroom no
basement no
hotwaterheating no
airconditioning no
parking 0
prefarea no
furnishingstatus furnished
Name: 112, dtype: object
The information above confirms that our search result is accurate. The house with index 112 does indeed have 6 bedrooms.
What about even more bedrooms?
search_array(bedroom_data, 7)
-1
There’s no houses with 7 bedrooms, so we get a -1 for the index.
Sort Array
Imagine you want to know the lowest and highest house prices, as well as the areas, to get a rough idea of the housing market in that region.
We can find the house prices and areas information in the price and area columns repectively. Let’s take a look for the first 30 entries.
price_data = housing_data['price'].to_numpy()
price_data[0:30]
array([13300000, 12250000, 12250000, 12215000, 11410000, 10850000,
10150000, 10150000, 9870000, 9800000, 9800000, 9681000,
9310000, 9240000, 9240000, 9100000, 9100000, 8960000,
8890000, 8855000, 8750000, 8680000, 8645000, 8645000,
8575000, 8540000, 8463000, 8400000, 8400000, 8400000])
area_data = housing_data['area'].to_numpy()
area_data[0:30]
array([ 7420, 8960, 9960, 7500, 7420, 7500, 8580, 16200, 8100,
5750, 13200, 6000, 6550, 3500, 7800, 6000, 6600, 8500,
4600, 6420, 4320, 7155, 8050, 4560, 8800, 6540, 6000,
8875, 7950, 5500])
Then, we apply the array sorting function to these two columns. (For simplicity, only the first 30 entries are shown.)
price_data_sorted = sort_array(price_data)
price_data_sorted[0:30]
array([1750000, 1750000, 1750000, 1767150, 1820000, 1855000, 1890000,
1890000, 1960000, 2100000, 2100000, 2100000, 2135000, 2233000,
2240000, 2275000, 2275000, 2275000, 2310000, 2345000, 2380000,
2380000, 2380000, 2408000, 2450000, 2450000, 2450000, 2450000,
2450000, 2450000])
area_data_sorted = sort_array(area_data)
area_data_sorted[0:30]
array([1650, 1700, 1836, 1905, 1950, 1950, 2000, 2015, 2135, 2145, 2145,
2145, 2145, 2145, 2145, 2160, 2175, 2176, 2275, 2325, 2398, 2400,
2400, 2430, 2475, 2500, 2520, 2550, 2610, 2610])
As expected, the sort_array function arranges the values of these two columns in ascending order, from smallest to largest.
We can use the index to obtain the first and last elements of the sorted array, which represent the lowest and highest values, respectively.
print("Lowest house price: " , price_data_sorted[0], "dollars")
print("Highest house price: " , price_data_sorted[-1], "dollars")
print("Lowest house area: " , area_data_sorted[0], "sq ft")
print("Highest house area: " , area_data_sorted[-1], "sq ft")
Lowest house price: 1750000 dollars
Highest house price: 13300000 dollars
Lowest house area: 1650 sq ft
Highest house area: 16200 sq ft
Awesome! We have obtained the lowest and highest house prices, as well as the areas, which is exactly what we wanted.
Count Non-zero Elements
Let’s imagine that you want to know the number of houses with parking spaces to guide your planning process for making arrangements to accommodate additional cars at your own property.
To achieve this, you can load the data Housing.csv and pick parking column. The column contains the entries denoting parking and 0 indicates no parking. Let’s take a look for the first 30 entries.
parking_data = housing_data['parking'].to_numpy()
parking_data[0:30]
array([2, 3, 2, 3, 2, 2, 2, 0, 2, 1, 2, 2, 1, 2, 0, 2, 1, 2, 2, 1, 2, 2,
1, 1, 2, 2, 0, 1, 2, 1])
To filter out the houses with parking spaces, you can use count_nonzero_elements from mds_array_manipulation package.
parking_houses = count_nonzero_elements(parking_data)
parking_houses
{'Total Non-Zero Elements in Array': 246}
As expected, count_nonzero_elements returns a dict telling us how many non-zero elements there are in the array. Therefore, there are 246 houses with parking spaces.
Then, we can subtract the number of non-zero elements from the total length of the array to get the number of zero elements (i.e., the number of houses without parking spaces).
noparking_houses = len(parking_data) - parking_houses['Total Non-Zero Elements in Array']
print("Houses with Parking space : " ,parking_houses['Total Non-Zero Elements in Array'])
print("Houses with No Parking Space : " ,noparking_houses)
Houses with Parking space : 246
Houses with No Parking Space : 299
This insightful analysis, provided us with clear understanding about parking spaces in housing.csv. Out of 545 total houses, you have 246 houses equipped with parking space and 299 without parking space.
Finding Indices of Maximum Value (argmax)
Imagine you want to identify the observation with the highest house prices and the one with the largest area, intending to use the index to retrieve all the information for that observation.
You can find the house prices and area information in the first two columns of the housing_data.
Let’s convert it to a numpy array and store it under the variable price_area_data.
price_area_data = housing_data.iloc[:,0:2].to_numpy()
We can begin by examining the first 30 entries.
price_area_data[0:30]
array([[13300000, 7420],
[12250000, 8960],
[12250000, 9960],
[12215000, 7500],
[11410000, 7420],
[10850000, 7500],
[10150000, 8580],
[10150000, 16200],
[ 9870000, 8100],
[ 9800000, 5750],
[ 9800000, 13200],
[ 9681000, 6000],
[ 9310000, 6550],
[ 9240000, 3500],
[ 9240000, 7800],
[ 9100000, 6000],
[ 9100000, 6600],
[ 8960000, 8500],
[ 8890000, 4600],
[ 8855000, 6420],
[ 8750000, 4320],
[ 8680000, 7155],
[ 8645000, 8050],
[ 8645000, 4560],
[ 8575000, 8800],
[ 8540000, 6540],
[ 8463000, 6000],
[ 8400000, 8875],
[ 8400000, 7950],
[ 8400000, 5500]])
To determine the house with the highest price and the one with the largest area, you can utilize the argmax function from the mds_array_manipulation package. (Since we are comparing values along columns, it is necessary to specify axis=1.) Display only the first match in the event of a tie.
argmax(price_area_data, axis=1)
[0, 7]
The function returns two indices. The first index corresponds to the house with the highest price, and the second index corresponds to the house with the largest area.
Subsequently, we can utilize these indices to retrieve information for both the house with the highest price and the one with the largest area, facilitating further comparison or analysis.
print("House with the highest price: ")
housing_data.loc[0]
House with the highest price:
price 13300000
area 7420
bedrooms 4
bathrooms 2
stories 3
mainroad yes
guestroom no
basement no
hotwaterheating no
airconditioning yes
parking 2
prefarea yes
furnishingstatus furnished
Name: 0, dtype: object
print("House with the largest area: ")
housing_data.loc[7]
House with the largest area:
price 10150000
area 16200
bedrooms 5
bathrooms 3
stories 2
mainroad yes
guestroom no
basement no
hotwaterheating no
airconditioning no
parking 0
prefarea no
furnishingstatus unfurnished
Name: 7, dtype: object
Great! We have obtained all the information about the house with the highest price and the house with the largest area, which is exactly what we wanted. We can now perform further analysis based on this information.