Example Usage

In this tutorial, you’ll learn a few techniques for array manipulation using mds_array_manipulation.

This package is designed to perform array manipulation functions such as Searching, Sorting, Counting non-zero elements, Finding indices of max value.

For this tutorial, we’ll be using the housing prices dataset from Kaggle. The dataset lists the price, area and other attributes for a collection of houses from different areas.

Imports

We’ll first load our library along with numpy and pandas, to make some manipulations of the dataset easier.

import sys
import os
import numpy as np
import pandas as pd
module_path = os.path.abspath(os.path.join('..'))
sys.path.append(module_path+"\\src")

#Load package functions
from mds_array_manipulation.search_array import search_array
from mds_array_manipulation.argmax import argmax
from mds_array_manipulation.sort_array import sort_array
from mds_array_manipulation.count_nonzero_elements import count_nonzero_elements

Load Housing Prices Data

We’ll load housing price data into a pandas DataFrame and take a quick overview of the dataset’s structure and initial entries.

We are going to explore different columns in the dataframe using each of the functions in mds_array_manipulation package.

housing_data = pd.read_csv("Housing.csv")
housing_data.head()

	price	area	bedrooms	bathrooms	stories	mainroad	guestroom	basement	hotwaterheating	airconditioning	parking	prefarea	furnishingstatus
0	13300000	7420	4	2	3	yes	no	no	no	yes	2	yes	furnished
1	12250000	8960	4	4	4	yes	no	no	no	yes	3	no	furnished
2	12250000	9960	3	2	2	yes	no	yes	no	no	2	yes	semi-furnished
3	12215000	7500	4	2	2	yes	no	yes	no	yes	3	yes	furnished
4	11410000	7420	4	1	2	yes	yes	yes	no	yes	2	no	furnished

Search Array

Imagine you’re searching for a house suitable for you. With many kids, you need a house with 6 bedrooms.

We can start by looking at the bedroom count (i.e., column bedrooms in the housing price dataset) for the first 30 houses.

bedroom_data = housing_data['bedrooms'].to_numpy()
bedroom_data[0:30]

array([4, 4, 3, 4, 4, 3, 4, 5, 4, 3, 3, 4, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 4, 3, 3, 5, 4])

We convert the housing data to a numpy array, to allow us to search it using our search_array function, then see if we have a house which fits our criteria (i.e., 6 bedrooms).

search_array(bedroom_data, 6)

We can index back into our original dataframe to find additional information on this house, to see if it’s otherwise suitable for us.

housing_data.loc[112]

price                 6083000
area                     4300
bedrooms                    6
bathrooms                   2
stories                     2
mainroad                  yes
guestroom                  no
basement                   no
hotwaterheating            no
airconditioning            no
parking                     0
prefarea                   no
furnishingstatus    furnished
Name: 112, dtype: object

The information above confirms that our search result is accurate. The house with index 112 does indeed have 6 bedrooms.

What about even more bedrooms?

search_array(bedroom_data, 7)

-1

There’s no houses with 7 bedrooms, so we get a -1 for the index.

Sort Array

Imagine you want to know the lowest and highest house prices, as well as the areas, to get a rough idea of the housing market in that region.

We can find the house prices and areas information in the price and area columns repectively. Let’s take a look for the first 30 entries.

price_data = housing_data['price'].to_numpy()
price_data[0:30]

array([13300000, 12250000, 12250000, 12215000, 11410000, 10850000,
       10150000, 10150000,  9870000,  9800000,  9800000,  9681000,
        9310000,  9240000,  9240000,  9100000,  9100000,  8960000,
        8890000,  8855000,  8750000,  8680000,  8645000,  8645000,
        8575000,  8540000,  8463000,  8400000,  8400000,  8400000])

area_data = housing_data['area'].to_numpy()
area_data[0:30]

array([ 7420,  8960,  9960,  7500,  7420,  7500,  8580, 16200,  8100,
        5750, 13200,  6000,  6550,  3500,  7800,  6000,  6600,  8500,
        4600,  6420,  4320,  7155,  8050,  4560,  8800,  6540,  6000,
        8875,  7950,  5500])

Then, we apply the array sorting function to these two columns. (For simplicity, only the first 30 entries are shown.)

price_data_sorted = sort_array(price_data)
price_data_sorted[0:30]

array([1750000, 1750000, 1750000, 1767150, 1820000, 1855000, 1890000,
       1890000, 1960000, 2100000, 2100000, 2100000, 2135000, 2233000,
       2240000, 2275000, 2275000, 2275000, 2310000, 2345000, 2380000,
       2380000, 2380000, 2408000, 2450000, 2450000, 2450000, 2450000,
       2450000, 2450000])

area_data_sorted = sort_array(area_data)
area_data_sorted[0:30]

array([1650, 1700, 1836, 1905, 1950, 1950, 2000, 2015, 2135, 2145, 2145,
       2145, 2145, 2145, 2145, 2160, 2175, 2176, 2275, 2325, 2398, 2400,
       2400, 2430, 2475, 2500, 2520, 2550, 2610, 2610])

As expected, the sort_array function arranges the values of these two columns in ascending order, from smallest to largest.

We can use the index to obtain the first and last elements of the sorted array, which represent the lowest and highest values, respectively.

print("Lowest house price: " , price_data_sorted[0], "dollars")
print("Highest house price: " , price_data_sorted[-1], "dollars")
print("Lowest house area: " , area_data_sorted[0], "sq ft")
print("Highest house area: " , area_data_sorted[-1], "sq ft")

Lowest house price:  1750000 dollars
Highest house price:  13300000 dollars
Lowest house area:  1650 sq ft
Highest house area:  16200 sq ft

Awesome! We have obtained the lowest and highest house prices, as well as the areas, which is exactly what we wanted.

Count Non-zero Elements

Let’s imagine that you want to know the number of houses with parking spaces to guide your planning process for making arrangements to accommodate additional cars at your own property.

To achieve this, you can load the data Housing.csv and pick parking column. The column contains the entries denoting parking and 0 indicates no parking. Let’s take a look for the first 30 entries.

parking_data = housing_data['parking'].to_numpy()
parking_data[0:30]

array([2, 3, 2, 3, 2, 2, 2, 0, 2, 1, 2, 2, 1, 2, 0, 2, 1, 2, 2, 1, 2, 2,
       1, 1, 2, 2, 0, 1, 2, 1])

To filter out the houses with parking spaces, you can use count_nonzero_elements from mds_array_manipulation package.

parking_houses = count_nonzero_elements(parking_data)
parking_houses

{'Total Non-Zero Elements in Array': 246}

As expected, count_nonzero_elements returns a dict telling us how many non-zero elements there are in the array. Therefore, there are 246 houses with parking spaces.

Then, we can subtract the number of non-zero elements from the total length of the array to get the number of zero elements (i.e., the number of houses without parking spaces).

noparking_houses = len(parking_data) - parking_houses['Total Non-Zero Elements in Array']
print("Houses with Parking space : " ,parking_houses['Total Non-Zero Elements in Array'])
print("Houses with No Parking Space : " ,noparking_houses)

Houses with Parking space :  246
Houses with No Parking Space :  299

This insightful analysis, provided us with clear understanding about parking spaces in housing.csv. Out of 545 total houses, you have 246 houses equipped with parking space and 299 without parking space.

Finding Indices of Maximum Value (argmax)

Imagine you want to identify the observation with the highest house prices and the one with the largest area, intending to use the index to retrieve all the information for that observation.

You can find the house prices and area information in the first two columns of the housing_data.

Let’s convert it to a numpy array and store it under the variable price_area_data.

price_area_data = housing_data.iloc[:,0:2].to_numpy()

We can begin by examining the first 30 entries.

price_area_data[0:30]

array([[13300000,     7420],
       [12250000,     8960],
       [12250000,     9960],
       [12215000,     7500],
       [11410000,     7420],
       [10850000,     7500],
       [10150000,     8580],
       [10150000,    16200],
       [ 9870000,     8100],
       [ 9800000,     5750],
       [ 9800000,    13200],
       [ 9681000,     6000],
       [ 9310000,     6550],
       [ 9240000,     3500],
       [ 9240000,     7800],
       [ 9100000,     6000],
       [ 9100000,     6600],
       [ 8960000,     8500],
       [ 8890000,     4600],
       [ 8855000,     6420],
       [ 8750000,     4320],
       [ 8680000,     7155],
       [ 8645000,     8050],
       [ 8645000,     4560],
       [ 8575000,     8800],
       [ 8540000,     6540],
       [ 8463000,     6000],
       [ 8400000,     8875],
       [ 8400000,     7950],
       [ 8400000,     5500]])

To determine the house with the highest price and the one with the largest area, you can utilize the argmax function from the mds_array_manipulation package. (Since we are comparing values along columns, it is necessary to specify axis=1.) Display only the first match in the event of a tie.

argmax(price_area_data, axis=1)

[0, 7]

The function returns two indices. The first index corresponds to the house with the highest price, and the second index corresponds to the house with the largest area.

Subsequently, we can utilize these indices to retrieve information for both the house with the highest price and the one with the largest area, facilitating further comparison or analysis.

print("House with the highest price: ")
housing_data.loc[0]

House with the highest price:

price                13300000
area                     7420
bedrooms                    4
bathrooms                   2
stories                     3
mainroad                  yes
guestroom                  no
basement                   no
hotwaterheating            no
airconditioning           yes
parking                     2
prefarea                  yes
furnishingstatus    furnished
Name: 0, dtype: object

print("House with the largest area: ")
housing_data.loc[7]

House with the largest area:

price                  10150000
area                      16200
bedrooms                      5
bathrooms                     3
stories                       2
mainroad                    yes
guestroom                    no
basement                     no
hotwaterheating              no
airconditioning              no
parking                       0
prefarea                     no
furnishingstatus    unfurnished
Name: 7, dtype: object

Great! We have obtained all the information about the house with the highest price and the house with the largest area, which is exactly what we wanted. We can now perform further analysis based on this information.