A python challenge. generate a list of emails with duplicates in random places. Then find a way to prune the list, leaving the result in the same order.

Go to file

Greg Gauthier 7f466bdb80 Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 12s Details force another runner trigger		2024-02-15 21:16:47 +00:00
.gitea/workflows	add gitea workflow	2024-02-15 21:10:28 +00:00
.gitignore	final commit	2020-10-21 23:57:48 +01:00
Pipfile	remove specific python version requirement. CircleCI is already on 3.9, so that's good enough for now	2020-10-23 13:17:22 +01:00
README.md	force another runner trigger	2024-02-15 21:16:47 +00:00
conftest.py	final commit	2020-10-21 23:57:48 +01:00
email_pruner.py	Add option to dump the email list to the console	2020-10-22 00:14:51 +01:00
requirements.txt	circleci seems to want to install requirements the old way. We'll let him install pipenv this way, and do the rest from there	2020-10-23 13:05:05 +01:00
test_email_pruner.py	final commit	2020-10-21 23:57:48 +01:00

README.md

email-pruner

An experiment in Python lists.

Requirements

Python 3.7+ (needed for the "fprint" statements)
Pipenv ~2020.8.13 (for virtualenv dev)
Pytest 6+

Setup

cd into the root of the project
type these commands

$ python3 -m pip install pipenv 
$ pipenv --python 3.7
$ pipenv install
$ pipenv shell

This will drop you into the virtualenv, with the right packages already installed. Next, all you need to do, is run the tests, then run the app, which should look someething like this:

(email-prune) [23:02:47][~/Projects/Coding/Python/email-prune]
gmgauthier@shackleton $ pytest -vv                         
=================== test session starts ==============================================================================
platform darwin -- Python 3.8.6, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 -- /Users/gmgauthier/.local/share/virtualenvs/email-prune-6GbCapbV/bin/python
cachedir: .pytest_cache
rootdir: /Users/gmgauthier/Projects/Coding/Python/email-prune
collected 6 items

test_email_pruner.py::test_email_creation PASSED                                                                 [ 16%]
test_email_pruner.py::test_dup_list_creation PASSED                                                              [ 33%]
test_email_pruner.py::test_compare_dups_and_pruned PASSED                                                        [ 50%]
test_email_pruner.py::test_alternative_pruner PASSED                                                             [ 66%]
test_email_pruner.py::test_random_string_contents PASSED                                                         [ 83%]
test_email_pruner.py::test_random_string_len PASSED                                                              [100%]

============================ 6 passed in 0.03s ========================================================================
(email-prune) [23:02:55][~/Projects/Coding/Python/email-prune]
gmgauthier@shackleton $ python ./email_pruner.py -e 750000              
GENERATED COMPLETE LIST WITH DUPLICATES: (count = 1500000)
Elapsed Time:  0:00:57.541989
IDENTIFIED DUPLICATES IN COMPLETE LIST: (count = 750000)
Elapsed time:  0:00:00.545665

TOTAL ELAPSED TIME: 0:00:58.087654

NOTES

I have a lot of comments in the code and on the tests, that explain my reasoning around certain decisions. I'll just explain the console output here.

What you're seeing echoed out to the console is a record of the amount of time it took to execut the two major steps in this code (a) the generation of the emai list (which includes the duplications inserted in random order), and (b) the amount of time it took to execute the identification of those duplications, including bifurcating the list into two separate lists: originals, and duplicates.

As you can see, this particular execution was a sort of simple "load test" on the app. The requirements called for isolating the duplicates in 100,000 emails in less than a second. This code was able to do 1.5 million, in 546 milliseconds. Not bad!

The tests are run with pytest. They are designed to run quickly. I'm only seeding 100 emails. The point is merely to demonstrate the functionality of the methods I wrote, and to showcase the importance of TESTING the application (and to demonstrate that I can reason good assertions from the requirements).

I should mention, I could have wrapped the tests in the Behave DSL, but chose not to for this challenge because the nature of the work being done in this application is at the functional integration level, rather than at the level of user interaction. Gherkin specifications are best used in the context of a behavioral relationship between user and application, rather than as a tool for "englishifying" component level specifications. The "raw" test code is much more instructive, if you know what you're looking for.