Published on June 29, 2021 by Sreeram Ramesh
During the golden age of economic expansion in the 1950s, many fields of science and engineering witnessed rapid growth. In particular, two fields of mathematical and physical sciences -mathematical programming and Monte Carlo methods — witnessed exponential growth in both theory and practical applications. Along with the academic progress in natural sciences, a new field of science and technology — computer science — began to emerge. Since then, the platforms and tools used to analyse data have undergone a number of changes.
Today, Python has emerged as one of the most popular programming languages, fuelled by its open-source nature and a rich history of contributors from the fields of scientific computing, mathematics and engineering. Due to its clear and intuitive syntax, Python is perhaps the most user-friendly language compared with traditional languages such as Java and C++.
For young data scientists and data science aspirants, here is our quick guide that covers the libraries, functions, models and integrated development environments (IDEs) most frequently used in Python.
Data science solutions typically have four elements:
1. Data extraction
Data sources used most often in the finance domain and Python libraries and functions used to extract data from them include the following:
Databases Flat files Web scraping PDF reports
- Traditional libraries: PyPDF2, pdfminer, tika, PyMuPDF and fitz
- OCR libraries: textract and pytesseract
2. Data preparation and analysis
Data preparation and analysis methods depend on the type of data:
- pandas: to perform data transformations, aggregations, validations and cleansing
- numpy: to carry out fast numerical dataset operations
- nltk: to perform a wider range of NLP pre-processing functionality
- SciPy: to perform different mathematical computations
- scikit-learn: the most commonly used library for predictive analytics
- nltk: to perform NLP analysis, including sentiment scoring
- spacy: to classify entity names through entity recognition modelling
- textblob and vaderSentiment: to perform sentiment analysis on textual data
- keras: to build and transform deep-learning networks and datasets
- pytorch: to build advanced and flexible deep-learning models
- tensorflow: another advanced framework to build deep-learning models
- word2vec, glove, BERT and USE: pre-trained embedded language models to build advanced NLP
- huggingface: open source repository of many different kinds of NLP pre-trained models
- matplotlib, seaborn, and plotly: to visualise data through graphs
3. Solution packaging
- jupyter notebook: to rapidly code, visualise and present data applications and reports
- powerbiclient: to visualise data insights and predictions
- dash: to build graphical and user interface web page components
4. IDE Conclusion
Open-source programmers around the world are constantly improvising Python and open-source libraries and tools, making the language really powerful, easy to use and agile. With leading financial institutions and corporations embracing hybrid cloud strategies to manage work-flow during the pandemic and beyond, Python has become an integral catalyst for transforming global businesses.
About the Author
Sreeram Ramesh, Data Scientist, has over six years of experience in solving business problems using machine learning (ML), deep learning and statistical techniques. He has worked with multiple financial service providers and Fortune 500 companies, providing robust data science solutions. At Acuity Knowledge Partners, Sreeram creates artificial intelligence (AI)/ML engines to analyse textual data, and generate insights and proprietary scores. He holds a Bachelor of Engineering degree in Mechanical Engineering from Birla Institute of Technology and Science (BITS), Pilani.