Gamifying R: A Video Game To Unlock Potential

One of the projects I’ve been working on during the winter break of 2024/2025 is a video game for getting started with R. I’m usually thinking of new ways to communicate and share knowledge during my spare time. I’ve written a website (RGalleon.com), written a textbook (An Introduction to R for Non-Programmers), and taught short courses and in university setting on various aspects of R, Python, and/or SQL.

Developing the Video Game

However, I’ve been noticing that many of my students are now on their phones – all the time! I tried to think of a way that I could connect and share my expertise with those individuals better. In the summer of 2024, I tried developing an app, but it got put on the backburner. However, I buckled down during the winter break and made some serious progress. I tried my hand at Ren’Py, as it is python-based. This made it much easier for me to develop the video game as I already know python, so the syntax was easier for me to pick up and learn over other video game engines. Ren’Py also has a feature that exports your work to XCode for iOS development. (It has some small bugs when porting the game, but does so much of the work that it is still VERY helpful.) At this point, I have a game that works and has the main functionalities that I want.

Below is the opening screen:

The Basic Look and Functionality

Once the user starts the game, they will see a screen that looks like this:

The goal of the video game was to look like a phone texting conversation. I am hoping that it makes folks on their phones a lot more comfortable with the experience. By pressing the play icon, the user can progress the conversation. At certain points, the user will have different options to select. Their choice will prompt potentially different responses. Below is an example of the user’s message and the subsequent responses:

Since the app doesn’t have R built into it, I decided to include images of output so that the user can see what code will look like in R. Here’s an example of that:

The user will be presented with questions throughout the conversation. During these points, the user will have to make different choices. If the user makes an incorrect choice, the Professor character will explain why the choice is incorrect, and will give the user another chance to select the right solution. Once the user selects the right answer, the user will be able to progress.

Looking Forward

At this stage, I am still working out some kinks and polishing the game up (i.e., removing some buttons, including a tutorial for navigation, etc.). I hope to be able to release is during the first quarter for 2025 – so please stay tuned! 🙂

Is the DJIA better than the S&P 500?

Disclaimer: Before making any investment decisions based on this (or any financial content on the internet) analysis, consult with a financial professional (https://www.youtube.com/watch?v=ILsXSJeF9Xc). This blog post is for informational and educational purposes only.

In the world of investing, choosing the right index fund can be a daunting task. Two popular options are the Dow Jones Industrial Average (DJIA or sometimes abbreviated here as DOW) and the S&P 500. This post explores their historical performance to see if one might be a better choice.   If you are interested in learning how to perform analyses like this, consider one of my data science courses: https://wp.me/P5xMk4-5p

Methods

We utilized different methods to compare and contrast the indexes.  Some of these were traditional, such as computing helpful summary statistics of the returns (such as the average return).  We also calculated different moving averages as a useful benchmark to obtain medium to long term performance.  

We also performed a more sophisticated approach to understand the returns of both indexes called the bootstrap.  To understand the bootstrap, imagine you have the historical return data for both the Dow Jones and the S&P 500. The bootstrap algorithm is a fancy way to create many “fake histories” of returns, helping us understand how reliable the statistics we calculate from the real data might be.

Here’s how it works:

  1. Resampling with Replacement: Think of randomly grabbing returns from your data/historical returns/history, but with a twist: you can put them back in each time! This allows you to create a new “fake history” with potentially duplicate years. 
  2. Creating Many Fake Histories: We repeat this grabbing and replacing process hundreds of thousands of times, creating a whole collection of these “fake histories” for both DJIA and S&P 500. Each fake history has the same number of years (like 30) for each run, but the order and sometimes even the specific years might be shuffled around.
  3. Analyzing Each Fake History: Now, for each fake history, we calculate statistics like average return, just like we did with the real data. This gives us a sense of how much these statistics would vary if things like random chance influenced which years we picked.  From these statistics, we can even describe the distributions of these statistics.

Why it’s Useful for Comparing Distributions:

By repeating this resampling process a large number of times, the bootstrap generates a distribution of summary statistics (mean, median, standard deviation etc.) for both DJIA and S&P 500 returns. This allows you to:

  • Compare Variability: Analyze how much the summary statistics (like mean return) vary between the two indexes. A wider spread in the bootstrap distribution suggests more variability in the statistic.
  • Distribution Shape: Visualize the distribution of these statistics using techniques like histograms or density plots. This can reveal if one index has a more skewed distribution of returns compared to the other.

By comparing the bootstrap distributions of the DJIA and S&P 500, we’ll gain a deeper understanding of how consistent their returns are, how much they might fluctuate, and any potential differences in their return distributions.  If you are interested in learning how to perform the bootstrap in situations like this, consider my data science bootcamp: https://wp.me/P5xMk4-5p

Initial Analysis

We compared the annualized historical returns of the DJIA and the S&P 500 over the same time period (1928 to 2023). While initial plots suggested the Dow Jones might even outperform the S&P 500, further analysis revealed a different story. The code and output from this analysis is provided at my GitHub link: https://github.com/billyl320/sp500_dow_compare 

The above plot represents the returns as “proportion change”.  Proportion change is defined to be the annualized return change.  For instance, a return of 40% would equate to a proportion change of 1+0.40 = 1.40.  A return of -40% would equate to a proportion change of 1-0.40 = 0.60.  We can see that the histograms of the proportion change is fairly similar.  There is even an (albite very very unlikely) chance that the DJIA may have years of the greatest return due to have the maximum value across both indexes.  

Average Returns Are Similar

Financial experts often cite similar average returns for both indexes across different time periods [1, 2, 3]. Our analysis confirms this. While both indexes have experienced periods of strong growth and decline, their long-term average returns tend to be close.  Below are the 10 year and 30 year moving averages of both indexes.  (This is especially true in more recent years.)  The x axis is time where the larger values indicate more recent years.  

Median Returns Tell a Different Story

However, looking deeper, we found a larger difference in the medians. Recall that the median represents the “middle” value in a dataset, where half is less than the median and the other half is greater.  The average or mean does not guarantee to have half of the data on either side of it. 

The above histogram compares the distributions of the means of the DJIA and S&P 500 (in blue and red, respectively).  The distributions are very similar and would suggest that either index would give similar rates of performance.  

The above histogram compares the distributions of the medians of the DJIA and S&P 500 (in blue and red, respectively).  The distributions are very different and would suggest that either index behaves differently.  The DJIA has two peaks, one closer to a 5% annualized return and one closer to 15%.  The S&P 500 has one primary peak around a 15% annualized return.  This suggests that it is more likely to have higher returns with the S&P 500 than the DJIA, despite having a similar range for the median.  

This got me thinking, “Is there a good way to represent how different these distributions are?”.  I created a plot comparing the proportion of medians less than a given value for both of these (this is very similar to an estimated CDF plot).  I first looked at this plot for the means, and then the median.  

The plot for the means are very similar with no noticeable differences in the plot.  Generally speaking, we would expect a better performing index to have a line closer to the x axis for a longer stretch of proportion change.   

However, when we look at the median counterpoint, we can see some deviations between the x axis values of about 1.06 to 1.14. The next plot looks as a zoomed in section of this part of the plot.  

This plot shows that there is a large discrepancy between the two indexes.  For example, the distribution of the S&P 500 median has less than 20% of its distribution less than 8%.  Conversely, the distribution of the DJIA median has over 40% of its distribution less than 8%.  The distributions begin to converge after 1.14.

So, Which Index is Better?

Based purely on historical data and risk tolerance, the S&P 500 appears to be the better choice. Here’s why:

  • More Consistent Growth: The distribution of the S&P 500’s median return suggests steadier growth potential since it only has one major peak.
  • Greater Diversification: The S&P 500 tracks 500 companies, offering broader diversification compared to the Dow Jones’s 30 companies.

Are There Reasons to Choose the Dow Jones?

While the S&P 500 might be the analytical favorite, there are potential reasons to consider the Dow Jones:

  • Social Investing: An investor might have issues investing in many of the companies in the S&P 500.  Investing in the DJIA might provide a list of less problematic companies for the investor.  
  • Easier for Self Management: It also might be easier to create a self-driven portfolio without using investment products like ETFs using the DJIA than the S&P 500.  
  • Potential for Higher Returns (with Higher Risk): The Dow Jones has experienced some periods of higher returns than the S&P 500 (as due to the one year with outsized returns for the DJIA in the historical data). However, this comes with the risk of larger potential losses.

The Final Word

The “better” index depends on your investment goals. If you prioritize consistent growth and diversification, the S&P 500 might be ideal. However, if you’re comfortable with potentially higher risk for potentially higher rewards, the Dow Jones could be an option.

Is there anything that I didn’t consider that you would have? Anything that you might have done differently? Let me know and perhaps I can do a follow up to this!

Remember: Regardless of which index you choose, consulting with a financial professional is crucial before making any investment decisions. They can help you create a personalized investment plan that aligns with your financial goals and risk tolerance.

Unlock Your Inner Sherlock Fire: Analytical Minds Need to Journal

While crunching numbers and dissecting data might be your forte, there’s a valuable tool waiting to be unlocked: journaling. Yes, journaling. Don’t let the seemingly sentimental vibe fool you; for those who thrive on logic and reason, journaling offers a surprising treasure trove of benefits.

Journaling 101:

Let’s break it down. Journaling simply means capturing your thoughts and experiences in writing. Whether you prefer the tactile satisfaction of pen and paper or the convenience of digital apps, the choice is yours. Both offer unique advantages:

  • Digital: Quick, portable, and easily searchable, perfect for on-the-go capture and organization.
  • Analog: Creates a tangible connection to your thoughts, fostering deeper reflection and mindfulness.

Unlocking the Power:

As an analytical thinker, you might wonder what journaling has to offer. Here’s the secret: it pushes you outside your comfort zone. It’s less about numbers and models, and more about exploring emotions, motivations, and personal growth. This activates a different set of skills and is also valuable to nurture and grow.

Journaling also helps recenter yourself. Immersing yourself in your thoughts allows you to step back, analyze patterns, and gain clarity on your life’s direction. Think of it as a mental detox, clearing the clutter to see the bigger picture.

My Journey with Journaling:

Here’s how I’ve incorporated journaling into my personal life:

  • Gratitude Journal (Daily): Just a few lines each day, expressing appreciation for the good stuff, big or small. It’s a simple practice with powerful impact, shifting focus towards positivity.
  • Analog Journaling (About 3 times a week): Longer entries, delving deeper into thoughts, experiences, and challenges. This allows for introspection and contemplation, fostering self-awareness and growth.
  • Random Thoughts Journal (As Needed): Need to clear your mind? Jot down those fleeting thoughts and ideas. This helps declutter your mental space and can spark inspiration for further exploration.

Getting Started:

Don’t feel intimidated! Start small and find what resonates with you. Here’s my advice:

  • Begin with Random Thoughts and Gratitude: These are low-pressure, easy to maintain, and require minimal time (5 minutes a day can work wonders!).
  • Gradually Explore Analog Journaling: Start with shorter entries and gradually increase the duration as you get comfortable. Even aiming for 15 minutes 3 times a week can be a great initial goal to help get started!
  • Find Your Tools: Whether it’s a simple notebook or a dedicated app, choose something you enjoy using. I find using Noteability on my iPad with an Apple Pencil feels surprisingly like analog writing.

Remember, this is your journey. Experiment, discover what works for you, and unleash the hidden potential within your analytical mind. You might be surprised at the insights and growth that journaling unlocks. So, grab your pen, open your app, and embark on this exciting adventure of self-discovery!

Note: Bard was used to help write this article.  Midjourney was used to help create the images presented in this article.

Friendly Fire to Explainable AI: How to Trust Algorithms

OpenAI’s recent collaboration with the Pentagon on cybersecurity and veteran suicide prevention projects has sparked important conversations about the ethics and implications of artificial intelligence in critical, high-stakes domains. While OpenAI assures responsible development and a ban on weaponry, there’s one crucial consideration that deserves further attention: explainability.

When dealing with applications that carry life-altering consequences, opaque AI algorithms simply aren’t good enough. Take, for example, a predictive model informing veteran suicide risk. A black box churning out a binary “high risk” verdict without explanation is ethically unjustifiable. Imagine the immense psychological burden on the flagged individual, the potential harm of unwarranted interventions, and the erosion of trust in the system (see Dr. Pershing in Season 3 Episode 3 of the Mandolarian for an example of a patient receiving unhelpful care from an AI robot/droid).

Explainability transcends the “right to know.” It’s a moral imperative in critical applications. In healthcare, understanding why a diagnosis is reached guides treatment decisions.  In law enforcement, knowing the reasoning behind suspect identification ensures fairness and accountability. These principles extend to the military, where AI-powered algorithms might contribute to targeting decisions or risk assessments (To be clear, DARPA has been thinking about these sorts of issues).

The stakes are simply too high to rely on blind faith in an algorithm’s output. We need models that not only deliver accurate predictions but also offer clear, human-interpretable insights into their reasoning. This allows for:

  • Accountability: When wrong decisions are made, explainability facilitates tracing the error back to its source, enabling improvement and mitigating future harm.
  • Building trust: Transparency fosters trust between humans and the AI systems they interact with, crucial for long-term acceptance and effective collaboration.
  • Human oversight: Even with advanced AI, critical decisions ultimately lie with humans. Explainability empowers humans to understand the AI’s reasoning, challenge its conclusions, and ultimately make informed judgments.

Fortunately, advancements in AI research are paving the way for more explainable models.  From feature importance analysis to bread and butter machine learning approaches, various techniques offer glimpses into the inner workings of algorithms. While challenges remain, the pursuit of explainable AI is not only a possibility, but a pressing necessity.  I provide an overview of explainable AI in my textbook chapter, “An Overview of Explainable and Interpretable Artificial Intelligence”.

This blog post is just a starting point for a broader discussion. Do you think explainability is truly achievable in complex AI systems? How can we balance accuracy with transparency in critical applications? Let’s work together to ensure AI, with all its potential, is used for good, and explainability plays a key role in that journey.

Note: Bard was used to help write this article.  Midjourney was used to help create the image(s) presented in this article.

Statistical Sorcery and Data Alchemy: The Hidden Magic of Numbers

This image has an empty alt attribute; its file name is bUpdeU3zrnA3TDAAijlIUJGvMVm2XSsEvXEs1Xvh7kyabF4dJpMP8Hv4iZfLAbZVUk1BfQ648O250aGSIr2F2UkFMF7Md8jTV2K-Ha_-CCHZn5RYHLj_3cWTFNeX486tJcHrs5iEk5A3hjF5QEvaijM

In the age of big data, two professions stand out as masters of making sense of it all: data scientists and statisticians. Both wear the analytical hat, but under each field comprises a difference in training and emphasis. Let’s explore the similarities and differences between these data whisperers.

Statistics: The Bedrock of Inference

This image has an empty alt attribute; its file name is TynMEGE8gcNQZZHDMjDvJUfdktAdmBi7V6Q1RwQr4xQ-klwNB03qKT7-pet42JZPtl8W5BEgMSrQqFxcMYFahiJ4GqQaczdMPsWEkKHlxcIC4lF-VABML51yNmHGHA_etbZowHE5NEiU5yajqa6BZ6E

Statisticians are the architects of rigorous experimentation and mathematical model building. Their toolbox brims with R, a language tailor-made for statistical analysis.  (If you are looking into getting started with R, consider looking into my intro to programming textbook using R.  If you prefer a video format, I also have a video series on the topic.) They wield traditional methods like the t-test to unveil relationships between variables and draw conclusions with confidence. Their strength lies in the solid theoretical foundation behind their methods, ensuring reliable and interpretable results.

Data Science: Adapting to the Tsunami of Data

This image has an empty alt attribute; its file name is OCnMdxpgWQssMVv-zxFuyegm9ptkaEy5hmFRy5gFrbJaasFpGe5v-mw6vSWBusLWdk4ACoObE1G__Icq3uIBVG3zsVa9QlKqR8_41q4sc3mP5gxJxjTdD34hWywf29VZ0W-nuWvj7a1S_610D_cVmDE

Data scientists, on the other hand, are the agile surfers riding the wave of big data. Python helps them navigate through messy, unstructured datasets. They embrace performance-centric approaches like Support Vector Machines (SVMs) and Random Forests to build accurate predictive models.  If you are interested in getting started with building these kinds of models, I would suggest the Introduction to Statistical Learning with R (ISLR 2nd Edition Affiliate Link, Non-Affiliate Free PDF Link).  If you prefer a video format, I created an intro to machine and statistical learning video series.  While mathematical theory isn’t absent, the focus leans more towards finding the best tool for the job, regardless of its theoretical pedigree.

Bridging the Divide: Where They Converge

This image has an empty alt attribute; its file name is w_nnuGTi2evnl8S2wKMXRzJTyiNTN3ikdJgLn8Q2cjb2IREDPsX0_GKJj1lpI1rm0z2a0RxG8ZTelsWd26DYPAXJ92GCC3NHOaBMDHVRK9aPY-DFJFZm_OB6IFP4-kMjn0kThnyw7RlP5QdLwCOvDX0

Despite their distinct styles, these data gurus share some vital common ground:

  • Communication: Both speak the language of insight, translating complex numbers into actionable stories for business stakeholders.
  • Visualization: Data is more than just numbers; it’s a story waiting to be told. Both statisticians and data scientists master the art of compelling visualizations to make their findings come alive.
  • Actionable Insights: Ultimately, both professions strive to use data to solve real-world problems. Whether it’s predicting greenhouse emissions or optimizing marketing campaigns, their insights drive data-driven decision making.

So, who is better equipped to unravel the patterns within data? The truth is, there’s no one-size-fits-all answer. Each profession and perspective brings unique strengths to the table, and the choice depends on the specific problem at hand. Statisticians offer theoretical rigor and interpretability, while data scientists excel at flexibility and performance.

The ideal scenario? A synergy of these two worlds. Imagine a team where statisticians provide the theoretical grounding and data scientists unleash the power of modern tools. It’s a collaboration that promises to unlock the true potential of data, transforming every industry from healthcare to finance and beyond.

So, the next time you’re drowning in data, remember, you don’t have to choose between these data heroes. Let them join forces, and watch the insights flow!

This image has an empty alt attribute; its file name is _ff92KrIgN9VeHENUKsLt9TjPeYpftmndFuNDMMiSntuxl5hMJX2Xkle4naIZM0djdv4RdDei_bfnOTkdtQVas06AjcANkpHWc9gsyaQ80JuKG_siza2z8txAdffr7h1lDhfiTfrob-G-tOSaL64Bso

Note: Bard was used to help write this article.  Midjourney was used to help create the image(s) presented in this article.

Friendship or Firestorm: Delving into the Inferno of AI’s Mind

Artificial intelligence (AI) is the buzzword of the past year or so. From personalized shopping recommendations to self-driving cars, it feels like AI is infiltrating every facet of our lives. But with this ever-growing presence comes a critical question: is AI dangerous?

Defining the Beast:

First, let’s be clear what we’re talking about. AI isn’t some omnipotent robot overlord (*laughs nervously*). It’s a broad term encompassing algorithms that can learn and make decisions without the need for human inputs. These algorithms range from simple recommendation engines to complex systems powering medical diagnosis.  The FDA has also thought about these ideas for a number of years at this point and AI’s application to medical devices.  

The Good, the Bad, and the Algorithmic:

AI undeniably offers countless benefits. It streamlines processes, automates tedious tasks, and even has the potential to help save lives. But beneath the gleaming surface lie potential pitfalls. One key concern is the cost of AI mistakes (I talk about this idea a bit in my textbook chapter which is available as a paperback at this affiliate link). When an algorithm makes an error, the consequences can range from mild annoyance (a bad movie recommendation) to catastrophic (a misdiagnosed illness).

Example 1: Level Up, Game Over?

Consider the world of video games. AI-powered opponents are becoming increasingly sophisticated, offering a more realistic and challenging experience. However, a poorly designed AI could lead to frustrating, unfair gameplay, pushing players away. The AI could even make a benign error making the environment in a particular scene look jarring, taking away from the immersive experience players expect. This, while not world-ending, demonstrates the importance of responsible AI development to ensure positive user experiences.  However, in the grand scheme of things, making an AI making a mistake doesn’t directly result in catastrophic results.  Maybe the water doesn’t look exactly right, but it’s not like someone died.  (Quick aside: If a game was so buggy and unplayable due to a reliance on a bad AI, a team or company could all lose their respective jobs which would be a severe downside.)

Example 2: National Security on Auto-Pilot?

Now, the stakes get higher when it comes to national security. Imagine AI being used in national security applications, from analyzing intelligence to making critical decisions in high-pressure situations. While AI can process vast amounts of data and identify patterns humans might miss, the potential for unintended consequences is immense. A misattribution of enemy activity or a faulty algorithm triggering an autonomous weapon could have devastating real-world repercussions.  DARPA has been thinking about how to utilize AI in an explainable and safe manner for a number of years. Claiming that AI will solve all of our problems is a lofty claim, as implementing solutions in high stakes scenarios is extremely challenging.   

Conclusion: Not Monsters, but Tools

So, is AI dangerous? The answer isn’t a simple yes or no. It’s a potent tool, like any technology, capable of immense good and devastating harm. The key lies in responsible development, rigorous testing, and clear ethical guidelines to ensure AI serves humanity, not the other way around. We must approach AI with cautious optimism, acknowledging its potential risks while harnessing its power for a better future.

Note: Bard was used to help write this article.  Midjourney was used to help create the image(s) presented in this article.

Don’t Just Hope: Binge-Learn Data Science This New Year

Welcome, fellow data enthusiasts, to the precipice of a new year! As 2023 gracefully exits stage left, we stand poised on the threshold of 2024, a blank canvas brimming with possibilities. For many, this translates to resolutions, aspirations, and perhaps the ever-present yearning to conquer the enigmatic realm of data science.

This blog post is your armor against the inevitable doldrums, your compass through the labyrinthine world of data, and your ultimate guide to sticking with data science throughout 2024.

Charting Your Course: A Roadmap to Success

First things first, you need a roadmap. Think of it as your personal GPS, guiding you through the dense forest of algorithms and statistical models. There are plenty of excellent resources available online, but I recommend checking out these gems:

  • DataCamp: Structured learning paths with bite-sized, interactive lessons.
  • Kaggle: Learn by doing with real-world datasets and a vibrant community of data scientists.
  • Coursera: Specializations from top universities and industry leaders.
  • My content: If you are just starting out with programming, consider looking into my intro to programming textbook using R.  If you prefer a video format, I also have a video series on the topic.

Remember, the perfect roadmap is the one that works for you. Don’t be afraid to customize it, experiment with different resources, and find what ignites your inner data scientist.

Fueling the Fire: Staying Motivated

Data science is a marathon, not a sprint. There will be days when the code doesn’t compile, the models refuse to cooperate, and you feel like you’re banging your head against a statistical wall. But fear not, for even the mightiest data wranglers face these hurdles. Here’s how to stay motivated:

  • Set achievable goals: Break down your learning into smaller, manageable chunks. Completing these mini-quests will give you a sense of accomplishment and keep you moving forward. 
  • Find your community: Join online communities, forums, or local meetups to connect with other data enthusiasts. Sharing your struggles and successes can be incredibly motivating.
  • Celebrate the wins: Take the time to appreciate your progress, no matter how small. Did you finally understand the concept of p-values? High five yourself! Baked a machine learning-themed cake? Share it with your fellow data warriors!
  • Remember your “why”: Remind yourself why you embarked on this data-driven odyssey in the first place. Is it to solve real-world problems? Make a difference in the world? Fuel your passion for data and let it guide you through the tough times.

Sharpening Your Tools: Practice Makes Perfect

Data science is not a spectator sport. To truly master this craft, you need to get your hands dirty. Here are some ways to put your theoretical knowledge into practice:

  • Work on personal projects: Find a dataset that sparks your curiosity and build something cool with it. Analyze your favorite movie ratings, predict the next stock market trend, or create a tool to solve a problem you face in your daily life.
  • Participate in hackathons: These timed coding competitions are a great way to test your skills under pressure and learn from other data scientists.
  • Contribute to open-source projects: Lend your expertise to existing projects and gain valuable experience while giving back to the community.

Remember, the more you practice, the more confident and skilled you’ll become. So, don’t be afraid to experiment, make mistakes, and learn from them. Every line of code, every failed model, is a stepping stone on your path to data science mastery.

Remember, the journey of a data scientist is not a solitary one

We are a community of curious minds, united by our passion for extracting insights from the ever-growing ocean of data. So, let’s embark on this exciting adventure together, armed with our roadmaps, fueled by motivation, and ever-honing our skills through practice. Together, we can conquer the dataverse in 2024 and beyond!

Note: Bard was used to help write this article.  Midjourney was used to help create the images presented in this article.

3 Essential Python Looms for Unraveling the Data Oracle’s Destiny

For those who dare to plumb the depths of the digital unknown, fear not! Within the Python language lies a trove of libraries, ready to empower your quest for knowledge. Today, we delve into the three of some of the most potent libraries at the data scientist’s disposal: NumPy, Pandas, and scikit-learn.

NumPy, the Swift Elixir

 Imagine swirling numbers into a shimmering vial. This is the magic of NumPy, the master of efficient calculations. Forget clunky lists and for loops! NumPy conjures multi-dimensional arrays, where data is organized in a manner that is efficient for various complex calculations. I personally use NumPy arrays as a format to organize image data in a clean manner. From vectorized calculations to matrix manipulations, NumPy is the fuel that propels your data analyses from a snail’s crawl to a cheetah’s sprint.

Pandas, the Data Sculptor

 But raw data, like unhewn ore, requires refinement. Enter Pandas, the alchemist’s chisel. This library cleanses and shapes your data for many applications, transforming spreadsheets into glistening dataframes. Missing values vanish, inconsistencies smoothen, and columns align like soldiers under a data-driven banner. Indexing, merging, and grouping become much easier, each incantation revealing the hidden structure within your datasets. Pandas is the potter’s wheel, molding data into forms ready for analysis and prediction.

Scikit-Learn, the Seer of Patterns

 Now, with your data polished and primed, you yearn to peek through the veil of the unknown. This is where scikit-learn emerges, a grimoire of potent algorithms, each a key to unlock the secrets hidden within your numbers. Regression, classification, clustering – these are the algorithmic spells available at your disposal. Training algorithms to these data allow the incantor to discern patterns and trends. With each line of code, you imbue these models with the hidden patterns of your data, transforming them into seers that glimpse the future, predict outcomes, and reveal correlations unseen by mortal eyes.

But remember, young alchemists, these elixirs are potent. Like any great power, data analysis demands responsibility. Master the craft, understand the algorithms, and wield these libraries with a steady hand. For within your grasp lies the potential to unravel mysteries, solve problems, and shape the course of the digital future. So, go forth, brew your own data-driven destiny, and remember: the true magic lies not in the libraries themselves, but in the questions you ask and the insights you extract from the swirling storm of information. Now, raise your flask of data, and let the data analysis begin!

Note: Bard was used to help write this article.  Midjourney was used to help create the image(s) presented in this article.

Sharing the Hope of Christmas Magic For Your Portfolio

Want to learn how to do data science over the holidays?  Once you know the basics (consider my intro to programming textbook using R or video series on the topic), it’s important to START a project! Here are a few holiday-themed ideas to get you started:

  • Most popular Christmas songs: Analyze streaming data to find the most listened-to Christmas songs over time, by region, or even by generation. You could even build a model to predict the next Christmas hit!
  • Gift-giving trends: Use e-commerce data to explore what people are buying the most for Christmas gifts. Analyze trends by age, gender, location, or price range. You could even predict the most popular gifts of the year.
  • Santa’s logistics: Use geographic data and airspeed calculations to estimate how Santa could possibly deliver all those presents in one night. Consider factors like time zones, weather conditions, and reindeer power!
  • Evolution of Christmas movies: Analyze movie ratings and release dates to see how Christmas movie trends have changed over time. You could even identify the most popular tropes or predict the next Christmas movie hit.
  • Visualize Christmas tree ornaments: Use image recognition to categorize types of Christmas tree ornaments, or build a tool that suggests ornament pairings based on color and style.
  • Identify charitable giving trends: Analyze donation data to see how people’s giving habits change around the holidays. You could explore which causes are most popular or how much is donated overall.  Further, you could try to replicate other reports from other analyses and try to explain any similarities/differences you observe.  

Now that your creative gears are jingling, it’s your turn to take the reins! If you need some help getting started with model building consider my intro to machine and statistical learning video series. Now – let’s build a collaborative Christmas data empire, one snowglobe-shaped insight at a time! Don’t be shy, data elves – the world needs your festive analytics magic!

Note: Bard was used to help write this article.  Midjourney was used to help create the images presented in this article. 

Unlock the Magic: Data Science with R’s Enchanting Elixirs

Forget bubbling demagogues and cryptic chants – the modern data scientist wields R, and their laboratory brims with potent packages. Today, I unveil three essential packages for deriving data-driven insights: e1071, ggplot2, and caret. Brace yourselves, fellow data scientists, for we’re about to transmute raw data into shimmering pure gold!

If you are just starting out with programming, consider looking into my intro to programming textbook using R.  If you prefer a video format, I also have a video series on the topic.

1. Elemental Essence: e1071

Think of e1071 as your alchemist’s cabinet, overflowing with potent algorithmic elixirs. From fiery linear regressions to swirling support vector machines, it offers a dizzying array of tools to unravel the mysteries of your data. Whether you seek to predict customer churn with the precision of a crystal ball or cluster market segments like constellations, e1071 fuels your analytical fire.

If you are interested in getting started modeling with R, I would suggest the Introduction to Statistical Learning with R (ISLR 2nd Edition Affiliate Link, Non-Affiliate Free PDF Link).  If you prefer a video format, I created an intro to machine and statistical learning video series.

2. Crystallize Clarity: ggplot2

Data may whisper its secrets, but ggplot2 amplifies them into dazzling visual tapestries. This package is your potion for transmuting numbers into breathtaking graphs, charts, and maps. With its intuitive incantations and boundless flexibility, ggplot2 isn’t just for eye candy – it’s about weaving narratives from data that captivate both the scientist and your broader audiences.

3. The Crucible of Model Curation: caret

Crafting the perfect machine learning model can be a chaotic art. But fear not, aspiring alchemists – caret will create an orderly way to manage the art. This package orchestrates the entire process, from data cleaning to model training. With caret, you can experiment with algorithms like alchemical ingredients, optimize hyperparameters with practiced precision, and ultimately declare the champion model, ready to unlock the secrets of your data.

So, how do these three reagents form the Data Alchemist’s ultimate elixir?

  • e1071 provides the raw power of algorithmic transmutation.
  • ggplot2 crystallizes insights into mesmerizing visual clarity.
  • caret stirs the cauldron of model creation with masterful efficiency.

Mastering these tools equips you to tackle real-world problems with the wisdom of Merlin himself. Predict stock market fluctuations, optimize resource allocation, or discover hidden patterns in social media – the possibilities are endless.

This is just the first step on our data scientist journey. Stay tuned for deeper dives into each package, secret spells for data wrangling, and thrilling adventures in the uncharted lands of data science. Now, grab your beakers, fire up R, and let’s transform the world with the alchemy of code!

Are there additional topics regarding data science you would like me to cover next? Consider reaching out to let me know what I should talk about next time!

Note: Bard was used to help write this article.  Midjourney was used to help create the image(s) presented in this article.

Essential Skills for Mastering the Arcane Art of Data Science in 2024

The US Bureau of Labor Statistics has pointed out the strong demand for skilled data scientists.  In my opinion, this is more crucial than ever as companies across industries are scrambling to harness the power of artificial intelligence (AI). But this isn’t just about weaving spells with algorithms; it’s about building bridges between raw data and people to make impactful results.

So, aspiring data wizards, what ingredients do you need to brew the perfect career potion in 2024? Let’s break down the essential skills you’ll need to master for 2024 and beyond!

1. Coding Alchemy: Python, R, and the SQL Elixir:

Think of programming languages as your incantations. Python, R, and SQL are the most potent brews in the data scientist’s cauldron. Python is very powerful for its versatility and vast libraries like NumPy and Pandas. R, meanwhile, is the go-to for statisticians with its focus on statistical modeling and analysis. And don’t forget SQL, the language that unlocks the secrets hidden within databases. Mastering these languages isn’t just about writing code; it’s about understanding the logic and structure behind them, allowing you to wield them with precision and efficiency to complete tasks ranging from the mundane to the arcane.

If you are just starting out with programming, consider looking into my intro to programming textbook using R.  If you prefer a video format, I also have a video series on the topic.

2. From Raw Data to Refined Insights: Modeling the Future:

Data is the raw material, but the real magic lies in transforming it into actionable insights. This is where your analytical skills come into play. You need to be able to clean, wrangle, and explore data, identifying patterns and trends that might otherwise be illusive. Statistical modeling and machine learning algorithms are your tools for building predictive models, uncovering hidden relationships, and ultimately, understanding what the data is capturing in the world around us.

If you are interested in getting started modeling with R, I would suggest the Introduction to Statistical Learning with R (ISLR 2nd Edition Affiliate Link, Free PDF Link).  If you prefer a video format, I created an intro to machine and statistical learning video series.  The Python version of the textbook is also available (ISLP Affiliate Link, Free PDF Link). 

3. Bridging the Gap: From Geek to Guru:

Remember, data science isn’t just about interacting with machines; it’s about speaking to people. Your ability to translate complex findings into clear, concise, and compelling stories is crucial. Think of yourself as an interpreter, guiding stakeholders (such as team members, managers, or those whom you serve) through the labyrinth of data to actionable insights. Strong communication skills, both written and verbal, are essential for building trust and ensuring your work has a real-world impact.

4. The Unspoken Secrets: Soft Skills Make You a Sorcerer Supreme:

Beyond the technical wizardry, there are unspoken skills that make you a truly exceptional data scientist. Collaboration and teamwork are paramount, as you’ll often be working with engineers, analysts, and business leaders.  Further, being able to fit into the team culture is a critical component for enjoying your job.  So this isn’t something you can simply ignore and hope will work itself out.  

Remember, data science isn’t just about crunching numbers; it’s about applying creativity, critical thinking, and a collaborative spirit to solve real-world problems. So, hone your coding skills, refine your analytical abilities, and unlock the power of communication. With the right ingredients in your cauldron, you’ll be well on your way to becoming a data science sorcerer supreme in 2024 and beyond!

Are there additional topics regarding data science you would like me to cover next? Consider reaching out to let me know what I should talk about next time!

Note: Bard was used to help write this article.  Midjourney was used to help create the images presented in this article. 

JSM Wrap Up and Tips for Next Year Attendees

JSM 2017 Expo Entrance

After attending my first conference, JSM 2017, I would like to share some of my thoughts regarding what I learned going to conferences is about, what I wish I knew, and what to do moving forward. Let’s get started!

What JSM( or conferences overall) are about

Baltimore Inner Harbor.

JSM was about sharing new knowledge and connecting with old colleagues (and making new ones along the way). JSM had a lot of opportunities to attend interesting sessions where I could learn more about a subject or have a quick refresher course. Even big names in the field come and share their thoughts. I had the pleasure of hearing Dr. Robert Tibshirani talk about Statistical Learning.

But that’s only a small part of JSM. The other major part is connecting with others. I made several connections, some new friends and some new potential contacts. This is important aspect of JSM, since it’s good to remind ourselves as academics and working professionals that a huge part of any profession, or life for that matter, is the human element. Sometimes that means skipping out on a lecture to (re)connect with another – even if you don’t talk about statistics, but I’m sure the probability is high that you will! :]

What I Wish I Knew

What I wasn’t completely aware of was the job opportunities at JSM. There were a lot of interesting companies (i.e. – Facebook, Amazon, SAS, etc..) and government agencies (i.e. – Census, FDA, etc..) at JSM conducting interviews and having booths where you could talk to employees about their respective organization. It was similar to a Career Fair at a university, but a lot smaller. The lines to talk to employers was basically nonexistent, it was less overwhelming, and there was a focus to for the opportunities. At a typical career fair, some companies may not be hiring in your area. But when the company is at a conference for your field, you can be certain, that they will be looking for people just like you! Sometimes, you can even talk to employees who do a very similar job to what you could potentially be hired for.

What to do moving Forward

At future JSM’s, I want to be more involved – and I recommend you do the same! When you are involved, it makes learning and meeting people so much easier. I did chair a session and present a poster (and did help), but I hope to continue to do things like this – but even more. I hope to get even further involved by doing an oral presentation.

My poster at JSM 2017 on my website RGalleon.com

If you can, try to reach out to a section that you think is interesting (public health, education, imaging, risk, etc..) and get involved. For instance, reach out to the head section office and ask how you can help. A lot of times, they need volunteers to chair sessions.

Also, I want to find a balance between doing things and relaxing. Next JSM, I’ll try to do more activities to explore the city where the conference is located but also find time to recoup. Talking to lots of new people and learning many new things can be exhausting. Finding time to step away from it all can help helpful to recharge and get back into the fun!

If you have any other suggestions or comments about how to get the most out of JSM (or any conference for that matter), please feel free to start a conversation with me via Twitter.

See you at JSM 2018!

A Perspective on Googling “Health Care”: From 2008 to Now

A few days ago, Nate Silver stated here the following:

“We see that Google searches for “health care” — although not a perfect proxy for media coverage — have spiked for about a week at a time, only to fall back down again. Which could reflect the media’s short attention span for the story, or the public’s.”

This got me thinking: what has been the relative interest in the current Republican health care attempt at health care?  So I extended the time frame analyzed to be February 1, 2008, to July 6, 2017.   I recreated the plot below in R using ggplot2 (and provide the code to create it at the end of the post).

 

The figure shows the relative interest in searching “health care” in Google over time.  The x axis is the date.  The y axis is the interest relative to the most popular time “health care” was searched.  In this case it was when the March 2010.  The scale goes from 0 to 100, where 0 is not as searched as relative to the most popular point.  A 50 means that the term was only has as popular. 100 means that it was just as popular.  We can discuss if this is a good or bad metric, but let’s table that for another time (since it’s a long discussion).   In short, sometimes it’s good, others it’s bad.

The blue dot with a white triangle indicates the month where President Obama announced to a joint session of Congress he would actively pursue health care reform. The green dot with a white center dot indicates when Congress went on recess in August 2009. It was during this recess when a particularly large number of members of Congress first encountered the Tea Party. The blue dot with the cyan center indicates when the Affordable Care Act (ACA), aka Obamacare, passed in Congress. The green point with a white triangle is when the United States Supreme Court stated that the ACA was constitutional since its was considered a tax. The red and white point indicates when the House failed to pass health care reform in March 2017. The red dot and white triangle point indicates when the House passed the American Health Care Act (AHCA) to repeal and replace the ACA in May 2017.

I pointed out some of these events to give an idea of how popular searching “health care” was during some other events.  Note that the popular moment for searching was when the ACA passed.  However, what’s interesting is that people seemed much more engaged and interested in finding more about health care leading up to passing the ACA.  This does not appear to be the case for GOP’s attempt at passing health care.  Events that appear to be more similar in interest to the GOP’s attempts is when the Supreme Court revealed their judgement on the constitutionality of the ACA.

In short, this means that the public has been pretty disengaged with the GOP’s attempts at health care reform!

This raises a lot of interesting questions.  Why is it that people appear less interested this time around?  Here are three (possible) ideas I have:

  1. Health care is messy.  Passing health care is complicated and confusing.  People do not want to think about reworking the health care system again! (I’m not aware of data to support this claim.  So it’s a complete shot in the dark.)
  2. There are a lot more distractions this time around.  With contention between Trump and the media and recent missile tests from North Korea just to name two.  (Again, can’t find data.)
  3. There’s simply too little information available for the public to easily digest on the GOP’s attempts at reforming health care.  While the House’s bill is very unpopular according to Nate Silver, there is also a sizeable chunk on undecideds according to the YouGov poll.  When the ACA was in the works, the process was long and time consuming.  This attempt has been much faster (since it has come, died, and then been resurrected).  This has prevented the public from really thinking about it. (Yay! Data!)

If you have any ideas (and/or data) to investigate this further, I’d love to hear about it!  You can tweet it at me!

library(ggplot2)

dat<-read.csv(file="multiTimeline.csv", sep=",", header=FALSE) #note that I did remove the top of the csv file downloaded directly from Google

colnames(dat)<-c("Date", "Rel")

df<-as.data.frame(dat)

aca<- data.frame( Dat = "2010-03", Rel = 100 ) #when ACA was passed
oba<-data.frame(Dat = "2008-02", Rel = 29 ) #when Obama announced
                                          #intention to pass health care

house1<-data.frame(Dat="2017-03", Rel = 32)#Rep House fails to vote on AHCA
house2<-data.frame(Dat="2017-05", Rel = 29)

sc<-data.frame(Dat="2012-06", Rel = 33) #SCOTUS decision on ACA and taxes

tea<-data.frame(Dat="2009-08", Rel = 57)# congress recess of aug 2009

mytheme<-theme(

	plot.title = element_text(lineheight=1.5, size=35, face="bold"),
	axis.text.x=element_text(size=23),
	axis.text.y=element_text(size=23),
	axis.title.x=element_text(size=28, face='bold'),
	axis.title.y=element_text(size=28, face='bold'),
	strip.background=element_rect(fill="gray80"),
	panel.background=element_rect(fill="gray80"),
	axis.text=element_text(colour="black")

	)

#general setup
p<-ggplot(data=df, aes(x=Date, y=Rel, group=1))+geom_line()+
   geom_point(data=df, aes( x=Date, y=Rel ))+
   xlab("Date")+
   ylab("Relative Interest")+
   ggtitle("Realtive Interest in\nHealth Care Over Time ")+
   theme(plot.title = element_text(hjust = 0.5) )

#important points
p<-p + geom_point(data=aca, aes(x=Dat, y=Rel), color="blue", size=4 )+
       geom_point(data=aca, aes(x=Dat, y=Rel), color="cyan" )+

       geom_point(data=oba, aes(x=Dat, y=Rel), color="blue", size=4 )+
       geom_point(data=oba, aes(x=Dat, y=Rel), color="white", shape=17)+

       geom_point(data=house1, aes(x=Dat, y=Rel), color="red", size=4 )+
       geom_point(data=house1, aes(x=Dat, y=Rel), color="orange")+

       geom_point(data=house2, aes(x=Dat, y=Rel), color="red", size=4)+
       geom_point(data=house2, aes(x=Dat, y=Rel), color="white", shape=17)+

       geom_point(data=sc, aes(x=Dat, y=Rel), color="forestgreen", size=4)+
       geom_point(data=sc, aes(x=Dat, y=Rel), color="white", shape=17)+

       geom_point(data=tea, aes(x=Dat, y=Rel), color="forestgreen", size=4)+
       geom_point(data=tea, aes(x=Dat, y=Rel), color="white")


#adding general layout
p<- p + mytheme + scale_x_discrete(breaks = c("2009-01", "2011-01",
                                               "2013-01", "2015-01", "2017-01"),
                                   labels = c("2009", "2011",
                                              "2013", "2015",  "2017")
                                              )

p

ggsave("rel_health.png")

 

Pokemon Hidden Abilities and Statistics

Disclosure: This post assumes that you are familiar with Pokemon and its newer features (in X, Y, and ORAS).

When breeding for a Pokemon with a Hidden Ability, it can be a major challenge as it adds complexity to the breeding process. However, when breeding with a male and female Pokemon with the Hidden Ability, there appears to be some discrepancy in regards to the probability of the offspring having the Hidden Ability. IGN states that the probability is 80%, while sites such as Heavy state that the probability is 60%. Since there is not uniform consesis on this issue, this warrants our investigation. Thusly, we performed a statistical test (a more advanced one than what is typically seen in an introductory statistics course) to determine what the actual probability is. If you would like a more detailed explanation, you can see the work I did for this on my other blog here at RGalleon.com. Otherwise, I will discuss the basic summary of the results in this post.

However, why should we even care about this? Who actually cares about the probability of this certain event in a video game? Besides the fact that playing Pokemon is a fun game and myself and others desire to fully understand this game, there is a practical reason as well. In industry, it is important to be able to test your product before releasing it the public for release. Nowadays, this can be much easier to do with patches for video games, but broken games or software can still be developed and released. For example, if the Apple released a broken iOS for their iPhone that essentially broke the device, users would flock to an alternative that worked such as Android. Thus, it is still important to test software and products before releasing them.

The results of our study did indicate that the probability is more likely to be 60%.  So, IGN was probably incorrect. However, this study was only done by using Eevee’s. Therefore, if we assume that breeding probabilities are universal across all Pokemon capable of breeding. However, we do know that the gender proportions are not universal across Pokemon. Therefore, it is more liberal to state that all Pokemon have these proportions. However, if we are being more careful in our assumptions, we cannot generalize for all Pokemon that the breeding probability is 60%. Therefore, I would suggest that we need to do more research in regards to this. However, I would not be surprised to find that it would be true that for all Pokemon the breeding percentage was 60.

Tips and Tools for Applying to Graduate School

If you want to apply to graduate school for statistics, there are certain tools you will need to more easily apply to graduate school. (Note: While I am writing this for individuals applying for statistics, many of these tips can also be applied to other types of programs. It is also assuming that you are going directly from undergrad to grad school. However, this again could be applied to many people going from the work force to grad school.)

But before you begin, ask yourself the following question; what is your reason for applying to graduate school? You need to understand why you are pursuing this path. Use this reason to guide you in throughout this process so that you remain motivated. If your reason does not motivate you, find another reason.

Find a Graduate Advisor

Throughout this process, you will need someone to help you. Find someone with experience in applying to graduate schools to help you. For me, that was a professor I had as an undergrad. You will need this person to get advice from but also to bounce ideas off of. They will help you to edit your essays as their experience will help you to write your essays in a particular way.

What schools do you apply to?

While you must make the final decision on what schools you apply to, you should ask yourself these questions regarding each program:

  1. Does the program look like a place that you can be at for the next 5 years?
  2. Would you be excited to apply to the program?
  3. Does the program have professors actively doing research in areas that you are interested in?

You need to be able to answer these questions and understand why you are applying to each program. If you need to, write down each answer for each program.

How many schools do you apply to?

You should apply to at least:

  • 1 reach programs
  • 3 or 2 competitive programs
  • 1 safety school

This is a general guideline and should be always reconsidered if you feel like you should apply to more programs. Be forewarned: it can easily get overwhelming if you apply to too many programs.

Recommenders

You should have 3 or 4 individuals in mind for writing you a letter of recommendation. They should be people that you know fairly well. They could be a favorite professor, department chair, or, if warranted, a boss from work. For myself, I tried to find a balance between professors who could speak on my theoretical and/or applied statistics skills. However, I tried to at least have all of them covered.

When you ask them, ask them in person. This is the best way. If you cannot, formulate a brief but polite email for them to write you a strong letter. However, you might have to send the email multiple times, at least a week apart, because the individual might have missed the email. This might be because we get a lot of email, and they can be easy to miss.

If they agree, write them a thank you note with a list of school you are thinking of applying to with the following information:

  • When the due date is for the program’s reference writers
  • How they will submit the program (electronically, via email, etc.)

You should also give them this list with this information on a piece of paper. Also, remind them at appropriate intervals about submitting the letter. Offer to give them a resume or CV of your work. Also, keep them updated with your current place of applying to graduate school. It will enforce it in their minds that you are serious about this and will also remind them to get the letter in!

Contents for Personal Statement/Statement of Purpose (SOP)

Your SOP should incorporate the following:

  • Your reason for applying to graduate school
  • Activities that you have participated in that will showcase yourself as a strong candidate (undergraduate research opportunities)
  • What field of statistics you are particularly interested in (computational statistics, nonparametrics, etc.)
  • A personal touch for the program you are applying to (ie. mentioning a professor in the department that is engaging in research you find exciting)
Other Tools

Other useful tools that I used often was:

 

Summary
Essays

– Personal Statement

– SOP

Graduate Advisor

-find 1 and meet with him/her often
*could be department chair
*could be professor you get along with
-meet with them and have them edit essays

Recomenders

-have 3 to 4
-first ask if they can write you a strong letter
– second give them an excel sheet of what they need to know
*dates
*where
*strong letter
– remind them 2 weeks before letter due
– remind them 1 week before letter due
– remind them every day afterwards

Where to Apply

-pick at least 1 reach
-pick at least 3 medium
-pick at least 1 safety

Timeframe

-start looking during the summer before you apply (your senior year)

Personal Decisions

– how many places do you apply to
– do you apply to Applied Stat/Applied Math/Math/ Computer Science/Biostat?
– do you go for programs where you need to get master’s first?
*do you get master’s first?
– do you go for master’s at