Building a 35,000-Sample Wall Street Bets Dataset: Fine-Tuning, Nvidia Giveaway, and Data Access

- Authors
- Published on
- Published on
Welcome back to another thrilling episode on the sentdex channel, where we dive deep into the exhilarating world of Reddit data sets. Today, the team embarks on a quest to construct a dataset comprising 35,000 samples sourced from the dynamic realm of the Wall Street Bets subreddit. This dataset isn't just your run-of-the-mill collection; it intricately captures conversations between multiple speakers and the calculated responses from a sophisticated bot, mirroring the intricate tapestry of real-life interactions. As they fine-tune this dataset, the team aims to surpass previous iterations, setting their sights on achieving unparalleled results in this data-driven odyssey.
But wait, there's more excitement in store! In a generous gesture, Nvidia steps in to offer a jaw-dropping giveaway of an RTX 4080 super at the GPU Technology Conference. Attendees are in for a treat as they delve into the cutting-edge advancements in technology and witness the fusion of robotics with generative AI, promising a future brimming with innovation and endless possibilities. As the channel founder reminisces about their encounter with Reddit data from yesteryears, they shed light on the evolution of chatbots and language models, showcasing the remarkable progress in the field over the years.
Venturing into the vast landscape of Reddit data sources, the team explores avenues such as torrents, archive.org, and BigQuery, unearthing a treasure trove of terabytes worth of comments dating up to 2019. With meticulous attention to detail, they navigate the process of exporting this valuable data to Google Cloud Storage, opting for a JSON format with compression to streamline handling. Their ultimate goal? To make this dataset accessible to all by uploading it to Hugging Face, ensuring that this wealth of information is readily available for enthusiasts and researchers alike. Amidst the whirlwind of data processing and fine-tuning, the team strategizes on pre-decompressing files to enhance processing speed and efficiency, a crucial step in their relentless pursuit of data perfection.

Image copyright Youtube

Image copyright Youtube

Image copyright Youtube

Image copyright Youtube
Watch Building an LLM fine-tuning Dataset on Youtube
Viewer Reactions for Building an LLM fine-tuning Dataset
Viewer expresses gratitude for learning ML and coding from the channel
Viewer shares experience of creating a binary classification model about breast cancer
Viewer mentions starting a tech startup for logistics
Viewer shares creating a scraper for WSB comments
Viewer appreciates the fun learning process provided by the channel
Viewer discusses cleaning and uploading conversation data for others to use
Viewer points out potential issues with data filtering for model training
Viewer plans on creating a comprehensive guide for ComfyUI and Flux
Viewer inquires about completing the Python from scratch series
Viewer requests a video on meta-learning with examples
Related Articles

Mastering Programming Inspire Robot Hands: Challenges & Successes
Join sentdex as they tackle programming the advanced Inspire robot hands, exploring challenges and successes in communicating with and controlling these cutting-edge robotic devices.

Revolutionizing Prototyping: GPT4 Terminal Access for Efficient R&D
Explore how the sentdex team leverages GPT4 for streamlined prototyping and R&D. Discover the potential time-saving benefits and innovative applications of granting GPT4 access to the terminal.

Unleashing Longnet: Revolutionizing Large Language Models
Explore the limitations of large language models due to context length constraints on sentdex. Discover Microsoft's longnet and its potential to revolutionize models with billion-token capacities. Uncover the challenges and promises of dilated attention in expanding context windows for improved model performance.

Revolutionizing Programming: Function Calling and AI Integration
Explore sentdex's latest update on groundbreaking function calling capabilities and API enhancements, revolutionizing programming with speed and intelligence integration. Learn how to define functions and parameters for optimal structured data extraction and seamless interactions with GPT-4.