Building a 35,000-Sample Wall Street Bets Dataset: Fine-Tuning, Nvidia Giveaway, and Data Access

- Authors
- Published on
- Published on
Welcome back to another thrilling episode on the sentdex channel, where we dive deep into the exhilarating world of Reddit data sets. Today, the team embarks on a quest to construct a dataset comprising 35,000 samples sourced from the dynamic realm of the Wall Street Bets subreddit. This dataset isn't just your run-of-the-mill collection; it intricately captures conversations between multiple speakers and the calculated responses from a sophisticated bot, mirroring the intricate tapestry of real-life interactions. As they fine-tune this dataset, the team aims to surpass previous iterations, setting their sights on achieving unparalleled results in this data-driven odyssey.
But wait, there's more excitement in store! In a generous gesture, Nvidia steps in to offer a jaw-dropping giveaway of an RTX 4080 super at the GPU Technology Conference. Attendees are in for a treat as they delve into the cutting-edge advancements in technology and witness the fusion of robotics with generative AI, promising a future brimming with innovation and endless possibilities. As the channel founder reminisces about their encounter with Reddit data from yesteryears, they shed light on the evolution of chatbots and language models, showcasing the remarkable progress in the field over the years.
Venturing into the vast landscape of Reddit data sources, the team explores avenues such as torrents, archive.org, and BigQuery, unearthing a treasure trove of terabytes worth of comments dating up to 2019. With meticulous attention to detail, they navigate the process of exporting this valuable data to Google Cloud Storage, opting for a JSON format with compression to streamline handling. Their ultimate goal? To make this dataset accessible to all by uploading it to Hugging Face, ensuring that this wealth of information is readily available for enthusiasts and researchers alike. Amidst the whirlwind of data processing and fine-tuning, the team strategizes on pre-decompressing files to enhance processing speed and efficiency, a crucial step in their relentless pursuit of data perfection.

Image copyright Youtube

Image copyright Youtube

Image copyright Youtube

Image copyright Youtube
Watch Building an LLM fine-tuning Dataset on Youtube
Viewer Reactions for Building an LLM fine-tuning Dataset
Viewer expresses gratitude for learning ML and coding from the channel
Viewer shares experience of creating a binary classification model about breast cancer
Viewer mentions starting a tech startup for logistics
Viewer shares creating a scraper for WSB comments
Viewer appreciates the fun learning process provided by the channel
Viewer discusses cleaning and uploading conversation data for others to use
Viewer points out potential issues with data filtering for model training
Viewer plans on creating a comprehensive guide for ComfyUI and Flux
Viewer inquires about completing the Python from scratch series
Viewer requests a video on meta-learning with examples
Related Articles

Unleashing Longnet: Revolutionizing Large Language Models
Explore the limitations of large language models due to context length constraints on sentdex. Discover Microsoft's longnet and its potential to revolutionize models with billion-token capacities. Uncover the challenges and promises of dilated attention in expanding context windows for improved model performance.

Revolutionizing Programming: Function Calling and AI Integration
Explore sentdex's latest update on groundbreaking function calling capabilities and API enhancements, revolutionizing programming with speed and intelligence integration. Learn how to define functions and parameters for optimal structured data extraction and seamless interactions with GPT-4.

Unleashing Falcon 40b: Practical Applications and Comparative Analysis
Explore the Falcon 40b instruct model by sentdex, a powerful large language model with 40 billion parameters. Discover its practical applications, use cases, and comparison to other models like GPT-3.5 and GPT-4. Unleash the potential of Falcon in natural language generation, math problem-solving, and understanding human emotions. Get insights on running the model locally, its licensing, and the AI team behind its development. Join the AI revolution with Falcon 40b instruct!

Revolutionizing Sentiment Analysis: KNN vs. Bert with Gzip Compression
Explore how a text classification method on sentdex challenges Bert in sentiment analysis using K nearest neighbors and gzip compression. Learn about the process, implementation, efficiency improvements, and promising results of this innovative approach.