Exploitez des données textuelles

بواسطة: OpenClassrooms

Overview

Bienvenue dans ce cours de traitement du langage naturel ! L’objectif de ce cours est de comprendre les méthodes qui permettent de transformer le texte en features exploitables par des algorithmes de machine learning classiques, et les architectures et modèles qui correspondent le mieux à ce type de données. En l’occurence un ensemble de documents texte non-structurés.

Ce cours est donc divisé en 3 parties : une première qui traite de l'exploration, du nettoyage et de la normalisation du texte. Une seconde partie est dédiée au différents types de transformations qui vont nous permettre de mieux comprendre nos données textuelles et de créer des features que nous pourrons utiliser dans les algorithmes classiques de machine learning. La dernière partie sera consacrée à la classification du texte à l'aide de l'apprentissage automatique.

Prérequis :

Ce cours fait partie du parcours Data Scientist. Il se situe au croisement des mathématiques et de l'informatique. Pour en profiter pleinement, n'hésitez pas à vous rafraîchir la mémoire, avant ou pendant le cours, sur :

  • Python pour le calcul numérique (numpy) et la création de graphiques (pyplot), que nous utiliserons dans les parties TP du cours,
  • Quelques notions d'algèbre linéaire : manipulation de vecteurs, multiplications de matrices, normes, et valeurs/vecteurs propres,
  • Quelques notions de probabilités et statistiques, telles que distribution de loi de probabilité et variance
  • Les modèles non-supervisées permettront de modéliser des features automatiquement à partir du texte
  • Les modèles supervisées non-linéaires sont indispensables au traitement du texte, notamment les réseaux de neurones séquentiels

Syllabus

Part #1 - Traitez des données textuelles
1. Explorez des données texte
2. Nettoyez et normalisez des données texte
Activity: Effectuez un nettoyage et une analyse exploratoire de données texte

Part #2 - Transformez des données textuelles
1. Représentez votre corpus en "bag of words"
2. Effectuez des plongements de mots (word embeddings)
3. Modélisez des sujets avec des méthodes non superviséesQuiz: Partie 2

Part #3 - Effectuez une classification de données textuelles
1. Opérez une première classification naïve de sentiments
2. Allez plus loin dans la classification de mots
3. Traitez le langage à l'aide de réseaux de neurones
Activity: Classifiez du texte

Taught by

Yannis Chaouche

Exploitez des données textuelles
الذهاب الي الدورة

Exploitez des données textuelles

بواسطة: OpenClassrooms

  • OpenClassrooms
  • مجانية
  • French
  • متاح شهادة
  • متاح في أي وقت
  • الجميع
  • N/A
8.1.2PHP Version1.12sRequest Duration2MBMemory UsageGET ar/الدورات/{slug}Route
    • Booting (660ms)
    • Application (454ms)
    • 1 x Booting (59.11%)
      660.32ms
      1 x Application (40.67%)
      454.30ms
      14 templates were rendered
      • public.courses.show (resources/views/public/courses/show.blade.php)3bladefile
        Params
        0
        course
        1
        links
        2
        config
      • public.courses.partials.breadcrumbs (resources/views/public/courses/partials/breadcrumbs.blade.php)6bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
      • public.courses.partials.heading (resources/views/public/courses/partials/heading.blade.php)7bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
        6
        classes
      • public.courses.partials.details (resources/views/public/courses/partials/details.blade.php)6bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
      • public.courses.partials.breadcrumbs (resources/views/public/courses/partials/breadcrumbs.blade.php)6bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
      • public.courses.partials.heading (resources/views/public/courses/partials/heading.blade.php)7bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
        6
        classes
      • public.layouts.main (resources/views/public/layouts/main.blade.php)6bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
      • public.layouts.partials.meta (resources/views/public/layouts/partials/meta.blade.php)6bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
      • public.layouts.partials.navbar (resources/views/public/layouts/partials/navbar.blade.php)6bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
      • public.auth.profile.partials.links (resources/views/public/auth/profile/partials/links.blade.php)6bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
      • public.auth.profile.partials.link (resources/views/public/auth/profile/partials/link.blade.php)8bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
        6
        route
        7
        title
      • public.auth.profile.partials.link (resources/views/public/auth/profile/partials/link.blade.php)8bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
        6
        route
        7
        title
      • public.auth.profile.partials.link (resources/views/public/auth/profile/partials/link.blade.php)8bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
        6
        route
        7
        title
      • public.layouts.partials.flash-session (resources/views/public/layouts/partials/flash-session.blade.php)6bladefile
        Params
        0
        __env
        1
        app
        2
        errors
        3
        course
        4
        links
        5
        config
      uri
      GET ar/الدورات/{slug}
      middleware
      web, localize:ar
      controller
      App\Http\Controllers\CourseController@show
      as
      ar.courses.show
      namespace
      prefix
      /ar
      where
      file
      app/Http/Controllers/CourseController.php:17-35
      6 statements were executed111ms
      • select * from `courses` where `slug_ar` = 'exploitez-des-données-textuelles' limit 1
        8.05ms/app/Http/Controllers/CourseController.php:20corspedia
        Metadata
        Bindings
        • 0. exploitez-des-données-textuelles
        Backtrace
        • 17. /app/Http/Controllers/CourseController.php:20
        • 18. /vendor/laravel/framework/src/Illuminate/Routing/Controller.php:54
        • 19. /vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:43
        • 20. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:260
        • 21. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:205
      • update `courses` set `visitors` = `visitors` + 1, `courses`.`updated_at` = '2025-06-12 21:13:49' where `id` = 1934
        102ms/app/Http/Controllers/CourseController.php:21corspedia
        Metadata
        Bindings
        • 0. 2025-06-12 21:13:49
        • 1. 1934
        Backtrace
        • 17. /app/Http/Controllers/CourseController.php:21
        • 18. /vendor/laravel/framework/src/Illuminate/Routing/Controller.php:54
        • 19. /vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:43
        • 20. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:260
        • 21. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:205
      • select `id`, `name_en`, `name_ar`, `topic_id`, `slug_en`, `slug_ar` from `subjects` where `subjects`.`id` in (36)
        250μs/app/Http/Controllers/CourseController.php:23corspedia
        Metadata
        Backtrace
        • 20. /app/Http/Controllers/CourseController.php:23
        • 21. /vendor/laravel/framework/src/Illuminate/Routing/Controller.php:54
        • 22. /vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:43
        • 23. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:260
        • 24. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:205
      • select `id`, `name_en`, `name_ar`, `slug_en`, `slug_ar` from `topics` where `topics`.`id` in (1)
        220μs/app/Http/Controllers/CourseController.php:23corspedia
        Metadata
        Backtrace
        • 25. /app/Http/Controllers/CourseController.php:23
        • 26. /vendor/laravel/framework/src/Illuminate/Routing/Controller.php:54
        • 27. /vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:43
        • 28. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:260
        • 29. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:205
      • select * from `providers` where `providers`.`id` in (36) and `providers`.`deleted_at` is null
        250μs/app/Http/Controllers/CourseController.php:23corspedia
        Metadata
        Backtrace
        • 20. /app/Http/Controllers/CourseController.php:23
        • 21. /vendor/laravel/framework/src/Illuminate/Routing/Controller.php:54
        • 22. /vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:43
        • 23. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:260
        • 24. /vendor/laravel/framework/src/Illuminate/Routing/Route.php:205
      • select * from `html_files` where `html_files`.`id` = 1925 limit 1
        260μs/app/Models/Course.php:84corspedia
        Metadata
        Bindings
        • 0. 1925
        Backtrace
        • 21. /app/Models/Course.php:84
        • 28. view::public.courses.show:29
        • 30. /vendor/laravel/framework/src/Illuminate/Filesystem/Filesystem.php:125
        • 31. /vendor/laravel/framework/src/Illuminate/View/Engines/PhpEngine.php:58
        • 32. /vendor/laravel/framework/src/Illuminate/View/Engines/CompilerEngine.php:72
      App\Models\HtmlFile
      1
      App\Models\Provider
      1
      App\Models\Topic
      1
      App\Models\Subject
      1
      App\Models\Course
      1
        _token
        d91bczauJBj0aYtpyzPop9KfktoVdXnl0RDSpgiW
        locale
        ar
        _previous
        array:1 [ "url" => "https://www.corspedia.com/ar/%D8%A7%D9%84%D8%AF%D9%88%D8%B1%D8%A7%D8%AA/exploi...
        _flash
        array:2 [ "old" => [] "new" => [] ]
        PHPDEBUGBAR_STACK_DATA
        []
        path_info
        /ar/%D8%A7%D9%84%D8%AF%D9%88%D8%B1%D8%A7%D8%AA/exploitez-des-donn%C3%A9es-textuelles
        status_code
        200
        
        status_text
        OK
        format
        html
        content_type
        text/html; charset=UTF-8
        request_query
        []
        
        request_request
        []
        
        request_headers
        0 of 0
        array:24 [ "cf-ipcountry" => array:1 [ 0 => "US" ] "cf-connecting-ip" => array:1 [ 0 => "216.73.216.168" ] "cdn-loop" => array:1 [ 0 => "cloudflare; loops=1" ] "x-forwarded-proto" => array:1 [ 0 => "https" ] "x-forwarded-for" => array:1 [ 0 => "216.73.216.168" ] "sec-fetch-site" => array:1 [ 0 => "none" ] "accept" => array:1 [ 0 => "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7" ] "user-agent" => array:1 [ 0 => "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" ] "upgrade-insecure-requests" => array:1 [ 0 => "1" ] "sec-ch-ua-platform" => array:1 [ 0 => ""Windows"" ] "sec-ch-ua-mobile" => array:1 [ 0 => "?0" ] "sec-ch-ua" => array:1 [ 0 => ""Chromium";v="130", "HeadlessChrome";v="130", "Not?A_Brand";v="99"" ] "cache-control" => array:1 [ 0 => "no-cache" ] "pragma" => array:1 [ 0 => "no-cache" ] "sec-fetch-dest" => array:1 [ 0 => "document" ] "cf-ray" => array:1 [ 0 => "94ec5ab0bb006083-ORD" ] "accept-encoding" => array:1 [ 0 => "gzip, br" ] "priority" => array:1 [ 0 => "u=0, i" ] "sec-fetch-user" => array:1 [ 0 => "?1" ] "sec-fetch-mode" => array:1 [ 0 => "navigate" ] "cf-visitor" => array:1 [ 0 => "{"scheme":"https"}" ] "host" => array:1 [ 0 => "www.corspedia.com" ] "content-length" => array:1 [ 0 => "" ] "content-type" => array:1 [ 0 => "" ] ]
        request_server
        0 of 0
        array:50 [ "USER" => "www-data" "HOME" => "/var/www" "HTTP_CF_IPCOUNTRY" => "US" "HTTP_CF_CONNECTING_IP" => "216.73.216.168" "HTTP_CDN_LOOP" => "cloudflare; loops=1" "HTTP_X_FORWARDED_PROTO" => "https" "HTTP_X_FORWARDED_FOR" => "216.73.216.168" "HTTP_SEC_FETCH_SITE" => "none" "HTTP_ACCEPT" => "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7" "HTTP_USER_AGENT" => "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" "HTTP_UPGRADE_INSECURE_REQUESTS" => "1" "HTTP_SEC_CH_UA_PLATFORM" => ""Windows"" "HTTP_SEC_CH_UA_MOBILE" => "?0" "HTTP_SEC_CH_UA" => ""Chromium";v="130", "HeadlessChrome";v="130", "Not?A_Brand";v="99"" "HTTP_CACHE_CONTROL" => "no-cache" "HTTP_PRAGMA" => "no-cache" "HTTP_SEC_FETCH_DEST" => "document" "HTTP_CF_RAY" => "94ec5ab0bb006083-ORD" "HTTP_ACCEPT_ENCODING" => "gzip, br" "HTTP_PRIORITY" => "u=0, i" "HTTP_SEC_FETCH_USER" => "?1" "HTTP_SEC_FETCH_MODE" => "navigate" "HTTP_CF_VISITOR" => "{"scheme":"https"}" "HTTP_HOST" => "www.corspedia.com" "REDIRECT_STATUS" => "200" "SERVER_NAME" => "corspedia.com" "SERVER_PORT" => "443" "SERVER_ADDR" => "141.95.147.152" "REMOTE_USER" => "" "REMOTE_PORT" => "15758" "REMOTE_ADDR" => "172.69.7.245" "SERVER_SOFTWARE" => "nginx/1.18.0" "GATEWAY_INTERFACE" => "CGI/1.1" "HTTPS" => "on" "REQUEST_SCHEME" => "https" "SERVER_PROTOCOL" => "HTTP/2.0" "DOCUMENT_ROOT" => "/var/www/corspedia/public" "DOCUMENT_URI" => "/index.php" "REQUEST_URI" => "/ar/%D8%A7%D9%84%D8%AF%D9%88%D8%B1%D8%A7%D8%AA/exploitez-des-donn%C3%A9es-textuelles" "SCRIPT_NAME" => "/index.php" "CONTENT_LENGTH" => "" "CONTENT_TYPE" => "" "REQUEST_METHOD" => "GET" "QUERY_STRING" => "" "SCRIPT_FILENAME" => "/var/www/corspedia/public/index.php" "PATH_INFO" => "" "FCGI_ROLE" => "RESPONDER" "PHP_SELF" => "/index.php" "REQUEST_TIME_FLOAT" => 1749762829.2195 "REQUEST_TIME" => 1749762829 ]
        request_cookies
        []
        
        response_headers
        0 of 0
        array:5 [ "content-type" => array:1 [ 0 => "text/html; charset=UTF-8" ] "cache-control" => array:1 [ 0 => "no-cache, private" ] "date" => array:1 [ 0 => "Thu, 12 Jun 2025 21:13:50 GMT" ] "set-cookie" => array:2 [ 0 => "XSRF-TOKEN=eyJpdiI6IldINE0yU3NvTGlJUzU5cVg5YVBVM2c9PSIsInZhbHVlIjoiZ1ZFMjBkVm05SXE0RGJFMkN6eDNhUkpXNWZQc0NRQ3QyVS9PYmgyWTJTZXh2ejlpWldxUExoQkVncG55VnJwbEk3QXN6QUNPeDQ5WnV1WXVSSzBVOHZrSStrQWNWRFlON1czOHI2WG1IYVZ0aTZnNFVZN21tbjhtcklDcWVKNUYiLCJtYWMiOiJiNzBiOTQ5MTNjODMxZDUyZmNkNDFiYzlhNjQyOWYwMTdlN2QzZmRmZTJhMTkwZWIyMTRmMjVlNTAzMTBiNTA4IiwidGFnIjoiIn0%3D; expires=Thu, 12 Jun 2025 23:13:50 GMT; Max-Age=7200; path=/; samesite=laxXSRF-TOKEN=eyJpdiI6IldINE0yU3NvTGlJUzU5cVg5YVBVM2c9PSIsInZhbHVlIjoiZ1ZFMjBkVm05SXE0RGJFMkN6eDNhUkpXNWZQc0NRQ3QyVS9PYmgyWTJTZXh2ejlpWldxUExoQkVncG55VnJwbEk3QXN6Q" 1 => "laravel_session=eyJpdiI6IlpaalA2UjVNa1JXZGxob2c4Q1hMb1E9PSIsInZhbHVlIjoiOE1SSGFlOE1TU21wc2ZaTkRGbFd2OHZtTmxtTE5jc3dpaFRZRExQbUQxZUxub0JWUGJkWEdkWjdqRndkN283THpmQmFueDRlbnNPTHROc2dsVFowUkpCbnNVTXBFQUtwcW9YY3kwTUtnUHZwU09FMXNFN2NLRitTc2dSQkVrU3YiLCJtYWMiOiI5OTZmNzgxN2I4ZGI0YzI5MzMyMjc3ZTZkOWEyMzA0YmU3ZTliYTQwZDAyYjc4NWNkMDNiODI0NmM1MDJlMGFhIiwidGFnIjoiIn0%3D; expires=Thu, 12 Jun 2025 23:13:50 GMT; Max-Age=7200; path=/; httponly; samesite=laxlaravel_session=eyJpdiI6IlpaalA2UjVNa1JXZGxob2c4Q1hMb1E9PSIsInZhbHVlIjoiOE1SSGFlOE1TU21wc2ZaTkRGbFd2OHZtTmxtTE5jc3dpaFRZRExQbUQxZUxub0JWUGJkWEdkWjdqRndkN283THpm" ] "Set-Cookie" => array:2 [ 0 => "XSRF-TOKEN=eyJpdiI6IldINE0yU3NvTGlJUzU5cVg5YVBVM2c9PSIsInZhbHVlIjoiZ1ZFMjBkVm05SXE0RGJFMkN6eDNhUkpXNWZQc0NRQ3QyVS9PYmgyWTJTZXh2ejlpWldxUExoQkVncG55VnJwbEk3QXN6QUNPeDQ5WnV1WXVSSzBVOHZrSStrQWNWRFlON1czOHI2WG1IYVZ0aTZnNFVZN21tbjhtcklDcWVKNUYiLCJtYWMiOiJiNzBiOTQ5MTNjODMxZDUyZmNkNDFiYzlhNjQyOWYwMTdlN2QzZmRmZTJhMTkwZWIyMTRmMjVlNTAzMTBiNTA4IiwidGFnIjoiIn0%3D; expires=Thu, 12-Jun-2025 23:13:50 GMT; path=/XSRF-TOKEN=eyJpdiI6IldINE0yU3NvTGlJUzU5cVg5YVBVM2c9PSIsInZhbHVlIjoiZ1ZFMjBkVm05SXE0RGJFMkN6eDNhUkpXNWZQc0NRQ3QyVS9PYmgyWTJTZXh2ejlpWldxUExoQkVncG55VnJwbEk3QXN6Q" 1 => "laravel_session=eyJpdiI6IlpaalA2UjVNa1JXZGxob2c4Q1hMb1E9PSIsInZhbHVlIjoiOE1SSGFlOE1TU21wc2ZaTkRGbFd2OHZtTmxtTE5jc3dpaFRZRExQbUQxZUxub0JWUGJkWEdkWjdqRndkN283THpmQmFueDRlbnNPTHROc2dsVFowUkpCbnNVTXBFQUtwcW9YY3kwTUtnUHZwU09FMXNFN2NLRitTc2dSQkVrU3YiLCJtYWMiOiI5OTZmNzgxN2I4ZGI0YzI5MzMyMjc3ZTZkOWEyMzA0YmU3ZTliYTQwZDAyYjc4NWNkMDNiODI0NmM1MDJlMGFhIiwidGFnIjoiIn0%3D; expires=Thu, 12-Jun-2025 23:13:50 GMT; path=/; httponlylaravel_session=eyJpdiI6IlpaalA2UjVNa1JXZGxob2c4Q1hMb1E9PSIsInZhbHVlIjoiOE1SSGFlOE1TU21wc2ZaTkRGbFd2OHZtTmxtTE5jc3dpaFRZRExQbUQxZUxub0JWUGJkWEdkWjdqRndkN283THpm" ] ]
        session_attributes
        0 of 0
        array:5 [ "_token" => "d91bczauJBj0aYtpyzPop9KfktoVdXnl0RDSpgiW" "locale" => "ar" "_previous" => array:1 [ "url" => "https://www.corspedia.com/ar/%D8%A7%D9%84%D8%AF%D9%88%D8%B1%D8%A7%D8%AA/exploitez-des-donn%C3%A9es-textuelles" ] "_flash" => array:2 [ "old" => [] "new" => [] ] "PHPDEBUGBAR_STACK_DATA" => [] ]